Message ID | 20240529180510.2295118-3-jthoughton@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: multi-gen LRU: Walk secondary MMU page tables while aging | expand |
On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@google.com> wrote: > > Secondary MMUs are currently consulted for access/age information at > eviction time, but before then, we don't get accurate age information. > That is, pages that are mostly accessed through a secondary MMU (like > guest memory, used by KVM) will always just proceed down to the oldest > generation, and then at eviction time, if KVM reports the page to be > young, the page will be activated/promoted back to the youngest > generation. Correct, and as I explained offline, this is the only reasonable behavior if we can't locklessly walk secondary MMUs. Just for the record, the (crude) analogy I used was: Imagine a large room with many bills ($1, $5, $10, ...) on the floor, but you are only allowed to pick up 10 of them (and put them in your pocket). A smart move would be to survey the room *first and then* pick up the largest ones. But if you are carrying a 500 lbs backpack, you would just want to pick up whichever that's in front of you rather than walk the entire room. MGLRU should only scan (or lookaround) secondary MMUs if it can be done lockless. Otherwise, it should just fall back to the existing approach, which existed in previous versions but is removed in this version. > Do not do look around if there is a secondary MMU we have to interact > with. > > The added feature bit (0x8), if disabled, will make MGLRU behave as if > there are no secondary MMUs subscribed to MMU notifiers except at > eviction time. > > Suggested-by: Yu Zhao <yuzhao@google.com> > Signed-off-by: James Houghton <jthoughton@google.com> This is not what I suggested, and it would have been done in the first place if it hadn't regressed the non-lockless case. NAK.
On Wed, May 29, 2024, Yu Zhao wrote: > On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@google.com> wrote: > > > > Secondary MMUs are currently consulted for access/age information at > > eviction time, but before then, we don't get accurate age information. > > That is, pages that are mostly accessed through a secondary MMU (like > > guest memory, used by KVM) will always just proceed down to the oldest > > generation, and then at eviction time, if KVM reports the page to be > > young, the page will be activated/promoted back to the youngest > > generation. > > Correct, and as I explained offline, this is the only reasonable > behavior if we can't locklessly walk secondary MMUs. > > Just for the record, the (crude) analogy I used was: > Imagine a large room with many bills ($1, $5, $10, ...) on the floor, > but you are only allowed to pick up 10 of them (and put them in your > pocket). A smart move would be to survey the room *first and then* > pick up the largest ones. But if you are carrying a 500 lbs backpack, > you would just want to pick up whichever that's in front of you rather > than walk the entire room. > > MGLRU should only scan (or lookaround) secondary MMUs if it can be > done lockless. Otherwise, it should just fall back to the existing > approach, which existed in previous versions but is removed in this > version. IIUC, by "existing approach" you mean completely ignore secondary MMUs that don't implement a lockless walk?
On Wed, May 29, 2024 at 3:59 PM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, May 29, 2024, Yu Zhao wrote: > > On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@google.com> wrote: > > > > > > Secondary MMUs are currently consulted for access/age information at > > > eviction time, but before then, we don't get accurate age information. > > > That is, pages that are mostly accessed through a secondary MMU (like > > > guest memory, used by KVM) will always just proceed down to the oldest > > > generation, and then at eviction time, if KVM reports the page to be > > > young, the page will be activated/promoted back to the youngest > > > generation. > > > > Correct, and as I explained offline, this is the only reasonable > > behavior if we can't locklessly walk secondary MMUs. > > > > Just for the record, the (crude) analogy I used was: > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor, > > but you are only allowed to pick up 10 of them (and put them in your > > pocket). A smart move would be to survey the room *first and then* > > pick up the largest ones. But if you are carrying a 500 lbs backpack, > > you would just want to pick up whichever that's in front of you rather > > than walk the entire room. > > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be > > done lockless. Otherwise, it should just fall back to the existing > > approach, which existed in previous versions but is removed in this > > version. > > IIUC, by "existing approach" you mean completely ignore secondary MMUs that don't > implement a lockless walk? No, the existing approach only checks secondary MMUs for LRU folios, i.e., those at the end of the LRU list. It might not find the best candidates (the coldest ones) on the entire list, but it doesn't pay as much for the locking. MGLRU can *optionally* scan MMUs (secondary included) to find the best candidates, but it can only be a win if the scanning incurs a relatively low overhead, e.g., done locklessly for the secondary MMU. IOW, this is a balance between the cost of reclaiming not-so-cold (warm) folios and that of finding the coldest folios. Scanning host MMUs is likely to be a win because 1) there is usually access locality 2) there is no coarsed locking. If neither holds, scanning secondary MMUs would likely be a loss. And 1) is generally weaker for secondary MMUs, since it's about (guest) physical address space.
On Wed, May 29, 2024, Yu Zhao wrote: > On Wed, May 29, 2024 at 3:59 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Wed, May 29, 2024, Yu Zhao wrote: > > > On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@google.com> wrote: > > > > > > > > Secondary MMUs are currently consulted for access/age information at > > > > eviction time, but before then, we don't get accurate age information. > > > > That is, pages that are mostly accessed through a secondary MMU (like > > > > guest memory, used by KVM) will always just proceed down to the oldest > > > > generation, and then at eviction time, if KVM reports the page to be > > > > young, the page will be activated/promoted back to the youngest > > > > generation. > > > > > > Correct, and as I explained offline, this is the only reasonable > > > behavior if we can't locklessly walk secondary MMUs. > > > > > > Just for the record, the (crude) analogy I used was: > > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor, > > > but you are only allowed to pick up 10 of them (and put them in your > > > pocket). A smart move would be to survey the room *first and then* > > > pick up the largest ones. But if you are carrying a 500 lbs backpack, > > > you would just want to pick up whichever that's in front of you rather > > > than walk the entire room. > > > > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be > > > done lockless. Otherwise, it should just fall back to the existing > > > approach, which existed in previous versions but is removed in this > > > version. > > > > IIUC, by "existing approach" you mean completely ignore secondary MMUs that > > don't implement a lockless walk? > > No, the existing approach only checks secondary MMUs for LRU folios, > i.e., those at the end of the LRU list. It might not find the best > candidates (the coldest ones) on the entire list, but it doesn't pay > as much for the locking. MGLRU can *optionally* scan MMUs (secondary > included) to find the best candidates, but it can only be a win if the > scanning incurs a relatively low overhead, e.g., done locklessly for > the secondary MMU. IOW, this is a balance between the cost of > reclaiming not-so-cold (warm) folios and that of finding the coldest > folios. Gotcha. I tend to agree with Yu, driving the behavior via a Kconfig may generate simpler _code_, but I think it increases the overall system complexity. E.g. distros will likely enable the Kconfig, and in my experience people using KVM with a distro kernel usually aren't kernel experts, i.e. likely won't know that there's even a decision to be made, let alone be able to make an informed decision. Having an mmu_notifier hook that is conditionally implemented doesn't seem overly complex, e.g. even if there's a runtime aspect at play, it'd be easy enough for KVM to nullify its mmu_notifier hook during initialization. The hardest part is likely going to be figuring out the threshold for how much overhead is too much.
On Wed, May 29, 2024 at 3:58 PM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, May 29, 2024, Yu Zhao wrote: > > On Wed, May 29, 2024 at 3:59 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Wed, May 29, 2024, Yu Zhao wrote: > > > > On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@google.com> wrote: > > > > > > > > > > Secondary MMUs are currently consulted for access/age information at > > > > > eviction time, but before then, we don't get accurate age information. > > > > > That is, pages that are mostly accessed through a secondary MMU (like > > > > > guest memory, used by KVM) will always just proceed down to the oldest > > > > > generation, and then at eviction time, if KVM reports the page to be > > > > > young, the page will be activated/promoted back to the youngest > > > > > generation. > > > > > > > > Correct, and as I explained offline, this is the only reasonable > > > > behavior if we can't locklessly walk secondary MMUs. > > > > > > > > Just for the record, the (crude) analogy I used was: > > > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor, > > > > but you are only allowed to pick up 10 of them (and put them in your > > > > pocket). A smart move would be to survey the room *first and then* > > > > pick up the largest ones. But if you are carrying a 500 lbs backpack, > > > > you would just want to pick up whichever that's in front of you rather > > > > than walk the entire room. > > > > > > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be > > > > done lockless. Otherwise, it should just fall back to the existing > > > > approach, which existed in previous versions but is removed in this > > > > version. > > > > > > IIUC, by "existing approach" you mean completely ignore secondary MMUs that > > > don't implement a lockless walk? > > > > No, the existing approach only checks secondary MMUs for LRU folios, > > i.e., those at the end of the LRU list. It might not find the best > > candidates (the coldest ones) on the entire list, but it doesn't pay > > as much for the locking. MGLRU can *optionally* scan MMUs (secondary > > included) to find the best candidates, but it can only be a win if the > > scanning incurs a relatively low overhead, e.g., done locklessly for > > the secondary MMU. IOW, this is a balance between the cost of > > reclaiming not-so-cold (warm) folios and that of finding the coldest > > folios. > > Gotcha. > > I tend to agree with Yu, driving the behavior via a Kconfig may generate simpler > _code_, but I think it increases the overall system complexity. E.g. distros > will likely enable the Kconfig, and in my experience people using KVM with a > distro kernel usually aren't kernel experts, i.e. likely won't know that there's > even a decision to be made, let alone be able to make an informed decision. > > Having an mmu_notifier hook that is conditionally implemented doesn't seem overly > complex, e.g. even if there's a runtime aspect at play, it'd be easy enough for > KVM to nullify its mmu_notifier hook during initialization. The hardest part is > likely going to be figuring out the threshold for how much overhead is too much. Hi Yu, Sean, Perhaps I "simplified" this bit of the series a little bit too much. Being able to opportunistically do aging with KVM (even without setting the Kconfig) is valuable. IIUC, we have the following possibilities: - v4: aging with KVM is done if the new Kconfig is set. - v3: aging with KVM is always done. - v2: aging with KVM is done when the architecture reports that it can probably be done locklessly, set at KVM MMU init time. - Another possibility?: aging with KVM is only done exactly when it can be done locklessly (i.e., mmu_notifier_test/clear_young() called such that it will not grab any locks). I like the v4 approach because: 1. We can choose whether or not to do aging with KVM no matter what architecture we're using (without requiring userspace be aware to disable the feature at runtime with sysfs to avoid regressing performance if they don't care about proactive reclaim). 2. If we check the new feature bit (0x8) in sysfs, we can know for sure if aging is meant to be working or not. The selftest changes I made won't work properly unless there is a way to be sure that aging is working with KVM. For look-around at eviction time: - v4: done if the main mm PTE was young and no MMU notifiers are subscribed. - v2/v3: done if the main mm PTE was young or (the SPTE was young and the MMU notifier was lockless/fast). I made this logic change as part of removing batching. I'd really appreciate guidance on what the correct thing to do is. In my mind, what would work great is: by default, do aging exactly when KVM can do it locklessly, and then have a Kconfig to always have MGLRU to do aging with KVM if a user really cares about proactive reclaim (when the feature bit is set). The selftest can check the Kconfig + feature bit to know for sure if aging will be done. I'm not sure what the exact right thing to do for look-around is. Thanks for the quick feedback.
On Wed, May 29, 2024 at 7:08 PM James Houghton <jthoughton@google.com> wrote: > > On Wed, May 29, 2024 at 3:58 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Wed, May 29, 2024, Yu Zhao wrote: > > > On Wed, May 29, 2024 at 3:59 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > On Wed, May 29, 2024, Yu Zhao wrote: > > > > > On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@google.com> wrote: > > > > > > > > > > > > Secondary MMUs are currently consulted for access/age information at > > > > > > eviction time, but before then, we don't get accurate age information. > > > > > > That is, pages that are mostly accessed through a secondary MMU (like > > > > > > guest memory, used by KVM) will always just proceed down to the oldest > > > > > > generation, and then at eviction time, if KVM reports the page to be > > > > > > young, the page will be activated/promoted back to the youngest > > > > > > generation. > > > > > > > > > > Correct, and as I explained offline, this is the only reasonable > > > > > behavior if we can't locklessly walk secondary MMUs. > > > > > > > > > > Just for the record, the (crude) analogy I used was: > > > > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor, > > > > > but you are only allowed to pick up 10 of them (and put them in your > > > > > pocket). A smart move would be to survey the room *first and then* > > > > > pick up the largest ones. But if you are carrying a 500 lbs backpack, > > > > > you would just want to pick up whichever that's in front of you rather > > > > > than walk the entire room. > > > > > > > > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be > > > > > done lockless. Otherwise, it should just fall back to the existing > > > > > approach, which existed in previous versions but is removed in this > > > > > version. > > > > > > > > IIUC, by "existing approach" you mean completely ignore secondary MMUs that > > > > don't implement a lockless walk? > > > > > > No, the existing approach only checks secondary MMUs for LRU folios, > > > i.e., those at the end of the LRU list. It might not find the best > > > candidates (the coldest ones) on the entire list, but it doesn't pay > > > as much for the locking. MGLRU can *optionally* scan MMUs (secondary > > > included) to find the best candidates, but it can only be a win if the > > > scanning incurs a relatively low overhead, e.g., done locklessly for > > > the secondary MMU. IOW, this is a balance between the cost of > > > reclaiming not-so-cold (warm) folios and that of finding the coldest > > > folios. > > > > Gotcha. > > > > I tend to agree with Yu, driving the behavior via a Kconfig may generate simpler > > _code_, but I think it increases the overall system complexity. E.g. distros > > will likely enable the Kconfig, and in my experience people using KVM with a > > distro kernel usually aren't kernel experts, i.e. likely won't know that there's > > even a decision to be made, let alone be able to make an informed decision. > > > > Having an mmu_notifier hook that is conditionally implemented doesn't seem overly > > complex, e.g. even if there's a runtime aspect at play, it'd be easy enough for > > KVM to nullify its mmu_notifier hook during initialization. The hardest part is > > likely going to be figuring out the threshold for how much overhead is too much. > > Hi Yu, Sean, > > Perhaps I "simplified" this bit of the series a little bit too much. > Being able to opportunistically do aging with KVM (even without > setting the Kconfig) is valuable. > > IIUC, we have the following possibilities: > - v4: aging with KVM is done if the new Kconfig is set. > - v3: aging with KVM is always done. This is not true -- in v3, MGLRU only scans secondary MMUs if it can be done locklessly on x86. It uses a bitmap to imply this requirement. > - v2: aging with KVM is done when the architecture reports that it can > probably be done locklessly, set at KVM MMU init time. Not really -- it's only done if it can be done locklessly on both x86 and arm64. > - Another possibility?: aging with KVM is only done exactly when it > can be done locklessly (i.e., mmu_notifier_test/clear_young() called > such that it will not grab any locks). This is exactly the case for v2. > I like the v4 approach because: > 1. We can choose whether or not to do aging with KVM no matter what > architecture we're using (without requiring userspace be aware to > disable the feature at runtime with sysfs to avoid regressing > performance if they don't care about proactive reclaim). > 2. If we check the new feature bit (0x8) in sysfs, we can know for > sure if aging is meant to be working or not. The selftest changes I > made won't work properly unless there is a way to be sure that aging > is working with KVM. I'm not convinced, but it doesn't mean your point of view is invalid. If you fully understand the implications of your design choice and document them, I will not object. All optimizations in v2 were measured step by step. Even that bitmap, which might be considered overengineered, brought a readily measuarable 4% improvement in memcached throughput on Altra Max swapping to Optane: Using the bitmap (64 KVM PTEs for each call) ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 0.00 --- --- --- --- --- --- 0.00 Gets 1012801.92 431436.92 14965.11 0.06246 0.04700 0.16700 4.31900 39635.83 Waits 0.00 --- --- --- --- --- --- --- Totals 1012801.92 431436.92 14965.11 0.06246 0.04700 0.16700 4.31900 39635.83 Not using the bitmap (1 KVM PTEs for each call) ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 0.00 --- --- --- --- --- --- 0.00 Gets 968210.02 412443.85 14303.89 0.06517 0.04700 0.15900 7.42300 37890.74 Waits 0.00 --- --- --- --- --- --- --- Totals 968210.02 412443.85 14303.89 0.06517 0.04700 0.15900 7.42300 37890.74 FlameGraphs with bitmap (1.svg) and without bitmap (2.svg) attached. What I don't think is acceptable is simplifying those optimizations out without documenting your justifications (I would even call it a design change, rather than simplification, from v3 to v4). > For look-around at eviction time: > - v4: done if the main mm PTE was young and no MMU notifiers are subscribed. > - v2/v3: done if the main mm PTE was young or (the SPTE was young and > the MMU notifier was lockless/fast). The host and secondary MMUs are two *independent* cases, IMO: 1. lookaround the host MMU if the PTE mapping the folio under reclaim is young. 2. lookaround the secondary MMU if it can be done locklessly. So the v2/v3 behavior sounds a lot more reasonable to me. Also a nit -- don't use 'else' in the following case (should_look_around()): if (foo) return bar; else do_something(); > I made this logic change as part of removing batching. > > I'd really appreciate guidance on what the correct thing to do is. > > In my mind, what would work great is: by default, do aging exactly > when KVM can do it locklessly, and then have a Kconfig to always have > MGLRU to do aging with KVM if a user really cares about proactive > reclaim (when the feature bit is set). The selftest can check the > Kconfig + feature bit to know for sure if aging will be done. I still don't see how that Kconfig helps. Or why the new static branch isn't enough? > I'm not sure what the exact right thing to do for look-around is. > > Thanks for the quick feedback.
On Fri, May 31, 2024 at 12:05:48AM -0600, Yu Zhao wrote: [...] > All optimizations in v2 were measured step by step. Even that bitmap, > which might be considered overengineered, brought a readily > measuarable 4% improvement in memcached throughput on Altra Max > swapping to Optane: That's great, but taking an iterative approach to the problem allows the reviewers and maintainers to come to their own conclusions about each optimization independently. Squashing all of that together and posting the result doesn't allow for this. Even if we were to take the series as-is, the door is wide open to subsequent improvements. > What I don't think is acceptable is simplifying those optimizations > out without documenting your justifications (I would even call it a > design change, rather than simplification, from v3 to v4). No, sorry, there's nothing wrong with James' approach here. The discussion that led to the design of v4 happened on list; you were on CC. The general consensus on the KVM side was that the bitmap was complicated and lacked independent justification. There was ample opportunity to voice your concerns before he spent the time on v4. You seriously cannot fault a contributor for respinning their work based on the provided feedback.
On Wed, May 29, 2024 at 03:03:21PM -0600, Yu Zhao wrote: > On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@google.com> wrote: > > > > Secondary MMUs are currently consulted for access/age information at > > eviction time, but before then, we don't get accurate age information. > > That is, pages that are mostly accessed through a secondary MMU (like > > guest memory, used by KVM) will always just proceed down to the oldest > > generation, and then at eviction time, if KVM reports the page to be > > young, the page will be activated/promoted back to the youngest > > generation. > > Correct, and as I explained offline, this is the only reasonable > behavior if we can't locklessly walk secondary MMUs. > > Just for the record, the (crude) analogy I used was: > Imagine a large room with many bills ($1, $5, $10, ...) on the floor, > but you are only allowed to pick up 10 of them (and put them in your > pocket). A smart move would be to survey the room *first and then* > pick up the largest ones. But if you are carrying a 500 lbs backpack, > you would just want to pick up whichever that's in front of you rather > than walk the entire room. > > MGLRU should only scan (or lookaround) secondary MMUs if it can be > done lockless. Otherwise, it should just fall back to the existing > approach, which existed in previous versions but is removed in this > version. Grabbing the MMU lock for write to scan sucks, no argument there. But can you please be specific about the impact of read lock v. RCU in the case of arm64? I had asked about this before and you never replied. My concern remains that adding support for software table walkers outside of the MMU lock entirely requires more work than just deferring the deallocation to an RCU callback. Walkers that previously assumed 'exclusive' access while holding the MMU lock for write must now cope with volatile PTEs. Yes, this problem already exists when hardware sets the AF, but the lock-free walker implementation needs to be generic so it can be applied for other PTE bits.
On Fri, May 31, 2024 at 1:03 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Fri, May 31, 2024 at 12:05:48AM -0600, Yu Zhao wrote: Let me add back what I said earlier: I'm not convinced, but it doesn't mean your point of view is invalid. If you fully understand the implications of your design choice and document them, I will not object. > > All optimizations in v2 were measured step by step. Even that bitmap, > > which might be considered overengineered, brought a readily > > measuarable 4% improvement in memcached throughput on Altra Max > > swapping to Optane: > > That's great, but taking an iterative approach to the problem allows > the reviewers and maintainers to come to their own conclusions about > each optimization independently. Squashing all of that together and > posting the result doesn't allow for this. That's your methodology, which I respect: as I said I won't stand in your way. But mine is backed by data, please do respect that as well, by doing what I asked: document your justifications. > Even if we were to take the series as-is, the door is wide open to > subsequent improvements. > > > What I don't think is acceptable is simplifying those optimizations > > out without documenting your justifications (I would even call it a > > design change, rather than simplification, from v3 to v4). > > No, sorry, there's nothing wrong with James' approach here. Sorry, are you saying "without documenting your justifications" is nothing wrong? If so, please elaborate. > The discussion that led to the design of v4 happened on list; you were > on CC. The general consensus on the KVM side was that the bitmap was > complicated and lacked independent justification. There was ample > opportunity to voice your concerns before he spent the time on v4. Please re-read my previous emails -- I never object to the removal of the bitmap or James' approach. And please stop making assumptions -- I did voice my concerns with James privately. > You seriously cannot fault a contributor for respinning their work based > on the provided feedback. Are you saying I faulted James for taking others' feedback? If so, where? And I'll make sure I don't give such an impression in the future. Also what do you think about the technical flaws and inaccurate understandings I pointed out? You seem to have a strong opinion on your iterate approach, but I hope you didn't choose to overlook the real meat of this discussion.
On Fri, May 31, 2024 at 10:45:04AM -0600, Yu Zhao wrote: > On Fri, May 31, 2024 at 1:03 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > On Fri, May 31, 2024 at 12:05:48AM -0600, Yu Zhao wrote: > > Let me add back what I said earlier: > > I'm not convinced, but it doesn't mean your point of view is > invalid. If you fully understand the implications of your design > choice and document them, I will not object. Thanks, I appreciate the sentiment. Hopefully we can align here. > > > All optimizations in v2 were measured step by step. Even that bitmap, > > > which might be considered overengineered, brought a readily > > > measuarable 4% improvement in memcached throughput on Altra Max > > > swapping to Optane: > > > > That's great, but taking an iterative approach to the problem allows > > the reviewers and maintainers to come to their own conclusions about > > each optimization independently. Squashing all of that together and > > posting the result doesn't allow for this. > > That's your methodology, which I respect: as I said I won't stand in your way. > > But mine is backed by data, please do respect that as well, Data is useful and expected for changes that aim to improve the performance of a system in one way or another. That is, after all, the sole intention of the work, no? What I'm also looking for is a controlled experiment, where a single independent variable (e.g. locking) can be evaluated against the baseline. All-or-nothing data has limited usefulness. > by doing what I asked: document your justifications. The justification for a series is against the upstream tree, not some out-of-tree stuff. The cover letter explicitly calls out the decision to simplify the patch series along with performance data I can reproduce on my own systems. This is a perfect example of how to contribute changes upstream. > > > What I don't think is acceptable is simplifying those optimizations > > > out without documenting your justifications (I would even call it a > > > design change, rather than simplification, from v3 to v4). > > > > No, sorry, there's nothing wrong with James' approach here. > > Sorry, are you saying "without documenting your justifications" is > nothing wrong? If so, please elaborate. As I mentioned above, the reasoning is adequately documented and the discussion that led to v4 is public. OTOH... > > The discussion that led to the design of v4 happened on list; you were > > on CC. The general consensus on the KVM side was that the bitmap was > > complicated and lacked independent justification. There was ample > > opportunity to voice your concerns before he spent the time on v4. > > Please re-read my previous emails -- I never object to the removal of > the bitmap or James' approach. > > And please stop making assumptions -- I did voice my concerns with > James privately. ^~~~~~~~~ If it happened in private then its no better than having said nothing at all. Please, keep the conversation on-list next time so we can iron out any disagreements there. Otherwise contributors are put in a *very* awkward situation of mediating the on- and off-list dialogue. > > You seriously cannot fault a contributor for respinning their work based > > on the provided feedback. > > Are you saying I faulted James for taking others' feedback? No. Sufficient justification is captured in the public review feedback + series cover letter. Your statement that the approach was changed without justification is unsubstantiated. > Also what do you think about the technical flaws and inaccurate > understandings I pointed out? You seem to have a strong opinion on > your iterate approach, but I hope you didn't choose to overlook the > real meat of this discussion. Serious question: are you not receiving my mail or something? I re-raised my question to you from ages ago about locking on the arm64 MMU. You didn't answer last time, I'd appreciate a reply this time around. Otherwise I couldn't be bothered about the color of the Kconfig bikeshed and don't have anything meaningful to add there. I think the three of you are trending in the right direction.
On Fri, May 31, 2024 at 1:24 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Wed, May 29, 2024 at 03:03:21PM -0600, Yu Zhao wrote: > > On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@google.com> wrote: > > > > > > Secondary MMUs are currently consulted for access/age information at > > > eviction time, but before then, we don't get accurate age information. > > > That is, pages that are mostly accessed through a secondary MMU (like > > > guest memory, used by KVM) will always just proceed down to the oldest > > > generation, and then at eviction time, if KVM reports the page to be > > > young, the page will be activated/promoted back to the youngest > > > generation. > > > > Correct, and as I explained offline, this is the only reasonable > > behavior if we can't locklessly walk secondary MMUs. > > > > Just for the record, the (crude) analogy I used was: > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor, > > but you are only allowed to pick up 10 of them (and put them in your > > pocket). A smart move would be to survey the room *first and then* > > pick up the largest ones. But if you are carrying a 500 lbs backpack, > > you would just want to pick up whichever that's in front of you rather > > than walk the entire room. > > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be > > done lockless. Otherwise, it should just fall back to the existing > > approach, which existed in previous versions but is removed in this > > version. > > Grabbing the MMU lock for write to scan sucks, no argument there. But > can you please be specific about the impact of read lock v. RCU in the > case of arm64? I had asked about this before and you never replied. > > My concern remains that adding support for software table walkers > outside of the MMU lock entirely requires more work than just deferring > the deallocation to an RCU callback. Walkers that previously assumed > 'exclusive' access while holding the MMU lock for write must now cope > with volatile PTEs. > > Yes, this problem already exists when hardware sets the AF, but the > lock-free walker implementation needs to be generic so it can be applied > for other PTE bits. Direct reclaim is multi-threaded and each reclaimer can take the mmu lock for read (testing the A-bit) or write (unmapping before paging out) on arm64. The fundamental problem of using the readers-writer lock in this case is priority inversion: the readers have lower priority than the writers, so ideally, we don't want the readers to block the writers at all. Using my previous (crude) analogy: puting the bill right in front of you (the writers) profits immediately whereas searching for the largest bill (the readers) can be futile. As I said earlier, I prefer we drop the arm64 support for now, but I will not object to taking the mmu lock for read when clearing the A-bit, as long as we fully understand the problem here and document it clearly.
On Fri, May 31, 2024 at 1:31 PM Yu Zhao <yuzhao@google.com> wrote: > > On Fri, May 31, 2024 at 1:24 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > On Wed, May 29, 2024 at 03:03:21PM -0600, Yu Zhao wrote: > > > On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@google.com> wrote: > > > > > > > > Secondary MMUs are currently consulted for access/age information at > > > > eviction time, but before then, we don't get accurate age information. > > > > That is, pages that are mostly accessed through a secondary MMU (like > > > > guest memory, used by KVM) will always just proceed down to the oldest > > > > generation, and then at eviction time, if KVM reports the page to be > > > > young, the page will be activated/promoted back to the youngest > > > > generation. > > > > > > Correct, and as I explained offline, this is the only reasonable > > > behavior if we can't locklessly walk secondary MMUs. > > > > > > Just for the record, the (crude) analogy I used was: > > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor, > > > but you are only allowed to pick up 10 of them (and put them in your > > > pocket). A smart move would be to survey the room *first and then* > > > pick up the largest ones. But if you are carrying a 500 lbs backpack, > > > you would just want to pick up whichever that's in front of you rather > > > than walk the entire room. > > > > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be > > > done lockless. Otherwise, it should just fall back to the existing > > > approach, which existed in previous versions but is removed in this > > > version. > > > > Grabbing the MMU lock for write to scan sucks, no argument there. But > > can you please be specific about the impact of read lock v. RCU in the > > case of arm64? I had asked about this before and you never replied. > > > > My concern remains that adding support for software table walkers > > outside of the MMU lock entirely requires more work than just deferring > > the deallocation to an RCU callback. Walkers that previously assumed > > 'exclusive' access while holding the MMU lock for write must now cope > > with volatile PTEs. > > > > Yes, this problem already exists when hardware sets the AF, but the > > lock-free walker implementation needs to be generic so it can be applied > > for other PTE bits. > > Direct reclaim is multi-threaded and each reclaimer can take the mmu > lock for read (testing the A-bit) or write (unmapping before paging > out) on arm64. The fundamental problem of using the readers-writer > lock in this case is priority inversion: the readers have lower > priority than the writers, so ideally, we don't want the readers to > block the writers at all. > > Using my previous (crude) analogy: puting the bill right in front of > you (the writers) profits immediately whereas searching for the > largest bill (the readers) can be futile. > > As I said earlier, I prefer we drop the arm64 support for now, but I > will not object to taking the mmu lock for read when clearing the > A-bit, as long as we fully understand the problem here and document it > clearly. FWIW, Google Cloud has been doing proactive reclaim and kstaled-based aging (a Google-internal page aging daemon, for those outside of Google) for many years on x86 VMs with the A-bit harvesting under the write-lock. So I'm skeptical that making ARM64 lockless is necessary to allow Secondary MMUs to participate in MGLRU aging with acceptable performance for Cloud usecases. I don't even think it's necessary on x86 but it's a simple enough change that we might as well just do it. I suspect under pathological conditions (host under intense memory pressure and high rate of reclaim occurring) making A-bit harvesting lockless will perform better. But under such conditions VM performance is likely going to suffer regardless. In a Cloud environment we deal with that through other mechanisms to reduce the rate of reclaim and make the host healthy. For these reasons, I think there's value in giving users the option to enable Secondary MMUs participation MGLRU aging even when A-bit test/clearing is not done locklessly. I believe this was James' intent with the Kconfig. Perhaps a default-off writable module parameter would be better to avoid distros accidentally turning it on? If and when there is a usecase for optimizing VM performance under pathological reclaim conditions on ARM, we can make it lockless then.
On Fri, May 31, 2024 at 2:06 PM David Matlack <dmatlack@google.com> wrote: > > On Fri, May 31, 2024 at 1:31 PM Yu Zhao <yuzhao@google.com> wrote: > > > > On Fri, May 31, 2024 at 1:24 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > On Wed, May 29, 2024 at 03:03:21PM -0600, Yu Zhao wrote: > > > > On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@google.com> wrote: > > > > > > > > > > Secondary MMUs are currently consulted for access/age information at > > > > > eviction time, but before then, we don't get accurate age information. > > > > > That is, pages that are mostly accessed through a secondary MMU (like > > > > > guest memory, used by KVM) will always just proceed down to the oldest > > > > > generation, and then at eviction time, if KVM reports the page to be > > > > > young, the page will be activated/promoted back to the youngest > > > > > generation. > > > > > > > > Correct, and as I explained offline, this is the only reasonable > > > > behavior if we can't locklessly walk secondary MMUs. > > > > > > > > Just for the record, the (crude) analogy I used was: > > > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor, > > > > but you are only allowed to pick up 10 of them (and put them in your > > > > pocket). A smart move would be to survey the room *first and then* > > > > pick up the largest ones. But if you are carrying a 500 lbs backpack, > > > > you would just want to pick up whichever that's in front of you rather > > > > than walk the entire room. > > > > > > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be > > > > done lockless. Otherwise, it should just fall back to the existing > > > > approach, which existed in previous versions but is removed in this > > > > version. > > > > > > Grabbing the MMU lock for write to scan sucks, no argument there. But > > > can you please be specific about the impact of read lock v. RCU in the > > > case of arm64? I had asked about this before and you never replied. > > > > > > My concern remains that adding support for software table walkers > > > outside of the MMU lock entirely requires more work than just deferring > > > the deallocation to an RCU callback. Walkers that previously assumed > > > 'exclusive' access while holding the MMU lock for write must now cope > > > with volatile PTEs. > > > > > > Yes, this problem already exists when hardware sets the AF, but the > > > lock-free walker implementation needs to be generic so it can be applied > > > for other PTE bits. > > > > Direct reclaim is multi-threaded and each reclaimer can take the mmu > > lock for read (testing the A-bit) or write (unmapping before paging > > out) on arm64. The fundamental problem of using the readers-writer > > lock in this case is priority inversion: the readers have lower > > priority than the writers, so ideally, we don't want the readers to > > block the writers at all. > > > > Using my previous (crude) analogy: puting the bill right in front of > > you (the writers) profits immediately whereas searching for the > > largest bill (the readers) can be futile. > > > > As I said earlier, I prefer we drop the arm64 support for now, but I > > will not object to taking the mmu lock for read when clearing the > > A-bit, as long as we fully understand the problem here and document it > > clearly. > > FWIW, Google Cloud has been doing proactive reclaim and kstaled-based > aging (a Google-internal page aging daemon, for those outside of > Google) for many years on x86 VMs with the A-bit harvesting > under the write-lock. So I'm skeptical that making ARM64 lockless is > necessary to allow Secondary MMUs to participate in MGLRU aging with > acceptable performance for Cloud usecases. I don't even think it's > necessary on x86 but it's a simple enough change that we might as well > just do it. The obvious caveat here: If MGLRU aging and kstaled aging are substantially different in how frequently they trigger mmu_notifiers, then my analysis may not be correct. I'm hoping Yu you can shed some light on that. I'm also operating under the assumption that Secondary MMUs are only participating in aging, and not look-around (i.e. what is implemented in v4). > > I suspect under pathological conditions (host under intense memory > pressure and high rate of reclaim occurring) making A-bit harvesting > lockless will perform better. But under such conditions VM performance > is likely going to suffer regardless. In a Cloud environment we deal > with that through other mechanisms to reduce the rate of reclaim and > make the host healthy. > > For these reasons, I think there's value in giving users the option to > enable Secondary MMUs participation MGLRU aging even when A-bit > test/clearing is not done locklessly. I believe this was James' intent > with the Kconfig. Perhaps a default-off writable module parameter > would be better to avoid distros accidentally turning it on? > > If and when there is a usecase for optimizing VM performance under > pathological reclaim conditions on ARM, we can make it lockless then.
On Fri, May 31, 2024 at 02:31:17PM -0600, Yu Zhao wrote: > On Fri, May 31, 2024 at 1:24 AM Oliver Upton <oliver.upton@linux.dev> wrote: [...] > > Grabbing the MMU lock for write to scan sucks, no argument there. But > > can you please be specific about the impact of read lock v. RCU in the > > case of arm64? I had asked about this before and you never replied. > > > > My concern remains that adding support for software table walkers > > outside of the MMU lock entirely requires more work than just deferring > > the deallocation to an RCU callback. Walkers that previously assumed > > 'exclusive' access while holding the MMU lock for write must now cope > > with volatile PTEs. > > > > Yes, this problem already exists when hardware sets the AF, but the > > lock-free walker implementation needs to be generic so it can be applied > > for other PTE bits. > > Direct reclaim is multi-threaded and each reclaimer can take the mmu > lock for read (testing the A-bit) or write (unmapping before paging > out) on arm64. The fundamental problem of using the readers-writer > lock in this case is priority inversion: the readers have lower > priority than the writers, so ideally, we don't want the readers to > block the writers at all. So we already have this sort of problem of stage-2 fault handling v. secondary MMU invalidations, which is why I've been doubtful of the perceived issue. In fact, I'd argue that needing to wait for faults is worse than aging participation since those can be trivially influenced by userspace/guest. In any case, we shouldn't ever be starved since younger readers cannot enter the critical section with a pending writer. > As I said earlier, I prefer we drop the arm64 support for now, but I > will not object to taking the mmu lock for read when clearing the > A-bit, as long as we fully understand the problem here and document it > clearly. I'd be convinced of this if there's data that shows read lock acquisition is in fact consequential. Otherwise, I'm not sure the added complexity of RCU table walkers (per my statement above) is worth the effort / maintenance burden.
On Thu, May 30, 2024 at 11:06 PM Yu Zhao <yuzhao@google.com> wrote: > > On Wed, May 29, 2024 at 7:08 PM James Houghton <jthoughton@google.com> wrote: > > > > Hi Yu, Sean, > > > > Perhaps I "simplified" this bit of the series a little bit too much. > > Being able to opportunistically do aging with KVM (even without > > setting the Kconfig) is valuable. > > > > IIUC, we have the following possibilities: > > - v4: aging with KVM is done if the new Kconfig is set. > > - v3: aging with KVM is always done. > > This is not true -- in v3, MGLRU only scans secondary MMUs if it can > be done locklessly on x86. It uses a bitmap to imply this requirement. > > > - v2: aging with KVM is done when the architecture reports that it can > > probably be done locklessly, set at KVM MMU init time. > > Not really -- it's only done if it can be done locklessly on both x86 and arm64. > > > - Another possibility?: aging with KVM is only done exactly when it > > can be done locklessly (i.e., mmu_notifier_test/clear_young() called > > such that it will not grab any locks). > > This is exactly the case for v2. Thanks for clarifying; sorry for getting this wrong. > > > I like the v4 approach because: > > 1. We can choose whether or not to do aging with KVM no matter what > > architecture we're using (without requiring userspace be aware to > > disable the feature at runtime with sysfs to avoid regressing > > performance if they don't care about proactive reclaim). > > 2. If we check the new feature bit (0x8) in sysfs, we can know for > > sure if aging is meant to be working or not. The selftest changes I > > made won't work properly unless there is a way to be sure that aging > > is working with KVM. > > I'm not convinced, but it doesn't mean your point of view is invalid. > If you fully understand the implications of your design choice and > document them, I will not object. > > All optimizations in v2 were measured step by step. Even that bitmap, > which might be considered overengineered, brought a readily > measuarable 4% improvement in memcached throughput on Altra Max > swapping to Optane: > > Using the bitmap (64 KVM PTEs for each call) > ============================================================================================================================ > Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 > Latency p99 Latency p99.9 Latency KB/sec > ---------------------------------------------------------------------------------------------------------------------------- > Sets 0.00 --- --- --- > --- --- --- 0.00 > Gets 1012801.92 431436.92 14965.11 0.06246 > 0.04700 0.16700 4.31900 39635.83 > Waits 0.00 --- --- --- > --- --- --- --- > Totals 1012801.92 431436.92 14965.11 0.06246 > 0.04700 0.16700 4.31900 39635.83 > > > Not using the bitmap (1 KVM PTEs for each call) > ============================================================================================================================ > Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 > Latency p99 Latency p99.9 Latency KB/sec > ---------------------------------------------------------------------------------------------------------------------------- > Sets 0.00 --- --- --- > --- --- --- 0.00 > Gets 968210.02 412443.85 14303.89 0.06517 > 0.04700 0.15900 7.42300 37890.74 > Waits 0.00 --- --- --- > --- --- --- --- > Totals 968210.02 412443.85 14303.89 0.06517 > 0.04700 0.15900 7.42300 37890.74 > > > FlameGraphs with bitmap (1.svg) and without bitmap (2.svg) attached. > > What I don't think is acceptable is simplifying those optimizations > out without documenting your justifications (I would even call it a > design change, rather than simplification, from v3 to v4). I'll put back something similar to what you had before (like a test_clear_young() with a "fast" parameter instead of "bitmap"). I like the idea of having a new mmu notifier, like fast_test_clear_young(), while leaving test_young() and clear_young() unchanged (where "fast" means "prioritize speed over accuracy"). It seems a little more straightforward that way. > > > For look-around at eviction time: > > - v4: done if the main mm PTE was young and no MMU notifiers are subscribed. > > - v2/v3: done if the main mm PTE was young or (the SPTE was young and > > the MMU notifier was lockless/fast). > > The host and secondary MMUs are two *independent* cases, IMO: > 1. lookaround the host MMU if the PTE mapping the folio under reclaim is young. > 2. lookaround the secondary MMU if it can be done locklessly. > > So the v2/v3 behavior sounds a lot more reasonable to me. I'll restore the v2/v3 behavior. I initially removed it because, without batching, we (mostly) lose the spatial locality that, IIUC, look-around is designed to exploit. > > Also a nit -- don't use 'else' in the following case (should_look_around()): > > if (foo) > return bar; > else > do_something(); Oh, yes, sorry. I wrote and rewrote should_look_around() quite a few times while trying to figure out what made sense in a no-batching series. I'll fix this. > > > I made this logic change as part of removing batching. > > > > I'd really appreciate guidance on what the correct thing to do is. > > > > In my mind, what would work great is: by default, do aging exactly > > when KVM can do it locklessly, and then have a Kconfig to always have > > MGLRU to do aging with KVM if a user really cares about proactive > > reclaim (when the feature bit is set). The selftest can check the > > Kconfig + feature bit to know for sure if aging will be done. > > I still don't see how that Kconfig helps. Or why the new static branch > isn't enough? Without a special Kconfig, the feature bit just tells us that aging with KVM is possible, not that it will necessarily be done. For the self-test, it'd be good to know exactly when aging is being done or not, so having a Kconfig like LRU_GEN_ALWAYS_WALK_SECONDARY_MMU would help make the self-test set the right expectations for aging. The Kconfig would also allow a user to know that, no matter what, we're going to get correct age data for VMs, even if, say, we're using the shadow MMU. This is somewhat important for me/Google Cloud. Is that reasonable? Maybe there's a better solution.
On Mon, Jun 03, 2024, James Houghton wrote: > On Thu, May 30, 2024 at 11:06 PM Yu Zhao <yuzhao@google.com> wrote: > > What I don't think is acceptable is simplifying those optimizations > > out without documenting your justifications (I would even call it a > > design change, rather than simplification, from v3 to v4). > > I'll put back something similar to what you had before (like a > test_clear_young() with a "fast" parameter instead of "bitmap"). I > like the idea of having a new mmu notifier, like > fast_test_clear_young(), while leaving test_young() and clear_young() > unchanged (where "fast" means "prioritize speed over accuracy"). Those two statements are contradicting each other, aren't they? Anyways, I vote for a "fast only" variant, e.g. test_clear_young_fast_only() or so. gup() has already established that terminology in mm/, so hopefully it would be familiar to readers. We could pass a param, but then the MGLRU code would likely end up doing a bunch of useless indirect calls into secondary MMUs, whereas a dedicated hook allows implementations to nullify the pointer if the API isn't supported for whatever reason. And pulling in Oliver's comments about locking, I think it's important that the mmu_notifier API express it's requirement that the operation be "fast", not that it be lockless. E.g. if a secondary MMU can guarantee that a lock will be contented only in rare, slow cases, then taking a lock is a-ok. Or a secondary MMU could do try-lock and bail if the lock is contended. That way KVM can honor the intent of the API with an implementation that works best for KVM _and_ for MGRLU. I'm sure there will be future adjustments and fixes, but that's just more motivation for using something like "fast only" instead of "lockless". > > > I made this logic change as part of removing batching. > > > > > > I'd really appreciate guidance on what the correct thing to do is. > > > > > > In my mind, what would work great is: by default, do aging exactly > > > when KVM can do it locklessly, and then have a Kconfig to always have > > > MGLRU to do aging with KVM if a user really cares about proactive > > > reclaim (when the feature bit is set). The selftest can check the > > > Kconfig + feature bit to know for sure if aging will be done. > > > > I still don't see how that Kconfig helps. Or why the new static branch > > isn't enough? > > Without a special Kconfig, the feature bit just tells us that aging > with KVM is possible, not that it will necessarily be done. For the > self-test, it'd be good to know exactly when aging is being done or > not, so having a Kconfig like LRU_GEN_ALWAYS_WALK_SECONDARY_MMU would > help make the self-test set the right expectations for aging. > > The Kconfig would also allow a user to know that, no matter what, > we're going to get correct age data for VMs, even if, say, we're using > the shadow MMU. Heh, unless KVM flushes, you won't get "correct" age data. > This is somewhat important for me/Google Cloud. Is that reasonable? Maybe > there's a better solution. Hmm, no? There's no reason to use a Kconfig, e.g. if we _really_ want to prioritize accuracy over speed, then a KVM (x86?) module param to have KVM walk nested TDP page tables would give us what we want. But before we do that, I think we need to perform due dilegence (or provide data) showing that having KVM take mmu_lock for write in the "fast only" API provides better total behavior. I.e. that the additional accuracy is indeed worth the cost.
On Mon, Jun 3, 2024 at 4:03 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Jun 03, 2024, James Houghton wrote: > > On Thu, May 30, 2024 at 11:06 PM Yu Zhao <yuzhao@google.com> wrote: > > > What I don't think is acceptable is simplifying those optimizations > > > out without documenting your justifications (I would even call it a > > > design change, rather than simplification, from v3 to v4). > > > > I'll put back something similar to what you had before (like a > > test_clear_young() with a "fast" parameter instead of "bitmap"). I > > like the idea of having a new mmu notifier, like > > fast_test_clear_young(), while leaving test_young() and clear_young() > > unchanged (where "fast" means "prioritize speed over accuracy"). > > Those two statements are contradicting each other, aren't they? I guess it depends on how you define "similar". :) > Anyways, I vote > for a "fast only" variant, e.g. test_clear_young_fast_only() or so. gup() has > already established that terminology in mm/, so hopefully it would be familiar > to readers. We could pass a param, but then the MGLRU code would likely end up > doing a bunch of useless indirect calls into secondary MMUs, whereas a dedicated > hook allows implementations to nullify the pointer if the API isn't supported > for whatever reason. > > And pulling in Oliver's comments about locking, I think it's important that the > mmu_notifier API express it's requirement that the operation be "fast", not that > it be lockless. E.g. if a secondary MMU can guarantee that a lock will be > contented only in rare, slow cases, then taking a lock is a-ok. Or a secondary > MMU could do try-lock and bail if the lock is contended. > > That way KVM can honor the intent of the API with an implementation that works > best for KVM _and_ for MGRLU. I'm sure there will be future adjustments and fixes, > but that's just more motivation for using something like "fast only" instead of > "lockless". Yes, thanks, this is exactly what I meant. I really should have "only" in the name to signify that it is a requirement that it be fast. Thanks for wording it so clearly. > > > > > I made this logic change as part of removing batching. > > > > > > > > I'd really appreciate guidance on what the correct thing to do is. > > > > > > > > In my mind, what would work great is: by default, do aging exactly > > > > when KVM can do it locklessly, and then have a Kconfig to always have > > > > MGLRU to do aging with KVM if a user really cares about proactive > > > > reclaim (when the feature bit is set). The selftest can check the > > > > Kconfig + feature bit to know for sure if aging will be done. > > > > > > I still don't see how that Kconfig helps. Or why the new static branch > > > isn't enough? > > > > Without a special Kconfig, the feature bit just tells us that aging > > with KVM is possible, not that it will necessarily be done. For the > > self-test, it'd be good to know exactly when aging is being done or > > not, so having a Kconfig like LRU_GEN_ALWAYS_WALK_SECONDARY_MMU would > > help make the self-test set the right expectations for aging. > > > > The Kconfig would also allow a user to know that, no matter what, > > we're going to get correct age data for VMs, even if, say, we're using > > the shadow MMU. > > Heh, unless KVM flushes, you won't get "correct" age data. > > > This is somewhat important for me/Google Cloud. Is that reasonable? Maybe > > there's a better solution. > > Hmm, no? There's no reason to use a Kconfig, e.g. if we _really_ want to prioritize > accuracy over speed, then a KVM (x86?) module param to have KVM walk nested TDP > page tables would give us what we want. > > But before we do that, I think we need to perform due dilegence (or provide data) > showing that having KVM take mmu_lock for write in the "fast only" API provides > better total behavior. I.e. that the additional accuracy is indeed worth the cost. That sounds good to me. I'll drop the Kconfig. I'm not really sure what to do about the self-test, but that's not really all that important.
On Mon, Jun 03, 2024, James Houghton wrote: > On Mon, Jun 3, 2024 at 4:03 PM Sean Christopherson <seanjc@google.com> wrote: > > But before we do that, I think we need to perform due dilegence (or provide data) > > showing that having KVM take mmu_lock for write in the "fast only" API provides > > better total behavior. I.e. that the additional accuracy is indeed worth the cost. > > That sounds good to me. I'll drop the Kconfig. I'm not really sure > what to do about the self-test, but that's not really all that > important. Enable it only on architectures+setups that are guaranteed to implement the fast-only API? E.g. on x86, it darn well better be active if the TDP MMU is enabled. If the test fails because that doesn't hold true, then we _want_ the failure.
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst index 33e068830497..1e578e0c4c0c 100644 --- a/Documentation/admin-guide/mm/multigen_lru.rst +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -48,6 +48,10 @@ Values Components verified on x86 varieties other than Intel and AMD. If it is disabled, the multi-gen LRU will suffer a negligible performance degradation. +0x0008 Continuously clear the accessed bit in secondary MMU page + tables instead of waiting until eviction time. This results in + accurate page age information for pages that are mainly used by + a secondary MMU. [yYnN] Apply to all the components above. ====== =============================================================== @@ -56,7 +60,7 @@ E.g., echo y >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled - 0x0007 + 0x000f echo 5 >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0005 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8f9c9590a42c..869824ef5f3b 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -400,6 +400,7 @@ enum { LRU_GEN_CORE, LRU_GEN_MM_WALK, LRU_GEN_NONLEAF_YOUNG, + LRU_GEN_SECONDARY_MMU_WALK, NR_LRU_GEN_CAPS }; @@ -557,7 +558,7 @@ struct lru_gen_memcg { void lru_gen_init_pgdat(struct pglist_data *pgdat); void lru_gen_init_lruvec(struct lruvec *lruvec); -void lru_gen_look_around(struct page_vma_mapped_walk *pvmw); +bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw); void lru_gen_init_memcg(struct mem_cgroup *memcg); void lru_gen_exit_memcg(struct mem_cgroup *memcg); @@ -576,8 +577,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec) { } -static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) { + return false; } static inline void lru_gen_init_memcg(struct mem_cgroup *memcg) diff --git a/mm/rmap.c b/mm/rmap.c index e8fc5ecb59b2..24a3ff639919 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -870,13 +870,10 @@ static bool folio_referenced_one(struct folio *folio, continue; } - if (pvmw.pte) { - if (lru_gen_enabled() && - pte_young(ptep_get(pvmw.pte))) { - lru_gen_look_around(&pvmw); + if (lru_gen_enabled() && pvmw.pte) { + if (lru_gen_look_around(&pvmw)) referenced++; - } - + } else if (pvmw.pte) { if (ptep_clear_flush_young_notify(vma, address, pvmw.pte)) referenced++; diff --git a/mm/vmscan.c b/mm/vmscan.c index d55e8d07ffc4..0d89f712f45c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -56,6 +56,7 @@ #include <linux/khugepaged.h> #include <linux/rculist_nulls.h> #include <linux/random.h> +#include <linux/mmu_notifier.h> #include <asm/tlbflush.h> #include <asm/div64.h> @@ -2579,6 +2580,12 @@ static bool should_clear_pmd_young(void) return arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG); } +static bool should_walk_secondary_mmu(void) +{ + return IS_ENABLED(CONFIG_LRU_GEN_WALKS_SECONDARY_MMU) && + get_cap(LRU_GEN_SECONDARY_MMU_WALK); +} + /****************************************************************************** * shorthand helpers ******************************************************************************/ @@ -3276,7 +3283,8 @@ static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk return false; } -static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr) +static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr, + struct pglist_data *pgdat) { unsigned long pfn = pte_pfn(pte); @@ -3291,10 +3299,15 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned if (WARN_ON_ONCE(!pfn_valid(pfn))) return -1; + /* try to avoid unnecessary memory loads */ + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + return -1; + return pfn; } -static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr) +static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr, + struct pglist_data *pgdat) { unsigned long pfn = pmd_pfn(pmd); @@ -3309,6 +3322,10 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned if (WARN_ON_ONCE(!pfn_valid(pfn))) return -1; + /* try to avoid unnecessary memory loads */ + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + return -1; + return pfn; } @@ -3317,10 +3334,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg, { struct folio *folio; - /* try to avoid unnecessary memory loads */ - if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) - return NULL; - folio = pfn_folio(pfn); if (folio_nid(folio) != pgdat->node_id) return NULL; @@ -3343,6 +3356,32 @@ static bool suitable_to_scan(int total, int young) return young * n >= total; } +static bool lru_gen_notifier_test_young(struct mm_struct *mm, + unsigned long addr) +{ + return should_walk_secondary_mmu() && mmu_notifier_test_young(mm, addr); +} + +static bool lru_gen_notifier_clear_young(struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + return should_walk_secondary_mmu() && + mmu_notifier_clear_young(mm, start, end); +} + +static bool lru_gen_pmdp_test_and_clear_young(struct vm_area_struct *vma, + unsigned long addr, + pmd_t *pmd) +{ + bool young = pmdp_test_and_clear_young(vma, addr, pmd); + + if (lru_gen_notifier_clear_young(vma->vm_mm, addr, addr + PMD_SIZE)) + young = true; + + return young; +} + static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, struct mm_walk *args) { @@ -3357,8 +3396,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); DEFINE_MAX_SEQ(walk->lruvec); int old_gen, new_gen = lru_gen_from_seq(max_seq); + struct mm_struct *mm = args->mm; - pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl); + pte = pte_offset_map_nolock(mm, pmd, start & PMD_MASK, &ptl); if (!pte) return false; if (!spin_trylock(ptl)) { @@ -3376,11 +3416,12 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, total++; walk->mm_stats[MM_LEAF_TOTAL]++; - pfn = get_pte_pfn(ptent, args->vma, addr); + pfn = get_pte_pfn(ptent, args->vma, addr, pgdat); if (pfn == -1) continue; - if (!pte_young(ptent)) { + if (!pte_young(ptent) && + !lru_gen_notifier_test_young(mm, addr)) { walk->mm_stats[MM_LEAF_OLD]++; continue; } @@ -3389,8 +3430,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, if (!folio) continue; - if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) - VM_WARN_ON_ONCE(true); + lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE); + if (pte_young(ptent)) + ptep_test_and_clear_young(args->vma, addr, pte + i); young++; walk->mm_stats[MM_LEAF_YOUNG]++; @@ -3456,22 +3498,25 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area /* don't round down the first address */ addr = i ? (*first & PMD_MASK) + i * PMD_SIZE : *first; - pfn = get_pmd_pfn(pmd[i], vma, addr); - if (pfn == -1) - goto next; - - if (!pmd_trans_huge(pmd[i])) { - if (should_clear_pmd_young()) + if (pmd_present(pmd[i]) && !pmd_trans_huge(pmd[i])) { + if (should_clear_pmd_young() && + !should_walk_secondary_mmu()) pmdp_test_and_clear_young(vma, addr, pmd + i); goto next; } + pfn = get_pmd_pfn(pmd[i], vma, addr, pgdat); + if (pfn == -1) + goto next; + folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap); if (!folio) goto next; - if (!pmdp_test_and_clear_young(vma, addr, pmd + i)) + if (!lru_gen_pmdp_test_and_clear_young(vma, addr, pmd + i)) { + walk->mm_stats[MM_LEAF_OLD]++; goto next; + } walk->mm_stats[MM_LEAF_YOUNG]++; @@ -3528,19 +3573,18 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end, } if (pmd_trans_huge(val)) { - unsigned long pfn = pmd_pfn(val); struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); + unsigned long pfn = get_pmd_pfn(val, vma, addr, pgdat); walk->mm_stats[MM_LEAF_TOTAL]++; - if (!pmd_young(val)) { - walk->mm_stats[MM_LEAF_OLD]++; + if (pfn == -1) continue; - } - /* try to avoid unnecessary memory loads */ - if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + if (!pmd_young(val) && !mm_has_notifiers(args->mm)) { + walk->mm_stats[MM_LEAF_OLD]++; continue; + } walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first); continue; @@ -3548,7 +3592,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end, walk->mm_stats[MM_NONLEAF_TOTAL]++; - if (should_clear_pmd_young()) { + if (should_clear_pmd_young() && !should_walk_secondary_mmu()) { if (!pmd_young(val)) continue; @@ -3994,6 +4038,26 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) * rmap/PT walk feedback ******************************************************************************/ +static bool should_look_around(struct vm_area_struct *vma, unsigned long addr, + pte_t *pte, int *young) +{ + bool secondary_was_young = + mmu_notifier_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE); + + /* + * Look around if (1) the PTE is young and (2) we do not need to + * consult any secondary MMUs. + */ + if (pte_young(ptep_get(pte))) { + ptep_test_and_clear_young(vma, addr, pte); + *young = true; + return !mm_has_notifiers(vma->vm_mm); + } else if (secondary_was_young) + *young = true; + + return false; +} + /* * This function exploits spatial locality when shrink_folio_list() walks the * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If @@ -4001,7 +4065,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) * the PTE table to the Bloom filter. This forms a feedback loop between the * eviction and the aging. */ -void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) { int i; unsigned long start; @@ -4019,16 +4083,20 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) struct lru_gen_mm_state *mm_state = get_mm_state(lruvec); DEFINE_MAX_SEQ(lruvec); int old_gen, new_gen = lru_gen_from_seq(max_seq); + struct mm_struct *mm = pvmw->vma->vm_mm; lockdep_assert_held(pvmw->ptl); VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio); + if (!should_look_around(vma, addr, pte, &young)) + return young; + if (spin_is_contended(pvmw->ptl)) - return; + return young; /* exclude special VMAs containing anon pages from COW */ if (vma->vm_flags & VM_SPECIAL) - return; + return young; /* avoid taking the LRU lock under the PTL when possible */ walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL; @@ -4036,6 +4104,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) start = max(addr & PMD_MASK, vma->vm_start); end = min(addr | ~PMD_MASK, vma->vm_end - 1) + 1; + if (end - start == PAGE_SIZE) + return young; + if (end - start > MIN_LRU_BATCH * PAGE_SIZE) { if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2) end = start + MIN_LRU_BATCH * PAGE_SIZE; @@ -4049,7 +4120,7 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) /* folio_update_gen() requires stable folio_memcg() */ if (!mem_cgroup_trylock_pages(memcg)) - return; + return young; arch_enter_lazy_mmu_mode(); @@ -4059,19 +4130,21 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) unsigned long pfn; pte_t ptent = ptep_get(pte + i); - pfn = get_pte_pfn(ptent, vma, addr); + pfn = get_pte_pfn(ptent, vma, addr, pgdat); if (pfn == -1) continue; - if (!pte_young(ptent)) + if (!pte_young(ptent) && + !lru_gen_notifier_test_young(mm, addr)) continue; folio = get_pfn_folio(pfn, memcg, pgdat, can_swap); if (!folio) continue; - if (!ptep_test_and_clear_young(vma, addr, pte + i)) - VM_WARN_ON_ONCE(true); + lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE); + if (pte_young(ptent)) + ptep_test_and_clear_young(vma, addr, pte + i); young++; @@ -4101,6 +4174,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) /* feedback from rmap walkers to page table walkers */ if (mm_state && suitable_to_scan(i, young)) update_bloom_filter(mm_state, max_seq, pvmw->pmd); + + return young; } /****************************************************************************** @@ -5137,6 +5212,9 @@ static ssize_t enabled_show(struct kobject *kobj, struct kobj_attribute *attr, c if (should_clear_pmd_young()) caps |= BIT(LRU_GEN_NONLEAF_YOUNG); + if (should_walk_secondary_mmu()) + caps |= BIT(LRU_GEN_SECONDARY_MMU_WALK); + return sysfs_emit(buf, "0x%04x\n", caps); }
Secondary MMUs are currently consulted for access/age information at eviction time, but before then, we don't get accurate age information. That is, pages that are mostly accessed through a secondary MMU (like guest memory, used by KVM) will always just proceed down to the oldest generation, and then at eviction time, if KVM reports the page to be young, the page will be activated/promoted back to the youngest generation. Do not do look around if there is a secondary MMU we have to interact with. The added feature bit (0x8), if disabled, will make MGLRU behave as if there are no secondary MMUs subscribed to MMU notifiers except at eviction time. Suggested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> --- Documentation/admin-guide/mm/multigen_lru.rst | 6 +- include/linux/mmzone.h | 6 +- mm/rmap.c | 9 +- mm/vmscan.c | 144 ++++++++++++++---- 4 files changed, 123 insertions(+), 42 deletions(-)