Message ID | 20240121214413.833776-1-tjmercier@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Revert "mm:vmscan: fix inaccurate reclaim during proactive reclaim" | expand |
On Sun, Jan 21, 2024 at 2:44 PM T.J. Mercier <tjmercier@google.com> wrote: > > This reverts commit 0388536ac29104a478c79b3869541524caec28eb. > > Proactive reclaim on the root cgroup is 10x slower after this patch when > MGLRU is enabled, and completion times for proactive reclaim on much > smaller non-root cgroups take ~30% longer (with or without MGLRU). With > root reclaim before the patch, I observe average reclaim rates of > ~70k pages/sec before try_to_free_mem_cgroup_pages starts to fail and > the nr_retries counter starts to decrement, eventually ending the > proactive reclaim attempt. After the patch the reclaim rate is > consistently ~6.6k pages/sec due to the reduced nr_pages value causing > scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The > proactive reclaim doesn't complete after several minutes because > try_to_free_mem_cgroup_pages is still capable of reclaiming pages in > tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented. > > The docs for memory.reclaim say, "the kernel can over or under reclaim > from the target cgroup" which this patch was trying to fix. Revert it > until a less costly solution is found. > > Signed-off-by: T.J. Mercier <tjmercier@google.com> Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") Cc: <stable@vger.kernel.org>
On Sun 21-01-24 21:44:12, T.J. Mercier wrote: > This reverts commit 0388536ac29104a478c79b3869541524caec28eb. > > Proactive reclaim on the root cgroup is 10x slower after this patch when > MGLRU is enabled, and completion times for proactive reclaim on much > smaller non-root cgroups take ~30% longer (with or without MGLRU). What is the reclaim target in these pro-active reclaim requests? > With > root reclaim before the patch, I observe average reclaim rates of > ~70k pages/sec before try_to_free_mem_cgroup_pages starts to fail and > the nr_retries counter starts to decrement, eventually ending the > proactive reclaim attempt. Do I understand correctly that the reclaim target is over estimated and you expect that the reclaim process breaks out early> > After the patch the reclaim rate is > consistently ~6.6k pages/sec due to the reduced nr_pages value causing > scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The > proactive reclaim doesn't complete after several minutes because > try_to_free_mem_cgroup_pages is still capable of reclaiming pages in > tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented. I do not understand this part. How does a smaller reclaim target manages to have reclaimed > 0 while larger one doesn't? > The docs for memory.reclaim say, "the kernel can over or under reclaim > from the target cgroup" which this patch was trying to fix. Revert it > until a less costly solution is found. > > Signed-off-by: T.J. Mercier <tjmercier@google.com> > --- > mm/memcontrol.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e4c8735e7c85..cee536c97151 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6956,8 +6956,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > lru_add_drain_all(); > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > - GFP_KERNEL, reclaim_options); > + nr_to_reclaim - nr_reclaimed, > + GFP_KERNEL, reclaim_options); > > if (!reclaimed && !nr_retries--) > return -EAGAIN; > -- > 2.43.0.429.g432eaa2c6b-goog >
On Tue, Jan 23, 2024 at 1:33 AM Michal Hocko <mhocko@suse.com> wrote: > > On Sun 21-01-24 21:44:12, T.J. Mercier wrote: > > This reverts commit 0388536ac29104a478c79b3869541524caec28eb. > > > > Proactive reclaim on the root cgroup is 10x slower after this patch when > > MGLRU is enabled, and completion times for proactive reclaim on much > > smaller non-root cgroups take ~30% longer (with or without MGLRU). > > What is the reclaim target in these pro-active reclaim requests? Two targets: 1) /sys/fs/cgroup/memory.reclaim 2) /sys/fs/cgroup/uid_0/memory.reclaim (a bunch of Android system services) Note that lru_gen_shrink_node is used for 1, but shrink_node_memcgs is used for 2. The 10x comes from the rate of reclaim (~70k pages/sec vs ~6.6k pages/sec) for 1. After this revert the root reclaim took only about 10 seconds. Before the revert it's still running after about 3 minutes using a core at 100% the whole time, and I'm too impatient to wait longer to record times for comparison. The 30% comes from the average of a few runs for 2: Before revert: $ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time echo "" > /sys/fs/cgroup/uid_0/memory.reclaim' restarting adbd as root 0m09.69s real 0m00.00s user 0m09.19s system After revert: $ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time echo "" > /sys/fs/cgroup/uid_0/memory.reclaim' 0m07.31s real 0m00.00s user 0m06.44s system It's actually a bigger difference for smaller reclaim amounts: Before revert: $ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time echo "3G" > /sys/fs/cgroup/uid_0/memory.reclaim' 0m12.04s real 0m00.00s user 0m11.48s system After revert: $ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time echo "3G" > /sys/fs/cgroup/uid_0/memory.reclaim' 0m06.65s real 0m00.00s user 0m05.91s system > > With > > root reclaim before the patch, I observe average reclaim rates of > > ~70k pages/sec before try_to_free_mem_cgroup_pages starts to fail and > > the nr_retries counter starts to decrement, eventually ending the > > proactive reclaim attempt. > > Do I understand correctly that the reclaim target is over estimated and > you expect that the reclaim process breaks out early Yes. I expect memory_reclaim to fail at some point when it becomes difficult/impossible to reclaim pages where I specify a large amount to reclaim. The ask here is, "please reclaim as much as possible from this cgroup, but don't take all day". But it takes minutes to get there on the root cgroup, working SWAP_CLUSTER_MAX pages at a time. > > After the patch the reclaim rate is > > consistently ~6.6k pages/sec due to the reduced nr_pages value causing > > scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The > > proactive reclaim doesn't complete after several minutes because > > try_to_free_mem_cgroup_pages is still capable of reclaiming pages in > > tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented. > > I do not understand this part. How does a smaller reclaim target manages > to have reclaimed > 0 while larger one doesn't? They both are able to make progress. The main difference is that a single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon after it reclaims nr_to_reclaim, and before it touches all memcgs. So a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish pages with MGLRU. WIthout MGLRU the memcg walk is not aborted immediately after nr_to_reclaim is reached, so a single call to try_to_free_mem_cgroup_pages can actually reclaim thousands of pages even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.) https://lore.kernel.org/lkml/20221201223923.873696-1-yuzhao@google.com/
On Tue 23-01-24 05:58:05, T.J. Mercier wrote: > On Tue, Jan 23, 2024 at 1:33 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Sun 21-01-24 21:44:12, T.J. Mercier wrote: > > > This reverts commit 0388536ac29104a478c79b3869541524caec28eb. > > > > > > Proactive reclaim on the root cgroup is 10x slower after this patch when > > > MGLRU is enabled, and completion times for proactive reclaim on much > > > smaller non-root cgroups take ~30% longer (with or without MGLRU). > > > > What is the reclaim target in these pro-active reclaim requests? > > Two targets: > 1) /sys/fs/cgroup/memory.reclaim > 2) /sys/fs/cgroup/uid_0/memory.reclaim (a bunch of Android system services) OK, I was not really clear. I was curious about nr_to_reclaim. > Note that lru_gen_shrink_node is used for 1, but shrink_node_memcgs is > used for 2. > > The 10x comes from the rate of reclaim (~70k pages/sec vs ~6.6k > pages/sec) for 1. After this revert the root reclaim took only about > 10 seconds. Before the revert it's still running after about 3 minutes > using a core at 100% the whole time, and I'm too impatient to wait > longer to record times for comparison. > > The 30% comes from the average of a few runs for 2: > Before revert: > $ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time > echo "" > /sys/fs/cgroup/uid_0/memory.reclaim' Ohh, so you want to reclaim all of it (resp. as much as possible). [...] > > > After the patch the reclaim rate is > > > consistently ~6.6k pages/sec due to the reduced nr_pages value causing > > > scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The > > > proactive reclaim doesn't complete after several minutes because > > > try_to_free_mem_cgroup_pages is still capable of reclaiming pages in > > > tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented. > > > > I do not understand this part. How does a smaller reclaim target manages > > to have reclaimed > 0 while larger one doesn't? > > They both are able to make progress. The main difference is that a > single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon > after it reclaims nr_to_reclaim, and before it touches all memcgs. So > a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish > pages with MGLRU. WIthout MGLRU the memcg walk is not aborted > immediately after nr_to_reclaim is reached, so a single call to > try_to_free_mem_cgroup_pages can actually reclaim thousands of pages > even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.) > https://lore.kernel.org/lkml/20221201223923.873696-1-yuzhao@google.com/ OK, I do see how try_to_free_mem_cgroup_pages might over reclaim but I do not really follow how increasing the batch actually fixes the issue that there is always progress being made and therefore memory_reclaim takes ages to terminates?
The revert isn't a straight-forward solution. The patch you're reverting fixed conventional reclaim and broke MGLRU. Your revert fixes MGLRU and breaks conventional reclaim. On Tue, Jan 23, 2024 at 05:58:05AM -0800, T.J. Mercier wrote: > They both are able to make progress. The main difference is that a > single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon > after it reclaims nr_to_reclaim, and before it touches all memcgs. So > a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish > pages with MGLRU. WIthout MGLRU the memcg walk is not aborted > immediately after nr_to_reclaim is reached, so a single call to > try_to_free_mem_cgroup_pages can actually reclaim thousands of pages > even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.) > https://lore.kernel.org/lkml/20221201223923.873696-1-yuzhao@google.com/ Is that a feature or a bug? * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing * of their max_seq counters ensures the eventual fairness to all eligible * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter(). If it bails out exactly after nr_to_reclaim, it'll overreclaim less. But with steady reclaim in a complex subtree, it will always hit the first cgroup returned by mem_cgroup_iter() and then bail. This seems like a fairness issue. We should figure out what the right method for balancing fairness with overreclaim is, regardless of reclaim implementation. Because having two different approaches and reverting dependent things back and forth doesn't make sense. Using an LRU to rotate through memcgs over multiple reclaim cycles seems like a good idea. Why is this specific to MGLRU? Shouldn't this be a generic piece of memcg infrastructure? Then there is the question of why there is an LRU for global reclaim, but not for subtree reclaim. Reclaiming a container with multiple subtrees would benefit from the fairness provided by a container-level LRU order just as much; having fairness for root but not for subtrees would produce different reclaim and pressure behavior, and can cause regressions when moving a service from bare-metal into a container. Figuring out these differences and converging on a method for cgroup fairness would be the better way of fixing this. Because of the regression risk to the default reclaim implementation, I'm inclined to NAK this revert.
On Tue 23-01-24 11:48:19, Johannes Weiner wrote: > The revert isn't a straight-forward solution. > > The patch you're reverting fixed conventional reclaim and broke > MGLRU. Your revert fixes MGLRU and breaks conventional reclaim. > > On Tue, Jan 23, 2024 at 05:58:05AM -0800, T.J. Mercier wrote: > > They both are able to make progress. The main difference is that a > > single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon > > after it reclaims nr_to_reclaim, and before it touches all memcgs. So > > a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish > > pages with MGLRU. WIthout MGLRU the memcg walk is not aborted > > immediately after nr_to_reclaim is reached, so a single call to > > try_to_free_mem_cgroup_pages can actually reclaim thousands of pages > > even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.) > > https://lore.kernel.org/lkml/20221201223923.873696-1-yuzhao@google.com/ > > Is that a feature or a bug? > > * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing > * of their max_seq counters ensures the eventual fairness to all eligible > * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter(). > > If it bails out exactly after nr_to_reclaim, it'll overreclaim > less. But with steady reclaim in a complex subtree, it will always hit > the first cgroup returned by mem_cgroup_iter() and then bail. This > seems like a fairness issue. Agreed. We would need to re-introduce something like we used to have before 1ba6fc9af35bf. > We should figure out what the right method for balancing fairness with > overreclaim is, regardless of reclaim implementation. Because having > two different approaches and reverting dependent things back and forth > doesn't make sense. Absolutely agreed! > Using an LRU to rotate through memcgs over multiple reclaim cycles > seems like a good idea. Why is this specific to MGLRU? Shouldn't this > be a generic piece of memcg infrastructure? > > Then there is the question of why there is an LRU for global reclaim, > but not for subtree reclaim. Reclaiming a container with multiple > subtrees would benefit from the fairness provided by a container-level > LRU order just as much; having fairness for root but not for subtrees > would produce different reclaim and pressure behavior, and can cause > regressions when moving a service from bare-metal into a container. > > Figuring out these differences and converging on a method for cgroup > fairness would be the better way of fixing this. Because of the > regression risk to the default reclaim implementation, I'm inclined to > NAK this revert. I do agree that a simple revert doesn't seem to be the way to go.
On Tue, Jan 23, 2024 at 8:19 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 23-01-24 05:58:05, T.J. Mercier wrote: > > On Tue, Jan 23, 2024 at 1:33 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Sun 21-01-24 21:44:12, T.J. Mercier wrote: > > > > This reverts commit 0388536ac29104a478c79b3869541524caec28eb. > > > > > > > > Proactive reclaim on the root cgroup is 10x slower after this patch when > > > > MGLRU is enabled, and completion times for proactive reclaim on much > > > > smaller non-root cgroups take ~30% longer (with or without MGLRU). > > > > > > What is the reclaim target in these pro-active reclaim requests? > > > > Two targets: > > 1) /sys/fs/cgroup/memory.reclaim > > 2) /sys/fs/cgroup/uid_0/memory.reclaim (a bunch of Android system services) > > OK, I was not really clear. I was curious about nr_to_reclaim. > > > Note that lru_gen_shrink_node is used for 1, but shrink_node_memcgs is > > used for 2. > > > > The 10x comes from the rate of reclaim (~70k pages/sec vs ~6.6k > > pages/sec) for 1. After this revert the root reclaim took only about > > 10 seconds. Before the revert it's still running after about 3 minutes > > using a core at 100% the whole time, and I'm too impatient to wait > > longer to record times for comparison. > > > > The 30% comes from the average of a few runs for 2: > > Before revert: > > $ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time > > echo "" > /sys/fs/cgroup/uid_0/memory.reclaim' > > Ohh, so you want to reclaim all of it (resp. as much as possible). > Right, the main use-case here is we decide an application should be backgrounded and its cgroup frozen. Before freezing, reclaim as much as possible so that the frozen processes' RAM use is as low as possible while they're dormant. > [...] > > > > > After the patch the reclaim rate is > > > > consistently ~6.6k pages/sec due to the reduced nr_pages value causing > > > > scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The > > > > proactive reclaim doesn't complete after several minutes because > > > > try_to_free_mem_cgroup_pages is still capable of reclaiming pages in > > > > tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented. > > > > > > I do not understand this part. How does a smaller reclaim target manages > > > to have reclaimed > 0 while larger one doesn't? > > > > They both are able to make progress. The main difference is that a > > single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon > > after it reclaims nr_to_reclaim, and before it touches all memcgs. So > > a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish > > pages with MGLRU. WIthout MGLRU the memcg walk is not aborted > > immediately after nr_to_reclaim is reached, so a single call to > > try_to_free_mem_cgroup_pages can actually reclaim thousands of pages > > even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.) > > https://lore.kernel.org/lkml/20221201223923.873696-1-yuzhao@google.com/ > > OK, I do see how try_to_free_mem_cgroup_pages might over reclaim but I > do not really follow how increasing the batch actually fixes the issue > that there is always progress being made and therefore memory_reclaim > takes ages to terminates? Oh, because the page reclaim rate with a small batch is just much lower than with a very large batch. We have to restart reclaim from fresh each time a batch is completed before we get to a place where we're actually freeing/swapping pages again. That setup cost is amortized over many more pages with a large batch size, but appears to be pretty significant for small batch sizes.
On Tue, Jan 23, 2024 at 8:48 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > The revert isn't a straight-forward solution. > > The patch you're reverting fixed conventional reclaim and broke > MGLRU. Your revert fixes MGLRU and breaks conventional reclaim. > > On Tue, Jan 23, 2024 at 05:58:05AM -0800, T.J. Mercier wrote: > > They both are able to make progress. The main difference is that a > > single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon > > after it reclaims nr_to_reclaim, and before it touches all memcgs. So > > a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish > > pages with MGLRU. WIthout MGLRU the memcg walk is not aborted > > immediately after nr_to_reclaim is reached, so a single call to > > try_to_free_mem_cgroup_pages can actually reclaim thousands of pages > > even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.) > > https://lore.kernel.org/lkml/20221201223923.873696-1-yuzhao@google.com/ > > Is that a feature or a bug? Feature! > * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing > * of their max_seq counters ensures the eventual fairness to all eligible > * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter(). > > If it bails out exactly after nr_to_reclaim, it'll overreclaim > less. But with steady reclaim in a complex subtree, it will always hit > the first cgroup returned by mem_cgroup_iter() and then bail. This > seems like a fairness issue. Right. Because the memcg LRU is maintained in pg_data_t and not in each cgroup, I think we are currently forced to have the iteration across all child memcgs for non-root memcg reclaim for fairness. > We should figure out what the right method for balancing fairness with > overreclaim is, regardless of reclaim implementation. Because having > two different approaches and reverting dependent things back and forth > doesn't make sense. > > Using an LRU to rotate through memcgs over multiple reclaim cycles > seems like a good idea. Why is this specific to MGLRU? Shouldn't this > be a generic piece of memcg infrastructure? It would be pretty sweet if it were. I haven't tried to measure this part in isolation, but I know we had to abandon attempts to use per-app memcgs in the past (2018?) because the perf overhead was too much. In recent tests where this feature is used, I see some perf gains which I think are probably attributable to this. > Then there is the question of why there is an LRU for global reclaim, > but not for subtree reclaim. Reclaiming a container with multiple > subtrees would benefit from the fairness provided by a container-level > LRU order just as much; having fairness for root but not for subtrees > would produce different reclaim and pressure behavior, and can cause > regressions when moving a service from bare-metal into a container. > > Figuring out these differences and converging on a method for cgroup > fairness would be the better way of fixing this. Because of the > regression risk to the default reclaim implementation, I'm inclined to > NAK this revert. In the meantime, instead of a revert how about changing the batch size geometrically instead of the SWAP_CLUSTER_MAX constant: reclaimed = try_to_free_mem_cgroup_pages(memcg, - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), + (nr_to_reclaim - nr_reclaimed)/2, GFP_KERNEL, reclaim_options); I think that should address the overreclaim concern (it was mentioned that the upper bound of overreclaim was 2 * request), and this should also increase the reclaim rate for root reclaim with MGLRU closer to what it was before.
On Wed, Jan 24, 2024 at 09:46:23AM -0800, T.J. Mercier wrote: > In the meantime, instead of a revert how about changing the batch size > geometrically instead of the SWAP_CLUSTER_MAX constant: > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > - min(nr_to_reclaim - > nr_reclaimed, SWAP_CLUSTER_MAX), > + (nr_to_reclaim - nr_reclaimed)/2, > GFP_KERNEL, reclaim_options); > > I think that should address the overreclaim concern (it was mentioned > that the upper bound of overreclaim was 2 * request), and this should > also increase the reclaim rate for root reclaim with MGLRU closer to > what it was before. Hahaha. Would /4 work for you? I genuinely think the idea is worth a shot. /4 would give us a bit more margin for error, since the bailout/fairness cutoffs have changed back and forth over time. And it should still give you a reasonable convergence on MGLRU. try_to_free_reclaim_pages() already does max(nr_to_reclaim, SWAP_CLUSTER_MAX) which will avoid the painful final approach loops the integer division would produce on its own. Please add a comment mentioning the compromise between the two reclaim implementations though.
On Fri, Jan 26, 2024 at 8:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Jan 24, 2024 at 09:46:23AM -0800, T.J. Mercier wrote: > > In the meantime, instead of a revert how about changing the batch size > > geometrically instead of the SWAP_CLUSTER_MAX constant: > > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > - min(nr_to_reclaim - > > nr_reclaimed, SWAP_CLUSTER_MAX), > > + (nr_to_reclaim - nr_reclaimed)/2, > > GFP_KERNEL, reclaim_options); > > > > I think that should address the overreclaim concern (it was mentioned > > that the upper bound of overreclaim was 2 * request), and this should > > also increase the reclaim rate for root reclaim with MGLRU closer to > > what it was before. > > Hahaha. Would /4 work for you? > > I genuinely think the idea is worth a shot. /4 would give us a bit > more margin for error, since the bailout/fairness cutoffs have changed > back and forth over time. And it should still give you a reasonable > convergence on MGLRU. > > try_to_free_reclaim_pages() already does max(nr_to_reclaim, > SWAP_CLUSTER_MAX) which will avoid the painful final approach loops > the integer division would produce on its own. > > Please add a comment mentioning the compromise between the two reclaim > implementations though. I'll try it out and get back to you. :)
On Tue, Jan 23, 2024 at 9:48 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > The revert isn't a straight-forward solution. > > The patch you're reverting fixed conventional reclaim and broke > MGLRU. Your revert fixes MGLRU and breaks conventional reclaim. This is not true -- the patch reverted regressed the active/inactive LRU too, on execution time. Quoting the commit message: "completion times for proactive reclaim on much smaller non-root cgroups take ~30% longer (with or without MGLRU)." And I wouldn't call the original patch a fix -- it shifted the problem from space to time, which at best is a tradeoff. > On Tue, Jan 23, 2024 at 05:58:05AM -0800, T.J. Mercier wrote: > > They both are able to make progress. The main difference is that a > > single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon > > after it reclaims nr_to_reclaim, and before it touches all memcgs. So > > a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish > > pages with MGLRU. WIthout MGLRU the memcg walk is not aborted > > immediately after nr_to_reclaim is reached, so a single call to > > try_to_free_mem_cgroup_pages can actually reclaim thousands of pages > > even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.) > > https://lore.kernel.org/lkml/20221201223923.873696-1-yuzhao@google.com/ > > Is that a feature or a bug? > > * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing > * of their max_seq counters ensures the eventual fairness to all eligible > * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter(). > > If it bails out exactly after nr_to_reclaim, it'll overreclaim > less. But with steady reclaim in a complex subtree, it will always hit > the first cgroup returned by mem_cgroup_iter() and then bail. This > seems like a fairness issue. > > We should figure out what the right method for balancing fairness with > overreclaim is, regardless of reclaim implementation. Because having > two different approaches and reverting dependent things back and forth > doesn't make sense. > > Using an LRU to rotate through memcgs over multiple reclaim cycles > seems like a good idea. Why is this specific to MGLRU? Shouldn't this > be a generic piece of memcg infrastructure? > > Then there is the question of why there is an LRU for global reclaim, > but not for subtree reclaim. Reclaiming a container with multiple > subtrees would benefit from the fairness provided by a container-level > LRU order just as much; having fairness for root but not for subtrees > would produce different reclaim and pressure behavior, and can cause > regressions when moving a service from bare-metal into a container. > > Figuring out these differences and converging on a method for cgroup > fairness would be the better way of fixing this. Because of the > regression risk to the default reclaim implementation, I'm inclined to > NAK this revert.
On Fri, Jan 26, 2024 at 8:41 AM T.J. Mercier <tjmercier@google.com> wrote: > > On Fri, Jan 26, 2024 at 8:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, Jan 24, 2024 at 09:46:23AM -0800, T.J. Mercier wrote: > > > In the meantime, instead of a revert how about changing the batch size > > > geometrically instead of the SWAP_CLUSTER_MAX constant: > > > > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > - min(nr_to_reclaim - > > > nr_reclaimed, SWAP_CLUSTER_MAX), > > > + (nr_to_reclaim - nr_reclaimed)/2, > > > GFP_KERNEL, reclaim_options); > > > > > > I think that should address the overreclaim concern (it was mentioned > > > that the upper bound of overreclaim was 2 * request), and this should > > > also increase the reclaim rate for root reclaim with MGLRU closer to > > > what it was before. > > > > Hahaha. Would /4 work for you? > > > > I genuinely think the idea is worth a shot. /4 would give us a bit > > more margin for error, since the bailout/fairness cutoffs have changed > > back and forth over time. And it should still give you a reasonable > > convergence on MGLRU. > > > > try_to_free_reclaim_pages() already does max(nr_to_reclaim, > > SWAP_CLUSTER_MAX) which will avoid the painful final approach loops > > the integer division would produce on its own. > > > > Please add a comment mentioning the compromise between the two reclaim > > implementations though. > > I'll try it out and get back to you. :) Right, so (nr_to_reclaim - nr_reclaimed)/4 looks pretty good to me: root - full reclaim pages/sec time (sec) pre-0388536ac291 : 68047 10.46 post-0388536ac291 : 13742 inf (reclaim-reclaimed)/4 : 67352 10.51 /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) pre-0388536ac291 : 258822 1.12 107.8 post-0388536ac291 : 105174 2.49 3.5 (reclaim-reclaimed)/4 : 233396 1.12 -7.4 /uid_0 - full reclaim pages/sec time (sec) pre-0388536ac291 : 72334 7.09 post-0388536ac291 : 38105 14.45 (reclaim-reclaimed)/4 : 72914 6.96 So I'll put up a new patch.
On Tue, Jan 30, 2024 at 12:58:12PM -0800, T.J. Mercier wrote: > On Fri, Jan 26, 2024 at 8:41 AM T.J. Mercier <tjmercier@google.com> wrote: > > > > On Fri, Jan 26, 2024 at 8:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Wed, Jan 24, 2024 at 09:46:23AM -0800, T.J. Mercier wrote: > > > > In the meantime, instead of a revert how about changing the batch size > > > > geometrically instead of the SWAP_CLUSTER_MAX constant: > > > > > > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > > - min(nr_to_reclaim - > > > > nr_reclaimed, SWAP_CLUSTER_MAX), > > > > + (nr_to_reclaim - nr_reclaimed)/2, > > > > GFP_KERNEL, reclaim_options); > > > > > > > > I think that should address the overreclaim concern (it was mentioned > > > > that the upper bound of overreclaim was 2 * request), and this should > > > > also increase the reclaim rate for root reclaim with MGLRU closer to > > > > what it was before. > > > > > > Hahaha. Would /4 work for you? > > > > > > I genuinely think the idea is worth a shot. /4 would give us a bit > > > more margin for error, since the bailout/fairness cutoffs have changed > > > back and forth over time. And it should still give you a reasonable > > > convergence on MGLRU. > > > > > > try_to_free_reclaim_pages() already does max(nr_to_reclaim, > > > SWAP_CLUSTER_MAX) which will avoid the painful final approach loops > > > the integer division would produce on its own. > > > > > > Please add a comment mentioning the compromise between the two reclaim > > > implementations though. > > > > I'll try it out and get back to you. :) > > Right, so (nr_to_reclaim - nr_reclaimed)/4 looks pretty good to me: > > root - full reclaim pages/sec time (sec) > pre-0388536ac291 : 68047 10.46 > post-0388536ac291 : 13742 inf > (reclaim-reclaimed)/4 : 67352 10.51 > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > pre-0388536ac291 : 258822 1.12 107.8 > post-0388536ac291 : 105174 2.49 3.5 > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > /uid_0 - full reclaim pages/sec time (sec) > pre-0388536ac291 : 72334 7.09 > post-0388536ac291 : 38105 14.45 > (reclaim-reclaimed)/4 : 72914 6.96 > > So I'll put up a new patch. That looks great, thanks for giving it a shot. Looking forward to your patch.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e4c8735e7c85..cee536c97151 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6956,8 +6956,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, lru_add_drain_all(); reclaimed = try_to_free_mem_cgroup_pages(memcg, - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), - GFP_KERNEL, reclaim_options); + nr_to_reclaim - nr_reclaimed, + GFP_KERNEL, reclaim_options); if (!reclaimed && !nr_retries--) return -EAGAIN;
This reverts commit 0388536ac29104a478c79b3869541524caec28eb. Proactive reclaim on the root cgroup is 10x slower after this patch when MGLRU is enabled, and completion times for proactive reclaim on much smaller non-root cgroups take ~30% longer (with or without MGLRU). With root reclaim before the patch, I observe average reclaim rates of ~70k pages/sec before try_to_free_mem_cgroup_pages starts to fail and the nr_retries counter starts to decrement, eventually ending the proactive reclaim attempt. After the patch the reclaim rate is consistently ~6.6k pages/sec due to the reduced nr_pages value causing scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The proactive reclaim doesn't complete after several minutes because try_to_free_mem_cgroup_pages is still capable of reclaiming pages in tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented. The docs for memory.reclaim say, "the kernel can over or under reclaim from the target cgroup" which this patch was trying to fix. Revert it until a less costly solution is found. Signed-off-by: T.J. Mercier <tjmercier@google.com> --- mm/memcontrol.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)