Message ID | 20220423155619.3669555-3-void@manifault.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Fix bugs in memcontroller cgroup tests | expand |
Hello David. On Sat, Apr 23, 2022 at 08:56:19AM -0700, David Vernet <void@manifault.com> wrote: > This unfortunately broke the memcg tests, which asserts that a sibling > that experienced reclaim but had a memory.low value of 0, would not > observe any memory.low events. This patch updates test_memcg_low() to > account for the new behavior introduced by memory_recursiveprot. I think the test is correct, there should be no (not even recursive) protection in this particular case (when the remaining siblings consume all of parental protection). This should be fixed in the kernel (see also [1], no updates from me yet :-/) Michal [1] https://lore.kernel.org/lkml/20220322182248.29121-1-mkoutny@suse.com/
Hi Michal, On Wed, Apr 27, 2022 at 04:09:28PM +0200, Michal Koutný wrote: > Hello David. > > On Sat, Apr 23, 2022 at 08:56:19AM -0700, David Vernet <void@manifault.com> wrote: > > This unfortunately broke the memcg tests, which asserts that a sibling > > that experienced reclaim but had a memory.low value of 0, would not > > observe any memory.low events. This patch updates test_memcg_low() to > > account for the new behavior introduced by memory_recursiveprot. > > I think the test is correct, there should be no (not even recursive) > protection in this particular case (when the remaining siblings consume > all of parental protection). > > This should be fixed in the kernel (see also [1], no updates from me yet > :-/) > > Michal > > [1] https://lore.kernel.org/lkml/20220322182248.29121-1-mkoutny@suse.com/ > I see, thanks for sharing that context. I think I see your point about the implementation of the reclaim mechanism potentially overcounting, but my interpretation of the rest of that discussion with Roman is that we haven't yet decided whether we don't want to propagate memory.low events from children cgroups with memory.low == 0. Or at the very least, some more justification was requested on why not counting such events was prudent. Would you be ok with merging this patch so that the cgroup selftests can pass again based on the current behavior of the kernel, and we can then revert the changes to test_memcg_low() later on if and when we decide that we don't want to propagate memory.low events for memory.low == 0 children? Thanks, David
On Thu, Apr 28, 2022 at 06:03:33PM -0700, David Vernet <void@manifault.com> wrote: > but my interpretation of the rest of that discussion with Roman is > that we haven't yet decided whether we don't want to propagate > memory.low events from children cgroups with memory.low == 0. Or at > the very least, some more justification was requested on why not > counting such events was prudent. I'm not a fan of that original proposal of mine anymore (to be more precise, of _only_ that patch, there's still the RFCness reason 1) to consider). As I shared with the last reply there, there's a problem in the behavior which shouldn't be masked by filtering some events. > Would you be ok with merging this patch so that the cgroup selftests can > pass again based on the current behavior of the kernel, and we can then > revert the changes to test_memcg_low() later on if and when we decide that > we don't want to propagate memory.low events for memory.low == 0 children? I still think that the behavior when there's no protection left for the memory.low == 0 child, there should be no memory.low events (not just uncounted but not happening) and test should not accept this (even though it's the current behavior). What might improve the test space would be to have two configs like Original one (simplified here) parent memory.low=50M memory.current=100M ` child1 memory.low=50M memory.current=50M ` child2 memory.low=0M memory.current=50M New one (checks events due to recursive protection) parent memory.low=50M memory.current=100M ` child1 memory.low=40M memory.current=50M ` child2 memory.low=0M memory.current=50M The second config assigns recursive protection to child2 and should therefore cause memory.low events in child2 (with memory_recursiveprot enabled of course). Or alternative new one (checks events due to recursive protection) parent memory.low=50M memory.current=100M ` child1 memory.low=0M memory.current=50M ` child2 memory.low=0M memory.current=50M HTH, Michal
Sorry for the delayed reply, Michal. I've been at LSFMM this week. On Fri, Apr 29, 2022 at 11:26:20AM +0200, Michal Koutný wrote: > I still think that the behavior when there's no protection left for the > memory.low == 0 child, there should be no memory.low events (not just > uncounted but not happening) and test should not accept this (even > though it's the current behavior). That's fair. I think part of the problem here is that in general, the memcontroller itself is quite heuristic, so it's tough to write tests that provide useful coverage while also being sufficiently flexible to avoid flakiness and over-prescribing expected behavior. In this case I think it's probably correct that the memory.low == 0 child shouldn't inherit protection from its parent under any circumstances due to its siblings overcommitting the parent's protection, but I also wonder if it's really necessary to enforce that. If you look at how much memory A/B/E gets at the end of the reclaim, it's still far less than 1MB (though should it be 0?). I'd be curious to hear what Johannes thinks. > What might improve the test space would be to have two configs like > > Original one (simplified here) > parent memory.low=50M memory.current=100M > ` child1 memory.low=50M memory.current=50M > ` child2 memory.low=0M memory.current=50M > > New one (checks events due to recursive protection) > parent memory.low=50M memory.current=100M > ` child1 memory.low=40M memory.current=50M > ` child2 memory.low=0M memory.current=50M > > The second config assigns recursive protection to child2 and should > therefore cause memory.low events in child2 (with memory_recursiveprot > enabled of course). Something like this would work, though I think it's useful to specifically validate the behavior of the memcontroller when the children overcommit the parent's memory.low protection, which the current test does. So I'm inclined to keep this testcase, and add your next suggestion: > Or alternative new one (checks events due to recursive protection) > parent memory.low=50M memory.current=100M > ` child1 memory.low=0M memory.current=50M > ` child2 memory.low=0M memory.current=50M This definitely sounds to me like a useful testcase to add, and I'm happy to do so in a follow-on patch. If we added this, do you think we need to keep the check for memory.low events for the memory.low == 0 child in the overcommit testcase? It arguably helped to catch the SWAP_CLUSTER_MAX rounding issue you pointed out. Again, curious to hear what Johannes thinks as well. Thanks, David
On Fri, May 06, 2022 at 09:40:15AM -0700, David Vernet wrote: > Sorry for the delayed reply, Michal. I've been at LSFMM this week. > > On Fri, Apr 29, 2022 at 11:26:20AM +0200, Michal Koutný wrote: > > I still think that the behavior when there's no protection left for the > > memory.low == 0 child, there should be no memory.low events (not just > > uncounted but not happening) and test should not accept this (even > > though it's the current behavior). > > That's fair. I think part of the problem here is that in general, the > memcontroller itself is quite heuristic, so it's tough to write tests that > provide useful coverage while also being sufficiently flexible to avoid > flakiness and over-prescribing expected behavior. In this case I think it's > probably correct that the memory.low == 0 child shouldn't inherit > protection from its parent under any circumstances due to its siblings > overcommitting the parent's protection, but I also wonder if it's really > necessary to enforce that. If you look at how much memory A/B/E gets at the > end of the reclaim, it's still far less than 1MB (though should it be 0?). > I'd be curious to hear what Johannes thinks. We need to distinguish between what the siblings declare and what they consume. My understanding of the issue you're raising, Michal, is that protected siblings start with current > low, then get reclaimed slightly too much and end up with current < low. This results in a tiny bit of float that then gets assigned to the low=0 sibling; when that sibling gets reclaimed regardless, it sees a low event. Correct me if I missed a detail or nuance here. But unused float going to siblings is intentional. This is documented in point 3 in the comment above effective_protection(): if you use less than you're legitimately claiming, the float goes to your siblings. So the problem doesn't seem to be with low accounting and event generation, but rather it's simply overreclaim. It's conceivable to make reclaim more precise and then tighten up the test. But right now, David's patch looks correct to me.
On Mon, 9 May 2022 11:09:15 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > On Fri, May 06, 2022 at 09:40:15AM -0700, David Vernet wrote: > > Sorry for the delayed reply, Michal. I've been at LSFMM this week. > > > > On Fri, Apr 29, 2022 at 11:26:20AM +0200, Michal Koutný wrote: > > > I still think that the behavior when there's no protection left for the > > > memory.low == 0 child, there should be no memory.low events (not just > > > uncounted but not happening) and test should not accept this (even > > > though it's the current behavior). > > > > That's fair. I think part of the problem here is that in general, the > > memcontroller itself is quite heuristic, so it's tough to write tests that > > provide useful coverage while also being sufficiently flexible to avoid > > flakiness and over-prescribing expected behavior. In this case I think it's > > probably correct that the memory.low == 0 child shouldn't inherit > > protection from its parent under any circumstances due to its siblings > > overcommitting the parent's protection, but I also wonder if it's really > > necessary to enforce that. If you look at how much memory A/B/E gets at the > > end of the reclaim, it's still far less than 1MB (though should it be 0?). > > I'd be curious to hear what Johannes thinks. > > We need to distinguish between what the siblings declare and what they > consume. > > My understanding of the issue you're raising, Michal, is that > protected siblings start with current > low, then get reclaimed > slightly too much and end up with current < low. This results in a > tiny bit of float that then gets assigned to the low=0 sibling; when > that sibling gets reclaimed regardless, it sees a low event. Correct > me if I missed a detail or nuance here. > > But unused float going to siblings is intentional. This is documented > in point 3 in the comment above effective_protection(): if you use > less than you're legitimately claiming, the float goes to your > siblings. So the problem doesn't seem to be with low accounting and > event generation, but rather it's simply overreclaim. > > It's conceivable to make reclaim more precise and then tighten up the > test. But right now, David's patch looks correct to me. So I think we're OK with [2/5] now. Unless there be objections, I'll be looking to get this series into mm-stable later this week.
Hello all. On Mon, May 09, 2022 at 05:44:24PM -0700, Andrew Morton <akpm@linux-foundation.org> wrote: > So I think we're OK with [2/5] now. Unless there be objections, I'll > be looking to get this series into mm-stable later this week. I'm sorry, I think the current form of the test reveals an unexpected behavior of reclaim and silencing the test is not the way to go. Although, I may be convinced that my understanding is wrong. On Mon, May 09, 2022 at 11:09:15AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote: > My understanding of the issue you're raising, Michal, is that > protected siblings start with current > low, then get reclaimed > slightly too much and end up with current < low. This results in a > tiny bit of float that then gets assigned to the low=0 sibling; Up until here, we're on the same page. > when that sibling gets reclaimed regardless, it sees a low event. > Correct me if I missed a detail or nuance here. Here, I'd like to stress that the event itself is just a messenger (whom my original RFC patch attempted to get rid of). The problem is that if the sibling with recursive protection is active enough to claim it, it's effectively stolen from the passive sibling. See the comparison of 'precious' vs 'victim' in [1]. > But unused float going to siblings is intentional. This is documented > in point 3 in the comment above effective_protection(): if you use > less than you're legitimately claiming, the float goes to your > siblings. The problem is how the unused protection came to be (voluntarily not consumed vs reclaimed). > So the problem doesn't seem to be with low accounting and > event generation, but rather it's simply overreclaim. Exactly. > It's conceivable to make reclaim more precise and then tighten up the > test. But right now, David's patch looks correct to me. The obvious fix is at the end of this message, it resolves the case I posted earlier (with memory_recursiveprot), however, it "breaks" memory.events:low accounting inside recursive children, hence I'm not considering it finished. (I may elaborate on the breaking case if interested, I also need to look more into that myself). On Fri, May 06, 2022 at 09:40:15AM -0700, David Vernet <void@manifault.com> wrote: > If you look at how much memory A/B/E gets at the end of the reclaim, > it's still far less than 1MB (though should it be 0?). This selftest has two ±equal workloads in siblings, however, if their activity varies, it can end up even opposite (the example [1]). > This definitely sounds to me like a useful testcase to add, and I'm > happy to do so in a follow-on patch. If we added this, do you think > we need to keep the check for memory.low events for the memory.low == > 0 child in the overcommit testcase? I think it's still useful, to check the behavior when inherited vs explicit siblings coexist under protected parent. Actually, the second case of all siblings having the inherited (implicit) protection is also interesting (it seems that's that I'm seeing in my tests with the attached patch). +Cc: Chris, who reasoned about the SWAP_CLUSTER_MAX rounding vs too high priority (too low numerically IIUC) [2]. Michal [1] https://lore.kernel.org/r/20220325103118.GC2828@blackbody.suse.cz/ [2] https://lore.kernel.org/all/20190128214213.GB15349@chrisdown.name/ --- 8< --- From e18caf7a5a1b0f39185fbdc11e4034def42cde88 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny@suse.com> Date: Tue, 10 May 2022 18:48:31 +0200 Subject: [RFC PATCH] mm: memcg: Do not overreclaim SWAP_CLUSTER_MAX from protected memcg MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This was observed with memcontrol selftest/new LTP test but can be also reproduced in simplified setup of two siblings: `parent .low=50M ` s1 .low=50M .current=50M+ε ` s2 .low=0M .current=50M The expectation is that s2/memory.events:low will be zero under outer reclaimer since no protection should be given to cgroup s2 (even with memory_recursiveprot). However, this does not happen. The apparent reason is that when s1 is considered for (proportional) reclaim the scanned proportion is rounded up to SWAP_CLUSTER_MAX and slightly over-proportional amount is reclaimed. Consequently, when the effective low value of s2 is calculated, it observes unclaimed parent's protection from s1 (ε-SWAP_CLUSTER_MAX in theory) and effectively appropriates it. What is worse, when the sibling s2 has more (memory) greedy workload, it can repeatedly "steal" the protection from s1 and the distribution ends up with s1 mostly reclaimed despite explicit prioritization over s2. Simply fix it by _not_ rounding up to SWAP_CLUSTER_MAX. This would have saved us ~5 levels of reclaim priority. I.e. we may be reclaiming from protected memcgs at relatively low priority _without_ counting any memory.events:low (due to overreclaim). Now, if the moderated scan is not enough, we must bring priority to zero to open protected reserves. And that's correct, we want to be explicit when reclaiming those. Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection") Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Reported-by: Richard Palethorpe <rpalethorpe@suse.com> Link: https://lore.kernel.org/all/20220321101429.3703-1-rpalethorpe@suse.com/ Signed-off-by: Michal Koutný <mkoutny@suse.com> --- mm/vmscan.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 1678802e03e7..cd760842b9ad 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2798,13 +2798,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, scan = lruvec_size - lruvec_size * protection / (cgroup_size + 1); - - /* - * Minimally target SWAP_CLUSTER_MAX pages to keep - * reclaim moving forwards, avoiding decrementing - * sc->priority further than desirable. - */ - scan = max(scan, SWAP_CLUSTER_MAX); } else { scan = lruvec_size; }
Hi Michal, On Tue, May 10, 2022 at 07:43:41PM +0200, Michal Koutný wrote: > On Mon, May 09, 2022 at 05:44:24PM -0700, Andrew Morton <akpm@linux-foundation.org> wrote: > > So I think we're OK with [2/5] now. Unless there be objections, I'll > > be looking to get this series into mm-stable later this week. > > I'm sorry, I think the current form of the test reveals an unexpected > behavior of reclaim and silencing the test is not the way to go. > Although, I may be convinced that my understanding is wrong. Looking through your demo results again, I agree with you. It's a tiny error, but it compounds and systematically robs the protected group over and over, to the point where its protection becomes worthless - at least in idle groups, which isn't super common but does happen. Let's keep the test as-is and fix reclaim to make it pass ;) > The obvious fix is at the end of this message, it resolves the case I > posted earlier (with memory_recursiveprot), however, it "breaks" > memory.events:low accounting inside recursive children, hence I'm not > considering it finished. (I may elaborate on the breaking case if > interested, I also need to look more into that myself). Can you indeed elaborate on the problem you see with low events? > @@ -2798,13 +2798,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > > scan = lruvec_size - lruvec_size * protection / > (cgroup_size + 1); > - > - /* > - * Minimally target SWAP_CLUSTER_MAX pages to keep > - * reclaim moving forwards, avoiding decrementing > - * sc->priority further than desirable. > - */ > - scan = max(scan, SWAP_CLUSTER_MAX); IIRC this was added due to premature OOMs in synthetic testing (Chris may remember more details). However, in practice it wasn't enough anyway, and was followed up by f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim"). Now, reclaim retries the whole cycle if proportional protection was in place and it didn't manage to make progress. The rounding for progress doesn't seem to matter anymore. So your proposed patch looks like the right thing to do to me. And I would ack it, but please do explain your concerns around low event reporting after it. Thanks!
On Wed, May 11, 2022 at 01:53:05PM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote: > Can you indeed elaborate on the problem you see with low events? My mistake. I realized I was testing on a system without memory_recursiveprot enabled. Therefore I saw no events in children with memory.low=0. However, it also means that my previous evaluation of the "simple" fix (dropping the SWAP_CLUSTER_MAX rounding) was incorrect and it actually doesn't resolve the problem of two differently active siblings I posted earlier. > So your proposed patch looks like the right thing to do to me. And I > would ack it, but please do explain your concerns around low event > reporting after it. I retract it (at least for now), it doesn't really help. It can be seen (after application) [1] that once (low) protected memory is opened for reclaim, the sibling proportions change suddenly (neither sibling is protected during sc->memcg_low_reclaim, however, the formerly protected suddenly provides good supply of reclaimable pages). OTOH, without memory_recursiveprot [2], the elow growth of the victim sibling is absent and situation stabilizes with only partial reclaim from the (explicitly) protected sibling. In both variants (recursive/non-recursive) the parent ends up with same amount of unreclaimed memory, however, the gradual tranfer of elow with the recursive protection is undesired. (I'm only thinking how to solve it simply.) Michal [1] https://bugzilla.suse.com/attachment.cgi?id=858869 [2] https://bugzilla.suse.com/attachment.cgi?id=858870
diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c index dbaa7aabbb4a..e5d8d727bdcf 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.c +++ b/tools/testing/selftests/cgroup/cgroup_util.c @@ -535,6 +535,18 @@ int set_oom_adj_score(int pid, int score) return 0; } +int proc_mount_contains(const char *option) +{ + char buf[4 * PAGE_SIZE]; + ssize_t read; + + read = read_text("/proc/mounts", buf, sizeof(buf)); + if (read < 0) + return read; + + return strstr(buf, option) != NULL; +} + ssize_t proc_read_text(int pid, bool thread, const char *item, char *buf, size_t size) { char path[PATH_MAX]; diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h index 628738532ac9..756f76052b44 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.h +++ b/tools/testing/selftests/cgroup/cgroup_util.h @@ -48,6 +48,7 @@ extern int is_swap_enabled(void); extern int set_oom_adj_score(int pid, int score); extern int cg_wait_for_proc_count(const char *cgroup, int count); extern int cg_killall(const char *cgroup); +int proc_mount_contains(const char *option); extern ssize_t proc_read_text(int pid, bool thread, const char *item, char *buf, size_t size); extern int proc_read_strstr(int pid, bool thread, const char *item, const char *needle); extern pid_t clone_into_cgroup(int cgroup_fd); diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c index 284d912e7d3e..d37e8dfb1248 100644 --- a/tools/testing/selftests/cgroup/test_memcontrol.c +++ b/tools/testing/selftests/cgroup/test_memcontrol.c @@ -21,6 +21,8 @@ #include "../kselftest.h" #include "cgroup_util.h" +static bool has_recursiveprot; + /* * This test creates two nested cgroups with and without enabling * the memory controller. @@ -521,15 +523,18 @@ static int test_memcg_low(const char *root) } for (i = 0; i < ARRAY_SIZE(children); i++) { + int no_low_events_index = has_recursiveprot ? 2 : 1; + oom = cg_read_key_long(children[i], "memory.events", "oom "); low = cg_read_key_long(children[i], "memory.events", "low "); if (oom) goto cleanup; - if (i < 2 && low <= 0) + if (i <= no_low_events_index && low <= 0) goto cleanup; - if (i >= 2 && low) + if (i > no_low_events_index && low) goto cleanup; + } ret = KSFT_PASS; @@ -1272,7 +1277,7 @@ struct memcg_test { int main(int argc, char **argv) { char root[PATH_MAX]; - int i, ret = EXIT_SUCCESS; + int i, proc_status, ret = EXIT_SUCCESS; if (cg_find_unified_root(root, sizeof(root))) ksft_exit_skip("cgroup v2 isn't mounted\n"); @@ -1288,6 +1293,11 @@ int main(int argc, char **argv) if (cg_write(root, "cgroup.subtree_control", "+memory")) ksft_exit_skip("Failed to set memory controller\n"); + proc_status = proc_mount_contains("memory_recursiveprot"); + if (proc_status < 0) + ksft_exit_skip("Failed to query cgroup mount option\n"); + has_recursiveprot = proc_status; + for (i = 0; i < ARRAY_SIZE(tests); i++) { switch (tests[i].fn(root)) { case KSFT_PASS: