diff mbox series

[v6,1/2] mm/vmscan: Skip memcg with !usage in shrink_node_memcgs()

Message ID 20250414021249.3232315-2-longman@redhat.com (mailing list archive)
State New
Headers show
Series memcg: Fix test_memcg_min/low test failures | expand

Commit Message

Waiman Long April 14, 2025, 2:12 a.m. UTC
The test_memcontrol selftest consistently fails its test_memcg_low
sub-test due to the fact that two of its test child cgroups which
have a memmory.low of 0 or an effective memory.low of 0 still have low
events generated for them since mem_cgroup_below_low() use the ">="
operator when comparing to elow.

The two failed use cases are as follows:

1) memory.low is set to 0, but low events can still be triggered and
   so the cgroup may have a non-zero low event count.

2) memory.low is set to a non-zero value but the cgroup has no task in
   it so that it has an effective low value of 0. Again it may have a
   non-zero low event count if memory reclaim happens. This is probably
   not a result expected by the users and it is really doubtful that
   users will check an empty cgroup with no task in it and expecting
   some non-zero event counts.

In the first case, even though memory.low isn't set, it may still have
some low protection if memory.low is set in the parent and the cgroup2
memory_recursiveprot mount option is enabled. So low event may still
be recorded. The test_memcontrol.c test has to be modified to account
for that.

For the second case, it really doesn't make sense to have non-zero
low event if the cgroup has 0 usage. So we need to skip this corner
case in shrink_node_memcgs() by skipping the !usage case.

With this patch applied, the test_memcg_low sub-test finishes
successfully without failure in most cases. Though both test_memcg_low
and test_memcg_min sub-tests may still fail occasionally if the
memory.current values fall outside of the expected ranges.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/internal.h                                    |  9 +++++++++
 mm/memcontrol-v1.h                               |  2 --
 mm/vmscan.c                                      |  4 ++++
 tools/testing/selftests/cgroup/test_memcontrol.c | 16 +++++++++++-----
 4 files changed, 24 insertions(+), 7 deletions(-)

Comments

Michal Koutný April 14, 2025, 12:42 p.m. UTC | #1
On Sun, Apr 13, 2025 at 10:12:48PM -0400, Waiman Long <longman@redhat.com> wrote:
> 2) memory.low is set to a non-zero value but the cgroup has no task in
>    it so that it has an effective low value of 0. Again it may have a
>    non-zero low event count if memory reclaim happens. This is probably
>    not a result expected by the users and it is really doubtful that
>    users will check an empty cgroup with no task in it and expecting
>    some non-zero event counts.

I think you want to distinguish "no tasks" vs "no usage" in this
paragraph.


> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5963,6 +5963,10 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  
>  		mem_cgroup_calculate_protection(target_memcg, memcg);
>  
> +		/* Skip memcg with no usage */
> +		if (!mem_cgroup_usage(memcg, false))
> +			continue;
> +
>  		if (mem_cgroup_below_min(target_memcg, memcg)) {

As I think more about this -- the idea expressed by the diff makes
sense. But is it really a change?
For non-root memcgs, they'll be skipped because 0 >= 0 (in
mem_cgroup_below_min()) and root memcg would hardly be skipped.


> --- a/tools/testing/selftests/cgroup/test_memcontrol.c
> +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
> @@ -380,10 +380,10 @@ static bool reclaim_until(const char *memcg, long goal);
>   *
>   * Then it checks actual memory usages and expects that:
>   * A/B    memory.current ~= 50M
> - * A/B/C  memory.current ~= 29M
> - * A/B/D  memory.current ~= 21M
> - * A/B/E  memory.current ~= 0
> - * A/B/F  memory.current  = 0
> + * A/B/C  memory.current ~= 29M [memory.events:low > 0]
> + * A/B/D  memory.current ~= 21M [memory.events:low > 0]
> + * A/B/E  memory.current ~= 0   [memory.events:low == 0 if !memory_recursiveprot, > 0 otherwise]

Please note the subtlety in my suggestion -- I want the test with
memory_recursiveprot _not_ to check events count at all. Because:
	a) it forces single interpretation of low events wrt effective
	   low limit 
	b) effective low limit should still be 0 in E in this testcase
	   (there should be no unclaimed protection of C and D).

> + * A/B/F  memory.current  = 0   [memory.events:low == 0]


Thanks,
Michal
Waiman Long April 14, 2025, 1:15 p.m. UTC | #2
On 4/14/25 8:42 AM, Michal Koutný wrote:
> On Sun, Apr 13, 2025 at 10:12:48PM -0400, Waiman Long <longman@redhat.com> wrote:
>> 2) memory.low is set to a non-zero value but the cgroup has no task in
>>     it so that it has an effective low value of 0. Again it may have a
>>     non-zero low event count if memory reclaim happens. This is probably
>>     not a result expected by the users and it is really doubtful that
>>     users will check an empty cgroup with no task in it and expecting
>>     some non-zero event counts.
> I think you want to distinguish "no tasks" vs "no usage" in this
> paragraph.

Good point. Will update it if I need to send a new version.


>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -5963,6 +5963,10 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>>   
>>   		mem_cgroup_calculate_protection(target_memcg, memcg);
>>   
>> +		/* Skip memcg with no usage */
>> +		if (!mem_cgroup_usage(memcg, false))
>> +			continue;
>> +
>>   		if (mem_cgroup_below_min(target_memcg, memcg)) {
> As I think more about this -- the idea expressed by the diff makes
> sense. But is it really a change?
> For non-root memcgs, they'll be skipped because 0 >= 0 (in
> mem_cgroup_below_min()) and root memcg would hardly be skipped.

I did see some low event in the no usage case because of the ">=" 
comparison used in mem_cgroup_below_min(). I originally planning to 
guard against the elow == 0 case but Johannes advised against it.


>
>
>> --- a/tools/testing/selftests/cgroup/test_memcontrol.c
>> +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
>> @@ -380,10 +380,10 @@ static bool reclaim_until(const char *memcg, long goal);
>>    *
>>    * Then it checks actual memory usages and expects that:
>>    * A/B    memory.current ~= 50M
>> - * A/B/C  memory.current ~= 29M
>> - * A/B/D  memory.current ~= 21M
>> - * A/B/E  memory.current ~= 0
>> - * A/B/F  memory.current  = 0
>> + * A/B/C  memory.current ~= 29M [memory.events:low > 0]
>> + * A/B/D  memory.current ~= 21M [memory.events:low > 0]
>> + * A/B/E  memory.current ~= 0   [memory.events:low == 0 if !memory_recursiveprot, > 0 otherwise]
> Please note the subtlety in my suggestion -- I want the test with
> memory_recursiveprot _not_ to check events count at all. Because:
> 	a) it forces single interpretation of low events wrt effective
> 	   low limit
> 	b) effective low limit should still be 0 in E in this testcase
> 	   (there should be no unclaimed protection of C and D).

Yes, low event count for E is 0 in the !memory_recursiveprot case, but 
C/D still have low events and setting no_low_events_index to -1 will 
fail the test and it is not the same as not checking low event counts at 
all.

Cheers,
Longman
Michal Koutný April 14, 2025, 1:55 p.m. UTC | #3
On Mon, Apr 14, 2025 at 09:15:57AM -0400, Waiman Long <llong@redhat.com> wrote:
> I did see some low event in the no usage case because of the ">=" comparison
> used in mem_cgroup_below_min().

Do you refer to A/B/E or A/B/F from the test?
It's OK to see some events if there was non-zero usage initially.

Nevertheless, which situation this patch changes that is not handled by
mem_cgroup_below_min() already?

> Yes, low event count for E is 0 in the !memory_recursiveprot case, but C/D
> still have low events and setting no_low_events_index to -1 will fail the
> test and it is not the same as not checking low event counts at all.

I added yet another ignore_low_events_index variable (in my original
proposal) not to fail the test. But feel free to come up with another
implementation, I wanted to point out the "not specified" expectation
for E with memory_recursiveprot.

Michal
Johannes Weiner April 14, 2025, 4:47 p.m. UTC | #4
On Mon, Apr 14, 2025 at 03:55:39PM +0200, Michal Koutný wrote:
> On Mon, Apr 14, 2025 at 09:15:57AM -0400, Waiman Long <llong@redhat.com> wrote:
> > I did see some low event in the no usage case because of the ">=" comparison
> > used in mem_cgroup_below_min().
> 
> Do you refer to A/B/E or A/B/F from the test?
> It's OK to see some events if there was non-zero usage initially.
> 
> Nevertheless, which situation this patch changes that is not handled by
> mem_cgroup_below_min() already?

It's not a functional change to the protection semantics or the
reclaim behavior.

The problem is if we go into low_reclaim and encounter an empty group,
we'll issue "low-protected group is being reclaimed" events, which is
kind of absurd (nothing will be reclaimed) and thus confusing to users
(I didn't even configure any protection!)

I suggested, instead of redefining the protection definitions for that
special case, to bypass all the checks and the scan count calculations
when we already know the group is empty and none of this applies.

https://lore.kernel.org/linux-mm/20250404181308.GA300138@cmpxchg.org/
Michal Koutný April 14, 2025, 6:01 p.m. UTC | #5
On Mon, Apr 14, 2025 at 12:47:21PM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> It's not a functional change to the protection semantics or the
> reclaim behavior.

Yes, that's how I understand it, therefore I'm wondering what does it
change.

If this is taken:
               if (!mem_cgroup_usage(memcg, false))
                       continue;

this would've been taken too:
                if (mem_cgroup_below_min(target_memcg, memcg))
                        continue;
(unless target_memcg == memcg but that's not interesting for the events
here)

> The problem is if we go into low_reclaim and encounter an empty group,
> we'll issue "low-protected group is being reclaimed" events,

How can this happen when
	page_counter_read(&memcg->memory) <= memcg->memory.emin
? (I.e. in this case 0 <= emin and emin >= 0.)

> which is kind of absurd (nothing will be reclaimed) and thus confusing
> to users (I didn't even configure any protection!)

Yes.
 
> I suggested, instead of redefining the protection definitions for that
> special case, to bypass all the checks and the scan count calculations
> when we already know the group is empty and none of this applies.
> 
> https://lore.kernel.org/linux-mm/20250404181308.GA300138@cmpxchg.org/

Is this non-functional change to make shrink_node_memcgs() robust
against possible future redefinitions of mem_cgroup_below_*()?


Michal
Johannes Weiner April 14, 2025, 6:10 p.m. UTC | #6
On Mon, Apr 14, 2025 at 08:01:42PM +0200, Michal Koutný wrote:
> On Mon, Apr 14, 2025 at 12:47:21PM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > It's not a functional change to the protection semantics or the
> > reclaim behavior.
> 
> Yes, that's how I understand it, therefore I'm wondering what does it
> change.
> 
> If this is taken:
>                if (!mem_cgroup_usage(memcg, false))
>                        continue;
> 
> this would've been taken too:
>                 if (mem_cgroup_below_min(target_memcg, memcg))
>                         continue;
> (unless target_memcg == memcg but that's not interesting for the events
> here)

D'oh.

> > The problem is if we go into low_reclaim and encounter an empty group,
> > we'll issue "low-protected group is being reclaimed" events,
> 
> How can this happen when
> 	page_counter_read(&memcg->memory) <= memcg->memory.emin
> ? (I.e. in this case 0 <= emin and emin >= 0.)
> 
> > which is kind of absurd (nothing will be reclaimed) and thus confusing
> > to users (I didn't even configure any protection!)
> 
> Yes.
>  
> > I suggested, instead of redefining the protection definitions for that
> > special case, to bypass all the checks and the scan count calculations
> > when we already know the group is empty and none of this applies.
> > 
> > https://lore.kernel.org/linux-mm/20250404181308.GA300138@cmpxchg.org/
> 
> Is this non-functional change to make shrink_node_memcgs() robust
> against possible future redefinitions of mem_cgroup_below_*()?

No, this was really just aimed to stop low events on empty groups.

But as you rightfully point out, they should not get past the min
check in the first place. So something seems missing here.
Waiman Long April 14, 2025, 6:57 p.m. UTC | #7
On 4/14/25 2:10 PM, Johannes Weiner wrote:
> On Mon, Apr 14, 2025 at 08:01:42PM +0200, Michal Koutný wrote:
>> On Mon, Apr 14, 2025 at 12:47:21PM -0400, Johannes Weiner<hannes@cmpxchg.org> wrote:
>>> It's not a functional change to the protection semantics or the
>>> reclaim behavior.
>> Yes, that's how I understand it, therefore I'm wondering what does it
>> change.
>>
>> If this is taken:
>>                 if (!mem_cgroup_usage(memcg, false))
>>                         continue;
>>
>> this would've been taken too:
>>                  if (mem_cgroup_below_min(target_memcg, memcg))
>>                          continue;
>> (unless target_memcg == memcg but that's not interesting for the events
>> here)
> D'oh.
>
>>> The problem is if we go into low_reclaim and encounter an empty group,
>>> we'll issue "low-protected group is being reclaimed" events,
>> How can this happen when
>> 	page_counter_read(&memcg->memory) <= memcg->memory.emin
>> ? (I.e. in this case 0 <= emin and emin >= 0.)
>>
>>> which is kind of absurd (nothing will be reclaimed) and thus confusing
>>> to users (I didn't even configure any protection!)
>> Yes.
>>   
>>> I suggested, instead of redefining the protection definitions for that
>>> special case, to bypass all the checks and the scan count calculations
>>> when we already know the group is empty and none of this applies.
>>>
>>> https://lore.kernel.org/linux-mm/20250404181308.GA300138@cmpxchg.org/
>> Is this non-functional change to make shrink_node_memcgs() robust
>> against possible future redefinitions of mem_cgroup_below_*()?
> No, this was really just aimed to stop low events on empty groups.
>
> But as you rightfully point out, they should not get past the min
> check in the first place. So something seems missing here.

I think I saw some low events in the !usage case was because my original 
patch was to remove the '=' from mem_cgroup_below_low() and 
mem_cgroup_below_min() which made it past the mem_cgroup_below_min() 
check. Without touching mem_cgroup_below_min/low(), the addition of 
mem_cgroup_usage() in shrink_node_memcgs() is probably redundant. I can 
remove it from the patch.

Thanks for the detailed review.

Cheers,
Longman
diff mbox series

Patch

diff --git a/mm/internal.h b/mm/internal.h
index 50c2f590b2d0..c06fb0e8d75c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1535,6 +1535,15 @@  void __meminit __init_page_from_nid(unsigned long pfn, int nid);
 unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
 			  int priority);
 
+#ifdef CONFIG_MEMCG
+unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
+#else
+static inline unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
+{
+	return 1UL;
+}
+#endif
+
 #ifdef CONFIG_SHRINKER_DEBUG
 static inline __printf(2, 0) int shrinker_debugfs_name_alloc(
 			struct shrinker *shrinker, const char *fmt, va_list ap)
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 6358464bb416..e92b21af92b1 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -22,8 +22,6 @@ 
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
-
 void drain_all_stock(struct mem_cgroup *root_memcg);
 
 unsigned long memcg_events(struct mem_cgroup *memcg, int event);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b620d74b0f66..a771a0145a12 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5963,6 +5963,10 @@  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 
 		mem_cgroup_calculate_protection(target_memcg, memcg);
 
+		/* Skip memcg with no usage */
+		if (!mem_cgroup_usage(memcg, false))
+			continue;
+
 		if (mem_cgroup_below_min(target_memcg, memcg)) {
 			/*
 			 * Hard protection.
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 16f5d74ae762..5a5dcbe57b56 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -380,10 +380,10 @@  static bool reclaim_until(const char *memcg, long goal);
  *
  * Then it checks actual memory usages and expects that:
  * A/B    memory.current ~= 50M
- * A/B/C  memory.current ~= 29M
- * A/B/D  memory.current ~= 21M
- * A/B/E  memory.current ~= 0
- * A/B/F  memory.current  = 0
+ * A/B/C  memory.current ~= 29M [memory.events:low > 0]
+ * A/B/D  memory.current ~= 21M [memory.events:low > 0]
+ * A/B/E  memory.current ~= 0   [memory.events:low == 0 if !memory_recursiveprot, > 0 otherwise]
+ * A/B/F  memory.current  = 0   [memory.events:low == 0]
  * (for origin of the numbers, see model in memcg_protection.m.)
  *
  * After that it tries to allocate more than there is
@@ -525,8 +525,14 @@  static int test_memcg_protection(const char *root, bool min)
 		goto cleanup;
 	}
 
+	/*
+	 * Child 2 has memory.low=0, but some low protection is still being
+	 * distributed down from its parent with memory.low=50M if cgroup2
+	 * memory_recursiveprot mount option is enabled. So the low event
+	 * count will be non-zero in this case.
+	 */
 	for (i = 0; i < ARRAY_SIZE(children); i++) {
-		int no_low_events_index = 1;
+		int no_low_events_index = has_recursiveprot ? 2 : 1;
 		long low, oom;
 
 		oom = cg_read_key_long(children[i], "memory.events", "oom ");