diff mbox series

mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim

Message ID 20210817180506.220056-1-hannes@cmpxchg.org (mailing list archive)
State New
Headers show
Series mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim | expand

Commit Message

Johannes Weiner Aug. 17, 2021, 6:05 p.m. UTC
We've noticed occasional OOM killing when memory.low settings are in
effect for cgroups. This is unexpected and undesirable as memory.low
is supposed to express non-OOMing memory priorities between cgroups.

The reason for this is proportional memory.low reclaim. When cgroups
are below their memory.low threshold, reclaim passes them over in the
first round, and then retries if it couldn't find pages anywhere else.
But when cgroups are slighly above their memory.low setting, page scan
force is scaled down and diminished in proportion to the overage, to
the point where it can cause reclaim to fail as well - only in that
case we currently don't retry, and instead trigger OOM.

To fix this, hook proportional reclaim into the same retry logic we
have in place for when cgroups are skipped entirely. This way if
reclaim fails and some cgroups were scanned with dimished pressure,
we'll try another full-force cycle before giving up and OOMing.

Reported-by: Leon Yang <lnyng@fb.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 29 +++++++++++++++--------------
 mm/vmscan.c                | 27 +++++++++++++++++++--------
 2 files changed, 34 insertions(+), 22 deletions(-)

Comments

Rik van Riel Aug. 17, 2021, 6:44 p.m. UTC | #1
On Tue, 2021-08-17 at 14:05 -0400, Johannes Weiner wrote:
> To fix this, hook proportional reclaim into the same retry logic we
> have in place for when cgroups are skipped entirely. This way if
> reclaim fails and some cgroups were scanned with dimished pressure,
> we'll try another full-force cycle before giving up and OOMing.

Good catch.

> Reported-by: Leon Yang <lnyng@fb.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@surriel.com>
Shakeel Butt Aug. 17, 2021, 7:10 p.m. UTC | #2
On Tue, Aug 17, 2021 at 11:03 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> We've noticed occasional OOM killing when memory.low settings are in
> effect for cgroups. This is unexpected and undesirable as memory.low
> is supposed to express non-OOMing memory priorities between cgroups.
>
> The reason for this is proportional memory.low reclaim. When cgroups
> are below their memory.low threshold, reclaim passes them over in the
> first round, and then retries if it couldn't find pages anywhere else.
> But when cgroups are slighly above their memory.low setting, page scan

*slightly

> force is scaled down and diminished in proportion to the overage, to
> the point where it can cause reclaim to fail as well - only in that
> case we currently don't retry, and instead trigger OOM.
>
> To fix this, hook proportional reclaim into the same retry logic we
> have in place for when cgroups are skipped entirely. This way if
> reclaim fails and some cgroups were scanned with dimished pressure,

*diminished

> we'll try another full-force cycle before giving up and OOMing.
>
> Reported-by: Leon Yang <lnyng@fb.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Should this be considered for stable?

Reviewed-by: Shakeel Butt <shakeelb@google.com>

[...]
>
>  static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4620df62f0ff..701106e1829c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -100,9 +100,12 @@ struct scan_control {
>         unsigned int may_swap:1;
>
>         /*
> -        * Cgroups are not reclaimed below their configured memory.low,
> -        * unless we threaten to OOM. If any cgroups are skipped due to
> -        * memory.low and nothing was reclaimed, go back for memory.low.
> +        * Cgroup memory below memory.low is protected as long as we
> +        * don't threaten to OOM. If any cgroup is reclaimed at
> +        * reduced force or passed over entirely due to its memory.low
> +        * setting (memcg_low_skipped), and nothing is reclaimed as a
> +        * result, then go back back for one more cycle that reclaims

*back
Andrew Morton Aug. 17, 2021, 7:14 p.m. UTC | #3
On Tue, 17 Aug 2021 14:05:06 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> We've noticed occasional OOM killing when memory.low settings are in
> effect for cgroups. This is unexpected and undesirable as memory.low
> is supposed to express non-OOMing memory priorities between cgroups.
> 
> The reason for this is proportional memory.low reclaim. When cgroups
> are below their memory.low threshold, reclaim passes them over in the
> first round, and then retries if it couldn't find pages anywhere else.
> But when cgroups are slighly above their memory.low setting, page scan
> force is scaled down and diminished in proportion to the overage, to
> the point where it can cause reclaim to fail as well - only in that
> case we currently don't retry, and instead trigger OOM.
> 
> To fix this, hook proportional reclaim into the same retry logic we
> have in place for when cgroups are skipped entirely. This way if
> reclaim fails and some cgroups were scanned with dimished pressure,
> we'll try another full-force cycle before giving up and OOMing.

Which kernel version(s) do you think need this?
Roman Gushchin Aug. 17, 2021, 7:45 p.m. UTC | #4
On Tue, Aug 17, 2021 at 02:05:06PM -0400, Johannes Weiner wrote:
> We've noticed occasional OOM killing when memory.low settings are in
> effect for cgroups. This is unexpected and undesirable as memory.low
> is supposed to express non-OOMing memory priorities between cgroups.
> 
> The reason for this is proportional memory.low reclaim. When cgroups
> are below their memory.low threshold, reclaim passes them over in the
> first round, and then retries if it couldn't find pages anywhere else.
> But when cgroups are slighly above their memory.low setting, page scan
> force is scaled down and diminished in proportion to the overage, to
> the point where it can cause reclaim to fail as well - only in that
> case we currently don't retry, and instead trigger OOM.
> 
> To fix this, hook proportional reclaim into the same retry logic we
> have in place for when cgroups are skipped entirely. This way if
> reclaim fails and some cgroups were scanned with dimished pressure,
> we'll try another full-force cycle before giving up and OOMing.
> 
> Reported-by: Leon Yang <lnyng@fb.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Roman Gushchin <guro@fb.com>

I guess it's a stable material, so maybe adding:
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")

?


Thanks!
Johannes Weiner Aug. 18, 2021, 2:15 p.m. UTC | #5
On Tue, Aug 17, 2021 at 12:45:18PM -0700, Roman Gushchin wrote:
> On Tue, Aug 17, 2021 at 02:05:06PM -0400, Johannes Weiner wrote:
> > We've noticed occasional OOM killing when memory.low settings are in
> > effect for cgroups. This is unexpected and undesirable as memory.low
> > is supposed to express non-OOMing memory priorities between cgroups.
> > 
> > The reason for this is proportional memory.low reclaim. When cgroups
> > are below their memory.low threshold, reclaim passes them over in the
> > first round, and then retries if it couldn't find pages anywhere else.
> > But when cgroups are slighly above their memory.low setting, page scan
> > force is scaled down and diminished in proportion to the overage, to
> > the point where it can cause reclaim to fail as well - only in that
> > case we currently don't retry, and instead trigger OOM.
> > 
> > To fix this, hook proportional reclaim into the same retry logic we
> > have in place for when cgroups are skipped entirely. This way if
> > reclaim fails and some cgroups were scanned with dimished pressure,
> > we'll try another full-force cycle before giving up and OOMing.
> > 
> > Reported-by: Leon Yang <lnyng@fb.com>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Roman Gushchin <guro@fb.com>

Thank you.

> I guess it's a stable material, so maybe adding:
> Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")

Yes, that Fixes makes sense. Plus:

Cc: <stable@vger.kernel.org> # 5.4+

I initially didn't tag it because the issue is over two years old and
we've had no other reports of this. But thinking about it, it's
probably more a lack of users rather than severity. At FB we only
noticed with a recent rollout of memory_recursiveprot
(8a931f801340c2be10552c7b5622d5f4852f3a36) because we didn't have
working memory.low configurations before that. But now that we do
notice, it's a problem worth fixing. So yes, stable makes sense.

Thanks.
Johannes Weiner Aug. 18, 2021, 2:16 p.m. UTC | #6
On Tue, Aug 17, 2021 at 12:10:16PM -0700, Shakeel Butt wrote:
> On Tue, Aug 17, 2021 at 11:03 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > We've noticed occasional OOM killing when memory.low settings are in
> > effect for cgroups. This is unexpected and undesirable as memory.low
> > is supposed to express non-OOMing memory priorities between cgroups.
> >
> > The reason for this is proportional memory.low reclaim. When cgroups
> > are below their memory.low threshold, reclaim passes them over in the
> > first round, and then retries if it couldn't find pages anywhere else.
> > But when cgroups are slighly above their memory.low setting, page scan
> 
> *slightly
> 
> > force is scaled down and diminished in proportion to the overage, to
> > the point where it can cause reclaim to fail as well - only in that
> > case we currently don't retry, and instead trigger OOM.
> >
> > To fix this, hook proportional reclaim into the same retry logic we
> > have in place for when cgroups are skipped entirely. This way if
> > reclaim fails and some cgroups were scanned with dimished pressure,
> 
> *diminished

Oops. Andrew, would you mind folding these into the checkpatch fixlet?

> > we'll try another full-force cycle before giving up and OOMing.
> >
> > Reported-by: Leon Yang <lnyng@fb.com>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Should this be considered for stable?

Yes, I think so after all. Please see my reply to Roman.

> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Thanks Shakeel!
Chris Down Aug. 18, 2021, 8:18 p.m. UTC | #7
Johannes Weiner writes:
>We've noticed occasional OOM killing when memory.low settings are in
>effect for cgroups. This is unexpected and undesirable as memory.low
>is supposed to express non-OOMing memory priorities between cgroups.
>
>The reason for this is proportional memory.low reclaim. When cgroups
>are below their memory.low threshold, reclaim passes them over in the
>first round, and then retries if it couldn't find pages anywhere else.
>But when cgroups are slighly above their memory.low setting, page scan
>force is scaled down and diminished in proportion to the overage, to
>the point where it can cause reclaim to fail as well - only in that
>case we currently don't retry, and instead trigger OOM.
>
>To fix this, hook proportional reclaim into the same retry logic we
>have in place for when cgroups are skipped entirely. This way if
>reclaim fails and some cgroups were scanned with dimished pressure,
>we'll try another full-force cycle before giving up and OOMing.
>
>Reported-by: Leon Yang <lnyng@fb.com>
>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for tracking this down! Agreed that this looks like a good stable 
candidate.

Acked-by: Chris Down <chris@chrisdown.name>

>---
> include/linux/memcontrol.h | 29 +++++++++++++++--------------
> mm/vmscan.c                | 27 +++++++++++++++++++--------
> 2 files changed, 34 insertions(+), 22 deletions(-)
>
>diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>index bfe5c486f4ad..24797929d8a1 100644
>--- a/include/linux/memcontrol.h
>+++ b/include/linux/memcontrol.h
>@@ -612,12 +612,15 @@ static inline bool mem_cgroup_disabled(void)
> 	return !cgroup_subsys_enabled(memory_cgrp_subsys);
> }
>
>-static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root,
>-						  struct mem_cgroup *memcg,
>-						  bool in_low_reclaim)
>+static inline void mem_cgroup_protection(struct mem_cgroup *root,
>+					 struct mem_cgroup *memcg,
>+					 unsigned long *min,
>+					 unsigned long *low)
> {
>+	*min = *low = 0;
>+
> 	if (mem_cgroup_disabled())
>-		return 0;
>+		return;
>
> 	/*
> 	 * There is no reclaim protection applied to a targeted reclaim.
>@@ -653,13 +656,10 @@ static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root,
> 	 *
> 	 */
> 	if (root == memcg)
>-		return 0;
>-
>-	if (in_low_reclaim)
>-		return READ_ONCE(memcg->memory.emin);
>+		return;
>
>-	return max(READ_ONCE(memcg->memory.emin),
>-		   READ_ONCE(memcg->memory.elow));
>+	*min = READ_ONCE(memcg->memory.emin);
>+	*low = READ_ONCE(memcg->memory.elow);
> }
>
> void mem_cgroup_calculate_protection(struct mem_cgroup *root,
>@@ -1147,11 +1147,12 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> {
> }
>
>-static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root,
>-						  struct mem_cgroup *memcg,
>-						  bool in_low_reclaim)
>+static inline void mem_cgroup_protection(struct mem_cgroup *root,
>+					 struct mem_cgroup *memcg,
>+					 unsigned long *min,
>+					 unsigned long *low)
> {
>-	return 0;
>+	*min = *low = 0;
> }
>
> static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index 4620df62f0ff..701106e1829c 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -100,9 +100,12 @@ struct scan_control {
> 	unsigned int may_swap:1;
>
> 	/*
>-	 * Cgroups are not reclaimed below their configured memory.low,
>-	 * unless we threaten to OOM. If any cgroups are skipped due to
>-	 * memory.low and nothing was reclaimed, go back for memory.low.
>+	 * Cgroup memory below memory.low is protected as long as we
>+	 * don't threaten to OOM. If any cgroup is reclaimed at
>+	 * reduced force or passed over entirely due to its memory.low
>+	 * setting (memcg_low_skipped), and nothing is reclaimed as a
>+	 * result, then go back back for one more cycle that reclaims
>+	 * the protected memory (memcg_low_reclaim) to avert OOM.
> 	 */
> 	unsigned int memcg_low_reclaim:1;
> 	unsigned int memcg_low_skipped:1;
>@@ -2537,15 +2540,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> 	for_each_evictable_lru(lru) {
> 		int file = is_file_lru(lru);
> 		unsigned long lruvec_size;
>+		unsigned long low, min;
> 		unsigned long scan;
>-		unsigned long protection;
>
> 		lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
>-		protection = mem_cgroup_protection(sc->target_mem_cgroup,
>-						   memcg,
>-						   sc->memcg_low_reclaim);
>+		mem_cgroup_protection(sc->target_mem_cgroup, memcg,
>+				      &min, &low);
>
>-		if (protection) {
>+		if (min || low) {
> 			/*
> 			 * Scale a cgroup's reclaim pressure by proportioning
> 			 * its current usage to its memory.low or memory.min
>@@ -2576,6 +2578,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> 			 * hard protection.
> 			 */
> 			unsigned long cgroup_size = mem_cgroup_size(memcg);
>+			unsigned long protection;
>+
>+			/* memory.low scaling, make sure we retry before OOM */
>+			if (!sc->memcg_low_reclaim && low > min) {
>+				protection = low;
>+				sc->memcg_low_skipped = 1;
>+			} else {
>+				protection = min;
>+			}
>
> 			/* Avoid TOCTOU with earlier protection check */
> 			cgroup_size = max(cgroup_size, protection);
>-- 
>2.32.0
>
Michal Hocko Aug. 19, 2021, 3:01 p.m. UTC | #8
On Tue 17-08-21 14:05:06, Johannes Weiner wrote:
> We've noticed occasional OOM killing when memory.low settings are in
> effect for cgroups. This is unexpected and undesirable as memory.low
> is supposed to express non-OOMing memory priorities between cgroups.
> 
> The reason for this is proportional memory.low reclaim. When cgroups
> are below their memory.low threshold, reclaim passes them over in the
> first round, and then retries if it couldn't find pages anywhere else.
> But when cgroups are slighly above their memory.low setting, page scan
> force is scaled down and diminished in proportion to the overage, to
> the point where it can cause reclaim to fail as well - only in that
> case we currently don't retry, and instead trigger OOM.
> 
> To fix this, hook proportional reclaim into the same retry logic we
> have in place for when cgroups are skipped entirely. This way if
> reclaim fails and some cgroups were scanned with dimished pressure,
> we'll try another full-force cycle before giving up and OOMing.
> 
> Reported-by: Leon Yang <lnyng@fb.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

Although I have to say that the code is quite tricky and it deserves
more comments. See below.

[...]
> @@ -2576,6 +2578,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  			 * hard protection.
>  			 */
>  			unsigned long cgroup_size = mem_cgroup_size(memcg);
> +			unsigned long protection;
> +
> +			/* memory.low scaling, make sure we retry before OOM */
> +			if (!sc->memcg_low_reclaim && low > min) {
> +				protection = low;
> +				sc->memcg_low_skipped = 1;
> +			} else {
> +				protection = min;
> +			}

Just by looking at this in isolation one could be really curious how
does this not break the low memory protection altogether. The logic is
spread over 3 different places.

Would something like the following be more understandable?

			/*
			 * Low limit protected memcgs are already excluded at
			 * a higher level (shrink_node_memcgs) but scaling
			 * down the reclaim target can result in hard to
			 * reclaim and premature OOM. We do not have a full
			 * picture here so we cannot really judge this
			 * sutuation here but pro-actively flag this scenario
			 * and let do_try_to_free_pages to retry if
			 * there is no progress.
			 */
>  
>  			/* Avoid TOCTOU with earlier protection check */
>  			cgroup_size = max(cgroup_size, protection);
> -- 
> 2.32.0
Johannes Weiner Aug. 19, 2021, 8:38 p.m. UTC | #9
On Thu, Aug 19, 2021 at 05:01:38PM +0200, Michal Hocko wrote:
> On Tue 17-08-21 14:05:06, Johannes Weiner wrote:
> > We've noticed occasional OOM killing when memory.low settings are in
> > effect for cgroups. This is unexpected and undesirable as memory.low
> > is supposed to express non-OOMing memory priorities between cgroups.
> > 
> > The reason for this is proportional memory.low reclaim. When cgroups
> > are below their memory.low threshold, reclaim passes them over in the
> > first round, and then retries if it couldn't find pages anywhere else.
> > But when cgroups are slighly above their memory.low setting, page scan
> > force is scaled down and diminished in proportion to the overage, to
> > the point where it can cause reclaim to fail as well - only in that
> > case we currently don't retry, and instead trigger OOM.
> > 
> > To fix this, hook proportional reclaim into the same retry logic we
> > have in place for when cgroups are skipped entirely. This way if
> > reclaim fails and some cgroups were scanned with dimished pressure,
> > we'll try another full-force cycle before giving up and OOMing.
> > 
> > Reported-by: Leon Yang <lnyng@fb.com>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks

> 
> Although I have to say that the code is quite tricky and it deserves
> more comments. See below.
> 
> [...]
> > @@ -2576,6 +2578,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >  			 * hard protection.
> >  			 */
> >  			unsigned long cgroup_size = mem_cgroup_size(memcg);
> > +			unsigned long protection;
> > +
> > +			/* memory.low scaling, make sure we retry before OOM */
> > +			if (!sc->memcg_low_reclaim && low > min) {
> > +				protection = low;
> > +				sc->memcg_low_skipped = 1;
> > +			} else {
> > +				protection = min;
> > +			}
> 
> Just by looking at this in isolation one could be really curious how
> does this not break the low memory protection altogether.

You're right, it's a bit too terse.

> The logic is spread over 3 different places.
> 
> Would something like the following be more understandable?
> 
> 			/*
> 			 * Low limit protected memcgs are already excluded at
> 			 * a higher level (shrink_node_memcgs) but scaling
> 			 * down the reclaim target can result in hard to
> 			 * reclaim and premature OOM. We do not have a full
> 			 * picture here so we cannot really judge this
> 			 * sutuation here but pro-actively flag this scenario
> 			 * and let do_try_to_free_pages to retry if
> 			 * there is no progress.
> 			 */

I've been drafting around with this, but it seems to say the same
thing as the comment I put into struct scan_control already:

	/*
	 * Cgroup memory below memory.low is protected as long as we
	 * don't threaten to OOM. If any cgroup is reclaimed at
	 * reduced force or passed over entirely due to its memory.low
	 * setting (memcg_low_skipped), and nothing is reclaimed as a
	 * result, then go back back for one more cycle that reclaims
	 * the protected memory (memcg_low_reclaim) to avert OOM.
	 */

How about a brief version of this with a pointer to the original?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 701106e1829c..c32d686719d5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2580,7 +2580,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			unsigned long cgroup_size = mem_cgroup_size(memcg);
 			unsigned long protection;
 
-			/* memory.low scaling, make sure we retry before OOM */
+			/*
+			 * Soft protection must not cause reclaim failure. Let
+			 * the upper level know if we skipped pages during the
+			 * first pass, so it can retry if necessary. See the
+			 * struct scan_control definition of those flags.
+			 */
 			if (!sc->memcg_low_reclaim && low > min) {
 				protection = low;
 				sc->memcg_low_skipped = 1;
@@ -2853,16 +2858,16 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 
 		if (mem_cgroup_below_min(memcg)) {
 			/*
-			 * Hard protection.
-			 * If there is no reclaimable memory, OOM.
+			 * Hard protection. Always respected. If there is not
+			 * enough reclaimable memory elsewhere, it's an OOM.
 			 */
 			continue;
 		} else if (mem_cgroup_below_low(memcg)) {
 			/*
-			 * Soft protection.
-			 * Respect the protection only as long as
-			 * there is an unprotected supply
-			 * of reclaimable memory from other cgroups.
+			 * Soft protection must not cause reclaim failure. Let
+			 * the upper level know if we skipped pages during the
+			 * first pass, so it can retry if necessary. See the
+			 * struct scan_control definition of those flags.
 			 */
 			if (!sc->memcg_low_reclaim) {
 				sc->memcg_low_skipped = 1;
Michal Hocko Aug. 20, 2021, 3:44 p.m. UTC | #10
On Thu 19-08-21 16:38:59, Johannes Weiner wrote:
> On Thu, Aug 19, 2021 at 05:01:38PM +0200, Michal Hocko wrote:
[...]
> > The logic is spread over 3 different places.
> > 
> > Would something like the following be more understandable?
> > 
> > 			/*
> > 			 * Low limit protected memcgs are already excluded at
> > 			 * a higher level (shrink_node_memcgs) but scaling
> > 			 * down the reclaim target can result in hard to
> > 			 * reclaim and premature OOM. We do not have a full
> > 			 * picture here so we cannot really judge this
> > 			 * sutuation here but pro-actively flag this scenario
> > 			 * and let do_try_to_free_pages to retry if
> > 			 * there is no progress.
> > 			 */
> 
> I've been drafting around with this, but it seems to say the same
> thing as the comment I put into struct scan_control already:
> 
> 	/*
> 	 * Cgroup memory below memory.low is protected as long as we
> 	 * don't threaten to OOM. If any cgroup is reclaimed at
> 	 * reduced force or passed over entirely due to its memory.low
> 	 * setting (memcg_low_skipped), and nothing is reclaimed as a
> 	 * result, then go back back for one more cycle that reclaims
> 	 * the protected memory (memcg_low_reclaim) to avert OOM.
> 	 */
> 
> How about a brief version of this with a pointer to the original?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 701106e1829c..c32d686719d5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2580,7 +2580,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  			unsigned long cgroup_size = mem_cgroup_size(memcg);
>  			unsigned long protection;
>  
> -			/* memory.low scaling, make sure we retry before OOM */
> +			/*
> +			 * Soft protection must not cause reclaim failure. Let
> +			 * the upper level know if we skipped pages during the
> +			 * first pass, so it can retry if necessary. See the
> +			 * struct scan_control definition of those flags.
> +			 */
>  			if (!sc->memcg_low_reclaim && low > min) {
>  				protection = low;
>  				sc->memcg_low_skipped = 1;
> @@ -2853,16 +2858,16 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  
>  		if (mem_cgroup_below_min(memcg)) {
>  			/*
> -			 * Hard protection.
> -			 * If there is no reclaimable memory, OOM.
> +			 * Hard protection. Always respected. If there is not
> +			 * enough reclaimable memory elsewhere, it's an OOM.
>  			 */
>  			continue;
>  		} else if (mem_cgroup_below_low(memcg)) {
>  			/*
> -			 * Soft protection.
> -			 * Respect the protection only as long as
> -			 * there is an unprotected supply
> -			 * of reclaimable memory from other cgroups.
> +			 * Soft protection must not cause reclaim failure. Let
> +			 * the upper level know if we skipped pages during the
> +			 * first pass, so it can retry if necessary. See the
> +			 * struct scan_control definition of those flags.
>  			 */
>  			if (!sc->memcg_low_reclaim) {
>  				sc->memcg_low_skipped = 1;

Yes, this makes the situation more explicit. I still see some advantage
to be explicit about those other layers as this will be easier to follow
the code but I will certainly not insist.

Andrew has already sent your original patch to Linus so this will need
to go as a separate patch. For that
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!
Michal Koutný Aug. 23, 2021, 4:09 p.m. UTC | #11
Hello

(and sorry for a belated reply).

On Tue, Aug 17, 2021 at 02:05:06PM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> @@ -2576,6 +2578,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> [...]
> +			/* memory.low scaling, make sure we retry before OOM */
> +			if (!sc->memcg_low_reclaim && low > min) {
> +				protection = low;
> +				sc->memcg_low_skipped = 1;

IIUC, this won't result in memory.events:low increment although the
effect is similar (breaching (partial) memory.low protection) and signal
to the user is comparable (overcommited memory.low).

Admittedly, this patch's behavior adheres to the current documentation
(Documentation/admin-guide/cgroup-v2.rst):

> The number of times the cgroup is reclaimed due to high memory
> pressure even though its usage is under the low boundary,

however, that definition might not be what the useful indicator would
be now.
Is it worth including these partial breaches into memory.events:low?

Regards,
Michal
Johannes Weiner Aug. 23, 2021, 5:48 p.m. UTC | #12
Hi Michal,

On Mon, Aug 23, 2021 at 06:09:29PM +0200, Michal Koutný wrote:
> Hello
> 
> (and sorry for a belated reply).

It's never too late, thanks for taking a look.

> On Tue, Aug 17, 2021 at 02:05:06PM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > @@ -2576,6 +2578,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > [...]
> > +			/* memory.low scaling, make sure we retry before OOM */
> > +			if (!sc->memcg_low_reclaim && low > min) {
> > +				protection = low;
> > +				sc->memcg_low_skipped = 1;
> 
> IIUC, this won't result in memory.events:low increment although the
> effect is similar (breaching (partial) memory.low protection) and signal
> to the user is comparable (overcommited memory.low).

Good observation. I think you're right, we should probably count such
partial breaches as LOW events as well.

Note that this isn't new behavior. My patch merely moved this part
from mem_cgroup_protection():

-       if (in_low_reclaim)
-               return READ_ONCE(memcg->memory.emin);

Even before, if we retried due to just one (possibly insignificant)
cgroup below low, we'd ignore proportional reclaim and partially
breach ALL protected cgroups, while only counting a low event for the
one group that is usage < low.

> Admittedly, this patch's behavior adheres to the current documentation
> (Documentation/admin-guide/cgroup-v2.rst):
> 
> > The number of times the cgroup is reclaimed due to high memory
> > pressure even though its usage is under the low boundary,
> 
> however, that definition might not be what the useful indicator would
> be now.
> Is it worth including these partial breaches into memory.events:low?

I think it is. How about:

"The number of times the cgroup's memory.low-protected memory was
reclaimed in order to avoid OOM during high memory pressure."

And adding a MEMCG_LOW event to partial breaches. BTW, the comment
block above this code is also out-of-date, because it says we're
honoring memory.low on the retries, but that's not the case.

I'll prepare a follow-up patch for these 3 things as well as the more
verbose comment that Michal Hocko asked for on the retry logic.
Michal Koutný Aug. 24, 2021, 1:01 p.m. UTC | #13
On Mon, Aug 23, 2021 at 01:48:43PM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Note that this isn't new behavior.

Understood, there may be a difference between:
a) a cgroup where the protected reserve was detected (this changed),
b) a cgroup where the protected memory is reclaimed.

> "The number of times the cgroup's memory.low-protected memory was
> reclaimed in order to avoid OOM during high memory pressure."

Yes, this is what I meant (i.e. events for the case b) above).

Michal
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bfe5c486f4ad..24797929d8a1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -612,12 +612,15 @@  static inline bool mem_cgroup_disabled(void)
 	return !cgroup_subsys_enabled(memory_cgrp_subsys);
 }
 
-static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root,
-						  struct mem_cgroup *memcg,
-						  bool in_low_reclaim)
+static inline void mem_cgroup_protection(struct mem_cgroup *root,
+					 struct mem_cgroup *memcg,
+					 unsigned long *min,
+					 unsigned long *low)
 {
+	*min = *low = 0;
+
 	if (mem_cgroup_disabled())
-		return 0;
+		return;
 
 	/*
 	 * There is no reclaim protection applied to a targeted reclaim.
@@ -653,13 +656,10 @@  static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root,
 	 *
 	 */
 	if (root == memcg)
-		return 0;
-
-	if (in_low_reclaim)
-		return READ_ONCE(memcg->memory.emin);
+		return;
 
-	return max(READ_ONCE(memcg->memory.emin),
-		   READ_ONCE(memcg->memory.elow));
+	*min = READ_ONCE(memcg->memory.emin);
+	*low = READ_ONCE(memcg->memory.elow);
 }
 
 void mem_cgroup_calculate_protection(struct mem_cgroup *root,
@@ -1147,11 +1147,12 @@  static inline void memcg_memory_event_mm(struct mm_struct *mm,
 {
 }
 
-static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root,
-						  struct mem_cgroup *memcg,
-						  bool in_low_reclaim)
+static inline void mem_cgroup_protection(struct mem_cgroup *root,
+					 struct mem_cgroup *memcg,
+					 unsigned long *min,
+					 unsigned long *low)
 {
-	return 0;
+	*min = *low = 0;
 }
 
 static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4620df62f0ff..701106e1829c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -100,9 +100,12 @@  struct scan_control {
 	unsigned int may_swap:1;
 
 	/*
-	 * Cgroups are not reclaimed below their configured memory.low,
-	 * unless we threaten to OOM. If any cgroups are skipped due to
-	 * memory.low and nothing was reclaimed, go back for memory.low.
+	 * Cgroup memory below memory.low is protected as long as we
+	 * don't threaten to OOM. If any cgroup is reclaimed at
+	 * reduced force or passed over entirely due to its memory.low
+	 * setting (memcg_low_skipped), and nothing is reclaimed as a
+	 * result, then go back back for one more cycle that reclaims
+	 * the protected memory (memcg_low_reclaim) to avert OOM.
 	 */
 	unsigned int memcg_low_reclaim:1;
 	unsigned int memcg_low_skipped:1;
@@ -2537,15 +2540,14 @@  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	for_each_evictable_lru(lru) {
 		int file = is_file_lru(lru);
 		unsigned long lruvec_size;
+		unsigned long low, min;
 		unsigned long scan;
-		unsigned long protection;
 
 		lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
-		protection = mem_cgroup_protection(sc->target_mem_cgroup,
-						   memcg,
-						   sc->memcg_low_reclaim);
+		mem_cgroup_protection(sc->target_mem_cgroup, memcg,
+				      &min, &low);
 
-		if (protection) {
+		if (min || low) {
 			/*
 			 * Scale a cgroup's reclaim pressure by proportioning
 			 * its current usage to its memory.low or memory.min
@@ -2576,6 +2578,15 @@  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			 * hard protection.
 			 */
 			unsigned long cgroup_size = mem_cgroup_size(memcg);
+			unsigned long protection;
+
+			/* memory.low scaling, make sure we retry before OOM */
+			if (!sc->memcg_low_reclaim && low > min) {
+				protection = low;
+				sc->memcg_low_skipped = 1;
+			} else {
+				protection = min;
+			}
 
 			/* Avoid TOCTOU with earlier protection check */
 			cgroup_size = max(cgroup_size, protection);