memcg: effective memory.high reclaim for remote charging
diff mbox series

Message ID 20200507163301.229070-1-shakeelb@google.com
State New
Headers show
Series
  • memcg: effective memory.high reclaim for remote charging
Related show

Commit Message

Shakeel Butt May 7, 2020, 4:33 p.m. UTC
Currently the reclaim of excessive usage over memory.high is scheduled
to run on returning to the userland. The main reason behind this
approach was simplicity i.e. always reclaim with GFP_KERNEL context.
However the underlying assumptions behind this approach are: the current
task shares the memcg hierarchy with the given memcg and the memcg of
the current task most probably will not change on return to userland.

With the remote charging, the first assumption breaks and it allows the
usage to grow way beyond the memory.high as the reclaim and the
throttling becomes ineffective.

This patch forces the synchronous reclaim and potentially throttling for
the callers with context that allows blocking. For unblockable callers
or whose synch high reclaim is still not successful, a high reclaim is
scheduled either to return-to-userland if current task shares the
hierarchy with the given memcg or to system work queue.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 63 +++++++++++++++++++++++++++++--------------------
 1 file changed, 37 insertions(+), 26 deletions(-)

Comments

Michal Hocko May 7, 2020, 4:46 p.m. UTC | #1
On Thu 07-05-20 09:33:01, Shakeel Butt wrote:
[...]
> @@ -2600,8 +2596,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  				schedule_work(&memcg->high_work);
>  				break;
>  			}
> -			current->memcg_nr_pages_over_high += batch;
> -			set_notify_resume(current);
> +
> +			if (gfpflags_allow_blocking(gfp_mask))
> +				reclaim_over_high(memcg, gfp_mask, batch);
> +
> +			if (page_counter_read(&memcg->memory) <=
> +			    READ_ONCE(memcg->high))
> +				break;

I am half way to a long weekend so bear with me. Shouldn't this be continue? The
parent memcg might be still in excess even the child got reclaimed,
right?

> +			/*
> +			 * The above reclaim might not be able to do much. Punt
> +			 * the high reclaim to return to userland if the current
> +			 * task shares the hierarchy.
> +			 */
> +			if (current->mm && mm_match_cgroup(current->mm, memcg)) {
> +				current->memcg_nr_pages_over_high += batch;
> +				set_notify_resume(current);
> +			} else
> +				schedule_work(&memcg->high_work);
>  			break;
>  		}
>  	} while ((memcg = parent_mem_cgroup(memcg)));
> -- 
> 2.26.2.526.g744177e7f7-goog
>
Shakeel Butt May 7, 2020, 5 p.m. UTC | #2
On Thu, May 7, 2020 at 9:47 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 07-05-20 09:33:01, Shakeel Butt wrote:
> [...]
> > @@ -2600,8 +2596,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >                               schedule_work(&memcg->high_work);
> >                               break;
> >                       }
> > -                     current->memcg_nr_pages_over_high += batch;
> > -                     set_notify_resume(current);
> > +
> > +                     if (gfpflags_allow_blocking(gfp_mask))
> > +                             reclaim_over_high(memcg, gfp_mask, batch);
> > +
> > +                     if (page_counter_read(&memcg->memory) <=
> > +                         READ_ONCE(memcg->high))
> > +                             break;
>
> I am half way to a long weekend so bear with me. Shouldn't this be continue? The
> parent memcg might be still in excess even the child got reclaimed,
> right?
>

The reclaim_high() actually already does this walk up to the root and
reclaim from ones who are still over their high limit. Though having
'continue' here is correct too.

> > +                     /*
> > +                      * The above reclaim might not be able to do much. Punt
> > +                      * the high reclaim to return to userland if the current
> > +                      * task shares the hierarchy.
> > +                      */
> > +                     if (current->mm && mm_match_cgroup(current->mm, memcg)) {
> > +                             current->memcg_nr_pages_over_high += batch;
> > +                             set_notify_resume(current);
> > +                     } else
> > +                             schedule_work(&memcg->high_work);
> >                       break;
> >               }
> >       } while ((memcg = parent_mem_cgroup(memcg)));
> > --
> > 2.26.2.526.g744177e7f7-goog
> >
>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko May 7, 2020, 5:49 p.m. UTC | #3
On Thu 07-05-20 10:00:07, Shakeel Butt wrote:
> On Thu, May 7, 2020 at 9:47 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 07-05-20 09:33:01, Shakeel Butt wrote:
> > [...]
> > > @@ -2600,8 +2596,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >                               schedule_work(&memcg->high_work);
> > >                               break;
> > >                       }
> > > -                     current->memcg_nr_pages_over_high += batch;
> > > -                     set_notify_resume(current);
> > > +
> > > +                     if (gfpflags_allow_blocking(gfp_mask))
> > > +                             reclaim_over_high(memcg, gfp_mask, batch);
> > > +
> > > +                     if (page_counter_read(&memcg->memory) <=
> > > +                         READ_ONCE(memcg->high))
> > > +                             break;
> >
> > I am half way to a long weekend so bear with me. Shouldn't this be continue? The
> > parent memcg might be still in excess even the child got reclaimed,
> > right?
> >
> 
> The reclaim_high() actually already does this walk up to the root and
> reclaim from ones who are still over their high limit. Though having
> 'continue' here is correct too.

Ohh, right. As I've said weekend brain. I will have a proper look next
week. This just hit my eyes.
Michal Hocko May 11, 2020, 10:07 a.m. UTC | #4
On Thu 07-05-20 09:33:01, Shakeel Butt wrote:
> Currently the reclaim of excessive usage over memory.high is scheduled
> to run on returning to the userland. The main reason behind this
> approach was simplicity i.e. always reclaim with GFP_KERNEL context.
> However the underlying assumptions behind this approach are: the current
> task shares the memcg hierarchy with the given memcg and the memcg of
> the current task most probably will not change on return to userland.
> 
> With the remote charging, the first assumption breaks and it allows the
> usage to grow way beyond the memory.high as the reclaim and the
> throttling becomes ineffective.
> 
> This patch forces the synchronous reclaim and potentially throttling for
> the callers with context that allows blocking. For unblockable callers
> or whose synch high reclaim is still not successful, a high reclaim is
> scheduled either to return-to-userland if current task shares the
> hierarchy with the given memcg or to system work queue.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>

I would just make the early break a bit more clear.

[...]
> @@ -2600,8 +2596,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  				schedule_work(&memcg->high_work);
>  				break;
>  			}
> -			current->memcg_nr_pages_over_high += batch;
> -			set_notify_resume(current);
> +
> +			if (gfpflags_allow_blocking(gfp_mask))
> +				reclaim_over_high(memcg, gfp_mask, batch);
> +

			/*
			 * reclaim_over_high reclaims parents up the
			 * hierarchy so we can break out early here.
			 */
> +			if (page_counter_read(&memcg->memory) <=
> +			    READ_ONCE(memcg->high))
> +				break;
> +			/*
> +			 * The above reclaim might not be able to do much. Punt
> +			 * the high reclaim to return to userland if the current
> +			 * task shares the hierarchy.
> +			 */
> +			if (current->mm && mm_match_cgroup(current->mm, memcg)) {
> +				current->memcg_nr_pages_over_high += batch;
> +				set_notify_resume(current);
> +			} else
> +				schedule_work(&memcg->high_work);
>  			break;
>  		}
>  	} while ((memcg = parent_mem_cgroup(memcg)));
> -- 
> 2.26.2.526.g744177e7f7-goog
>
Johannes Weiner May 11, 2020, 3:56 p.m. UTC | #5
On Thu, May 07, 2020 at 10:00:07AM -0700, Shakeel Butt wrote:
> On Thu, May 7, 2020 at 9:47 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 07-05-20 09:33:01, Shakeel Butt wrote:
> > [...]
> > > @@ -2600,8 +2596,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >                               schedule_work(&memcg->high_work);
> > >                               break;
> > >                       }
> > > -                     current->memcg_nr_pages_over_high += batch;
> > > -                     set_notify_resume(current);
> > > +
> > > +                     if (gfpflags_allow_blocking(gfp_mask))
> > > +                             reclaim_over_high(memcg, gfp_mask, batch);
> > > +
> > > +                     if (page_counter_read(&memcg->memory) <=
> > > +                         READ_ONCE(memcg->high))
> > > +                             break;
> >
> > I am half way to a long weekend so bear with me. Shouldn't this be continue? The
> > parent memcg might be still in excess even the child got reclaimed,
> > right?
> >
> 
> The reclaim_high() actually already does this walk up to the root and
> reclaim from ones who are still over their high limit. Though having
> 'continue' here is correct too.

If reclaim was weak and failed to bring the child back in line, we
still do set_notify_resume(). We should do that for ancestors too.

But it seems we keep adding hierarchy walks and it's getting somewhat
convoluted: page_counter does it, then we check high overage
recursively, and now we add the call to reclaim which itself is a walk
up the ancestor line.

Can we hitchhike on the page_counter_try_charge() walk, which already
has the concept of identifying counters with overage? Rename the @fail
to @limited and return the first counter that is in excess of its high
as well, even when the function succeeds?

Then we could ditch the entire high checking loop here and simply
replace it with

done_restock:
	...

	if (*limited) {
		if (gfpflags_allow_blocking())
			reclaim_over_high(memcg_from_counter(limited));
		/* Reclaim may not be able to do much, ... */
		set_notify_resume(); // or schedule_work()
	};

In the long-term, the best thing might be to integrate memory.high
reclaim with the regular reclaim that try_charge() is already
doing. Especially the part where it retries several times - we
currently give up on memory.high unnecessarily early. Make
page_counter_try_charge() fail on high and max equally, and after
several reclaim cycles, instead of invoking the OOM killer, inject the
penalty sleep and force the charges. OOM killing and throttling is
supposed to be the only difference between the two, anyway, and yet
the code diverges far more than that for no apparent reason.

But I also appreciate that this is a cleanup beyond the scope of this
patch here, so it's up to you how far you want to take it.
Shakeel Butt May 11, 2020, 9:44 p.m. UTC | #6
On Mon, May 11, 2020 at 8:57 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, May 07, 2020 at 10:00:07AM -0700, Shakeel Butt wrote:
> > On Thu, May 7, 2020 at 9:47 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Thu 07-05-20 09:33:01, Shakeel Butt wrote:
> > > [...]
> > > > @@ -2600,8 +2596,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > >                               schedule_work(&memcg->high_work);
> > > >                               break;
> > > >                       }
> > > > -                     current->memcg_nr_pages_over_high += batch;
> > > > -                     set_notify_resume(current);
> > > > +
> > > > +                     if (gfpflags_allow_blocking(gfp_mask))
> > > > +                             reclaim_over_high(memcg, gfp_mask, batch);
> > > > +
> > > > +                     if (page_counter_read(&memcg->memory) <=
> > > > +                         READ_ONCE(memcg->high))
> > > > +                             break;
> > >
> > > I am half way to a long weekend so bear with me. Shouldn't this be continue? The
> > > parent memcg might be still in excess even the child got reclaimed,
> > > right?
> > >
> >
> > The reclaim_high() actually already does this walk up to the root and
> > reclaim from ones who are still over their high limit. Though having
> > 'continue' here is correct too.
>
> If reclaim was weak and failed to bring the child back in line, we
> still do set_notify_resume(). We should do that for ancestors too.
>
> But it seems we keep adding hierarchy walks and it's getting somewhat
> convoluted: page_counter does it, then we check high overage
> recursively, and now we add the call to reclaim which itself is a walk
> up the ancestor line.
>
> Can we hitchhike on the page_counter_try_charge() walk, which already
> has the concept of identifying counters with overage? Rename the @fail
> to @limited and return the first counter that is in excess of its high
> as well, even when the function succeeds?
>
> Then we could ditch the entire high checking loop here and simply
> replace it with
>
> done_restock:
>         ...
>
>         if (*limited) {
>                 if (gfpflags_allow_blocking())
>                         reclaim_over_high(memcg_from_counter(limited));
>                 /* Reclaim may not be able to do much, ... */
>                 set_notify_resume(); // or schedule_work()
>         };
>

I will try to code the above and will give a shot to the following
long-term suggestion as well.

> In the long-term, the best thing might be to integrate memory.high
> reclaim with the regular reclaim that try_charge() is already
> doing. Especially the part where it retries several times - we
> currently give up on memory.high unnecessarily early. Make
> page_counter_try_charge() fail on high and max equally, and after
> several reclaim cycles, instead of invoking the OOM killer, inject the
> penalty sleep and force the charges. OOM killing and throttling is
> supposed to be the only difference between the two, anyway, and yet
> the code diverges far more than that for no apparent reason.
>
> But I also appreciate that this is a cleanup beyond the scope of this
> patch here, so it's up to you how far you want to take it.

Thanks,
Shakeel

Patch
diff mbox series

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 317dbbaac603..7abb762f26cd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2387,23 +2387,13 @@  static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	return min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
 }
 
-/*
- * Scheduled by try_charge() to be executed from the userland return path
- * and reclaims memory over the high limit.
- */
-void mem_cgroup_handle_over_high(void)
+static void reclaim_over_high(struct mem_cgroup *memcg, gfp_t gfp_mask,
+			      unsigned long nr_pages)
 {
 	unsigned long penalty_jiffies;
 	unsigned long pflags;
-	unsigned int nr_pages = current->memcg_nr_pages_over_high;
-	struct mem_cgroup *memcg;
 
-	if (likely(!nr_pages))
-		return;
-
-	memcg = get_mem_cgroup_from_mm(current->mm);
-	reclaim_high(memcg, nr_pages, GFP_KERNEL);
-	current->memcg_nr_pages_over_high = 0;
+	reclaim_high(memcg, nr_pages, gfp_mask);
 
 	/*
 	 * memory.high is breached and reclaim is unable to keep up. Throttle
@@ -2418,7 +2408,7 @@  void mem_cgroup_handle_over_high(void)
 	 * been aggressively reclaimed enough yet.
 	 */
 	if (penalty_jiffies <= HZ / 100)
-		goto out;
+		return;
 
 	/*
 	 * If we exit early, we're guaranteed to die (since
@@ -2428,8 +2418,23 @@  void mem_cgroup_handle_over_high(void)
 	psi_memstall_enter(&pflags);
 	schedule_timeout_killable(penalty_jiffies);
 	psi_memstall_leave(&pflags);
+}
 
-out:
+/*
+ * Scheduled by try_charge() to be executed from the userland return path
+ * and reclaims memory over the high limit.
+ */
+void mem_cgroup_handle_over_high(void)
+{
+	unsigned int nr_pages = current->memcg_nr_pages_over_high;
+	struct mem_cgroup *memcg;
+
+	if (likely(!nr_pages))
+		return;
+
+	memcg = get_mem_cgroup_from_mm(current->mm);
+	reclaim_over_high(memcg, GFP_KERNEL, nr_pages);
+	current->memcg_nr_pages_over_high = 0;
 	css_put(&memcg->css);
 }
 
@@ -2584,15 +2589,6 @@  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 
-	/*
-	 * If the hierarchy is above the normal consumption range, schedule
-	 * reclaim on returning to userland.  We can perform reclaim here
-	 * if __GFP_RECLAIM but let's always punt for simplicity and so that
-	 * GFP_KERNEL can consistently be used during reclaim.  @memcg is
-	 * not recorded as it most likely matches current's and won't
-	 * change in the meantime.  As high limit is checked again before
-	 * reclaim, the cost of mismatch is negligible.
-	 */
 	do {
 		if (page_counter_read(&memcg->memory) > READ_ONCE(memcg->high)) {
 			/* Don't bother a random interrupted task */
@@ -2600,8 +2596,23 @@  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				schedule_work(&memcg->high_work);
 				break;
 			}
-			current->memcg_nr_pages_over_high += batch;
-			set_notify_resume(current);
+
+			if (gfpflags_allow_blocking(gfp_mask))
+				reclaim_over_high(memcg, gfp_mask, batch);
+
+			if (page_counter_read(&memcg->memory) <=
+			    READ_ONCE(memcg->high))
+				break;
+			/*
+			 * The above reclaim might not be able to do much. Punt
+			 * the high reclaim to return to userland if the current
+			 * task shares the hierarchy.
+			 */
+			if (current->mm && mm_match_cgroup(current->mm, memcg)) {
+				current->memcg_nr_pages_over_high += batch;
+				set_notify_resume(current);
+			} else
+				schedule_work(&memcg->high_work);
 			break;
 		}
 	} while ((memcg = parent_mem_cgroup(memcg)));