Message ID | 20210112163011.127833-1-hannes@cmpxchg.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: memcontrol: prevent starvation when writing memory.high | expand |
On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote: > When a value is written to a cgroup's memory.high control file, the > write() context first tries to reclaim the cgroup to size before > putting the limit in place for the workload. Concurrent charges from > the workload can keep such a write() looping in reclaim indefinitely. > > In the past, a write to memory.high would first put the limit in place > for the workload, then do targeted reclaim until the new limit has > been met - similar to how we do it for memory.max. This wasn't prone > to the described starvation issue. However, this sequence could cause > excessive latencies in the workload, when allocating threads could be > put into long penalty sleeps on the sudden memory.high overage created > by the write(), before that had a chance to work it off. > > Now that memory_high_write() performs reclaim before enforcing the new > limit, reflect that the cgroup may well fail to converge due to > concurrent workload activity. Bail out of the loop after a few tries. > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > Cc: <stable@vger.kernel.org> # 5.8+ > Reported-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > mm/memcontrol.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 605f671203ef..63a8d47c1cd3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > for (;;) { > unsigned long nr_pages = page_counter_read(&memcg->memory); > - unsigned long reclaimed; > > if (nr_pages <= high) > break; > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > continue; > } > > - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, true); > + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > + GFP_KERNEL, true); > > - if (!reclaimed && !nr_retries--) > + if (!nr_retries--) Shouldn't it be (!reclaimed || !nr_retries) instead? If reclaimed == 0, it probably doesn't make much sense to retry. Otherwise the patch looks good to me. Thanks!
On Tue, Jan 12, 2021 at 9:12 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > When a value is written to a cgroup's memory.high control file, the > write() context first tries to reclaim the cgroup to size before > putting the limit in place for the workload. Concurrent charges from > the workload can keep such a write() looping in reclaim indefinitely. > Is this observed on real workload? > > In the past, a write to memory.high would first put the limit in place > for the workload, then do targeted reclaim until the new limit has > been met - similar to how we do it for memory.max. This wasn't prone > to the described starvation issue. However, this sequence could cause > excessive latencies in the workload, when allocating threads could be > put into long penalty sleeps on the sudden memory.high overage created > by the write(), before that had a chance to work it off. > > Now that memory_high_write() performs reclaim before enforcing the new > limit, reflect that the cgroup may well fail to converge due to > concurrent workload activity. Bail out of the loop after a few tries. > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > Cc: <stable@vger.kernel.org> # 5.8+ > Reported-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > mm/memcontrol.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 605f671203ef..63a8d47c1cd3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > for (;;) { > unsigned long nr_pages = page_counter_read(&memcg->memory); > - unsigned long reclaimed; > > if (nr_pages <= high) > break; > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > continue; > } > > - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, true); > + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > + GFP_KERNEL, true); > > - if (!reclaimed && !nr_retries--) Any particular reason to remove !reclaimed? > + if (!nr_retries--) > break; > } > > -- > 2.30.0 >
On Tue, Jan 12, 2021 at 09:03:22AM -0800, Roman Gushchin wrote: > On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote: > > When a value is written to a cgroup's memory.high control file, the > > write() context first tries to reclaim the cgroup to size before > > putting the limit in place for the workload. Concurrent charges from > > the workload can keep such a write() looping in reclaim indefinitely. > > > > In the past, a write to memory.high would first put the limit in place > > for the workload, then do targeted reclaim until the new limit has > > been met - similar to how we do it for memory.max. This wasn't prone > > to the described starvation issue. However, this sequence could cause > > excessive latencies in the workload, when allocating threads could be > > put into long penalty sleeps on the sudden memory.high overage created > > by the write(), before that had a chance to work it off. > > > > Now that memory_high_write() performs reclaim before enforcing the new > > limit, reflect that the cgroup may well fail to converge due to > > concurrent workload activity. Bail out of the loop after a few tries. > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > > Cc: <stable@vger.kernel.org> # 5.8+ > > Reported-by: Tejun Heo <tj@kernel.org> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > --- > > mm/memcontrol.c | 7 +++---- > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 605f671203ef..63a8d47c1cd3 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > > > for (;;) { > > unsigned long nr_pages = page_counter_read(&memcg->memory); > > - unsigned long reclaimed; > > > > if (nr_pages <= high) > > break; > > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > continue; > > } > > > > - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > > - GFP_KERNEL, true); > > + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > > + GFP_KERNEL, true); > > > > - if (!reclaimed && !nr_retries--) > > + if (!nr_retries--) > > Shouldn't it be (!reclaimed || !nr_retries) instead? > > If reclaimed == 0, it probably doesn't make much sense to retry. We usually allow nr_retries worth of no-progress reclaim cycles to make up for intermittent reclaim failures. The difference to OOMs/memory.max is that we don't want to loop indefinitely on forward progress, but we should allow the usual number of no-progress loops.
On Tue, Jan 12, 2021 at 10:59:58AM -0800, Shakeel Butt wrote: > On Tue, Jan 12, 2021 at 9:12 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > When a value is written to a cgroup's memory.high control file, the > > write() context first tries to reclaim the cgroup to size before > > putting the limit in place for the workload. Concurrent charges from > > the workload can keep such a write() looping in reclaim indefinitely. > > > > Is this observed on real workload? Yes. On several production hosts running a particularly aggressive workload, we've observed writers to memory.high getting stuck for minutes while consuming significant amount of CPU. > Any particular reason to remove !reclaimed? It's purpose so far was to allow successful reclaim to continue indefinitely, while restricting no-progress loops to 'nr_retries'. Without the first part, it doesn't really matter whether reclaim is making progress or not: we do a maximum of 'nr_retries' loops until the cgroup size meets the new limit, then exit one way or another.
On Tue, Jan 12, 2021 at 02:45:43PM -0500, Johannes Weiner wrote: > On Tue, Jan 12, 2021 at 09:03:22AM -0800, Roman Gushchin wrote: > > On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote: > > > When a value is written to a cgroup's memory.high control file, the > > > write() context first tries to reclaim the cgroup to size before > > > putting the limit in place for the workload. Concurrent charges from > > > the workload can keep such a write() looping in reclaim indefinitely. > > > > > > In the past, a write to memory.high would first put the limit in place > > > for the workload, then do targeted reclaim until the new limit has > > > been met - similar to how we do it for memory.max. This wasn't prone > > > to the described starvation issue. However, this sequence could cause > > > excessive latencies in the workload, when allocating threads could be > > > put into long penalty sleeps on the sudden memory.high overage created > > > by the write(), before that had a chance to work it off. > > > > > > Now that memory_high_write() performs reclaim before enforcing the new > > > limit, reflect that the cgroup may well fail to converge due to > > > concurrent workload activity. Bail out of the loop after a few tries. > > > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > > > Cc: <stable@vger.kernel.org> # 5.8+ > > > Reported-by: Tejun Heo <tj@kernel.org> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > > --- > > > mm/memcontrol.c | 7 +++---- > > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 605f671203ef..63a8d47c1cd3 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > > > > > for (;;) { > > > unsigned long nr_pages = page_counter_read(&memcg->memory); > > > - unsigned long reclaimed; > > > > > > if (nr_pages <= high) > > > break; > > > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > > continue; > > > } > > > > > > - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > > > - GFP_KERNEL, true); > > > + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > > > + GFP_KERNEL, true); > > > > > > - if (!reclaimed && !nr_retries--) > > > + if (!nr_retries--) > > > > Shouldn't it be (!reclaimed || !nr_retries) instead? > > > > If reclaimed == 0, it probably doesn't make much sense to retry. > > We usually allow nr_retries worth of no-progress reclaim cycles to > make up for intermittent reclaim failures. > > The difference to OOMs/memory.max is that we don't want to loop > indefinitely on forward progress, but we should allow the usual number > of no-progress loops. Re memory.max: trying really hard makes sense because we are OOMing otherwise. With memory.high such an idea is questionable: if were not able to reclaim a single page from the first attempt, it's unlikely that we can reclaim many from repeating 16 times. My concern here is that we can see CPU regressions in some cases when there is no reclaimable memory. Do you think we can win something by trying harder? If so, it's worth mentioning in the commit log. Because it's really a separate change to what's described in the log, to some extent it's a move into an opposite direction. Thanks!
On Tue, Jan 12, 2021 at 11:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Jan 12, 2021 at 10:59:58AM -0800, Shakeel Butt wrote: > > On Tue, Jan 12, 2021 at 9:12 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > When a value is written to a cgroup's memory.high control file, the > > > write() context first tries to reclaim the cgroup to size before > > > putting the limit in place for the workload. Concurrent charges from > > > the workload can keep such a write() looping in reclaim indefinitely. > > > > > > > Is this observed on real workload? > > Yes. > > On several production hosts running a particularly aggressive > workload, we've observed writers to memory.high getting stuck for > minutes while consuming significant amount of CPU. > Good to add this in the commit message or at least mentioning that it happened in production. > > Any particular reason to remove !reclaimed? > > It's purpose so far was to allow successful reclaim to continue > indefinitely, while restricting no-progress loops to 'nr_retries'. > > Without the first part, it doesn't really matter whether reclaim is > making progress or not: we do a maximum of 'nr_retries' loops until > the cgroup size meets the new limit, then exit one way or another. Does it make sense to add this in the commit message as well? I am fine with either way. For the patch: Reviewed-by: Shakeel Butt <shakeelb@google.com>
On Tue, Jan 12, 2021 at 12:12:37PM -0800, Roman Gushchin wrote: > On Tue, Jan 12, 2021 at 02:45:43PM -0500, Johannes Weiner wrote: > > On Tue, Jan 12, 2021 at 09:03:22AM -0800, Roman Gushchin wrote: > > > On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote: > > > > When a value is written to a cgroup's memory.high control file, the > > > > write() context first tries to reclaim the cgroup to size before > > > > putting the limit in place for the workload. Concurrent charges from > > > > the workload can keep such a write() looping in reclaim indefinitely. > > > > > > > > In the past, a write to memory.high would first put the limit in place > > > > for the workload, then do targeted reclaim until the new limit has > > > > been met - similar to how we do it for memory.max. This wasn't prone > > > > to the described starvation issue. However, this sequence could cause > > > > excessive latencies in the workload, when allocating threads could be > > > > put into long penalty sleeps on the sudden memory.high overage created > > > > by the write(), before that had a chance to work it off. > > > > > > > > Now that memory_high_write() performs reclaim before enforcing the new > > > > limit, reflect that the cgroup may well fail to converge due to > > > > concurrent workload activity. Bail out of the loop after a few tries. > > > > > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > > > > Cc: <stable@vger.kernel.org> # 5.8+ > > > > Reported-by: Tejun Heo <tj@kernel.org> > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > > > --- > > > > mm/memcontrol.c | 7 +++---- > > > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > index 605f671203ef..63a8d47c1cd3 100644 > > > > --- a/mm/memcontrol.c > > > > +++ b/mm/memcontrol.c > > > > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > > > > > > > for (;;) { > > > > unsigned long nr_pages = page_counter_read(&memcg->memory); > > > > - unsigned long reclaimed; > > > > > > > > if (nr_pages <= high) > > > > break; > > > > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > > > continue; > > > > } > > > > > > > > - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > > > > - GFP_KERNEL, true); > > > > + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > > > > + GFP_KERNEL, true); > > > > > > > > - if (!reclaimed && !nr_retries--) > > > > + if (!nr_retries--) > > > > > > Shouldn't it be (!reclaimed || !nr_retries) instead? > > > > > > If reclaimed == 0, it probably doesn't make much sense to retry. > > > > We usually allow nr_retries worth of no-progress reclaim cycles to > > make up for intermittent reclaim failures. > > > > The difference to OOMs/memory.max is that we don't want to loop > > indefinitely on forward progress, but we should allow the usual number > > of no-progress loops. > > Re memory.max: trying really hard makes sense because we are OOMing otherwise. > With memory.high such an idea is questionable: if were not able to reclaim > a single page from the first attempt, it's unlikely that we can reclaim many > from repeating 16 times. > > My concern here is that we can see CPU regressions in some cases when there is > no reclaimable memory. Do you think we can win something by trying harder? > If so, it's worth mentioning in the commit log. Because it's really a separate > change to what's described in the log, to some extent it's a move into an opposite > direction. Hm, I'm confused what change you are referring to. Current upstream allows: a. unlimited progress loops b. 16 no-progress loops My patch is fixing the issue resulting from the unlimited progress loops in a). This is described in the changelog. You seem to be advocating for an unrelated change to the no-progress loops condition in b). Am I missing something?
On Tue, Jan 12, 2021 at 04:11:27PM -0500, Johannes Weiner wrote: > On Tue, Jan 12, 2021 at 12:12:37PM -0800, Roman Gushchin wrote: > > On Tue, Jan 12, 2021 at 02:45:43PM -0500, Johannes Weiner wrote: > > > On Tue, Jan 12, 2021 at 09:03:22AM -0800, Roman Gushchin wrote: > > > > On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote: > > > > > When a value is written to a cgroup's memory.high control file, the > > > > > write() context first tries to reclaim the cgroup to size before > > > > > putting the limit in place for the workload. Concurrent charges from > > > > > the workload can keep such a write() looping in reclaim indefinitely. > > > > > > > > > > In the past, a write to memory.high would first put the limit in place > > > > > for the workload, then do targeted reclaim until the new limit has > > > > > been met - similar to how we do it for memory.max. This wasn't prone > > > > > to the described starvation issue. However, this sequence could cause > > > > > excessive latencies in the workload, when allocating threads could be > > > > > put into long penalty sleeps on the sudden memory.high overage created > > > > > by the write(), before that had a chance to work it off. > > > > > > > > > > Now that memory_high_write() performs reclaim before enforcing the new > > > > > limit, reflect that the cgroup may well fail to converge due to > > > > > concurrent workload activity. Bail out of the loop after a few tries. > > > > > > > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > > > > > Cc: <stable@vger.kernel.org> # 5.8+ > > > > > Reported-by: Tejun Heo <tj@kernel.org> > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > --- > > > > > mm/memcontrol.c | 7 +++---- > > > > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > > index 605f671203ef..63a8d47c1cd3 100644 > > > > > --- a/mm/memcontrol.c > > > > > +++ b/mm/memcontrol.c > > > > > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > > > > > > > > > for (;;) { > > > > > unsigned long nr_pages = page_counter_read(&memcg->memory); > > > > > - unsigned long reclaimed; > > > > > > > > > > if (nr_pages <= high) > > > > > break; > > > > > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > > > > continue; > > > > > } > > > > > > > > > > - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > > > > > - GFP_KERNEL, true); > > > > > + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > > > > > + GFP_KERNEL, true); > > > > > > > > > > - if (!reclaimed && !nr_retries--) > > > > > + if (!nr_retries--) > > > > > > > > Shouldn't it be (!reclaimed || !nr_retries) instead? > > > > > > > > If reclaimed == 0, it probably doesn't make much sense to retry. > > > > > > We usually allow nr_retries worth of no-progress reclaim cycles to > > > make up for intermittent reclaim failures. > > > > > > The difference to OOMs/memory.max is that we don't want to loop > > > indefinitely on forward progress, but we should allow the usual number > > > of no-progress loops. > > > > Re memory.max: trying really hard makes sense because we are OOMing otherwise. > > With memory.high such an idea is questionable: if were not able to reclaim > > a single page from the first attempt, it's unlikely that we can reclaim many > > from repeating 16 times. > > > > My concern here is that we can see CPU regressions in some cases when there is > > no reclaimable memory. Do you think we can win something by trying harder? > > If so, it's worth mentioning in the commit log. Because it's really a separate > > change to what's described in the log, to some extent it's a move into an opposite > > direction. > > Hm, I'm confused what change you are referring to. > > Current upstream allows: > > a. unlimited progress loops > b. 16 no-progress loops > > My patch is fixing the issue resulting from the unlimited progress > loops in a). This is described in the changelog. > > You seem to be advocating for an unrelated change to the no-progress > loops condition in b). Fair enough. But still the question remains: what are we gaining by trying again after not being able to reclaim a single page? If you want, it can be done separately, but it looks like a good idea to me to bail out if we can't reclaim a single page. Thanks!
On Tue 12-01-21 11:30:11, Johannes Weiner wrote: > When a value is written to a cgroup's memory.high control file, the > write() context first tries to reclaim the cgroup to size before > putting the limit in place for the workload. Concurrent charges from > the workload can keep such a write() looping in reclaim indefinitely. > > In the past, a write to memory.high would first put the limit in place > for the workload, then do targeted reclaim until the new limit has > been met - similar to how we do it for memory.max. This wasn't prone > to the described starvation issue. However, this sequence could cause > excessive latencies in the workload, when allocating threads could be > put into long penalty sleeps on the sudden memory.high overage created > by the write(), before that had a chance to work it off. > > Now that memory_high_write() performs reclaim before enforcing the new > limit, reflect that the cgroup may well fail to converge due to > concurrent workload activity. Bail out of the loop after a few tries. I can see that you have provided some more details in follow up replies but I do not see any explicit argument why an excessive time for writer is an actual problem. Could you be more specific? If the writer is time sensitive then there is a trivial way to workaround that and kill it by a signal (timeout 30s echo ....). Btw. this behavior has been considered http://lkml.kernel.org/r/20200710122917.GB3022@dhcp22.suse.cz/ " With this change the reclaim here might be just playing never ending catch up. On the plus side a break out from the reclaim loop would just enforce the limit so if the operation takes too long then the reclaim burden will move over to consumers eventually. So I do not see any real danger. " > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > Cc: <stable@vger.kernel.org> # 5.8+ Why is this worth backporting to stable? The behavior is different but I do not think any of them is harmful. > Reported-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> I am not against the patch. The existing interface doesn't provide any meaningful feedback to the userspace anyway. User would have to re check to see the result of the operation. So how hard we try is really an implementation detail. > --- > mm/memcontrol.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 605f671203ef..63a8d47c1cd3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > for (;;) { > unsigned long nr_pages = page_counter_read(&memcg->memory); > - unsigned long reclaimed; > > if (nr_pages <= high) > break; > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > continue; > } > > - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, true); > + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > + GFP_KERNEL, true); > > - if (!reclaimed && !nr_retries--) > + if (!nr_retries--) > break; > } > > -- > 2.30.0 >
On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote: > - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, true); > + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > + GFP_KERNEL, true); Although I was also initially confused by throwing 'reclaimed' info away, the patch makes sense to me given the reasoning. It is Reviewed-by: Michal Koutný <mkoutny@suse.com> As for the discussed unsuccessful retries, I'd keep it a separate change. Regards, Michal
On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote: > When a value is written to a cgroup's memory.high control file, the > write() context first tries to reclaim the cgroup to size before > putting the limit in place for the workload. Concurrent charges from > the workload can keep such a write() looping in reclaim indefinitely. > > In the past, a write to memory.high would first put the limit in place > for the workload, then do targeted reclaim until the new limit has > been met - similar to how we do it for memory.max. This wasn't prone > to the described starvation issue. However, this sequence could cause > excessive latencies in the workload, when allocating threads could be > put into long penalty sleeps on the sudden memory.high overage created > by the write(), before that had a chance to work it off. > > Now that memory_high_write() performs reclaim before enforcing the new > limit, reflect that the cgroup may well fail to converge due to > concurrent workload activity. Bail out of the loop after a few tries. > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > Cc: <stable@vger.kernel.org> # 5.8+ > Reported-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > mm/memcontrol.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 605f671203ef..63a8d47c1cd3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > for (;;) { > unsigned long nr_pages = page_counter_read(&memcg->memory); > - unsigned long reclaimed; > > if (nr_pages <= high) > break; > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > continue; > } > > - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, true); > + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > + GFP_KERNEL, true); > > - if (!reclaimed && !nr_retries--) > + if (!nr_retries--) > break; > } > > -- > 2.30.0 > Acked-by: Roman Gushchin <guro@fb.com> Thanks!
On Tue, Jan 12, 2021 at 01:45:12PM -0800, Roman Gushchin wrote: > On Tue, Jan 12, 2021 at 04:11:27PM -0500, Johannes Weiner wrote: > > On Tue, Jan 12, 2021 at 12:12:37PM -0800, Roman Gushchin wrote: > > > On Tue, Jan 12, 2021 at 02:45:43PM -0500, Johannes Weiner wrote: > > > > On Tue, Jan 12, 2021 at 09:03:22AM -0800, Roman Gushchin wrote: > > > > > On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote: > > > > > > When a value is written to a cgroup's memory.high control file, the > > > > > > write() context first tries to reclaim the cgroup to size before > > > > > > putting the limit in place for the workload. Concurrent charges from > > > > > > the workload can keep such a write() looping in reclaim indefinitely. > > > > > > > > > > > > In the past, a write to memory.high would first put the limit in place > > > > > > for the workload, then do targeted reclaim until the new limit has > > > > > > been met - similar to how we do it for memory.max. This wasn't prone > > > > > > to the described starvation issue. However, this sequence could cause > > > > > > excessive latencies in the workload, when allocating threads could be > > > > > > put into long penalty sleeps on the sudden memory.high overage created > > > > > > by the write(), before that had a chance to work it off. > > > > > > > > > > > > Now that memory_high_write() performs reclaim before enforcing the new > > > > > > limit, reflect that the cgroup may well fail to converge due to > > > > > > concurrent workload activity. Bail out of the loop after a few tries. > > > > > > > > > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > > > > > > Cc: <stable@vger.kernel.org> # 5.8+ > > > > > > Reported-by: Tejun Heo <tj@kernel.org> > > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > --- > > > > > > mm/memcontrol.c | 7 +++---- > > > > > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > > > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > > > index 605f671203ef..63a8d47c1cd3 100644 > > > > > > --- a/mm/memcontrol.c > > > > > > +++ b/mm/memcontrol.c > > > > > > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > > > > > > > > > > > for (;;) { > > > > > > unsigned long nr_pages = page_counter_read(&memcg->memory); > > > > > > - unsigned long reclaimed; > > > > > > > > > > > > if (nr_pages <= high) > > > > > > break; > > > > > > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > > > > > continue; > > > > > > } > > > > > > > > > > > > - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > > > > > > - GFP_KERNEL, true); > > > > > > + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > > > > > > + GFP_KERNEL, true); > > > > > > > > > > > > - if (!reclaimed && !nr_retries--) > > > > > > + if (!nr_retries--) > > > > > > > > > > Shouldn't it be (!reclaimed || !nr_retries) instead? > > > > > > > > > > If reclaimed == 0, it probably doesn't make much sense to retry. > > > > > > > > We usually allow nr_retries worth of no-progress reclaim cycles to > > > > make up for intermittent reclaim failures. > > > > > > > > The difference to OOMs/memory.max is that we don't want to loop > > > > indefinitely on forward progress, but we should allow the usual number > > > > of no-progress loops. > > > > > > Re memory.max: trying really hard makes sense because we are OOMing otherwise. > > > With memory.high such an idea is questionable: if were not able to reclaim > > > a single page from the first attempt, it's unlikely that we can reclaim many > > > from repeating 16 times. > > > > > > My concern here is that we can see CPU regressions in some cases when there is > > > no reclaimable memory. Do you think we can win something by trying harder? > > > If so, it's worth mentioning in the commit log. Because it's really a separate > > > change to what's described in the log, to some extent it's a move into an opposite > > > direction. > > > > Hm, I'm confused what change you are referring to. > > > > Current upstream allows: > > > > a. unlimited progress loops > > b. 16 no-progress loops > > > > My patch is fixing the issue resulting from the unlimited progress > > loops in a). This is described in the changelog. > > > > You seem to be advocating for an unrelated change to the no-progress > > loops condition in b). > > Fair enough. > > But still the question remains: what are we gaining by trying again after not > being able to reclaim a single page? If you want, it can be done separately, > but it looks like a good idea to me to bail out if we can't reclaim a single > page. You lost me there. If memory.max retries before declaring oom, why shouldn't memory.high retry before returning to userspace? If there is intermittent reclaim failure, then returning from the memory.high write before the limit is enforced in such rare situations saves very little, but makes the user visible behavior unpredictable and inconsistent. It's extra code and a special case that needs to be documented.
On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote: > On Tue 12-01-21 11:30:11, Johannes Weiner wrote: > > When a value is written to a cgroup's memory.high control file, the > > write() context first tries to reclaim the cgroup to size before > > putting the limit in place for the workload. Concurrent charges from > > the workload can keep such a write() looping in reclaim indefinitely. > > > > In the past, a write to memory.high would first put the limit in place > > for the workload, then do targeted reclaim until the new limit has > > been met - similar to how we do it for memory.max. This wasn't prone > > to the described starvation issue. However, this sequence could cause > > excessive latencies in the workload, when allocating threads could be > > put into long penalty sleeps on the sudden memory.high overage created > > by the write(), before that had a chance to work it off. > > > > Now that memory_high_write() performs reclaim before enforcing the new > > limit, reflect that the cgroup may well fail to converge due to > > concurrent workload activity. Bail out of the loop after a few tries. > > I can see that you have provided some more details in follow up replies > but I do not see any explicit argument why an excessive time for writer > is an actual problem. Could you be more specific? Our writer isn't necessarily time sensitive, but there is a difference between a) the write taking a few seconds to reclaim down the requested delta and b) the writer essentially turning into kswapd for the workload and busy-spinning inside the kernel indefinitely. We've seen the writer stuck in this function for minutes, long after the requested delta has been reclaimed, consuming alarming amounts of CPU cycles - CPU time that should really be accounted to the workload, not the system software performing the write. Obviously, we could work around it using timeouts and signals. In fact, we may have to until the new kernel is deployed everywhere. But this is the definition of an interface change breaking userspace, so I'm a bit surprised by your laid-back response. > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > > Cc: <stable@vger.kernel.org> # 5.8+ > > Why is this worth backporting to stable? The behavior is different but I > do not think any of them is harmful. The referenced patch changed user-visible behavior in a way that is causing real production problems for us. From stable-kernel-rules: - It must fix a real bug that bothers people (not a, "This could be a problem..." type thing). > > Reported-by: Tejun Heo <tj@kernel.org> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > I am not against the patch. The existing interface doesn't provide any > meaningful feedback to the userspace anyway. User would have to re check > to see the result of the operation. So how hard we try is really an > implementation detail. Yeah, I wish it was a bit more consistent from an interface POV. Btw, if you have noticed, Roman's patch to enforce memcg->high *after* trying to reclaim went into the tree at the same exact time as Chris's series "mm, memcg: reclaim harder before high throttling" (commit b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap. Chris's patch changes memory.high reclaim on the allocation side from reclaim once, sleep if there is still overage to reclaim the overage as long as you make forward progress; sleep after 16 no-progress loops if there is still overage Roman's patch describes a problem where allocating threads go to sleep when memory.high is lowered by a wider step. This is exceedingly unlikely after Chris's change. Because after Chris's change, memory.high is reclaimed on the allocation side as aggressively as memory.max. The only difference is that upon failure, one sleeps and the other OOMs. If Roman's issue were present after Chris's change, then we'd also see premature OOM kills when memory.max is lowered by a large step. And I have never seen that happening. So I suggest instead of my fix here, we revert Roman's patch instead, as it should no longer be needed. Thoughts?
On Fri, Jan 15, 2021 at 11:20:50AM -0500, Johannes Weiner wrote: > On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote: > > On Tue 12-01-21 11:30:11, Johannes Weiner wrote: > > > When a value is written to a cgroup's memory.high control file, the > > > write() context first tries to reclaim the cgroup to size before > > > putting the limit in place for the workload. Concurrent charges from > > > the workload can keep such a write() looping in reclaim indefinitely. > > > > > > In the past, a write to memory.high would first put the limit in place > > > for the workload, then do targeted reclaim until the new limit has > > > been met - similar to how we do it for memory.max. This wasn't prone > > > to the described starvation issue. However, this sequence could cause > > > excessive latencies in the workload, when allocating threads could be > > > put into long penalty sleeps on the sudden memory.high overage created > > > by the write(), before that had a chance to work it off. > > > > > > Now that memory_high_write() performs reclaim before enforcing the new > > > limit, reflect that the cgroup may well fail to converge due to > > > concurrent workload activity. Bail out of the loop after a few tries. > > > > I can see that you have provided some more details in follow up replies > > but I do not see any explicit argument why an excessive time for writer > > is an actual problem. Could you be more specific? > > Our writer isn't necessarily time sensitive, but there is a difference > between a) the write taking a few seconds to reclaim down the > requested delta and b) the writer essentially turning into kswapd for > the workload and busy-spinning inside the kernel indefinitely. > > We've seen the writer stuck in this function for minutes, long after > the requested delta has been reclaimed, consuming alarming amounts of > CPU cycles - CPU time that should really be accounted to the workload, > not the system software performing the write. > > Obviously, we could work around it using timeouts and signals. In > fact, we may have to until the new kernel is deployed everywhere. But > this is the definition of an interface change breaking userspace, so > I'm a bit surprised by your laid-back response. > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > > > Cc: <stable@vger.kernel.org> # 5.8+ > > > > Why is this worth backporting to stable? The behavior is different but I > > do not think any of them is harmful. > > The referenced patch changed user-visible behavior in a way that is > causing real production problems for us. From stable-kernel-rules: > > - It must fix a real bug that bothers people (not a, "This could be a > problem..." type thing). > > > > Reported-by: Tejun Heo <tj@kernel.org> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > > > I am not against the patch. The existing interface doesn't provide any > > meaningful feedback to the userspace anyway. User would have to re check > > to see the result of the operation. So how hard we try is really an > > implementation detail. > > Yeah, I wish it was a bit more consistent from an interface POV. > > Btw, if you have noticed, Roman's patch to enforce memcg->high *after* > trying to reclaim went into the tree at the same exact time as Chris's > series "mm, memcg: reclaim harder before high throttling" (commit > b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap. > > Chris's patch changes memory.high reclaim on the allocation side from > > reclaim once, sleep if there is still overage > > to > > reclaim the overage as long as you make forward progress; > sleep after 16 no-progress loops if there is still overage > > Roman's patch describes a problem where allocating threads go to sleep > when memory.high is lowered by a wider step. This is exceedingly > unlikely after Chris's change. > > Because after Chris's change, memory.high is reclaimed on the > allocation side as aggressively as memory.max. The only difference is > that upon failure, one sleeps and the other OOMs. > > If Roman's issue were present after Chris's change, then we'd also see > premature OOM kills when memory.max is lowered by a large step. And I > have never seen that happening. > > So I suggest instead of my fix here, we revert Roman's patch instead, > as it should no longer be needed. Thoughts? Chris's patch was merged way earlier than mine into the kernel tree which was used when I observed the problem in the production. So likely it was there. I think it makes sense to try to reclaim memory first before putting all processes in the cgroup into reclaim mode. Even without artificial delays it creates some latency and btw doesn't make the reclaim process more efficient. What about something like this? Make one optimistic iteration without setting the high value and then fallback to the old behavior. @@ -6317,15 +6317,15 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, continue; } - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, true); + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, GFP_KERNEL, + true); - if (!reclaimed && !nr_retries--) + if (nr_retries == MAX_RECLAIM_RETRIES) + page_counter_set_high(&memcg->memory, high); + if (!nr_retries--) break; } - page_counter_set_high(&memcg->memory, high); - memcg_wb_domain_size_changed(memcg); return nbytes;
On Fri, Jan 15, 2021 at 09:03:41AM -0800, Roman Gushchin wrote: > On Fri, Jan 15, 2021 at 11:20:50AM -0500, Johannes Weiner wrote: > > On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote: > > > On Tue 12-01-21 11:30:11, Johannes Weiner wrote: > > > > When a value is written to a cgroup's memory.high control file, the > > > > write() context first tries to reclaim the cgroup to size before > > > > putting the limit in place for the workload. Concurrent charges from > > > > the workload can keep such a write() looping in reclaim indefinitely. > > > > > > > > In the past, a write to memory.high would first put the limit in place > > > > for the workload, then do targeted reclaim until the new limit has > > > > been met - similar to how we do it for memory.max. This wasn't prone > > > > to the described starvation issue. However, this sequence could cause > > > > excessive latencies in the workload, when allocating threads could be > > > > put into long penalty sleeps on the sudden memory.high overage created > > > > by the write(), before that had a chance to work it off. > > > > > > > > Now that memory_high_write() performs reclaim before enforcing the new > > > > limit, reflect that the cgroup may well fail to converge due to > > > > concurrent workload activity. Bail out of the loop after a few tries. > > > > > > I can see that you have provided some more details in follow up replies > > > but I do not see any explicit argument why an excessive time for writer > > > is an actual problem. Could you be more specific? > > > > Our writer isn't necessarily time sensitive, but there is a difference > > between a) the write taking a few seconds to reclaim down the > > requested delta and b) the writer essentially turning into kswapd for > > the workload and busy-spinning inside the kernel indefinitely. > > > > We've seen the writer stuck in this function for minutes, long after > > the requested delta has been reclaimed, consuming alarming amounts of > > CPU cycles - CPU time that should really be accounted to the workload, > > not the system software performing the write. > > > > Obviously, we could work around it using timeouts and signals. In > > fact, we may have to until the new kernel is deployed everywhere. But > > this is the definition of an interface change breaking userspace, so > > I'm a bit surprised by your laid-back response. > > > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > > > > Cc: <stable@vger.kernel.org> # 5.8+ > > > > > > Why is this worth backporting to stable? The behavior is different but I > > > do not think any of them is harmful. > > > > The referenced patch changed user-visible behavior in a way that is > > causing real production problems for us. From stable-kernel-rules: > > > > - It must fix a real bug that bothers people (not a, "This could be a > > problem..." type thing). > > > > > > Reported-by: Tejun Heo <tj@kernel.org> > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > I am not against the patch. The existing interface doesn't provide any > > > meaningful feedback to the userspace anyway. User would have to re check > > > to see the result of the operation. So how hard we try is really an > > > implementation detail. > > > > Yeah, I wish it was a bit more consistent from an interface POV. > > > > Btw, if you have noticed, Roman's patch to enforce memcg->high *after* > > trying to reclaim went into the tree at the same exact time as Chris's > > series "mm, memcg: reclaim harder before high throttling" (commit > > b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap. > > > > Chris's patch changes memory.high reclaim on the allocation side from > > > > reclaim once, sleep if there is still overage > > > > to > > > > reclaim the overage as long as you make forward progress; > > sleep after 16 no-progress loops if there is still overage > > > > Roman's patch describes a problem where allocating threads go to sleep > > when memory.high is lowered by a wider step. This is exceedingly > > unlikely after Chris's change. > > > > Because after Chris's change, memory.high is reclaimed on the > > allocation side as aggressively as memory.max. The only difference is > > that upon failure, one sleeps and the other OOMs. > > > > If Roman's issue were present after Chris's change, then we'd also see > > premature OOM kills when memory.max is lowered by a large step. And I > > have never seen that happening. > > > > So I suggest instead of my fix here, we revert Roman's patch instead, > > as it should no longer be needed. Thoughts? > > Chris's patch was merged way earlier than mine into the kernel tree which > was used when I observed the problem in the production. So likely it was there. Chris's patch was in the tree earlier, but the first release containing it was tagged a day before you put in your change, so I doubt it was on the production system where you observed the issue. As per above, it'd be very surprising to see premature sleeps when lowering memory.high, when allocation-side reclaim keeps going until the cgroup meets the definition of OOM. > I think it makes sense to try to reclaim memory first before putting > all processes in the cgroup into reclaim mode. Even without artificial delays > it creates some latency and btw doesn't make the reclaim process more efficient. It's not obvious that this is a practical problem. It certainly isn't for memory.max, and there should be a good reason why the two should be different aside from the documented OOM vs sleep behavior. Absent any concrete evidence to the contrary, I'd make the behavior the same again for those two.
On Fri, Jan 15, 2021 at 03:55:36PM -0500, Johannes Weiner wrote: > On Fri, Jan 15, 2021 at 09:03:41AM -0800, Roman Gushchin wrote: > > On Fri, Jan 15, 2021 at 11:20:50AM -0500, Johannes Weiner wrote: > > > On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote: > > > > On Tue 12-01-21 11:30:11, Johannes Weiner wrote: > > > > > When a value is written to a cgroup's memory.high control file, the > > > > > write() context first tries to reclaim the cgroup to size before > > > > > putting the limit in place for the workload. Concurrent charges from > > > > > the workload can keep such a write() looping in reclaim indefinitely. > > > > > > > > > > In the past, a write to memory.high would first put the limit in place > > > > > for the workload, then do targeted reclaim until the new limit has > > > > > been met - similar to how we do it for memory.max. This wasn't prone > > > > > to the described starvation issue. However, this sequence could cause > > > > > excessive latencies in the workload, when allocating threads could be > > > > > put into long penalty sleeps on the sudden memory.high overage created > > > > > by the write(), before that had a chance to work it off. > > > > > > > > > > Now that memory_high_write() performs reclaim before enforcing the new > > > > > limit, reflect that the cgroup may well fail to converge due to > > > > > concurrent workload activity. Bail out of the loop after a few tries. > > > > > > > > I can see that you have provided some more details in follow up replies > > > > but I do not see any explicit argument why an excessive time for writer > > > > is an actual problem. Could you be more specific? > > > > > > Our writer isn't necessarily time sensitive, but there is a difference > > > between a) the write taking a few seconds to reclaim down the > > > requested delta and b) the writer essentially turning into kswapd for > > > the workload and busy-spinning inside the kernel indefinitely. > > > > > > We've seen the writer stuck in this function for minutes, long after > > > the requested delta has been reclaimed, consuming alarming amounts of > > > CPU cycles - CPU time that should really be accounted to the workload, > > > not the system software performing the write. > > > > > > Obviously, we could work around it using timeouts and signals. In > > > fact, we may have to until the new kernel is deployed everywhere. But > > > this is the definition of an interface change breaking userspace, so > > > I'm a bit surprised by your laid-back response. > > > > > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > > > > > Cc: <stable@vger.kernel.org> # 5.8+ > > > > > > > > Why is this worth backporting to stable? The behavior is different but I > > > > do not think any of them is harmful. > > > > > > The referenced patch changed user-visible behavior in a way that is > > > causing real production problems for us. From stable-kernel-rules: > > > > > > - It must fix a real bug that bothers people (not a, "This could be a > > > problem..." type thing). > > > > > > > > Reported-by: Tejun Heo <tj@kernel.org> > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > > > I am not against the patch. The existing interface doesn't provide any > > > > meaningful feedback to the userspace anyway. User would have to re check > > > > to see the result of the operation. So how hard we try is really an > > > > implementation detail. > > > > > > Yeah, I wish it was a bit more consistent from an interface POV. > > > > > > Btw, if you have noticed, Roman's patch to enforce memcg->high *after* > > > trying to reclaim went into the tree at the same exact time as Chris's > > > series "mm, memcg: reclaim harder before high throttling" (commit > > > b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap. > > > > > > Chris's patch changes memory.high reclaim on the allocation side from > > > > > > reclaim once, sleep if there is still overage > > > > > > to > > > > > > reclaim the overage as long as you make forward progress; > > > sleep after 16 no-progress loops if there is still overage > > > > > > Roman's patch describes a problem where allocating threads go to sleep > > > when memory.high is lowered by a wider step. This is exceedingly > > > unlikely after Chris's change. > > > > > > Because after Chris's change, memory.high is reclaimed on the > > > allocation side as aggressively as memory.max. The only difference is > > > that upon failure, one sleeps and the other OOMs. > > > > > > If Roman's issue were present after Chris's change, then we'd also see > > > premature OOM kills when memory.max is lowered by a large step. And I > > > have never seen that happening. > > > > > > So I suggest instead of my fix here, we revert Roman's patch instead, > > > as it should no longer be needed. Thoughts? > > > > Chris's patch was merged way earlier than mine into the kernel tree which > > was used when I observed the problem in the production. So likely it was there. > > Chris's patch was in the tree earlier, but the first release > containing it was tagged a day before you put in your change, so I > doubt it was on the production system where you observed the issue. > > As per above, it'd be very surprising to see premature sleeps when > lowering memory.high, when allocation-side reclaim keeps going until > the cgroup meets the definition of OOM. > > > I think it makes sense to try to reclaim memory first before putting > > all processes in the cgroup into reclaim mode. Even without artificial delays > > it creates some latency and btw doesn't make the reclaim process more efficient. > > It's not obvious that this is a practical problem. It certainly isn't > for memory.max, Because memory.max is usually not adjusted dynamically? > and there should be a good reason why the two should > be different aside from the documented OOM vs sleep behavior. Maybe we have different examples in our heads, but mine is a cgroup with a significant amount of relatively cold pagecache and a multi-threaded workload. Now somebody wants to tighten memory.high. Why would we put all threads into a direct reclaim? I don't see a good reason. Memory.max is different because nobody (hopefully) is adjusting it dynamically to be just a little bit bigger than the workingset. Most likely it's set once that after the creation of a cgroup. So such a problem just doesn't exist.
On Fri 15-01-21 11:20:50, Johannes Weiner wrote: > On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote: > > On Tue 12-01-21 11:30:11, Johannes Weiner wrote: > > > When a value is written to a cgroup's memory.high control file, the > > > write() context first tries to reclaim the cgroup to size before > > > putting the limit in place for the workload. Concurrent charges from > > > the workload can keep such a write() looping in reclaim indefinitely. > > > > > > In the past, a write to memory.high would first put the limit in place > > > for the workload, then do targeted reclaim until the new limit has > > > been met - similar to how we do it for memory.max. This wasn't prone > > > to the described starvation issue. However, this sequence could cause > > > excessive latencies in the workload, when allocating threads could be > > > put into long penalty sleeps on the sudden memory.high overage created > > > by the write(), before that had a chance to work it off. > > > > > > Now that memory_high_write() performs reclaim before enforcing the new > > > limit, reflect that the cgroup may well fail to converge due to > > > concurrent workload activity. Bail out of the loop after a few tries. > > > > I can see that you have provided some more details in follow up replies > > but I do not see any explicit argument why an excessive time for writer > > is an actual problem. Could you be more specific? > > Our writer isn't necessarily time sensitive, but there is a difference > between a) the write taking a few seconds to reclaim down the > requested delta and b) the writer essentially turning into kswapd for > the workload and busy-spinning inside the kernel indefinitely. > > We've seen the writer stuck in this function for minutes, long after > the requested delta has been reclaimed, consuming alarming amounts of > CPU cycles - CPU time that should really be accounted to the workload, > not the system software performing the write. OK, this is an important detail. So the context which is doing the work doesn't belong to the target memcg? If that is the case then I do understand why you consider it a problem. In general I would recommend running operations like this one in scope of the affected cgroup. But maybe that is not really an option in your setup. Anyway this is an important information to have in the changelog. > Obviously, we could work around it using timeouts and signals. In > fact, we may have to until the new kernel is deployed everywhere. But > this is the definition of an interface change breaking userspace, so > I'm a bit surprised by your laid-back response. Well, I was basing my feedback on the available information in the changelog. It is quite clear that somebody has to pay for the work. Moving as much of the work to the writer makes sense as long as the context is runing in the same cgroup so the work gets accounted properly. If this assumption doesn't match the reality then we have to re-evaluate our priorities here. > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > > > Cc: <stable@vger.kernel.org> # 5.8+ > > > > Why is this worth backporting to stable? The behavior is different but I > > do not think any of them is harmful. > > The referenced patch changed user-visible behavior in a way that is > causing real production problems for us. From stable-kernel-rules: > > - It must fix a real bug that bothers people (not a, "This could be a > problem..." type thing). > > > > Reported-by: Tejun Heo <tj@kernel.org> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > > > I am not against the patch. The existing interface doesn't provide any > > meaningful feedback to the userspace anyway. User would have to re check > > to see the result of the operation. So how hard we try is really an > > implementation detail. > > Yeah, I wish it was a bit more consistent from an interface POV. > > Btw, if you have noticed, Roman's patch to enforce memcg->high *after* > trying to reclaim went into the tree at the same exact time as Chris's > series "mm, memcg: reclaim harder before high throttling" (commit > b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap. > > Chris's patch changes memory.high reclaim on the allocation side from > > reclaim once, sleep if there is still overage > > to > > reclaim the overage as long as you make forward progress; > sleep after 16 no-progress loops if there is still overage > > Roman's patch describes a problem where allocating threads go to sleep > when memory.high is lowered by a wider step. This is exceedingly > unlikely after Chris's change. > > Because after Chris's change, memory.high is reclaimed on the > allocation side as aggressively as memory.max. The only difference is > that upon failure, one sleeps and the other OOMs. > > If Roman's issue were present after Chris's change, then we'd also see > premature OOM kills when memory.max is lowered by a large step. And I > have never seen that happening. This should be something quite easy to double check right? > So I suggest instead of my fix here, we revert Roman's patch instead, > as it should no longer be needed. Thoughts? Reverting 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") would certainly help to throttle producers but it still doesn't solve the underlying problem that a lot of work could be done in a context which lives outside of the memcg, right? The effect would be much smaller and it shouldn't be effectivelly unbounded but still something we should think about. That being said going with the revert sounds like a slightly better approach to me.
On Fri, Jan 15, 2021 at 01:27:23PM -0800, Roman Gushchin wrote: > On Fri, Jan 15, 2021 at 03:55:36PM -0500, Johannes Weiner wrote: > > On Fri, Jan 15, 2021 at 09:03:41AM -0800, Roman Gushchin wrote: > > > On Fri, Jan 15, 2021 at 11:20:50AM -0500, Johannes Weiner wrote: > > > > On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote: > > > > > On Tue 12-01-21 11:30:11, Johannes Weiner wrote: > > > > > > When a value is written to a cgroup's memory.high control file, the > > > > > > write() context first tries to reclaim the cgroup to size before > > > > > > putting the limit in place for the workload. Concurrent charges from > > > > > > the workload can keep such a write() looping in reclaim indefinitely. > > > > > > > > > > > > In the past, a write to memory.high would first put the limit in place > > > > > > for the workload, then do targeted reclaim until the new limit has > > > > > > been met - similar to how we do it for memory.max. This wasn't prone > > > > > > to the described starvation issue. However, this sequence could cause > > > > > > excessive latencies in the workload, when allocating threads could be > > > > > > put into long penalty sleeps on the sudden memory.high overage created > > > > > > by the write(), before that had a chance to work it off. > > > > > > > > > > > > Now that memory_high_write() performs reclaim before enforcing the new > > > > > > limit, reflect that the cgroup may well fail to converge due to > > > > > > concurrent workload activity. Bail out of the loop after a few tries. > > > > > > > > > > I can see that you have provided some more details in follow up replies > > > > > but I do not see any explicit argument why an excessive time for writer > > > > > is an actual problem. Could you be more specific? > > > > > > > > Our writer isn't necessarily time sensitive, but there is a difference > > > > between a) the write taking a few seconds to reclaim down the > > > > requested delta and b) the writer essentially turning into kswapd for > > > > the workload and busy-spinning inside the kernel indefinitely. > > > > > > > > We've seen the writer stuck in this function for minutes, long after > > > > the requested delta has been reclaimed, consuming alarming amounts of > > > > CPU cycles - CPU time that should really be accounted to the workload, > > > > not the system software performing the write. > > > > > > > > Obviously, we could work around it using timeouts and signals. In > > > > fact, we may have to until the new kernel is deployed everywhere. But > > > > this is the definition of an interface change breaking userspace, so > > > > I'm a bit surprised by your laid-back response. > > > > > > > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > > > > > > Cc: <stable@vger.kernel.org> # 5.8+ > > > > > > > > > > Why is this worth backporting to stable? The behavior is different but I > > > > > do not think any of them is harmful. > > > > > > > > The referenced patch changed user-visible behavior in a way that is > > > > causing real production problems for us. From stable-kernel-rules: > > > > > > > > - It must fix a real bug that bothers people (not a, "This could be a > > > > problem..." type thing). > > > > > > > > > > Reported-by: Tejun Heo <tj@kernel.org> > > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > > > > > I am not against the patch. The existing interface doesn't provide any > > > > > meaningful feedback to the userspace anyway. User would have to re check > > > > > to see the result of the operation. So how hard we try is really an > > > > > implementation detail. > > > > > > > > Yeah, I wish it was a bit more consistent from an interface POV. > > > > > > > > Btw, if you have noticed, Roman's patch to enforce memcg->high *after* > > > > trying to reclaim went into the tree at the same exact time as Chris's > > > > series "mm, memcg: reclaim harder before high throttling" (commit > > > > b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap. > > > > > > > > Chris's patch changes memory.high reclaim on the allocation side from > > > > > > > > reclaim once, sleep if there is still overage > > > > > > > > to > > > > > > > > reclaim the overage as long as you make forward progress; > > > > sleep after 16 no-progress loops if there is still overage > > > > > > > > Roman's patch describes a problem where allocating threads go to sleep > > > > when memory.high is lowered by a wider step. This is exceedingly > > > > unlikely after Chris's change. > > > > > > > > Because after Chris's change, memory.high is reclaimed on the > > > > allocation side as aggressively as memory.max. The only difference is > > > > that upon failure, one sleeps and the other OOMs. > > > > > > > > If Roman's issue were present after Chris's change, then we'd also see > > > > premature OOM kills when memory.max is lowered by a large step. And I > > > > have never seen that happening. > > > > > > > > So I suggest instead of my fix here, we revert Roman's patch instead, > > > > as it should no longer be needed. Thoughts? > > > > > > Chris's patch was merged way earlier than mine into the kernel tree which > > > was used when I observed the problem in the production. So likely it was there. > > > > Chris's patch was in the tree earlier, but the first release > > containing it was tagged a day before you put in your change, so I > > doubt it was on the production system where you observed the issue. > > > > As per above, it'd be very surprising to see premature sleeps when > > lowering memory.high, when allocation-side reclaim keeps going until > > the cgroup meets the definition of OOM. > > > > > I think it makes sense to try to reclaim memory first before putting > > > all processes in the cgroup into reclaim mode. Even without artificial delays > > > it creates some latency and btw doesn't make the reclaim process more efficient. > > > > It's not obvious that this is a practical problem. It certainly isn't > > for memory.max, > > Because memory.max is usually not adjusted dynamically? > > > and there should be a good reason why the two should > > be different aside from the documented OOM vs sleep behavior. > > Maybe we have different examples in our heads, but mine is a cgroup > with a significant amount of relatively cold pagecache and a multi-threaded > workload. Now somebody wants to tighten memory.high. Why would we put all > threads into a direct reclaim? I don't see a good reason. But how is that different than a multi-threaded workload simply running up against a static limit? As soon as one event puts the cgroup over the limit, all concurrent allocations will form a thundering herd of reclaimers. I don't see much of a difference between a burst of allocations exceeding the limit from below and the limit being lowered from above. Sure, the requested limit delta is usually bigger than a single allocation. However, memory.high enforcement is batched, and quite a few threads could contribute to the overage and be marked to enter direct reclaim before the first page is actually scanned. Also, while we do adjust memory.high dynamically, we do so no more than once every 5-10 seconds, whereas a multi-threaded workload may breach its limit and provoke thundering herds many times a second. So I just cannot imagine the sequence of events here would make a practical difference for the aspect of the thundering herd (whereas it was VERY obvious with the premature sleep issue on the allocation side). But the thundering herd may start as soon as you lower the limit, whether you reclaim before or after. It may already be occurring against the higher limit before you even enter write(). Allocation latency is a real concern, I agree with you there. But that's IMO more to do with how cgroup limits are enforced in general. That's why there have been multiple discussions and patch submissions around async reclaim. But the fact is, the change of sequence is causing real problems in production, and with the premature-throttling-on-large-delta-bug out of the way, the justification for keeping it are somewhat nebulous. So unless we can clearly show from production that the sequence still matters after Chris's patch, I'd say we go with the revert and take the small risk of getting data to the contrary. As opposed to accruing additional complexity and special-cased code on a hunch, that we then may carry unnecessarily for a very long time - until somebody later on, under the pressure of keeping all this stuff maintainable, takes the risk of chopping it again with much less of the original context and stake holders available.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 605f671203ef..63a8d47c1cd3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, for (;;) { unsigned long nr_pages = page_counter_read(&memcg->memory); - unsigned long reclaimed; if (nr_pages <= high) break; @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, continue; } - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, true); + try_to_free_mem_cgroup_pages(memcg, nr_pages - high, + GFP_KERNEL, true); - if (!reclaimed && !nr_retries--) + if (!nr_retries--) break; }
When a value is written to a cgroup's memory.high control file, the write() context first tries to reclaim the cgroup to size before putting the limit in place for the workload. Concurrent charges from the workload can keep such a write() looping in reclaim indefinitely. In the past, a write to memory.high would first put the limit in place for the workload, then do targeted reclaim until the new limit has been met - similar to how we do it for memory.max. This wasn't prone to the described starvation issue. However, this sequence could cause excessive latencies in the workload, when allocating threads could be put into long penalty sleeps on the sudden memory.high overage created by the write(), before that had a chance to work it off. Now that memory_high_write() performs reclaim before enforcing the new limit, reflect that the cgroup may well fail to converge due to concurrent workload activity. Bail out of the loop after a few tries. Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") Cc: <stable@vger.kernel.org> # 5.8+ Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/memcontrol.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)