mm: memcontrol: prevent starvation when writing memory.high

Message ID	20210112163011.127833-1-hannes@cmpxchg.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=wefk=GP=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D202422B2B From: Johannes Weiner <hannes@cmpxchg.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org>, Roman Gushchin <guro@fb.com>, Michal Hocko <mhocko@suse.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH] mm: memcontrol: prevent starvation when writing memory.high Date: Tue, 12 Jan 2021 11:30:11 -0500 Message-Id: <20210112163011.127833-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: memcontrol: prevent starvation when writing memory.high \| expand mm: memcontrol: prevent starvation when writing memory.high

Johannes Weiner Jan. 12, 2021, 4:30 p.m. UTC

When a value is written to a cgroup's memory.high control file, the
write() context first tries to reclaim the cgroup to size before
putting the limit in place for the workload. Concurrent charges from
the workload can keep such a write() looping in reclaim indefinitely.

In the past, a write to memory.high would first put the limit in place
for the workload, then do targeted reclaim until the new limit has
been met - similar to how we do it for memory.max. This wasn't prone
to the described starvation issue. However, this sequence could cause
excessive latencies in the workload, when allocating threads could be
put into long penalty sleeps on the sudden memory.high overage created
by the write(), before that had a chance to work it off.

Now that memory_high_write() performs reclaim before enforcing the new
limit, reflect that the cgroup may well fail to converge due to
concurrent workload activity. Bail out of the loop after a few tries.

Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
Cc: <stable@vger.kernel.org> # 5.8+
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Roman Gushchin Jan. 12, 2021, 5:03 p.m. UTC | #1

On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote:
> When a value is written to a cgroup's memory.high control file, the
> write() context first tries to reclaim the cgroup to size before
> putting the limit in place for the workload. Concurrent charges from
> the workload can keep such a write() looping in reclaim indefinitely.
> 
> In the past, a write to memory.high would first put the limit in place
> for the workload, then do targeted reclaim until the new limit has
> been met - similar to how we do it for memory.max. This wasn't prone
> to the described starvation issue. However, this sequence could cause
> excessive latencies in the workload, when allocating threads could be
> put into long penalty sleeps on the sudden memory.high overage created
> by the write(), before that had a chance to work it off.
> 
> Now that memory_high_write() performs reclaim before enforcing the new
> limit, reflect that the cgroup may well fail to converge due to
> concurrent workload activity. Bail out of the loop after a few tries.
> 
> Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> Cc: <stable@vger.kernel.org> # 5.8+
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 605f671203ef..63a8d47c1cd3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  
>  	for (;;) {
>  		unsigned long nr_pages = page_counter_read(&memcg->memory);
> -		unsigned long reclaimed;
>  
>  		if (nr_pages <= high)
>  			break;
> @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  			continue;
>  		}
>  
> -		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -							 GFP_KERNEL, true);
> +		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> +					     GFP_KERNEL, true);
>  
> -		if (!reclaimed && !nr_retries--)
> +		if (!nr_retries--)

Shouldn't it be (!reclaimed || !nr_retries) instead?

If reclaimed == 0, it probably doesn't make much sense to retry.

Otherwise the patch looks good to me.

Thanks!

Shakeel Butt Jan. 12, 2021, 6:59 p.m. UTC | #2

On Tue, Jan 12, 2021 at 9:12 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> When a value is written to a cgroup's memory.high control file, the
> write() context first tries to reclaim the cgroup to size before
> putting the limit in place for the workload. Concurrent charges from
> the workload can keep such a write() looping in reclaim indefinitely.
>

Is this observed on real workload?

>
> In the past, a write to memory.high would first put the limit in place
> for the workload, then do targeted reclaim until the new limit has
> been met - similar to how we do it for memory.max. This wasn't prone
> to the described starvation issue. However, this sequence could cause
> excessive latencies in the workload, when allocating threads could be
> put into long penalty sleeps on the sudden memory.high overage created
> by the write(), before that had a chance to work it off.
>
> Now that memory_high_write() performs reclaim before enforcing the new
> limit, reflect that the cgroup may well fail to converge due to
> concurrent workload activity. Bail out of the loop after a few tries.
>
> Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> Cc: <stable@vger.kernel.org> # 5.8+
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 605f671203ef..63a8d47c1cd3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>
>         for (;;) {
>                 unsigned long nr_pages = page_counter_read(&memcg->memory);
> -               unsigned long reclaimed;
>
>                 if (nr_pages <= high)
>                         break;
> @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>                         continue;
>                 }
>
> -               reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -                                                        GFP_KERNEL, true);
> +               try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> +                                            GFP_KERNEL, true);
>
> -               if (!reclaimed && !nr_retries--)

Any particular reason to remove !reclaimed?

> +               if (!nr_retries--)
>                         break;
>         }
>
> --
> 2.30.0
>

Johannes Weiner Jan. 12, 2021, 7:45 p.m. UTC | #3

On Tue, Jan 12, 2021 at 09:03:22AM -0800, Roman Gushchin wrote:
> On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote:
> > When a value is written to a cgroup's memory.high control file, the
> > write() context first tries to reclaim the cgroup to size before
> > putting the limit in place for the workload. Concurrent charges from
> > the workload can keep such a write() looping in reclaim indefinitely.
> > 
> > In the past, a write to memory.high would first put the limit in place
> > for the workload, then do targeted reclaim until the new limit has
> > been met - similar to how we do it for memory.max. This wasn't prone
> > to the described starvation issue. However, this sequence could cause
> > excessive latencies in the workload, when allocating threads could be
> > put into long penalty sleeps on the sudden memory.high overage created
> > by the write(), before that had a chance to work it off.
> > 
> > Now that memory_high_write() performs reclaim before enforcing the new
> > limit, reflect that the cgroup may well fail to converge due to
> > concurrent workload activity. Bail out of the loop after a few tries.
> > 
> > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> > Cc: <stable@vger.kernel.org> # 5.8+
> > Reported-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/memcontrol.c | 7 +++----
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 605f671203ef..63a8d47c1cd3 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> >  
> >  	for (;;) {
> >  		unsigned long nr_pages = page_counter_read(&memcg->memory);
> > -		unsigned long reclaimed;
> >  
> >  		if (nr_pages <= high)
> >  			break;
> > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> >  			continue;
> >  		}
> >  
> > -		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > -							 GFP_KERNEL, true);
> > +		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > +					     GFP_KERNEL, true);
> >  
> > -		if (!reclaimed && !nr_retries--)
> > +		if (!nr_retries--)
> 
> Shouldn't it be (!reclaimed || !nr_retries) instead?
> 
> If reclaimed == 0, it probably doesn't make much sense to retry.

We usually allow nr_retries worth of no-progress reclaim cycles to
make up for intermittent reclaim failures.

The difference to OOMs/memory.max is that we don't want to loop
indefinitely on forward progress, but we should allow the usual number
of no-progress loops.

Johannes Weiner Jan. 12, 2021, 7:53 p.m. UTC | #4

On Tue, Jan 12, 2021 at 10:59:58AM -0800, Shakeel Butt wrote:
> On Tue, Jan 12, 2021 at 9:12 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > When a value is written to a cgroup's memory.high control file, the
> > write() context first tries to reclaim the cgroup to size before
> > putting the limit in place for the workload. Concurrent charges from
> > the workload can keep such a write() looping in reclaim indefinitely.
> >
> 
> Is this observed on real workload?

Yes.

On several production hosts running a particularly aggressive
workload, we've observed writers to memory.high getting stuck for
minutes while consuming significant amount of CPU.

> Any particular reason to remove !reclaimed?

It's purpose so far was to allow successful reclaim to continue
indefinitely, while restricting no-progress loops to 'nr_retries'.

Without the first part, it doesn't really matter whether reclaim is
making progress or not: we do a maximum of 'nr_retries' loops until
the cgroup size meets the new limit, then exit one way or another.

Roman Gushchin Jan. 12, 2021, 8:12 p.m. UTC | #5

On Tue, Jan 12, 2021 at 02:45:43PM -0500, Johannes Weiner wrote:
> On Tue, Jan 12, 2021 at 09:03:22AM -0800, Roman Gushchin wrote:
> > On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote:
> > > When a value is written to a cgroup's memory.high control file, the
> > > write() context first tries to reclaim the cgroup to size before
> > > putting the limit in place for the workload. Concurrent charges from
> > > the workload can keep such a write() looping in reclaim indefinitely.
> > > 
> > > In the past, a write to memory.high would first put the limit in place
> > > for the workload, then do targeted reclaim until the new limit has
> > > been met - similar to how we do it for memory.max. This wasn't prone
> > > to the described starvation issue. However, this sequence could cause
> > > excessive latencies in the workload, when allocating threads could be
> > > put into long penalty sleeps on the sudden memory.high overage created
> > > by the write(), before that had a chance to work it off.
> > > 
> > > Now that memory_high_write() performs reclaim before enforcing the new
> > > limit, reflect that the cgroup may well fail to converge due to
> > > concurrent workload activity. Bail out of the loop after a few tries.
> > > 
> > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> > > Cc: <stable@vger.kernel.org> # 5.8+
> > > Reported-by: Tejun Heo <tj@kernel.org>
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > >  mm/memcontrol.c | 7 +++----
> > >  1 file changed, 3 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 605f671203ef..63a8d47c1cd3 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> > >  
> > >  	for (;;) {
> > >  		unsigned long nr_pages = page_counter_read(&memcg->memory);
> > > -		unsigned long reclaimed;
> > >  
> > >  		if (nr_pages <= high)
> > >  			break;
> > > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> > >  			continue;
> > >  		}
> > >  
> > > -		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > > -							 GFP_KERNEL, true);
> > > +		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > > +					     GFP_KERNEL, true);
> > >  
> > > -		if (!reclaimed && !nr_retries--)
> > > +		if (!nr_retries--)
> > 
> > Shouldn't it be (!reclaimed || !nr_retries) instead?
> > 
> > If reclaimed == 0, it probably doesn't make much sense to retry.
> 
> We usually allow nr_retries worth of no-progress reclaim cycles to
> make up for intermittent reclaim failures.
> 
> The difference to OOMs/memory.max is that we don't want to loop
> indefinitely on forward progress, but we should allow the usual number
> of no-progress loops.

Re memory.max: trying really hard makes sense because we are OOMing otherwise.
With memory.high such an idea is questionable: if were not able to reclaim
a single page from the first attempt, it's unlikely that we can reclaim many
from repeating 16 times.

My concern here is that we can see CPU regressions in some cases when there is
no reclaimable memory. Do you think we can win something by trying harder?
If so, it's worth mentioning in the commit log. Because it's really a separate
change to what's described in the log, to some extent it's a move into an opposite
direction.

Thanks!

Shakeel Butt Jan. 12, 2021, 8:28 p.m. UTC | #6

On Tue, Jan 12, 2021 at 11:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Jan 12, 2021 at 10:59:58AM -0800, Shakeel Butt wrote:
> > On Tue, Jan 12, 2021 at 9:12 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > When a value is written to a cgroup's memory.high control file, the
> > > write() context first tries to reclaim the cgroup to size before
> > > putting the limit in place for the workload. Concurrent charges from
> > > the workload can keep such a write() looping in reclaim indefinitely.
> > >
> >
> > Is this observed on real workload?
>
> Yes.
>
> On several production hosts running a particularly aggressive
> workload, we've observed writers to memory.high getting stuck for
> minutes while consuming significant amount of CPU.
>

Good to add this in the commit message or at least mentioning that it
happened in production.

> > Any particular reason to remove !reclaimed?
>
> It's purpose so far was to allow successful reclaim to continue
> indefinitely, while restricting no-progress loops to 'nr_retries'.
>
> Without the first part, it doesn't really matter whether reclaim is
> making progress or not: we do a maximum of 'nr_retries' loops until
> the cgroup size meets the new limit, then exit one way or another.

Does it make sense to add this in the commit message as well? I am
fine with either way.

For the patch:
Reviewed-by: Shakeel Butt <shakeelb@google.com>

Johannes Weiner Jan. 12, 2021, 9:11 p.m. UTC | #7

On Tue, Jan 12, 2021 at 12:12:37PM -0800, Roman Gushchin wrote:
> On Tue, Jan 12, 2021 at 02:45:43PM -0500, Johannes Weiner wrote:
> > On Tue, Jan 12, 2021 at 09:03:22AM -0800, Roman Gushchin wrote:
> > > On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote:
> > > > When a value is written to a cgroup's memory.high control file, the
> > > > write() context first tries to reclaim the cgroup to size before
> > > > putting the limit in place for the workload. Concurrent charges from
> > > > the workload can keep such a write() looping in reclaim indefinitely.
> > > > 
> > > > In the past, a write to memory.high would first put the limit in place
> > > > for the workload, then do targeted reclaim until the new limit has
> > > > been met - similar to how we do it for memory.max. This wasn't prone
> > > > to the described starvation issue. However, this sequence could cause
> > > > excessive latencies in the workload, when allocating threads could be
> > > > put into long penalty sleeps on the sudden memory.high overage created
> > > > by the write(), before that had a chance to work it off.
> > > > 
> > > > Now that memory_high_write() performs reclaim before enforcing the new
> > > > limit, reflect that the cgroup may well fail to converge due to
> > > > concurrent workload activity. Bail out of the loop after a few tries.
> > > > 
> > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> > > > Cc: <stable@vger.kernel.org> # 5.8+
> > > > Reported-by: Tejun Heo <tj@kernel.org>
> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > ---
> > > >  mm/memcontrol.c | 7 +++----
> > > >  1 file changed, 3 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 605f671203ef..63a8d47c1cd3 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> > > >  
> > > >  	for (;;) {
> > > >  		unsigned long nr_pages = page_counter_read(&memcg->memory);
> > > > -		unsigned long reclaimed;
> > > >  
> > > >  		if (nr_pages <= high)
> > > >  			break;
> > > > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> > > >  			continue;
> > > >  		}
> > > >  
> > > > -		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > > > -							 GFP_KERNEL, true);
> > > > +		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > > > +					     GFP_KERNEL, true);
> > > >  
> > > > -		if (!reclaimed && !nr_retries--)
> > > > +		if (!nr_retries--)
> > > 
> > > Shouldn't it be (!reclaimed || !nr_retries) instead?
> > > 
> > > If reclaimed == 0, it probably doesn't make much sense to retry.
> > 
> > We usually allow nr_retries worth of no-progress reclaim cycles to
> > make up for intermittent reclaim failures.
> > 
> > The difference to OOMs/memory.max is that we don't want to loop
> > indefinitely on forward progress, but we should allow the usual number
> > of no-progress loops.
> 
> Re memory.max: trying really hard makes sense because we are OOMing otherwise.
> With memory.high such an idea is questionable: if were not able to reclaim
> a single page from the first attempt, it's unlikely that we can reclaim many
> from repeating 16 times.
> 
> My concern here is that we can see CPU regressions in some cases when there is
> no reclaimable memory. Do you think we can win something by trying harder?
> If so, it's worth mentioning in the commit log. Because it's really a separate
> change to what's described in the log, to some extent it's a move into an opposite
> direction.

Hm, I'm confused what change you are referring to.

Current upstream allows:

    a. unlimited progress loops
    b. 16 no-progress loops

My patch is fixing the issue resulting from the unlimited progress
loops in a). This is described in the changelog.

You seem to be advocating for an unrelated change to the no-progress
loops condition in b).

Am I missing something?

Roman Gushchin Jan. 12, 2021, 9:45 p.m. UTC | #8

On Tue, Jan 12, 2021 at 04:11:27PM -0500, Johannes Weiner wrote:
> On Tue, Jan 12, 2021 at 12:12:37PM -0800, Roman Gushchin wrote:
> > On Tue, Jan 12, 2021 at 02:45:43PM -0500, Johannes Weiner wrote:
> > > On Tue, Jan 12, 2021 at 09:03:22AM -0800, Roman Gushchin wrote:
> > > > On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote:
> > > > > When a value is written to a cgroup's memory.high control file, the
> > > > > write() context first tries to reclaim the cgroup to size before
> > > > > putting the limit in place for the workload. Concurrent charges from
> > > > > the workload can keep such a write() looping in reclaim indefinitely.
> > > > > 
> > > > > In the past, a write to memory.high would first put the limit in place
> > > > > for the workload, then do targeted reclaim until the new limit has
> > > > > been met - similar to how we do it for memory.max. This wasn't prone
> > > > > to the described starvation issue. However, this sequence could cause
> > > > > excessive latencies in the workload, when allocating threads could be
> > > > > put into long penalty sleeps on the sudden memory.high overage created
> > > > > by the write(), before that had a chance to work it off.
> > > > > 
> > > > > Now that memory_high_write() performs reclaim before enforcing the new
> > > > > limit, reflect that the cgroup may well fail to converge due to
> > > > > concurrent workload activity. Bail out of the loop after a few tries.
> > > > > 
> > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> > > > > Cc: <stable@vger.kernel.org> # 5.8+
> > > > > Reported-by: Tejun Heo <tj@kernel.org>
> > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > ---
> > > > >  mm/memcontrol.c | 7 +++----
> > > > >  1 file changed, 3 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index 605f671203ef..63a8d47c1cd3 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> > > > >  
> > > > >  	for (;;) {
> > > > >  		unsigned long nr_pages = page_counter_read(&memcg->memory);
> > > > > -		unsigned long reclaimed;
> > > > >  
> > > > >  		if (nr_pages <= high)
> > > > >  			break;
> > > > > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> > > > >  			continue;
> > > > >  		}
> > > > >  
> > > > > -		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > > > > -							 GFP_KERNEL, true);
> > > > > +		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > > > > +					     GFP_KERNEL, true);
> > > > >  
> > > > > -		if (!reclaimed && !nr_retries--)
> > > > > +		if (!nr_retries--)
> > > > 
> > > > Shouldn't it be (!reclaimed || !nr_retries) instead?
> > > > 
> > > > If reclaimed == 0, it probably doesn't make much sense to retry.
> > > 
> > > We usually allow nr_retries worth of no-progress reclaim cycles to
> > > make up for intermittent reclaim failures.
> > > 
> > > The difference to OOMs/memory.max is that we don't want to loop
> > > indefinitely on forward progress, but we should allow the usual number
> > > of no-progress loops.
> > 
> > Re memory.max: trying really hard makes sense because we are OOMing otherwise.
> > With memory.high such an idea is questionable: if were not able to reclaim
> > a single page from the first attempt, it's unlikely that we can reclaim many
> > from repeating 16 times.
> > 
> > My concern here is that we can see CPU regressions in some cases when there is
> > no reclaimable memory. Do you think we can win something by trying harder?
> > If so, it's worth mentioning in the commit log. Because it's really a separate
> > change to what's described in the log, to some extent it's a move into an opposite
> > direction.
> 
> Hm, I'm confused what change you are referring to.
> 
> Current upstream allows:
> 
>     a. unlimited progress loops
>     b. 16 no-progress loops
> 
> My patch is fixing the issue resulting from the unlimited progress
> loops in a). This is described in the changelog.
> 
> You seem to be advocating for an unrelated change to the no-progress
> loops condition in b).

Fair enough.

But still the question remains: what are we gaining by trying again after not
being able to reclaim a single page? If you want, it can be done separately,
but it looks like a good idea to me to bail out if we can't reclaim a single
page.

Thanks!

Michal Hocko Jan. 13, 2021, 2:46 p.m. UTC | #9

On Tue 12-01-21 11:30:11, Johannes Weiner wrote:
> When a value is written to a cgroup's memory.high control file, the
> write() context first tries to reclaim the cgroup to size before
> putting the limit in place for the workload. Concurrent charges from
> the workload can keep such a write() looping in reclaim indefinitely.
> 
> In the past, a write to memory.high would first put the limit in place
> for the workload, then do targeted reclaim until the new limit has
> been met - similar to how we do it for memory.max. This wasn't prone
> to the described starvation issue. However, this sequence could cause
> excessive latencies in the workload, when allocating threads could be
> put into long penalty sleeps on the sudden memory.high overage created
> by the write(), before that had a chance to work it off.
> 
> Now that memory_high_write() performs reclaim before enforcing the new
> limit, reflect that the cgroup may well fail to converge due to
> concurrent workload activity. Bail out of the loop after a few tries.

I can see that you have provided some more details in follow up replies
but I do not see any explicit argument why an excessive time for writer
is an actual problem. Could you be more specific?

If the writer is time sensitive then there is a trivial way to
workaround that and kill it by a signal (timeout 30s echo ....).

Btw. this behavior has been considered http://lkml.kernel.org/r/20200710122917.GB3022@dhcp22.suse.cz/
"
With this change
the reclaim here might be just playing never ending catch up. On the
plus side a break out from the reclaim loop would just enforce the limit
so if the operation takes too long then the reclaim burden will move
over to consumers eventually. So I do not see any real danger.
"

> Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> Cc: <stable@vger.kernel.org> # 5.8+

Why is this worth backporting to stable? The behavior is different but I
do not think any of them is harmful.

> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I am not against the patch. The existing interface doesn't provide any
meaningful feedback to the userspace anyway. User would have to re check
to see the result of the operation. So how hard we try is really an
implementation detail.

> ---
>  mm/memcontrol.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 605f671203ef..63a8d47c1cd3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  
>  	for (;;) {
>  		unsigned long nr_pages = page_counter_read(&memcg->memory);
> -		unsigned long reclaimed;
>  
>  		if (nr_pages <= high)
>  			break;
> @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  			continue;
>  		}
>  
> -		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -							 GFP_KERNEL, true);
> +		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> +					     GFP_KERNEL, true);
>  
> -		if (!reclaimed && !nr_retries--)
> +		if (!nr_retries--)
>  			break;
>  	}
>  
> -- 
> 2.30.0
>

Michal Koutný Jan. 13, 2021, 5:25 p.m. UTC | #10

On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> -		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -							 GFP_KERNEL, true);
> +		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> +					     GFP_KERNEL, true);
Although I was also initially confused by throwing 'reclaimed' info
away, the patch makes sense to me given the reasoning.

It is
Reviewed-by: Michal Koutný <mkoutny@suse.com>

As for the discussed unsuccessful retries, I'd keep it a separate change.

Regards,
Michal

Roman Gushchin Jan. 13, 2021, 6:06 p.m. UTC | #11

On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote:
> When a value is written to a cgroup's memory.high control file, the
> write() context first tries to reclaim the cgroup to size before
> putting the limit in place for the workload. Concurrent charges from
> the workload can keep such a write() looping in reclaim indefinitely.
> 
> In the past, a write to memory.high would first put the limit in place
> for the workload, then do targeted reclaim until the new limit has
> been met - similar to how we do it for memory.max. This wasn't prone
> to the described starvation issue. However, this sequence could cause
> excessive latencies in the workload, when allocating threads could be
> put into long penalty sleeps on the sudden memory.high overage created
> by the write(), before that had a chance to work it off.
> 
> Now that memory_high_write() performs reclaim before enforcing the new
> limit, reflect that the cgroup may well fail to converge due to
> concurrent workload activity. Bail out of the loop after a few tries.
> 
> Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> Cc: <stable@vger.kernel.org> # 5.8+
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 605f671203ef..63a8d47c1cd3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  
>  	for (;;) {
>  		unsigned long nr_pages = page_counter_read(&memcg->memory);
> -		unsigned long reclaimed;
>  
>  		if (nr_pages <= high)
>  			break;
> @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  			continue;
>  		}
>  
> -		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -							 GFP_KERNEL, true);
> +		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> +					     GFP_KERNEL, true);
>  
> -		if (!reclaimed && !nr_retries--)
> +		if (!nr_retries--)
>  			break;
>  	}
>  
> -- 
> 2.30.0
> 

Acked-by: Roman Gushchin <guro@fb.com>

Thanks!

Johannes Weiner Jan. 15, 2021, 3:34 p.m. UTC | #12

On Tue, Jan 12, 2021 at 01:45:12PM -0800, Roman Gushchin wrote:
> On Tue, Jan 12, 2021 at 04:11:27PM -0500, Johannes Weiner wrote:
> > On Tue, Jan 12, 2021 at 12:12:37PM -0800, Roman Gushchin wrote:
> > > On Tue, Jan 12, 2021 at 02:45:43PM -0500, Johannes Weiner wrote:
> > > > On Tue, Jan 12, 2021 at 09:03:22AM -0800, Roman Gushchin wrote:
> > > > > On Tue, Jan 12, 2021 at 11:30:11AM -0500, Johannes Weiner wrote:
> > > > > > When a value is written to a cgroup's memory.high control file, the
> > > > > > write() context first tries to reclaim the cgroup to size before
> > > > > > putting the limit in place for the workload. Concurrent charges from
> > > > > > the workload can keep such a write() looping in reclaim indefinitely.
> > > > > > 
> > > > > > In the past, a write to memory.high would first put the limit in place
> > > > > > for the workload, then do targeted reclaim until the new limit has
> > > > > > been met - similar to how we do it for memory.max. This wasn't prone
> > > > > > to the described starvation issue. However, this sequence could cause
> > > > > > excessive latencies in the workload, when allocating threads could be
> > > > > > put into long penalty sleeps on the sudden memory.high overage created
> > > > > > by the write(), before that had a chance to work it off.
> > > > > > 
> > > > > > Now that memory_high_write() performs reclaim before enforcing the new
> > > > > > limit, reflect that the cgroup may well fail to converge due to
> > > > > > concurrent workload activity. Bail out of the loop after a few tries.
> > > > > > 
> > > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> > > > > > Cc: <stable@vger.kernel.org> # 5.8+
> > > > > > Reported-by: Tejun Heo <tj@kernel.org>
> > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > ---
> > > > > >  mm/memcontrol.c | 7 +++----
> > > > > >  1 file changed, 3 insertions(+), 4 deletions(-)
> > > > > > 
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index 605f671203ef..63a8d47c1cd3 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -6275,7 +6275,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> > > > > >  
> > > > > >  	for (;;) {
> > > > > >  		unsigned long nr_pages = page_counter_read(&memcg->memory);
> > > > > > -		unsigned long reclaimed;
> > > > > >  
> > > > > >  		if (nr_pages <= high)
> > > > > >  			break;
> > > > > > @@ -6289,10 +6288,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> > > > > >  			continue;
> > > > > >  		}
> > > > > >  
> > > > > > -		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > > > > > -							 GFP_KERNEL, true);
> > > > > > +		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > > > > > +					     GFP_KERNEL, true);
> > > > > >  
> > > > > > -		if (!reclaimed && !nr_retries--)
> > > > > > +		if (!nr_retries--)
> > > > > 
> > > > > Shouldn't it be (!reclaimed || !nr_retries) instead?
> > > > > 
> > > > > If reclaimed == 0, it probably doesn't make much sense to retry.
> > > > 
> > > > We usually allow nr_retries worth of no-progress reclaim cycles to
> > > > make up for intermittent reclaim failures.
> > > > 
> > > > The difference to OOMs/memory.max is that we don't want to loop
> > > > indefinitely on forward progress, but we should allow the usual number
> > > > of no-progress loops.
> > > 
> > > Re memory.max: trying really hard makes sense because we are OOMing otherwise.
> > > With memory.high such an idea is questionable: if were not able to reclaim
> > > a single page from the first attempt, it's unlikely that we can reclaim many
> > > from repeating 16 times.
> > > 
> > > My concern here is that we can see CPU regressions in some cases when there is
> > > no reclaimable memory. Do you think we can win something by trying harder?
> > > If so, it's worth mentioning in the commit log. Because it's really a separate
> > > change to what's described in the log, to some extent it's a move into an opposite
> > > direction.
> > 
> > Hm, I'm confused what change you are referring to.
> > 
> > Current upstream allows:
> > 
> >     a. unlimited progress loops
> >     b. 16 no-progress loops
> > 
> > My patch is fixing the issue resulting from the unlimited progress
> > loops in a). This is described in the changelog.
> > 
> > You seem to be advocating for an unrelated change to the no-progress
> > loops condition in b).
> 
> Fair enough.
> 
> But still the question remains: what are we gaining by trying again after not
> being able to reclaim a single page? If you want, it can be done separately,
> but it looks like a good idea to me to bail out if we can't reclaim a single
> page.

You lost me there.

If memory.max retries before declaring oom, why shouldn't memory.high
retry before returning to userspace? If there is intermittent reclaim
failure, then returning from the memory.high write before the limit is
enforced in such rare situations saves very little, but makes the user
visible behavior unpredictable and inconsistent. It's extra code and a
special case that needs to be documented.

Johannes Weiner Jan. 15, 2021, 4:20 p.m. UTC | #13

On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote:
> On Tue 12-01-21 11:30:11, Johannes Weiner wrote:
> > When a value is written to a cgroup's memory.high control file, the
> > write() context first tries to reclaim the cgroup to size before
> > putting the limit in place for the workload. Concurrent charges from
> > the workload can keep such a write() looping in reclaim indefinitely.
> > 
> > In the past, a write to memory.high would first put the limit in place
> > for the workload, then do targeted reclaim until the new limit has
> > been met - similar to how we do it for memory.max. This wasn't prone
> > to the described starvation issue. However, this sequence could cause
> > excessive latencies in the workload, when allocating threads could be
> > put into long penalty sleeps on the sudden memory.high overage created
> > by the write(), before that had a chance to work it off.
> > 
> > Now that memory_high_write() performs reclaim before enforcing the new
> > limit, reflect that the cgroup may well fail to converge due to
> > concurrent workload activity. Bail out of the loop after a few tries.
> 
> I can see that you have provided some more details in follow up replies
> but I do not see any explicit argument why an excessive time for writer
> is an actual problem. Could you be more specific?

Our writer isn't necessarily time sensitive, but there is a difference
between a) the write taking a few seconds to reclaim down the
requested delta and b) the writer essentially turning into kswapd for
the workload and busy-spinning inside the kernel indefinitely.

We've seen the writer stuck in this function for minutes, long after
the requested delta has been reclaimed, consuming alarming amounts of
CPU cycles - CPU time that should really be accounted to the workload,
not the system software performing the write.

Obviously, we could work around it using timeouts and signals. In
fact, we may have to until the new kernel is deployed everywhere. But
this is the definition of an interface change breaking userspace, so
I'm a bit surprised by your laid-back response.

> > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> > Cc: <stable@vger.kernel.org> # 5.8+
> 
> Why is this worth backporting to stable? The behavior is different but I
> do not think any of them is harmful.

The referenced patch changed user-visible behavior in a way that is
causing real production problems for us. From stable-kernel-rules:

 - It must fix a real bug that bothers people (not a, "This could be a
   problem..." type thing).

> > Reported-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> I am not against the patch. The existing interface doesn't provide any
> meaningful feedback to the userspace anyway. User would have to re check
> to see the result of the operation. So how hard we try is really an
> implementation detail.

Yeah, I wish it was a bit more consistent from an interface POV.

Btw, if you have noticed, Roman's patch to enforce memcg->high *after*
trying to reclaim went into the tree at the same exact time as Chris's
series "mm, memcg: reclaim harder before high throttling" (commit
b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap.

Chris's patch changes memory.high reclaim on the allocation side from

	reclaim once, sleep if there is still overage

to

	reclaim the overage as long as you make forward progress;
	sleep after 16 no-progress loops if there is still overage

Roman's patch describes a problem where allocating threads go to sleep
when memory.high is lowered by a wider step. This is exceedingly
unlikely after Chris's change.

Because after Chris's change, memory.high is reclaimed on the
allocation side as aggressively as memory.max. The only difference is
that upon failure, one sleeps and the other OOMs.

If Roman's issue were present after Chris's change, then we'd also see
premature OOM kills when memory.max is lowered by a large step. And I
have never seen that happening.

So I suggest instead of my fix here, we revert Roman's patch instead,
as it should no longer be needed. Thoughts?

Roman Gushchin Jan. 15, 2021, 5:03 p.m. UTC | #14

On Fri, Jan 15, 2021 at 11:20:50AM -0500, Johannes Weiner wrote:
> On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote:
> > On Tue 12-01-21 11:30:11, Johannes Weiner wrote:
> > > When a value is written to a cgroup's memory.high control file, the
> > > write() context first tries to reclaim the cgroup to size before
> > > putting the limit in place for the workload. Concurrent charges from
> > > the workload can keep such a write() looping in reclaim indefinitely.
> > > 
> > > In the past, a write to memory.high would first put the limit in place
> > > for the workload, then do targeted reclaim until the new limit has
> > > been met - similar to how we do it for memory.max. This wasn't prone
> > > to the described starvation issue. However, this sequence could cause
> > > excessive latencies in the workload, when allocating threads could be
> > > put into long penalty sleeps on the sudden memory.high overage created
> > > by the write(), before that had a chance to work it off.
> > > 
> > > Now that memory_high_write() performs reclaim before enforcing the new
> > > limit, reflect that the cgroup may well fail to converge due to
> > > concurrent workload activity. Bail out of the loop after a few tries.
> > 
> > I can see that you have provided some more details in follow up replies
> > but I do not see any explicit argument why an excessive time for writer
> > is an actual problem. Could you be more specific?
> 
> Our writer isn't necessarily time sensitive, but there is a difference
> between a) the write taking a few seconds to reclaim down the
> requested delta and b) the writer essentially turning into kswapd for
> the workload and busy-spinning inside the kernel indefinitely.
> 
> We've seen the writer stuck in this function for minutes, long after
> the requested delta has been reclaimed, consuming alarming amounts of
> CPU cycles - CPU time that should really be accounted to the workload,
> not the system software performing the write.
> 
> Obviously, we could work around it using timeouts and signals. In
> fact, we may have to until the new kernel is deployed everywhere. But
> this is the definition of an interface change breaking userspace, so
> I'm a bit surprised by your laid-back response.
> 
> > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> > > Cc: <stable@vger.kernel.org> # 5.8+
> > 
> > Why is this worth backporting to stable? The behavior is different but I
> > do not think any of them is harmful.
> 
> The referenced patch changed user-visible behavior in a way that is
> causing real production problems for us. From stable-kernel-rules:
> 
>  - It must fix a real bug that bothers people (not a, "This could be a
>    problem..." type thing).
> 
> > > Reported-by: Tejun Heo <tj@kernel.org>
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > I am not against the patch. The existing interface doesn't provide any
> > meaningful feedback to the userspace anyway. User would have to re check
> > to see the result of the operation. So how hard we try is really an
> > implementation detail.
> 
> Yeah, I wish it was a bit more consistent from an interface POV.
> 
> Btw, if you have noticed, Roman's patch to enforce memcg->high *after*
> trying to reclaim went into the tree at the same exact time as Chris's
> series "mm, memcg: reclaim harder before high throttling" (commit
> b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap.
> 
> Chris's patch changes memory.high reclaim on the allocation side from
> 
> 	reclaim once, sleep if there is still overage
> 
> to
> 
> 	reclaim the overage as long as you make forward progress;
> 	sleep after 16 no-progress loops if there is still overage
> 
> Roman's patch describes a problem where allocating threads go to sleep
> when memory.high is lowered by a wider step. This is exceedingly
> unlikely after Chris's change.
> 
> Because after Chris's change, memory.high is reclaimed on the
> allocation side as aggressively as memory.max. The only difference is
> that upon failure, one sleeps and the other OOMs.
> 
> If Roman's issue were present after Chris's change, then we'd also see
> premature OOM kills when memory.max is lowered by a large step. And I
> have never seen that happening.
> 
> So I suggest instead of my fix here, we revert Roman's patch instead,
> as it should no longer be needed. Thoughts?

Chris's patch was merged way earlier than mine into the kernel tree which
was used when I observed the problem in the production. So likely it was there.

I think it makes sense to try to reclaim memory first before putting
all processes in the cgroup into reclaim mode. Even without artificial delays
it creates some latency and btw doesn't make the reclaim process more efficient.

What about something like this? Make one optimistic iteration without setting
the high value and then fallback to the old behavior.


@@ -6317,15 +6317,15 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 			continue;
 		}
 
-		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-							 GFP_KERNEL, true);
+		try_to_free_mem_cgroup_pages(memcg, nr_pages - high, GFP_KERNEL,
+					     true);
 
-		if (!reclaimed && !nr_retries--)
+		if (nr_retries == MAX_RECLAIM_RETRIES)
+			page_counter_set_high(&memcg->memory, high);
+		if (!nr_retries--)
 			break;
 	}
 
-	page_counter_set_high(&memcg->memory, high);
-
 	memcg_wb_domain_size_changed(memcg);
 
 	return nbytes;

Johannes Weiner Jan. 15, 2021, 8:55 p.m. UTC | #15

On Fri, Jan 15, 2021 at 09:03:41AM -0800, Roman Gushchin wrote:
> On Fri, Jan 15, 2021 at 11:20:50AM -0500, Johannes Weiner wrote:
> > On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote:
> > > On Tue 12-01-21 11:30:11, Johannes Weiner wrote:
> > > > When a value is written to a cgroup's memory.high control file, the
> > > > write() context first tries to reclaim the cgroup to size before
> > > > putting the limit in place for the workload. Concurrent charges from
> > > > the workload can keep such a write() looping in reclaim indefinitely.
> > > > 
> > > > In the past, a write to memory.high would first put the limit in place
> > > > for the workload, then do targeted reclaim until the new limit has
> > > > been met - similar to how we do it for memory.max. This wasn't prone
> > > > to the described starvation issue. However, this sequence could cause
> > > > excessive latencies in the workload, when allocating threads could be
> > > > put into long penalty sleeps on the sudden memory.high overage created
> > > > by the write(), before that had a chance to work it off.
> > > > 
> > > > Now that memory_high_write() performs reclaim before enforcing the new
> > > > limit, reflect that the cgroup may well fail to converge due to
> > > > concurrent workload activity. Bail out of the loop after a few tries.
> > > 
> > > I can see that you have provided some more details in follow up replies
> > > but I do not see any explicit argument why an excessive time for writer
> > > is an actual problem. Could you be more specific?
> > 
> > Our writer isn't necessarily time sensitive, but there is a difference
> > between a) the write taking a few seconds to reclaim down the
> > requested delta and b) the writer essentially turning into kswapd for
> > the workload and busy-spinning inside the kernel indefinitely.
> > 
> > We've seen the writer stuck in this function for minutes, long after
> > the requested delta has been reclaimed, consuming alarming amounts of
> > CPU cycles - CPU time that should really be accounted to the workload,
> > not the system software performing the write.
> > 
> > Obviously, we could work around it using timeouts and signals. In
> > fact, we may have to until the new kernel is deployed everywhere. But
> > this is the definition of an interface change breaking userspace, so
> > I'm a bit surprised by your laid-back response.
> > 
> > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> > > > Cc: <stable@vger.kernel.org> # 5.8+
> > > 
> > > Why is this worth backporting to stable? The behavior is different but I
> > > do not think any of them is harmful.
> > 
> > The referenced patch changed user-visible behavior in a way that is
> > causing real production problems for us. From stable-kernel-rules:
> > 
> >  - It must fix a real bug that bothers people (not a, "This could be a
> >    problem..." type thing).
> > 
> > > > Reported-by: Tejun Heo <tj@kernel.org>
> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > 
> > > I am not against the patch. The existing interface doesn't provide any
> > > meaningful feedback to the userspace anyway. User would have to re check
> > > to see the result of the operation. So how hard we try is really an
> > > implementation detail.
> > 
> > Yeah, I wish it was a bit more consistent from an interface POV.
> > 
> > Btw, if you have noticed, Roman's patch to enforce memcg->high *after*
> > trying to reclaim went into the tree at the same exact time as Chris's
> > series "mm, memcg: reclaim harder before high throttling" (commit
> > b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap.
> > 
> > Chris's patch changes memory.high reclaim on the allocation side from
> > 
> > 	reclaim once, sleep if there is still overage
> > 
> > to
> > 
> > 	reclaim the overage as long as you make forward progress;
> > 	sleep after 16 no-progress loops if there is still overage
> > 
> > Roman's patch describes a problem where allocating threads go to sleep
> > when memory.high is lowered by a wider step. This is exceedingly
> > unlikely after Chris's change.
> > 
> > Because after Chris's change, memory.high is reclaimed on the
> > allocation side as aggressively as memory.max. The only difference is
> > that upon failure, one sleeps and the other OOMs.
> > 
> > If Roman's issue were present after Chris's change, then we'd also see
> > premature OOM kills when memory.max is lowered by a large step. And I
> > have never seen that happening.
> > 
> > So I suggest instead of my fix here, we revert Roman's patch instead,
> > as it should no longer be needed. Thoughts?
> 
> Chris's patch was merged way earlier than mine into the kernel tree which
> was used when I observed the problem in the production. So likely it was there.

Chris's patch was in the tree earlier, but the first release
containing it was tagged a day before you put in your change, so I
doubt it was on the production system where you observed the issue.

As per above, it'd be very surprising to see premature sleeps when
lowering memory.high, when allocation-side reclaim keeps going until
the cgroup meets the definition of OOM.

> I think it makes sense to try to reclaim memory first before putting
> all processes in the cgroup into reclaim mode. Even without artificial delays
> it creates some latency and btw doesn't make the reclaim process more efficient.

It's not obvious that this is a practical problem. It certainly isn't
for memory.max, and there should be a good reason why the two should
be different aside from the documented OOM vs sleep behavior.

Absent any concrete evidence to the contrary, I'd make the behavior
the same again for those two.

Roman Gushchin Jan. 15, 2021, 9:27 p.m. UTC | #16

On Fri, Jan 15, 2021 at 03:55:36PM -0500, Johannes Weiner wrote:
> On Fri, Jan 15, 2021 at 09:03:41AM -0800, Roman Gushchin wrote:
> > On Fri, Jan 15, 2021 at 11:20:50AM -0500, Johannes Weiner wrote:
> > > On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote:
> > > > On Tue 12-01-21 11:30:11, Johannes Weiner wrote:
> > > > > When a value is written to a cgroup's memory.high control file, the
> > > > > write() context first tries to reclaim the cgroup to size before
> > > > > putting the limit in place for the workload. Concurrent charges from
> > > > > the workload can keep such a write() looping in reclaim indefinitely.
> > > > > 
> > > > > In the past, a write to memory.high would first put the limit in place
> > > > > for the workload, then do targeted reclaim until the new limit has
> > > > > been met - similar to how we do it for memory.max. This wasn't prone
> > > > > to the described starvation issue. However, this sequence could cause
> > > > > excessive latencies in the workload, when allocating threads could be
> > > > > put into long penalty sleeps on the sudden memory.high overage created
> > > > > by the write(), before that had a chance to work it off.
> > > > > 
> > > > > Now that memory_high_write() performs reclaim before enforcing the new
> > > > > limit, reflect that the cgroup may well fail to converge due to
> > > > > concurrent workload activity. Bail out of the loop after a few tries.
> > > > 
> > > > I can see that you have provided some more details in follow up replies
> > > > but I do not see any explicit argument why an excessive time for writer
> > > > is an actual problem. Could you be more specific?
> > > 
> > > Our writer isn't necessarily time sensitive, but there is a difference
> > > between a) the write taking a few seconds to reclaim down the
> > > requested delta and b) the writer essentially turning into kswapd for
> > > the workload and busy-spinning inside the kernel indefinitely.
> > > 
> > > We've seen the writer stuck in this function for minutes, long after
> > > the requested delta has been reclaimed, consuming alarming amounts of
> > > CPU cycles - CPU time that should really be accounted to the workload,
> > > not the system software performing the write.
> > > 
> > > Obviously, we could work around it using timeouts and signals. In
> > > fact, we may have to until the new kernel is deployed everywhere. But
> > > this is the definition of an interface change breaking userspace, so
> > > I'm a bit surprised by your laid-back response.
> > > 
> > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> > > > > Cc: <stable@vger.kernel.org> # 5.8+
> > > > 
> > > > Why is this worth backporting to stable? The behavior is different but I
> > > > do not think any of them is harmful.
> > > 
> > > The referenced patch changed user-visible behavior in a way that is
> > > causing real production problems for us. From stable-kernel-rules:
> > > 
> > >  - It must fix a real bug that bothers people (not a, "This could be a
> > >    problem..." type thing).
> > > 
> > > > > Reported-by: Tejun Heo <tj@kernel.org>
> > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > 
> > > > I am not against the patch. The existing interface doesn't provide any
> > > > meaningful feedback to the userspace anyway. User would have to re check
> > > > to see the result of the operation. So how hard we try is really an
> > > > implementation detail.
> > > 
> > > Yeah, I wish it was a bit more consistent from an interface POV.
> > > 
> > > Btw, if you have noticed, Roman's patch to enforce memcg->high *after*
> > > trying to reclaim went into the tree at the same exact time as Chris's
> > > series "mm, memcg: reclaim harder before high throttling" (commit
> > > b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap.
> > > 
> > > Chris's patch changes memory.high reclaim on the allocation side from
> > > 
> > > 	reclaim once, sleep if there is still overage
> > > 
> > > to
> > > 
> > > 	reclaim the overage as long as you make forward progress;
> > > 	sleep after 16 no-progress loops if there is still overage
> > > 
> > > Roman's patch describes a problem where allocating threads go to sleep
> > > when memory.high is lowered by a wider step. This is exceedingly
> > > unlikely after Chris's change.
> > > 
> > > Because after Chris's change, memory.high is reclaimed on the
> > > allocation side as aggressively as memory.max. The only difference is
> > > that upon failure, one sleeps and the other OOMs.
> > > 
> > > If Roman's issue were present after Chris's change, then we'd also see
> > > premature OOM kills when memory.max is lowered by a large step. And I
> > > have never seen that happening.
> > > 
> > > So I suggest instead of my fix here, we revert Roman's patch instead,
> > > as it should no longer be needed. Thoughts?
> > 
> > Chris's patch was merged way earlier than mine into the kernel tree which
> > was used when I observed the problem in the production. So likely it was there.
> 
> Chris's patch was in the tree earlier, but the first release
> containing it was tagged a day before you put in your change, so I
> doubt it was on the production system where you observed the issue.
> 
> As per above, it'd be very surprising to see premature sleeps when
> lowering memory.high, when allocation-side reclaim keeps going until
> the cgroup meets the definition of OOM.
> 
> > I think it makes sense to try to reclaim memory first before putting
> > all processes in the cgroup into reclaim mode. Even without artificial delays
> > it creates some latency and btw doesn't make the reclaim process more efficient.
> 
> It's not obvious that this is a practical problem. It certainly isn't
> for memory.max,

Because memory.max is usually not adjusted dynamically?

> and there should be a good reason why the two should
> be different aside from the documented OOM vs sleep behavior.

Maybe we have different examples in our heads, but mine is a cgroup
with a significant amount of relatively cold pagecache and a multi-threaded
workload. Now somebody wants to tighten memory.high. Why would we put all
threads into a direct reclaim? I don't see a good reason.

Memory.max is different because nobody (hopefully) is adjusting it dynamically
to be just a little bit bigger than the workingset. Most likely it's set
once that after the creation of a cgroup. So such a problem just doesn't exist.

Michal Hocko Jan. 18, 2021, 1:12 p.m. UTC | #17

On Fri 15-01-21 11:20:50, Johannes Weiner wrote:
> On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote:
> > On Tue 12-01-21 11:30:11, Johannes Weiner wrote:
> > > When a value is written to a cgroup's memory.high control file, the
> > > write() context first tries to reclaim the cgroup to size before
> > > putting the limit in place for the workload. Concurrent charges from
> > > the workload can keep such a write() looping in reclaim indefinitely.
> > > 
> > > In the past, a write to memory.high would first put the limit in place
> > > for the workload, then do targeted reclaim until the new limit has
> > > been met - similar to how we do it for memory.max. This wasn't prone
> > > to the described starvation issue. However, this sequence could cause
> > > excessive latencies in the workload, when allocating threads could be
> > > put into long penalty sleeps on the sudden memory.high overage created
> > > by the write(), before that had a chance to work it off.
> > > 
> > > Now that memory_high_write() performs reclaim before enforcing the new
> > > limit, reflect that the cgroup may well fail to converge due to
> > > concurrent workload activity. Bail out of the loop after a few tries.
> > 
> > I can see that you have provided some more details in follow up replies
> > but I do not see any explicit argument why an excessive time for writer
> > is an actual problem. Could you be more specific?
> 
> Our writer isn't necessarily time sensitive, but there is a difference
> between a) the write taking a few seconds to reclaim down the
> requested delta and b) the writer essentially turning into kswapd for
> the workload and busy-spinning inside the kernel indefinitely.
> 
> We've seen the writer stuck in this function for minutes, long after
> the requested delta has been reclaimed, consuming alarming amounts of
> CPU cycles - CPU time that should really be accounted to the workload,
> not the system software performing the write.

OK, this is an important detail. So the context which is doing the work
doesn't belong to the target memcg? If that is the case then I do
understand why you consider it a problem. In general I would recommend
running operations like this one in scope of the affected cgroup. But
maybe that is not really an option in your setup.

Anyway this is an important information to have in the changelog.

> Obviously, we could work around it using timeouts and signals. In
> fact, we may have to until the new kernel is deployed everywhere. But
> this is the definition of an interface change breaking userspace, so
> I'm a bit surprised by your laid-back response.

Well, I was basing my feedback on the available information in the
changelog. It is quite clear that somebody has to pay for the work.
Moving as much of the work to the writer makes sense as long as the
context is runing in the same cgroup so the work gets accounted
properly. If this assumption doesn't match the reality then we have to
re-evaluate our priorities here.

> > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> > > Cc: <stable@vger.kernel.org> # 5.8+
> > 
> > Why is this worth backporting to stable? The behavior is different but I
> > do not think any of them is harmful.
> 
> The referenced patch changed user-visible behavior in a way that is
> causing real production problems for us. From stable-kernel-rules:
> 
>  - It must fix a real bug that bothers people (not a, "This could be a
>    problem..." type thing).
> 
> > > Reported-by: Tejun Heo <tj@kernel.org>
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > I am not against the patch. The existing interface doesn't provide any
> > meaningful feedback to the userspace anyway. User would have to re check
> > to see the result of the operation. So how hard we try is really an
> > implementation detail.
> 
> Yeah, I wish it was a bit more consistent from an interface POV.
> 
> Btw, if you have noticed, Roman's patch to enforce memcg->high *after*
> trying to reclaim went into the tree at the same exact time as Chris's
> series "mm, memcg: reclaim harder before high throttling" (commit
> b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap.
> 
> Chris's patch changes memory.high reclaim on the allocation side from
> 
> 	reclaim once, sleep if there is still overage
> 
> to
> 
> 	reclaim the overage as long as you make forward progress;
> 	sleep after 16 no-progress loops if there is still overage
> 
> Roman's patch describes a problem where allocating threads go to sleep
> when memory.high is lowered by a wider step. This is exceedingly
> unlikely after Chris's change.
> 
> Because after Chris's change, memory.high is reclaimed on the
> allocation side as aggressively as memory.max. The only difference is
> that upon failure, one sleeps and the other OOMs.
> 
> If Roman's issue were present after Chris's change, then we'd also see
> premature OOM kills when memory.max is lowered by a large step. And I
> have never seen that happening.

This should be something quite easy to double check right?

> So I suggest instead of my fix here, we revert Roman's patch instead,
> as it should no longer be needed. Thoughts?

Reverting 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when
lowering memory.high") would certainly help to throttle producers but it
still doesn't solve the underlying problem that a lot of work could be
done in a context which lives outside of the memcg, right? The effect
would be much smaller and it shouldn't be effectivelly unbounded but
still something we should think about.

That being said going with the revert sounds like a slightly better
approach to me.

Johannes Weiner Jan. 19, 2021, 4:47 p.m. UTC | #18

On Fri, Jan 15, 2021 at 01:27:23PM -0800, Roman Gushchin wrote:
> On Fri, Jan 15, 2021 at 03:55:36PM -0500, Johannes Weiner wrote:
> > On Fri, Jan 15, 2021 at 09:03:41AM -0800, Roman Gushchin wrote:
> > > On Fri, Jan 15, 2021 at 11:20:50AM -0500, Johannes Weiner wrote:
> > > > On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote:
> > > > > On Tue 12-01-21 11:30:11, Johannes Weiner wrote:
> > > > > > When a value is written to a cgroup's memory.high control file, the
> > > > > > write() context first tries to reclaim the cgroup to size before
> > > > > > putting the limit in place for the workload. Concurrent charges from
> > > > > > the workload can keep such a write() looping in reclaim indefinitely.
> > > > > > 
> > > > > > In the past, a write to memory.high would first put the limit in place
> > > > > > for the workload, then do targeted reclaim until the new limit has
> > > > > > been met - similar to how we do it for memory.max. This wasn't prone
> > > > > > to the described starvation issue. However, this sequence could cause
> > > > > > excessive latencies in the workload, when allocating threads could be
> > > > > > put into long penalty sleeps on the sudden memory.high overage created
> > > > > > by the write(), before that had a chance to work it off.
> > > > > > 
> > > > > > Now that memory_high_write() performs reclaim before enforcing the new
> > > > > > limit, reflect that the cgroup may well fail to converge due to
> > > > > > concurrent workload activity. Bail out of the loop after a few tries.
> > > > > 
> > > > > I can see that you have provided some more details in follow up replies
> > > > > but I do not see any explicit argument why an excessive time for writer
> > > > > is an actual problem. Could you be more specific?
> > > > 
> > > > Our writer isn't necessarily time sensitive, but there is a difference
> > > > between a) the write taking a few seconds to reclaim down the
> > > > requested delta and b) the writer essentially turning into kswapd for
> > > > the workload and busy-spinning inside the kernel indefinitely.
> > > > 
> > > > We've seen the writer stuck in this function for minutes, long after
> > > > the requested delta has been reclaimed, consuming alarming amounts of
> > > > CPU cycles - CPU time that should really be accounted to the workload,
> > > > not the system software performing the write.
> > > > 
> > > > Obviously, we could work around it using timeouts and signals. In
> > > > fact, we may have to until the new kernel is deployed everywhere. But
> > > > this is the definition of an interface change breaking userspace, so
> > > > I'm a bit surprised by your laid-back response.
> > > > 
> > > > > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> > > > > > Cc: <stable@vger.kernel.org> # 5.8+
> > > > > 
> > > > > Why is this worth backporting to stable? The behavior is different but I
> > > > > do not think any of them is harmful.
> > > > 
> > > > The referenced patch changed user-visible behavior in a way that is
> > > > causing real production problems for us. From stable-kernel-rules:
> > > > 
> > > >  - It must fix a real bug that bothers people (not a, "This could be a
> > > >    problem..." type thing).
> > > > 
> > > > > > Reported-by: Tejun Heo <tj@kernel.org>
> > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > 
> > > > > I am not against the patch. The existing interface doesn't provide any
> > > > > meaningful feedback to the userspace anyway. User would have to re check
> > > > > to see the result of the operation. So how hard we try is really an
> > > > > implementation detail.
> > > > 
> > > > Yeah, I wish it was a bit more consistent from an interface POV.
> > > > 
> > > > Btw, if you have noticed, Roman's patch to enforce memcg->high *after*
> > > > trying to reclaim went into the tree at the same exact time as Chris's
> > > > series "mm, memcg: reclaim harder before high throttling" (commit
> > > > b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap.
> > > > 
> > > > Chris's patch changes memory.high reclaim on the allocation side from
> > > > 
> > > > 	reclaim once, sleep if there is still overage
> > > > 
> > > > to
> > > > 
> > > > 	reclaim the overage as long as you make forward progress;
> > > > 	sleep after 16 no-progress loops if there is still overage
> > > > 
> > > > Roman's patch describes a problem where allocating threads go to sleep
> > > > when memory.high is lowered by a wider step. This is exceedingly
> > > > unlikely after Chris's change.
> > > > 
> > > > Because after Chris's change, memory.high is reclaimed on the
> > > > allocation side as aggressively as memory.max. The only difference is
> > > > that upon failure, one sleeps and the other OOMs.
> > > > 
> > > > If Roman's issue were present after Chris's change, then we'd also see
> > > > premature OOM kills when memory.max is lowered by a large step. And I
> > > > have never seen that happening.
> > > > 
> > > > So I suggest instead of my fix here, we revert Roman's patch instead,
> > > > as it should no longer be needed. Thoughts?
> > > 
> > > Chris's patch was merged way earlier than mine into the kernel tree which
> > > was used when I observed the problem in the production. So likely it was there.
> > 
> > Chris's patch was in the tree earlier, but the first release
> > containing it was tagged a day before you put in your change, so I
> > doubt it was on the production system where you observed the issue.
> > 
> > As per above, it'd be very surprising to see premature sleeps when
> > lowering memory.high, when allocation-side reclaim keeps going until
> > the cgroup meets the definition of OOM.
> > 
> > > I think it makes sense to try to reclaim memory first before putting
> > > all processes in the cgroup into reclaim mode. Even without artificial delays
> > > it creates some latency and btw doesn't make the reclaim process more efficient.
> > 
> > It's not obvious that this is a practical problem. It certainly isn't
> > for memory.max,
> 
> Because memory.max is usually not adjusted dynamically?
> 
> > and there should be a good reason why the two should
> > be different aside from the documented OOM vs sleep behavior.
> 
> Maybe we have different examples in our heads, but mine is a cgroup
> with a significant amount of relatively cold pagecache and a multi-threaded
> workload. Now somebody wants to tighten memory.high. Why would we put all
> threads into a direct reclaim? I don't see a good reason.

But how is that different than a multi-threaded workload simply
running up against a static limit?

As soon as one event puts the cgroup over the limit, all concurrent
allocations will form a thundering herd of reclaimers. I don't see
much of a difference between a burst of allocations exceeding the
limit from below and the limit being lowered from above.

Sure, the requested limit delta is usually bigger than a single
allocation. However, memory.high enforcement is batched, and quite a
few threads could contribute to the overage and be marked to enter
direct reclaim before the first page is actually scanned. Also, while
we do adjust memory.high dynamically, we do so no more than once every
5-10 seconds, whereas a multi-threaded workload may breach its limit
and provoke thundering herds many times a second.

So I just cannot imagine the sequence of events here would make a
practical difference for the aspect of the thundering herd (whereas it
was VERY obvious with the premature sleep issue on the allocation
side). But the thundering herd may start as soon as you lower the
limit, whether you reclaim before or after. It may already be
occurring against the higher limit before you even enter write().

Allocation latency is a real concern, I agree with you there. But
that's IMO more to do with how cgroup limits are enforced in
general. That's why there have been multiple discussions and patch
submissions around async reclaim.

But the fact is, the change of sequence is causing real problems in
production, and with the premature-throttling-on-large-delta-bug out
of the way, the justification for keeping it are somewhat nebulous.

So unless we can clearly show from production that the sequence still
matters after Chris's patch, I'd say we go with the revert and take
the small risk of getting data to the contrary.

As opposed to accruing additional complexity and special-cased code on
a hunch, that we then may carry unnecessarily for a very long time -
until somebody later on, under the pressure of keeping all this stuff
maintainable, takes the risk of chopping it again with much less of
the original context and stake holders available.

mm: memcontrol: prevent starvation when writing memory.high

Commit Message

Comments

Patch