diff mbox series

mm: memcontrol: avoid workload stalls when lowering memory.high

Message ID 20200709194718.189231-1-guro@fb.com (mailing list archive)
State New, archived
Headers show
Series mm: memcontrol: avoid workload stalls when lowering memory.high | expand

Commit Message

Roman Gushchin July 9, 2020, 7:47 p.m. UTC
Memory.high limit is implemented in a way such that the kernel
penalizes all threads which are allocating a memory over the limit.
Forcing all threads into the synchronous reclaim and adding some
artificial delays allows to slow down the memory consumption and
potentially give some time for userspace oom handlers/resource control
agents to react.

It works nicely if the memory usage is hitting the limit from below,
however it works sub-optimal if a user adjusts memory.high to a value
way below the current memory usage. It basically forces all workload
threads (doing any memory allocations) into the synchronous reclaim
and sleep. This makes the workload completely unresponsive for
a long period of time and can also lead to a system-wide contention on
lru locks. It can happen even if the workload is not actually tight on
memory and has, for example, a ton of cold pagecache.

In the current implementation writing to memory.high causes an atomic
update of page counter's high value followed by an attempt to reclaim
enough memory to fit into the new limit. To fix the problem described
above, all we need is to change the order of execution: try to push
the memory usage under the limit first, and only then set the new
high limit.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Domas Mituzas <domas@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Chris Down <chris@chrisdown.name>
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Michal Hocko July 10, 2020, 12:29 p.m. UTC | #1
On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> Memory.high limit is implemented in a way such that the kernel
> penalizes all threads which are allocating a memory over the limit.
> Forcing all threads into the synchronous reclaim and adding some
> artificial delays allows to slow down the memory consumption and
> potentially give some time for userspace oom handlers/resource control
> agents to react.
> 
> It works nicely if the memory usage is hitting the limit from below,
> however it works sub-optimal if a user adjusts memory.high to a value
> way below the current memory usage. It basically forces all workload
> threads (doing any memory allocations) into the synchronous reclaim
> and sleep. This makes the workload completely unresponsive for
> a long period of time and can also lead to a system-wide contention on
> lru locks. It can happen even if the workload is not actually tight on
> memory and has, for example, a ton of cold pagecache.
> 
> In the current implementation writing to memory.high causes an atomic
> update of page counter's high value followed by an attempt to reclaim
> enough memory to fit into the new limit. To fix the problem described
> above, all we need is to change the order of execution: try to push
> the memory usage under the limit first, and only then set the new
> high limit.

Shakeel would this help with your pro-active reclaim usecase? It would
require to reset the high limit right after the reclaim returns which is
quite ugly but it would at least not require a completely new interface.
You would simply do
	high = current - to_reclaim
	echo $high > memory.high
	echo infinity > memory.high # To prevent direct reclaim
				    # allocation stalls

The primary reason to set the high limit in advance was to catch
potential runaways more effectively because they would just get
throttled while memory_high_write is reclaiming. With this change
the reclaim here might be just playing never ending catch up. On the
plus side a break out from the reclaim loop would just enforce the limit
so if the operation takes too long then the reclaim burden will move
over to consumers eventually. So I do not see any real danger.

> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reported-by: Domas Mituzas <domas@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Chris Down <chris@chrisdown.name>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b8424aa56e14..4b71feee7c42 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6203,8 +6203,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  	if (err)
>  		return err;
>  
> -	page_counter_set_high(&memcg->memory, high);
> -
>  	for (;;) {
>  		unsigned long nr_pages = page_counter_read(&memcg->memory);
>  		unsigned long reclaimed;
> @@ -6228,6 +6226,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  			break;
>  	}
>  
> +	page_counter_set_high(&memcg->memory, high);
> +
>  	return nbytes;
>  }
>  
> -- 
> 2.26.2
Shakeel Butt July 10, 2020, 2:12 p.m. UTC | #2
On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > Memory.high limit is implemented in a way such that the kernel
> > penalizes all threads which are allocating a memory over the limit.
> > Forcing all threads into the synchronous reclaim and adding some
> > artificial delays allows to slow down the memory consumption and
> > potentially give some time for userspace oom handlers/resource control
> > agents to react.
> >
> > It works nicely if the memory usage is hitting the limit from below,
> > however it works sub-optimal if a user adjusts memory.high to a value
> > way below the current memory usage. It basically forces all workload
> > threads (doing any memory allocations) into the synchronous reclaim
> > and sleep. This makes the workload completely unresponsive for
> > a long period of time and can also lead to a system-wide contention on
> > lru locks. It can happen even if the workload is not actually tight on
> > memory and has, for example, a ton of cold pagecache.
> >
> > In the current implementation writing to memory.high causes an atomic
> > update of page counter's high value followed by an attempt to reclaim
> > enough memory to fit into the new limit. To fix the problem described
> > above, all we need is to change the order of execution: try to push
> > the memory usage under the limit first, and only then set the new
> > high limit.
>
> Shakeel would this help with your pro-active reclaim usecase? It would
> require to reset the high limit right after the reclaim returns which is
> quite ugly but it would at least not require a completely new interface.
> You would simply do
>         high = current - to_reclaim
>         echo $high > memory.high
>         echo infinity > memory.high # To prevent direct reclaim
>                                     # allocation stalls
>

This will reduce the chance of stalls but the interface is still
non-delegatable i.e. applications can not change their own memory.high
for the use-cases like application controlled proactive reclaim and
uswapd.

One more ugly fix would be to add one more layer of cgroup and the
application use memory.high of that layer to fulfill such use-cases.

I think providing a new interface would allow us to have a much
cleaner solution than to settle on a bunch of ugly hacks.

> The primary reason to set the high limit in advance was to catch
> potential runaways more effectively because they would just get
> throttled while memory_high_write is reclaiming. With this change
> the reclaim here might be just playing never ending catch up. On the
> plus side a break out from the reclaim loop would just enforce the limit
> so if the operation takes too long then the reclaim burden will move
> over to consumers eventually. So I do not see any real danger.
>
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Reported-by: Domas Mituzas <domas@fb.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Cc: Chris Down <chris@chrisdown.name>
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>

This patch seems reasonable on its own.

Reviewed-by: Shakeel Butt <shakeelb@google.com>
Roman Gushchin July 10, 2020, 6:42 p.m. UTC | #3
On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote:
> On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > > Memory.high limit is implemented in a way such that the kernel
> > > penalizes all threads which are allocating a memory over the limit.
> > > Forcing all threads into the synchronous reclaim and adding some
> > > artificial delays allows to slow down the memory consumption and
> > > potentially give some time for userspace oom handlers/resource control
> > > agents to react.
> > >
> > > It works nicely if the memory usage is hitting the limit from below,
> > > however it works sub-optimal if a user adjusts memory.high to a value
> > > way below the current memory usage. It basically forces all workload
> > > threads (doing any memory allocations) into the synchronous reclaim
> > > and sleep. This makes the workload completely unresponsive for
> > > a long period of time and can also lead to a system-wide contention on
> > > lru locks. It can happen even if the workload is not actually tight on
> > > memory and has, for example, a ton of cold pagecache.
> > >
> > > In the current implementation writing to memory.high causes an atomic
> > > update of page counter's high value followed by an attempt to reclaim
> > > enough memory to fit into the new limit. To fix the problem described
> > > above, all we need is to change the order of execution: try to push
> > > the memory usage under the limit first, and only then set the new
> > > high limit.
> >
> > Shakeel would this help with your pro-active reclaim usecase? It would
> > require to reset the high limit right after the reclaim returns which is
> > quite ugly but it would at least not require a completely new interface.
> > You would simply do
> >         high = current - to_reclaim
> >         echo $high > memory.high
> >         echo infinity > memory.high # To prevent direct reclaim
> >                                     # allocation stalls
> >
> 
> This will reduce the chance of stalls but the interface is still
> non-delegatable i.e. applications can not change their own memory.high
> for the use-cases like application controlled proactive reclaim and
> uswapd.

Can you, please, elaborate a bit more on this? I didn't understand
why.

> 
> One more ugly fix would be to add one more layer of cgroup and the
> application use memory.high of that layer to fulfill such use-cases.
> 
> I think providing a new interface would allow us to have a much
> cleaner solution than to settle on a bunch of ugly hacks.
> 
> > The primary reason to set the high limit in advance was to catch
> > potential runaways more effectively because they would just get
> > throttled while memory_high_write is reclaiming. With this change
> > the reclaim here might be just playing never ending catch up. On the
> > plus side a break out from the reclaim loop would just enforce the limit
> > so if the operation takes too long then the reclaim burden will move
> > over to consumers eventually. So I do not see any real danger.
> >
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Reported-by: Domas Mituzas <domas@fb.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Michal Hocko <mhocko@kernel.org>
> > > Cc: Tejun Heo <tj@kernel.org>
> > > Cc: Shakeel Butt <shakeelb@google.com>
> > > Cc: Chris Down <chris@chrisdown.name>
> >
> > Acked-by: Michal Hocko <mhocko@suse.com>
> >
> 
> This patch seems reasonable on its own.
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Thank you!
Shakeel Butt July 10, 2020, 7:19 p.m. UTC | #4
On Fri, Jul 10, 2020 at 11:42 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote:
> > On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > > > Memory.high limit is implemented in a way such that the kernel
> > > > penalizes all threads which are allocating a memory over the limit.
> > > > Forcing all threads into the synchronous reclaim and adding some
> > > > artificial delays allows to slow down the memory consumption and
> > > > potentially give some time for userspace oom handlers/resource control
> > > > agents to react.
> > > >
> > > > It works nicely if the memory usage is hitting the limit from below,
> > > > however it works sub-optimal if a user adjusts memory.high to a value
> > > > way below the current memory usage. It basically forces all workload
> > > > threads (doing any memory allocations) into the synchronous reclaim
> > > > and sleep. This makes the workload completely unresponsive for
> > > > a long period of time and can also lead to a system-wide contention on
> > > > lru locks. It can happen even if the workload is not actually tight on
> > > > memory and has, for example, a ton of cold pagecache.
> > > >
> > > > In the current implementation writing to memory.high causes an atomic
> > > > update of page counter's high value followed by an attempt to reclaim
> > > > enough memory to fit into the new limit. To fix the problem described
> > > > above, all we need is to change the order of execution: try to push
> > > > the memory usage under the limit first, and only then set the new
> > > > high limit.
> > >
> > > Shakeel would this help with your pro-active reclaim usecase? It would
> > > require to reset the high limit right after the reclaim returns which is
> > > quite ugly but it would at least not require a completely new interface.
> > > You would simply do
> > >         high = current - to_reclaim
> > >         echo $high > memory.high
> > >         echo infinity > memory.high # To prevent direct reclaim
> > >                                     # allocation stalls
> > >
> >
> > This will reduce the chance of stalls but the interface is still
> > non-delegatable i.e. applications can not change their own memory.high
> > for the use-cases like application controlled proactive reclaim and
> > uswapd.
>
> Can you, please, elaborate a bit more on this? I didn't understand
> why.
>

Sure. Do we want memory.high a CFTYPE_NS_DELEGATABLE type file? I
don't think so otherwise any job on a system can change their
memory.high and can adversely impact the isolation and memory
scheduling of the system.

Next we have to agree that there are valid use-cases to allow
applications to reclaim from their cgroups and I think uswapd and
proactive reclaim are valid use-cases. Let's suppose memory.high is
the only way to trigger reclaim but the application can not write to
their top level memory.high, so, it has to create a dummy cgroup of
which it has write access to memory.high and has to move itself to
that dummy cgroup to use memory.high to trigger reclaim for
uswapd/proactive-reclaim.
Roman Gushchin July 10, 2020, 7:41 p.m. UTC | #5
On Fri, Jul 10, 2020 at 12:19:37PM -0700, Shakeel Butt wrote:
> On Fri, Jul 10, 2020 at 11:42 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote:
> > > On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > > > > Memory.high limit is implemented in a way such that the kernel
> > > > > penalizes all threads which are allocating a memory over the limit.
> > > > > Forcing all threads into the synchronous reclaim and adding some
> > > > > artificial delays allows to slow down the memory consumption and
> > > > > potentially give some time for userspace oom handlers/resource control
> > > > > agents to react.
> > > > >
> > > > > It works nicely if the memory usage is hitting the limit from below,
> > > > > however it works sub-optimal if a user adjusts memory.high to a value
> > > > > way below the current memory usage. It basically forces all workload
> > > > > threads (doing any memory allocations) into the synchronous reclaim
> > > > > and sleep. This makes the workload completely unresponsive for
> > > > > a long period of time and can also lead to a system-wide contention on
> > > > > lru locks. It can happen even if the workload is not actually tight on
> > > > > memory and has, for example, a ton of cold pagecache.
> > > > >
> > > > > In the current implementation writing to memory.high causes an atomic
> > > > > update of page counter's high value followed by an attempt to reclaim
> > > > > enough memory to fit into the new limit. To fix the problem described
> > > > > above, all we need is to change the order of execution: try to push
> > > > > the memory usage under the limit first, and only then set the new
> > > > > high limit.
> > > >
> > > > Shakeel would this help with your pro-active reclaim usecase? It would
> > > > require to reset the high limit right after the reclaim returns which is
> > > > quite ugly but it would at least not require a completely new interface.
> > > > You would simply do
> > > >         high = current - to_reclaim
> > > >         echo $high > memory.high
> > > >         echo infinity > memory.high # To prevent direct reclaim
> > > >                                     # allocation stalls
> > > >
> > >
> > > This will reduce the chance of stalls but the interface is still
> > > non-delegatable i.e. applications can not change their own memory.high
> > > for the use-cases like application controlled proactive reclaim and
> > > uswapd.
> >
> > Can you, please, elaborate a bit more on this? I didn't understand
> > why.
> >
> 
> Sure. Do we want memory.high a CFTYPE_NS_DELEGATABLE type file? I
> don't think so otherwise any job on a system can change their
> memory.high and can adversely impact the isolation and memory
> scheduling of the system.
> 
> Next we have to agree that there are valid use-cases to allow
> applications to reclaim from their cgroups and I think uswapd and
> proactive reclaim are valid use-cases. Let's suppose memory.high is
> the only way to trigger reclaim but the application can not write to
> their top level memory.high, so, it has to create a dummy cgroup of
> which it has write access to memory.high and has to move itself to
> that dummy cgroup to use memory.high to trigger reclaim for
> uswapd/proactive-reclaim.

Got it, good point. I tend to agree that memory.high is not enough.
I'll think a little bit more about how the new interface should look like.

Thank you!
Michal Hocko July 14, 2020, 8:41 a.m. UTC | #6
On Fri 10-07-20 12:19:37, Shakeel Butt wrote:
> On Fri, Jul 10, 2020 at 11:42 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote:
> > > On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > > > > Memory.high limit is implemented in a way such that the kernel
> > > > > penalizes all threads which are allocating a memory over the limit.
> > > > > Forcing all threads into the synchronous reclaim and adding some
> > > > > artificial delays allows to slow down the memory consumption and
> > > > > potentially give some time for userspace oom handlers/resource control
> > > > > agents to react.
> > > > >
> > > > > It works nicely if the memory usage is hitting the limit from below,
> > > > > however it works sub-optimal if a user adjusts memory.high to a value
> > > > > way below the current memory usage. It basically forces all workload
> > > > > threads (doing any memory allocations) into the synchronous reclaim
> > > > > and sleep. This makes the workload completely unresponsive for
> > > > > a long period of time and can also lead to a system-wide contention on
> > > > > lru locks. It can happen even if the workload is not actually tight on
> > > > > memory and has, for example, a ton of cold pagecache.
> > > > >
> > > > > In the current implementation writing to memory.high causes an atomic
> > > > > update of page counter's high value followed by an attempt to reclaim
> > > > > enough memory to fit into the new limit. To fix the problem described
> > > > > above, all we need is to change the order of execution: try to push
> > > > > the memory usage under the limit first, and only then set the new
> > > > > high limit.
> > > >
> > > > Shakeel would this help with your pro-active reclaim usecase? It would
> > > > require to reset the high limit right after the reclaim returns which is
> > > > quite ugly but it would at least not require a completely new interface.
> > > > You would simply do
> > > >         high = current - to_reclaim
> > > >         echo $high > memory.high
> > > >         echo infinity > memory.high # To prevent direct reclaim
> > > >                                     # allocation stalls
> > > >
> > >
> > > This will reduce the chance of stalls but the interface is still
> > > non-delegatable i.e. applications can not change their own memory.high
> > > for the use-cases like application controlled proactive reclaim and
> > > uswapd.
> >
> > Can you, please, elaborate a bit more on this? I didn't understand
> > why.
> >
> 
> Sure. Do we want memory.high a CFTYPE_NS_DELEGATABLE type file? I
> don't think so otherwise any job on a system can change their
> memory.high and can adversely impact the isolation and memory
> scheduling of the system.

Is this really the case? There should always be a parent cgroup that
overrides the setting. Also you can always set the hard limit if you do
not want to add another layer of cgroup in the hierarchy before
delegation. Or am I missing something?
Shakeel Butt July 14, 2020, 3:32 p.m. UTC | #7
On Tue, Jul 14, 2020 at 1:41 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 10-07-20 12:19:37, Shakeel Butt wrote:
> > On Fri, Jul 10, 2020 at 11:42 AM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote:
> > > > On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > > > > > Memory.high limit is implemented in a way such that the kernel
> > > > > > penalizes all threads which are allocating a memory over the limit.
> > > > > > Forcing all threads into the synchronous reclaim and adding some
> > > > > > artificial delays allows to slow down the memory consumption and
> > > > > > potentially give some time for userspace oom handlers/resource control
> > > > > > agents to react.
> > > > > >
> > > > > > It works nicely if the memory usage is hitting the limit from below,
> > > > > > however it works sub-optimal if a user adjusts memory.high to a value
> > > > > > way below the current memory usage. It basically forces all workload
> > > > > > threads (doing any memory allocations) into the synchronous reclaim
> > > > > > and sleep. This makes the workload completely unresponsive for
> > > > > > a long period of time and can also lead to a system-wide contention on
> > > > > > lru locks. It can happen even if the workload is not actually tight on
> > > > > > memory and has, for example, a ton of cold pagecache.
> > > > > >
> > > > > > In the current implementation writing to memory.high causes an atomic
> > > > > > update of page counter's high value followed by an attempt to reclaim
> > > > > > enough memory to fit into the new limit. To fix the problem described
> > > > > > above, all we need is to change the order of execution: try to push
> > > > > > the memory usage under the limit first, and only then set the new
> > > > > > high limit.
> > > > >
> > > > > Shakeel would this help with your pro-active reclaim usecase? It would
> > > > > require to reset the high limit right after the reclaim returns which is
> > > > > quite ugly but it would at least not require a completely new interface.
> > > > > You would simply do
> > > > >         high = current - to_reclaim
> > > > >         echo $high > memory.high
> > > > >         echo infinity > memory.high # To prevent direct reclaim
> > > > >                                     # allocation stalls
> > > > >
> > > >
> > > > This will reduce the chance of stalls but the interface is still
> > > > non-delegatable i.e. applications can not change their own memory.high
> > > > for the use-cases like application controlled proactive reclaim and
> > > > uswapd.
> > >
> > > Can you, please, elaborate a bit more on this? I didn't understand
> > > why.
> > >
> >
> > Sure. Do we want memory.high a CFTYPE_NS_DELEGATABLE type file? I
> > don't think so otherwise any job on a system can change their
> > memory.high and can adversely impact the isolation and memory
> > scheduling of the system.
>
> Is this really the case? There should always be a parent cgroup that
> overrides the setting.

Can you explain a bit more? I don't see any requirement of having a
layer of cgroup between root and the job cgroup. Internally we
schedule jobs as top level cgroups. There do exist jobs which are a
combination of other jobs and there we do use an additional layer of
cgroup (similar to pods running multiple containers in kubernetes).
Surely we can add a layer for all the jobs but it comes with an
overhead and at scale that overhead is not negligible.

> Also you can always set the hard limit if you do
> not want to add another layer of cgroup in the hierarchy before
> delegation. Or am I missing something?
>

Yes, we can set memory.max though it has different oom semantics and
not really a replacement for memory.high.
Johannes Weiner July 14, 2020, 3:38 p.m. UTC | #8
On Fri, Jul 10, 2020 at 12:19:37PM -0700, Shakeel Butt wrote:
> On Fri, Jul 10, 2020 at 11:42 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote:
> > > On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > > > > Memory.high limit is implemented in a way such that the kernel
> > > > > penalizes all threads which are allocating a memory over the limit.
> > > > > Forcing all threads into the synchronous reclaim and adding some
> > > > > artificial delays allows to slow down the memory consumption and
> > > > > potentially give some time for userspace oom handlers/resource control
> > > > > agents to react.
> > > > >
> > > > > It works nicely if the memory usage is hitting the limit from below,
> > > > > however it works sub-optimal if a user adjusts memory.high to a value
> > > > > way below the current memory usage. It basically forces all workload
> > > > > threads (doing any memory allocations) into the synchronous reclaim
> > > > > and sleep. This makes the workload completely unresponsive for
> > > > > a long period of time and can also lead to a system-wide contention on
> > > > > lru locks. It can happen even if the workload is not actually tight on
> > > > > memory and has, for example, a ton of cold pagecache.
> > > > >
> > > > > In the current implementation writing to memory.high causes an atomic
> > > > > update of page counter's high value followed by an attempt to reclaim
> > > > > enough memory to fit into the new limit. To fix the problem described
> > > > > above, all we need is to change the order of execution: try to push
> > > > > the memory usage under the limit first, and only then set the new
> > > > > high limit.
> > > >
> > > > Shakeel would this help with your pro-active reclaim usecase? It would
> > > > require to reset the high limit right after the reclaim returns which is
> > > > quite ugly but it would at least not require a completely new interface.
> > > > You would simply do
> > > >         high = current - to_reclaim
> > > >         echo $high > memory.high
> > > >         echo infinity > memory.high # To prevent direct reclaim
> > > >                                     # allocation stalls
> > > >
> > >
> > > This will reduce the chance of stalls but the interface is still
> > > non-delegatable i.e. applications can not change their own memory.high
> > > for the use-cases like application controlled proactive reclaim and
> > > uswapd.
> >
> > Can you, please, elaborate a bit more on this? I didn't understand
> > why.
> >
> 
> Sure. Do we want memory.high a CFTYPE_NS_DELEGATABLE type file? I
> don't think so otherwise any job on a system can change their
> memory.high and can adversely impact the isolation and memory
> scheduling of the system.
> 
> Next we have to agree that there are valid use-cases to allow
> applications to reclaim from their cgroups and I think uswapd and
> proactive reclaim are valid use-cases. Let's suppose memory.high is
> the only way to trigger reclaim but the application can not write to
> their top level memory.high, so, it has to create a dummy cgroup of
> which it has write access to memory.high and has to move itself to
> that dummy cgroup to use memory.high to trigger reclaim for
> uswapd/proactive-reclaim.

For what it's worth, for proactive reclaim driven by userspace, we're
currently carrying a hacky memory.high.tmp in our private tree. It
takes a limit and a timeout, so that in case the daemon crashes during
a dip in memory consumption no unsafe limits are left behind.

We haven't upstreamed it because it's not clear yet how exactly the
interface should look like. The userspace daemon is still
evolving. But I think we're going to need *some form* of a dedicated
knob to make this operation safe.

As far as permissions to self-pressurize go - I'm curious how you make
that safe? How do you keep the reclaim daemon from accidentally
putting excessive pressure on its own cgroup that may interfere with
the very act of backing off the limit again?

The way we do this right now is having the reclaimer daemon in a
dedicated top-level cgroup with memory.min protection.

This works well because we have a comprehensive cgroup setup anyway
and need to protect this daemon (it's oomd - the proactive reclaimer,
senpai, is a plugin) for other reasons as well. But it's probably a
royal pain to use if you don't have all of that infrastructure.

One possible idea to make this simpler would be to have a limit knob
that has a psi/pressure blowout valve. This way you could specify your
tolerances for paging and what constitutes "cold" memory, and the
limit unsets itself when pressure moves into harmful territory. This
would make it safe to use when the reclaimer becomes unresponsive or
dies altogether, which makes it safe to use from within the
cgroup. And being separate from max and high means we can delegate it.
Michal Hocko July 14, 2020, 3:50 p.m. UTC | #9
On Tue 14-07-20 08:32:09, Shakeel Butt wrote:
> On Tue, Jul 14, 2020 at 1:41 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 10-07-20 12:19:37, Shakeel Butt wrote:
> > > On Fri, Jul 10, 2020 at 11:42 AM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote:
> > > > > On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > >
> > > > > > On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > > > > > > Memory.high limit is implemented in a way such that the kernel
> > > > > > > penalizes all threads which are allocating a memory over the limit.
> > > > > > > Forcing all threads into the synchronous reclaim and adding some
> > > > > > > artificial delays allows to slow down the memory consumption and
> > > > > > > potentially give some time for userspace oom handlers/resource control
> > > > > > > agents to react.
> > > > > > >
> > > > > > > It works nicely if the memory usage is hitting the limit from below,
> > > > > > > however it works sub-optimal if a user adjusts memory.high to a value
> > > > > > > way below the current memory usage. It basically forces all workload
> > > > > > > threads (doing any memory allocations) into the synchronous reclaim
> > > > > > > and sleep. This makes the workload completely unresponsive for
> > > > > > > a long period of time and can also lead to a system-wide contention on
> > > > > > > lru locks. It can happen even if the workload is not actually tight on
> > > > > > > memory and has, for example, a ton of cold pagecache.
> > > > > > >
> > > > > > > In the current implementation writing to memory.high causes an atomic
> > > > > > > update of page counter's high value followed by an attempt to reclaim
> > > > > > > enough memory to fit into the new limit. To fix the problem described
> > > > > > > above, all we need is to change the order of execution: try to push
> > > > > > > the memory usage under the limit first, and only then set the new
> > > > > > > high limit.
> > > > > >
> > > > > > Shakeel would this help with your pro-active reclaim usecase? It would
> > > > > > require to reset the high limit right after the reclaim returns which is
> > > > > > quite ugly but it would at least not require a completely new interface.
> > > > > > You would simply do
> > > > > >         high = current - to_reclaim
> > > > > >         echo $high > memory.high
> > > > > >         echo infinity > memory.high # To prevent direct reclaim
> > > > > >                                     # allocation stalls
> > > > > >
> > > > >
> > > > > This will reduce the chance of stalls but the interface is still
> > > > > non-delegatable i.e. applications can not change their own memory.high
> > > > > for the use-cases like application controlled proactive reclaim and
> > > > > uswapd.
> > > >
> > > > Can you, please, elaborate a bit more on this? I didn't understand
> > > > why.
> > > >
> > >
> > > Sure. Do we want memory.high a CFTYPE_NS_DELEGATABLE type file? I
> > > don't think so otherwise any job on a system can change their
> > > memory.high and can adversely impact the isolation and memory
> > > scheduling of the system.
> >
> > Is this really the case? There should always be a parent cgroup that
> > overrides the setting.
> 
> Can you explain a bit more? I don't see any requirement of having a
> layer of cgroup between root and the job cgroup. Internally we
> schedule jobs as top level cgroups. There do exist jobs which are a
> combination of other jobs and there we do use an additional layer of
> cgroup (similar to pods running multiple containers in kubernetes).
> Surely we can add a layer for all the jobs but it comes with an
> overhead and at scale that overhead is not negligible.

What I've had in mind is that if you want to delegate then you have an
option to add a layer where you pre define restrictions/guanratees so
that the delegated cgroup under that hierarchy cannot runaway. So
configuring high limit in a delegated cgroup should be reasonably safe.

> > Also you can always set the hard limit if you do
> > not want to add another layer of cgroup in the hierarchy before
> > delegation. Or am I missing something?
> >
> 
> Yes, we can set memory.max though it has different oom semantics and
> not really a replacement for memory.high.

Right but you can define a safe cap this way and leave the high
watermark for the delegated cgroup.
Shakeel Butt July 14, 2020, 5:06 p.m. UTC | #10
On Tue, Jul 14, 2020 at 8:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Jul 10, 2020 at 12:19:37PM -0700, Shakeel Butt wrote:
> > On Fri, Jul 10, 2020 at 11:42 AM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote:
> > > > On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > > > > > Memory.high limit is implemented in a way such that the kernel
> > > > > > penalizes all threads which are allocating a memory over the limit.
> > > > > > Forcing all threads into the synchronous reclaim and adding some
> > > > > > artificial delays allows to slow down the memory consumption and
> > > > > > potentially give some time for userspace oom handlers/resource control
> > > > > > agents to react.
> > > > > >
> > > > > > It works nicely if the memory usage is hitting the limit from below,
> > > > > > however it works sub-optimal if a user adjusts memory.high to a value
> > > > > > way below the current memory usage. It basically forces all workload
> > > > > > threads (doing any memory allocations) into the synchronous reclaim
> > > > > > and sleep. This makes the workload completely unresponsive for
> > > > > > a long period of time and can also lead to a system-wide contention on
> > > > > > lru locks. It can happen even if the workload is not actually tight on
> > > > > > memory and has, for example, a ton of cold pagecache.
> > > > > >
> > > > > > In the current implementation writing to memory.high causes an atomic
> > > > > > update of page counter's high value followed by an attempt to reclaim
> > > > > > enough memory to fit into the new limit. To fix the problem described
> > > > > > above, all we need is to change the order of execution: try to push
> > > > > > the memory usage under the limit first, and only then set the new
> > > > > > high limit.
> > > > >
> > > > > Shakeel would this help with your pro-active reclaim usecase? It would
> > > > > require to reset the high limit right after the reclaim returns which is
> > > > > quite ugly but it would at least not require a completely new interface.
> > > > > You would simply do
> > > > >         high = current - to_reclaim
> > > > >         echo $high > memory.high
> > > > >         echo infinity > memory.high # To prevent direct reclaim
> > > > >                                     # allocation stalls
> > > > >
> > > >
> > > > This will reduce the chance of stalls but the interface is still
> > > > non-delegatable i.e. applications can not change their own memory.high
> > > > for the use-cases like application controlled proactive reclaim and
> > > > uswapd.
> > >
> > > Can you, please, elaborate a bit more on this? I didn't understand
> > > why.
> > >
> >
> > Sure. Do we want memory.high a CFTYPE_NS_DELEGATABLE type file? I
> > don't think so otherwise any job on a system can change their
> > memory.high and can adversely impact the isolation and memory
> > scheduling of the system.
> >
> > Next we have to agree that there are valid use-cases to allow
> > applications to reclaim from their cgroups and I think uswapd and
> > proactive reclaim are valid use-cases. Let's suppose memory.high is
> > the only way to trigger reclaim but the application can not write to
> > their top level memory.high, so, it has to create a dummy cgroup of
> > which it has write access to memory.high and has to move itself to
> > that dummy cgroup to use memory.high to trigger reclaim for
> > uswapd/proactive-reclaim.
>
> For what it's worth, for proactive reclaim driven by userspace, we're
> currently carrying a hacky memory.high.tmp in our private tree. It
> takes a limit and a timeout, so that in case the daemon crashes during
> a dip in memory consumption no unsafe limits are left behind.
>
> We haven't upstreamed it because it's not clear yet how exactly the
> interface should look like. The userspace daemon is still
> evolving. But I think we're going to need *some form* of a dedicated
> knob to make this operation safe.
>

We have an internal interface memory.try_to_free_pages and we echo the
number of bytes we want to reclaim from that memcg. No manipulation of
limits or need to restore them.

> As far as permissions to self-pressurize go - I'm curious how you make
> that safe? How do you keep the reclaim daemon from accidentally
> putting excessive pressure on its own cgroup that may interfere with
> the very act of backing off the limit again?
>

That's a good and important question. Internally we don't have uswapd
use-case (self pressure) and for proactive reclaim we do something
similar to your setup. The jobs have a dedicated sub-container to run
the reclaimer binary and the actual workload runs in the sibling
container. The reclaimer keeps pressuring the sibling (not the job's
root container) to reclaim proactively.

Yang Shi was looking into per-memcg kswapd and having an interface
which would benefit his use-case as well should be preferred IMO.

> The way we do this right now is having the reclaimer daemon in a
> dedicated top-level cgroup with memory.min protection.
>
> This works well because we have a comprehensive cgroup setup anyway
> and need to protect this daemon (it's oomd - the proactive reclaimer,
> senpai, is a plugin) for other reasons as well. But it's probably a
> royal pain to use if you don't have all of that infrastructure.
>
> One possible idea to make this simpler would be to have a limit knob
> that has a psi/pressure blowout valve. This way you could specify your
> tolerances for paging and what constitutes "cold" memory, and the
> limit unsets itself when pressure moves into harmful territory. This
> would make it safe to use when the reclaimer becomes unresponsive or
> dies altogether, which makes it safe to use from within the
> cgroup. And being separate from max and high means we can delegate it.

I like this idea and agree with having a separate interface from max
and high. Though why do we want to think of this interface as a limit
interface. Are there additional benefits or use-cases which can
benefit from this semantic?
Johannes Weiner July 15, 2020, 4:54 p.m. UTC | #11
On Tue, Jul 14, 2020 at 10:06:32AM -0700, Shakeel Butt wrote:
> On Tue, Jul 14, 2020 at 8:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > The way we do this right now is having the reclaimer daemon in a
> > dedicated top-level cgroup with memory.min protection.
> >
> > This works well because we have a comprehensive cgroup setup anyway
> > and need to protect this daemon (it's oomd - the proactive reclaimer,
> > senpai, is a plugin) for other reasons as well. But it's probably a
> > royal pain to use if you don't have all of that infrastructure.
> >
> > One possible idea to make this simpler would be to have a limit knob
> > that has a psi/pressure blowout valve. This way you could specify your
> > tolerances for paging and what constitutes "cold" memory, and the
> > limit unsets itself when pressure moves into harmful territory. This
> > would make it safe to use when the reclaimer becomes unresponsive or
> > dies altogether, which makes it safe to use from within the
> > cgroup. And being separate from max and high means we can delegate it.
> 
> I like this idea and agree with having a separate interface from max
> and high. Though why do we want to think of this interface as a limit
> interface. Are there additional benefits or use-cases which can
> benefit from this semantic?

I'm not saying we have to.

But one benefit of having a limit rather than a reclaim command knob
is that you can prevent cache-polluting scans through file data from
unnecessarily exploding the memory footprint of the cgroup.

It may be useful to compile a list of applications and goals for such
a knob, i.e. the reasons we want to do proactive reclaim in the first
place.
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b8424aa56e14..4b71feee7c42 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6203,8 +6203,6 @@  static ssize_t memory_high_write(struct kernfs_open_file *of,
 	if (err)
 		return err;
 
-	page_counter_set_high(&memcg->memory, high);
-
 	for (;;) {
 		unsigned long nr_pages = page_counter_read(&memcg->memory);
 		unsigned long reclaimed;
@@ -6228,6 +6226,8 @@  static ssize_t memory_high_write(struct kernfs_open_file *of,
 			break;
 	}
 
+	page_counter_set_high(&memcg->memory, high);
+
 	return nbytes;
 }