diff mbox series

[mm,v2,3/3] mm: automatically penalize tasks with high swap use

Message ID 20200511225516.2431921-4-kuba@kernel.org (mailing list archive)
State New, archived
Headers show
Series memcg: Slow down swap allocation as the available space gets depleted | expand

Commit Message

Jakub Kicinski May 11, 2020, 10:55 p.m. UTC
Add a memory.swap.high knob, which can be used to protect the system
from SWAP exhaustion. The mechanism used for penelizing is similar
to memory.high penalty (sleep on return to user space), but with
a less steep slope.

That is not to say that the knob itself is equivalent to memory.high.
The objective is more to protect the system from potentially buggy
tasks consuming a lot of swap and impacting other tasks, or even
bringing the whole system to stand still with complete SWAP
exhaustion. Hopefully without the need to find per-task hard
limits.

Slowing misbehaving tasks down gradually allows user space oom
killers or other protection mechanisms to react. oomd and earlyoom
already do killing based on swap exhaustion, and memory.swap.high
protection will help implement such userspace oom policies more
reliably.

Use one counter for number of pages allocated under pressure
to save struct task space and avoid two separate hierarchy
walks on the hot path.

Use swap.high when deciding if swap is full.
Perform reclaim and count memory over high events.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
--
v2:
 - add docs,
 - improve commit message.
---
 Documentation/admin-guide/cgroup-v2.rst | 16 +++++
 include/linux/memcontrol.h              |  4 ++
 mm/memcontrol.c                         | 94 +++++++++++++++++++++++--
 3 files changed, 107 insertions(+), 7 deletions(-)

Comments

Michal Hocko May 12, 2020, 7:26 a.m. UTC | #1
On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:
> Add a memory.swap.high knob, which can be used to protect the system
> from SWAP exhaustion. The mechanism used for penelizing is similar
> to memory.high penalty (sleep on return to user space), but with
> a less steep slope.
> 
> That is not to say that the knob itself is equivalent to memory.high.
> The objective is more to protect the system from potentially buggy
> tasks consuming a lot of swap and impacting other tasks, or even
> bringing the whole system to stand still with complete SWAP
> exhaustion. Hopefully without the need to find per-task hard
> limits.
> 
> Slowing misbehaving tasks down gradually allows user space oom
> killers or other protection mechanisms to react. oomd and earlyoom
> already do killing based on swap exhaustion, and memory.swap.high
> protection will help implement such userspace oom policies more
> reliably.

Thanks for adding more information about the usecase and motivation.
> 
> Use one counter for number of pages allocated under pressure
> to save struct task space and avoid two separate hierarchy
> walks on the hot path.
> 
> Use swap.high when deciding if swap is full.

Please be more specific why.

> Perform reclaim and count memory over high events.

Please expand on this and explain how this is working and why the
semantic is subtly different from MEMCG_HIGH. I suspect the reason
is that there is no reclaim for the swap so you are only emitting an
event on the memcg which is actually throttled. This is in line with
memory.high but the difference is that we do reclaim each memcg subtree
in the high limit excess. That means that the counter tells us how many
times the specific memcg was in excess which would be impossible with
your implementation.

I would also suggest to explain or ideally even separate the swap
penalty scaling logic to a seprate patch. What kind of data it is based
on?
 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> --
> v2:
>  - add docs,
>  - improve commit message.
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 16 +++++
>  include/linux/memcontrol.h              |  4 ++
>  mm/memcontrol.c                         | 94 +++++++++++++++++++++++--
>  3 files changed, 107 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 5f12f203822e..c60226daa193 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back.
>  	The total amount of swap currently being used by the cgroup
>  	and its descendants.
>  
> +  memory.swap.high
> +	A read-write single value file which exists on non-root
> +	cgroups.  The default is "max".
> +
> +	Swap usage throttle limit.  If a cgroup's swap usage exceeds
> +	this limit, all its further allocations will be throttled to
> +	allow userspace to implement custom out-of-memory procedures.
> +
> +	This limit marks a point of no return for the cgroup. It is NOT
> +	designed to manage the amount of swapping a workload does
> +	during regular operation. Compare to memory.swap.max, which
> +	prohibits swapping past a set amount, but lets the cgroup
> +	continue unimpeded as long as other memory can be reclaimed.
> +
> +	Healthy workloads are not expected to reach this limit.
> +
>    memory.swap.max
>  	A read-write single value file which exists on non-root
>  	cgroups.  The default is "max".
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b478a4e83297..882bda952a5c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -45,6 +45,7 @@ enum memcg_memory_event {
>  	MEMCG_MAX,
>  	MEMCG_OOM,
>  	MEMCG_OOM_KILL,
> +	MEMCG_SWAP_HIGH,
>  	MEMCG_SWAP_MAX,
>  	MEMCG_SWAP_FAIL,
>  	MEMCG_NR_MEMORY_EVENTS,
> @@ -212,6 +213,9 @@ struct mem_cgroup {
>  	/* Upper bound of normal memory consumption range */
>  	unsigned long high;
>  
> +	/* Upper bound of swap consumption range */
> +	unsigned long swap_high;
> +
>  	/* Range enforcement for interrupt charges */
>  	struct work_struct high_work;
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 66dd87bb9e0f..a3d13b30e3d6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2353,12 +2353,34 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
>  	return max_overage;
>  }
>  
> +static u64 swap_find_max_overage(struct mem_cgroup *memcg)
> +{
> +	u64 overage, max_overage = 0;
> +	struct mem_cgroup *max_cg;
> +
> +	do {
> +		overage = calculate_overage(page_counter_read(&memcg->swap),
> +					    READ_ONCE(memcg->swap_high));
> +		if (overage > max_overage) {
> +			max_overage = overage;
> +			max_cg = memcg;
> +		}
> +	} while ((memcg = parent_mem_cgroup(memcg)) &&
> +		 !mem_cgroup_is_root(memcg));
> +
> +	if (max_overage)
> +		memcg_memory_event(max_cg, MEMCG_SWAP_HIGH);
> +
> +	return max_overage;
> +}
> +
>  /*
>   * Get the number of jiffies that we should penalise a mischievous cgroup which
>   * is exceeding its memory.high by checking both it and its ancestors.
>   */
>  static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
>  					  unsigned int nr_pages,
> +					  unsigned char cost_shift,
>  					  u64 max_overage)
>  {
>  	unsigned long penalty_jiffies;
> @@ -2366,6 +2388,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
>  	if (!max_overage)
>  		return 0;
>  
> +	if (cost_shift)
> +		max_overage >>= cost_shift;
> +
>  	/*
>  	 * We use overage compared to memory.high to calculate the number of
>  	 * jiffies to sleep (penalty_jiffies). Ideally this value should be
> @@ -2411,9 +2436,16 @@ void mem_cgroup_handle_over_high(void)
>  	 * memory.high is breached and reclaim is unable to keep up. Throttle
>  	 * allocators proactively to slow down excessive growth.
>  	 */
> -	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
> +	penalty_jiffies = calculate_high_delay(memcg, nr_pages, 0,
>  					       mem_find_max_overage(memcg));
>  
> +	/*
> +	 * Make the swap curve more gradual, swap can be considered "cheaper",
> +	 * and is allocated in larger chunks. We want the delays to be gradual.
> +	 */
> +	penalty_jiffies += calculate_high_delay(memcg, nr_pages, 2,
> +						swap_find_max_overage(memcg));
> +
>  	/*
>  	 * Clamp the max delay per usermode return so as to still keep the
>  	 * application moving forwards and also permit diagnostics, albeit
> @@ -2604,12 +2636,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * reclaim, the cost of mismatch is negligible.
>  	 */
>  	do {
> -		if (page_counter_read(&memcg->memory) > READ_ONCE(memcg->high)) {
> -			/* Don't bother a random interrupted task */
> -			if (in_interrupt()) {
> +		bool mem_high, swap_high;
> +
> +		mem_high = page_counter_read(&memcg->memory) >
> +			READ_ONCE(memcg->high);
> +		swap_high = page_counter_read(&memcg->swap) >
> +			READ_ONCE(memcg->swap_high);
> +
> +		/* Don't bother a random interrupted task */
> +		if (in_interrupt()) {
> +			if (mem_high) {
>  				schedule_work(&memcg->high_work);
>  				break;
>  			}
> +			continue;
> +		}
> +
> +		if (mem_high || swap_high) {
>  			current->memcg_nr_pages_over_high += batch;
>  			set_notify_resume(current);
>  			break;
> @@ -5076,6 +5119,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  
>  	WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX);
>  	memcg->soft_limit = PAGE_COUNTER_MAX;
> +	WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX);
>  	if (parent) {
>  		memcg->swappiness = mem_cgroup_swappiness(parent);
>  		memcg->oom_kill_disable = parent->oom_kill_disable;
> @@ -5229,6 +5273,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
>  	page_counter_set_low(&memcg->memory, 0);
>  	WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX);
>  	memcg->soft_limit = PAGE_COUNTER_MAX;
> +	WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX);
>  	memcg_wb_domain_size_changed(memcg);
>  }
>  
> @@ -7136,10 +7181,13 @@ bool mem_cgroup_swap_full(struct page *page)
>  	if (!memcg)
>  		return false;
>  
> -	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
> -		if (page_counter_read(&memcg->swap) * 2 >=
> -		    READ_ONCE(memcg->swap.max))
> +	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> +		unsigned long usage = page_counter_read(&memcg->swap);
> +
> +		if (usage * 2 >= READ_ONCE(memcg->swap_high) ||
> +		    usage * 2 >= READ_ONCE(memcg->swap.max))
>  			return true;
> +	}
>  
>  	return false;
>  }
> @@ -7169,6 +7217,30 @@ static u64 swap_current_read(struct cgroup_subsys_state *css,
>  	return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE;
>  }
>  
> +static int swap_high_show(struct seq_file *m, void *v)
> +{
> +	unsigned long high = READ_ONCE(mem_cgroup_from_seq(m)->swap_high);
> +
> +	return seq_puts_memcg_tunable(m, high);
> +}
> +
> +static ssize_t swap_high_write(struct kernfs_open_file *of,
> +			       char *buf, size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned long high;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	err = page_counter_memparse(buf, "max", &high);
> +	if (err)
> +		return err;
> +
> +	WRITE_ONCE(memcg->swap_high, high);
> +
> +	return nbytes;
> +}
> +
>  static int swap_max_show(struct seq_file *m, void *v)
>  {
>  	return seq_puts_memcg_tunable(m,
> @@ -7196,6 +7268,8 @@ static int swap_events_show(struct seq_file *m, void *v)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
>  
> +	seq_printf(m, "high %lu\n",
> +		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH]));
>  	seq_printf(m, "max %lu\n",
>  		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX]));
>  	seq_printf(m, "fail %lu\n",
> @@ -7210,6 +7284,12 @@ static struct cftype swap_files[] = {
>  		.flags = CFTYPE_NOT_ON_ROOT,
>  		.read_u64 = swap_current_read,
>  	},
> +	{
> +		.name = "swap.high",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = swap_high_show,
> +		.write = swap_high_write,
> +	},
>  	{
>  		.name = "swap.max",
>  		.flags = CFTYPE_NOT_ON_ROOT,
> -- 
> 2.25.4
Jakub Kicinski May 12, 2020, 5:55 p.m. UTC | #2
On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote:
> On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:
> > Use swap.high when deciding if swap is full.  
> 
> Please be more specific why.

How about:

    Use swap.high when deciding if swap is full to influence ongoing
    swap reclaim in a best effort manner.

> > Perform reclaim and count memory over high events.  
> 
> Please expand on this and explain how this is working and why the
> semantic is subtly different from MEMCG_HIGH. I suspect the reason
> is that there is no reclaim for the swap so you are only emitting an
> event on the memcg which is actually throttled. This is in line with
> memory.high but the difference is that we do reclaim each memcg subtree
> in the high limit excess. That means that the counter tells us how many
> times the specific memcg was in excess which would be impossible with
> your implementation.

Right, with memory all cgroups over high get penalized with the extra
reclaim work. For swap we just have the delay, so the event is
associated with the worst offender, anything lower didn't really matter.

But it's easy enough to change if you prefer. Otherwise I'll just add
this to the commit message:

  Count swap over high events. Note that unlike memory over high events
  we only count them for the worst offender. This is because the
  delay penalties for both swap and memory over high are not cumulative,
  i.e. we use the max delay.

> I would also suggest to explain or ideally even separate the swap
> penalty scaling logic to a seprate patch. What kind of data it is based
> on?

It's a hard thing to get production data for since, as we mentioned we
don't expect the limit to be hit. It was more of a process of
experimentation and finding a gradual slope that "felt right"...

Is there a more scientific process we can follow here? We want the
delay to be small at first for a first few pages and then grow to make
sure we stop the task from going too much over high. The square
function works pretty well IMHO.
Michal Hocko May 13, 2020, 8:32 a.m. UTC | #3
On Tue 12-05-20 10:55:36, Jakub Kicinski wrote:
> On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote:
> > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:
> > > Use swap.high when deciding if swap is full.  
> > 
> > Please be more specific why.
> 
> How about:
> 
>     Use swap.high when deciding if swap is full to influence ongoing
>     swap reclaim in a best effort manner.

This is still way too vague. The crux is why should we treat hard and
high swap limit the same for mem_cgroup_swap_full purpose. Please note
that I am not saying this is wrong. I am asking for a more detailed
explanation mostly because I would bet that somebody stumbles over this
sooner or later.

mem_cgroup_swap_full is an odd predicate. It doesn't really want to tell
that the swap is really full. I haven't studied the original intention
but it is more in line of mem_cgroup_swap_under_pressure based on the
current usage to (attempt) scale swap cache size.

> > > Perform reclaim and count memory over high events.  
> > 
> > Please expand on this and explain how this is working and why the
> > semantic is subtly different from MEMCG_HIGH. I suspect the reason
> > is that there is no reclaim for the swap so you are only emitting an
> > event on the memcg which is actually throttled. This is in line with
> > memory.high but the difference is that we do reclaim each memcg subtree
> > in the high limit excess. That means that the counter tells us how many
> > times the specific memcg was in excess which would be impossible with
> > your implementation.
> 
> Right, with memory all cgroups over high get penalized with the extra
> reclaim work. For swap we just have the delay, so the event is
> associated with the worst offender, anything lower didn't really matter.
> 
> But it's easy enough to change if you prefer. Otherwise I'll just add
> this to the commit message:
> 
>   Count swap over high events. Note that unlike memory over high events
>   we only count them for the worst offender. This is because the
>   delay penalties for both swap and memory over high are not cumulative,
>   i.e. we use the max delay.

Well, memory high penalty is in fact cumulative, because the reclaim
would happen for each memcg subtree up the hierarchy. Sure the
additional throttling is not cumulative but that is not really that
important because the exact amount of throttling is an implementation detail.
The swap high is an odd one here because we do not reclaim swap so the
cumulative effect of that is 0 and there is only the additional
throttling happening. I suspect that your current implementation is
exposing an internal implementation to the userspace but considering how
the current memory high event is documented
          high
                The number of times processes of the cgroup are
                throttled and routed to perform direct memory reclaim
                because the high memory boundary was exceeded.  For a
                cgroup whose memory usage is capped by the high limit
                rather than global memory pressure, this event's
                occurrences are expected.

it talks about throttling rather than excess (like max) so I am not
really sure. I believe that it would be much better if both events were
more explicit about counting an excess and a throttling is just a side
effect of that situation.

I do not expect that we will have any form of the swap reclaim anytime
soon (if ever) but I fail to see why to creat a small little trap like
this now.

> > I would also suggest to explain or ideally even separate the swap
> > penalty scaling logic to a seprate patch. What kind of data it is based
> > on?
> 
> It's a hard thing to get production data for since, as we mentioned we
> don't expect the limit to be hit. It was more of a process of
> experimentation and finding a gradual slope that "felt right"...
> 
> Is there a more scientific process we can follow here? We want the
> delay to be small at first for a first few pages and then grow to make
> sure we stop the task from going too much over high. The square
> function works pretty well IMHO.

If there is no data to showing this to be an improvement then I would
just not add an additional scaling factor. Why? Mostly because once we
have it there it would be extremely hard to change. MM is full of these
little heuristics that are copied over because nobody dares to touch
them. If a different scaling is really needed it can always be added
later with some data to back that.
Michal Hocko May 13, 2020, 8:38 a.m. UTC | #4
On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 5f12f203822e..c60226daa193 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back.
>  	The total amount of swap currently being used by the cgroup
>  	and its descendants.
>  
> +  memory.swap.high
> +	A read-write single value file which exists on non-root
> +	cgroups.  The default is "max".
> +
> +	Swap usage throttle limit.  If a cgroup's swap usage exceeds
> +	this limit, all its further allocations will be throttled to
> +	allow userspace to implement custom out-of-memory procedures.
> +
> +	This limit marks a point of no return for the cgroup. It is NOT
> +	designed to manage the amount of swapping a workload does
> +	during regular operation. Compare to memory.swap.max, which
> +	prohibits swapping past a set amount, but lets the cgroup
> +	continue unimpeded as long as other memory can be reclaimed.
> +
> +	Healthy workloads are not expected to reach this limit.
> +

Btw. I forgot to mention that before but you should also add a
documentation for the swap high event to this file.
Jakub Kicinski May 13, 2020, 6:36 p.m. UTC | #5
On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote:
> On Tue 12-05-20 10:55:36, Jakub Kicinski wrote:
> > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote:  
> > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:  
> > > > Use swap.high when deciding if swap is full.    
> > > 
> > > Please be more specific why.  
> > 
> > How about:
> > 
> >     Use swap.high when deciding if swap is full to influence ongoing
> >     swap reclaim in a best effort manner.  
> 
> This is still way too vague. The crux is why should we treat hard and
> high swap limit the same for mem_cgroup_swap_full purpose. Please
> note that I am not saying this is wrong. I am asking for a more
> detailed explanation mostly because I would bet that somebody
> stumbles over this sooner or later.

Stumbles in what way? Isn't it expected for the kernel to take
reasonable precautions to avoid hitting limits?  We expect the
application which breaches swap.high to get terminated by user 
space OOM, kernel best be careful about approaching that limit, 
no?

> mem_cgroup_swap_full is an odd predicate. It doesn't really want to
> tell that the swap is really full. I haven't studied the original
> intention but it is more in line of mem_cgroup_swap_under_pressure
> based on the current usage to (attempt) scale swap cache size.

Perhaps Johannes has some experience here?  The 50% means full 
heuristic predates git :(

> > > > Perform reclaim and count memory over high events.    
> > > 
> > > Please expand on this and explain how this is working and why the
> > > semantic is subtly different from MEMCG_HIGH. I suspect the reason
> > > is that there is no reclaim for the swap so you are only emitting
> > > an event on the memcg which is actually throttled. This is in
> > > line with memory.high but the difference is that we do reclaim
> > > each memcg subtree in the high limit excess. That means that the
> > > counter tells us how many times the specific memcg was in excess
> > > which would be impossible with your implementation.  
> > 
> > Right, with memory all cgroups over high get penalized with the
> > extra reclaim work. For swap we just have the delay, so the event is
> > associated with the worst offender, anything lower didn't really
> > matter.
> > 
> > But it's easy enough to change if you prefer. Otherwise I'll just
> > add this to the commit message:
> > 
> >   Count swap over high events. Note that unlike memory over high
> > events we only count them for the worst offender. This is because
> > the delay penalties for both swap and memory over high are not
> > cumulative, i.e. we use the max delay.  
> 
> Well, memory high penalty is in fact cumulative, because the reclaim
> would happen for each memcg subtree up the hierarchy. Sure the
> additional throttling is not cumulative but that is not really that
> important because the exact amount of throttling is an implementation
> detail. The swap high is an odd one here because we do not reclaim
> swap so the cumulative effect of that is 0 and there is only the
> additional throttling happening. I suspect that your current
> implementation is exposing an internal implementation to the
> userspace but considering how the current memory high event is
> documented high
>                 The number of times processes of the cgroup are
>                 throttled and routed to perform direct memory reclaim
>                 because the high memory boundary was exceeded.  For a
>                 cgroup whose memory usage is capped by the high limit
>                 rather than global memory pressure, this event's
>                 occurrences are expected.
> 
> it talks about throttling rather than excess (like max) so I am not
> really sure. I believe that it would be much better if both events
> were more explicit about counting an excess and a throttling is just
> a side effect of that situation.
> 
> I do not expect that we will have any form of the swap reclaim anytime
> soon (if ever) but I fail to see why to creat a small little trap like
> this now.

Right, let me adjust then.

> > > I would also suggest to explain or ideally even separate the swap
> > > penalty scaling logic to a seprate patch. What kind of data it is
> > > based on?  
> > 
> > It's a hard thing to get production data for since, as we mentioned
> > we don't expect the limit to be hit. It was more of a process of
> > experimentation and finding a gradual slope that "felt right"...
> > 
> > Is there a more scientific process we can follow here? We want the
> > delay to be small at first for a first few pages and then grow to
> > make sure we stop the task from going too much over high. The square
> > function works pretty well IMHO.  
> 
> If there is no data to showing this to be an improvement then I would
> just not add an additional scaling factor. Why? Mostly because once we
> have it there it would be extremely hard to change. MM is full of
> these little heuristics that are copied over because nobody dares to
> touch them. If a different scaling is really needed it can always be
> added later with some data to back that.

Oh, I misunderstood the question, you were asking about the scaling
factor.. The allocation of swap is in larger batches, according to 
my tests, example below (AR - after reclaim, swap overage changes 
after memory reclaim). 
                                    mem overage AR
     swap pages over_high AR        |    swap overage AR
 swap pages over at call.   \       |    |      . mem sleep
   mem pages over_high.  \   \      |    |     /  . swap sleep
                       v  v   v     v    v    v  v
 [   73.360533] sleep (32/10->67) [-35|13379] 0+253
 [   73.631291] sleep (32/ 3->54) [-18|13430] 0+205
 [   73.851629] sleep (32/22->35) [-20|13443] 0+133
 [   74.021396] sleep (32/ 3->60) [-29|13500] 0+230
 [   74.263355] sleep (32/28->79) [-44|13551] 0+306
 [   74.585689] sleep (32/29->91) [-17|13627] 0+355
 [   74.958675] sleep (32/27->79) [-31|13679] 0+311
 [   75.293021] sleep (32/29->86) [ -9|13750] 0+344
 [   75.654218] sleep (32/22->72) [-24|13800] 0+290
 [   75.962467] sleep (32/22->73) [-39|13865] 0+296

That's for a process slowly leaking memory. Swap gets over the high by
about 2.5x MEMCG_CHARGE_BATCH on average. Hence to keep the same slope
I was trying to scale it back.

But you make a fair point, someone more knowledgeable can add the
heuristic later if it's really needed.
Michal Hocko May 14, 2020, 7:42 a.m. UTC | #6
On Wed 13-05-20 11:36:23, Jakub Kicinski wrote:
> On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote:
> > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote:
> > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote:  
> > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:  
> > > > > Use swap.high when deciding if swap is full.    
> > > > 
> > > > Please be more specific why.  
> > > 
> > > How about:
> > > 
> > >     Use swap.high when deciding if swap is full to influence ongoing
> > >     swap reclaim in a best effort manner.  
> > 
> > This is still way too vague. The crux is why should we treat hard and
> > high swap limit the same for mem_cgroup_swap_full purpose. Please
> > note that I am not saying this is wrong. I am asking for a more
> > detailed explanation mostly because I would bet that somebody
> > stumbles over this sooner or later.
> 
> Stumbles in what way?

Reading the code and trying to understand why this particular decision
has been made. Because it might be surprising that the hard and high
limits are treated same here.

> Isn't it expected for the kernel to take reasonable precautions to
> avoid hitting limits?

Isn't the throttling itself the precautious? How does the swap cache
and its control via mem_cgroup_swap_full interact here. See? This is
what I am asking to have explained in the changelog.

[...]
> > > > I would also suggest to explain or ideally even separate the swap
> > > > penalty scaling logic to a seprate patch. What kind of data it is
> > > > based on?  
> > > 
> > > It's a hard thing to get production data for since, as we mentioned
> > > we don't expect the limit to be hit. It was more of a process of
> > > experimentation and finding a gradual slope that "felt right"...
> > > 
> > > Is there a more scientific process we can follow here? We want the
> > > delay to be small at first for a first few pages and then grow to
> > > make sure we stop the task from going too much over high. The square
> > > function works pretty well IMHO.  
> > 
> > If there is no data to showing this to be an improvement then I would
> > just not add an additional scaling factor. Why? Mostly because once we
> > have it there it would be extremely hard to change. MM is full of
> > these little heuristics that are copied over because nobody dares to
> > touch them. If a different scaling is really needed it can always be
> > added later with some data to back that.
> 
> Oh, I misunderstood the question, you were asking about the scaling
> factor.. The allocation of swap is in larger batches, according to 
> my tests, example below (AR - after reclaim, swap overage changes 
> after memory reclaim). 
>                                     mem overage AR
>      swap pages over_high AR        |    swap overage AR
>  swap pages over at call.   \       |    |      . mem sleep
>    mem pages over_high.  \   \      |    |     /  . swap sleep
>                        v  v   v     v    v    v  v
>  [   73.360533] sleep (32/10->67) [-35|13379] 0+253
>  [   73.631291] sleep (32/ 3->54) [-18|13430] 0+205
>  [   73.851629] sleep (32/22->35) [-20|13443] 0+133
>  [   74.021396] sleep (32/ 3->60) [-29|13500] 0+230
>  [   74.263355] sleep (32/28->79) [-44|13551] 0+306
>  [   74.585689] sleep (32/29->91) [-17|13627] 0+355
>  [   74.958675] sleep (32/27->79) [-31|13679] 0+311
>  [   75.293021] sleep (32/29->86) [ -9|13750] 0+344
>  [   75.654218] sleep (32/22->72) [-24|13800] 0+290
>  [   75.962467] sleep (32/22->73) [-39|13865] 0+296
> 
> That's for a process slowly leaking memory. Swap gets over the high by
> about 2.5x MEMCG_CHARGE_BATCH on average. Hence to keep the same slope
> I was trying to scale it back.
> 
> But you make a fair point, someone more knowledgeable can add the
> heuristic later if it's really needed.

Or just make it a separate patch with all that information. This would
allow anybody touching that code in the future to understand the initial
motivation.

I am still not sure this scaling is a good fit in general (e.g. how does
it work with THP swapping?) though but this can be discussed separately
at least.
Johannes Weiner May 14, 2020, 8:21 p.m. UTC | #7
On Thu, May 14, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> On Wed 13-05-20 11:36:23, Jakub Kicinski wrote:
> > On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote:
> > > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote:
> > > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote:  
> > > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:  
> > > > > > Use swap.high when deciding if swap is full.    
> > > > > 
> > > > > Please be more specific why.  
> > > > 
> > > > How about:
> > > > 
> > > >     Use swap.high when deciding if swap is full to influence ongoing
> > > >     swap reclaim in a best effort manner.  
> > > 
> > > This is still way too vague. The crux is why should we treat hard and
> > > high swap limit the same for mem_cgroup_swap_full purpose. Please
> > > note that I am not saying this is wrong. I am asking for a more
> > > detailed explanation mostly because I would bet that somebody
> > > stumbles over this sooner or later.
> > 
> > Stumbles in what way?
> 
> Reading the code and trying to understand why this particular decision
> has been made. Because it might be surprising that the hard and high
> limits are treated same here.

I don't quite understand the controversy.

The idea behind "swap full" is that as long as the workload has plenty
of swap space available and it's not changing its memory contents, it
makes sense to generously hold on to copies of data in the swap
device, even after the swapin. A later reclaim cycle can drop the page
without any IO. Trading disk space for IO.

But the only two ways to reclaim a swap slot is when they're faulted
in and the references go away, or by scanning the virtual address space
like swapoff does - which is very expensive (one could argue it's too
expensive even for swapoff, it's often more practical to just reboot).

So at some point in the fill level, we have to start freeing up swap
slots on fault/swapin. Otherwise we could eventually run out of swap
slots while they're filled with copies of data that is also in RAM.

We don't want to OOM a workload because its available swap space is
filled with redundant cache.

That applies to physical swap limits, swap.max, and naturally also to
swap.high which is a limit to implement userspace OOM for swap space
exhaustion.

> > Isn't it expected for the kernel to take reasonable precautions to
> > avoid hitting limits?
> 
> Isn't the throttling itself the precautious? How does the swap cache
> and its control via mem_cgroup_swap_full interact here. See? This is
> what I am asking to have explained in the changelog.

It sounds like we need better documentation of what vm_swap_full() and
friends are there for. It should have been obvious why swap.high - a
limit on available swap space - hooks into it.
Michal Hocko May 15, 2020, 7:14 a.m. UTC | #8
On Thu 14-05-20 16:21:30, Johannes Weiner wrote:
> On Thu, May 14, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> > On Wed 13-05-20 11:36:23, Jakub Kicinski wrote:
> > > On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote:
> > > > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote:
> > > > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote:  
> > > > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:  
> > > > > > > Use swap.high when deciding if swap is full.    
> > > > > > 
> > > > > > Please be more specific why.  
> > > > > 
> > > > > How about:
> > > > > 
> > > > >     Use swap.high when deciding if swap is full to influence ongoing
> > > > >     swap reclaim in a best effort manner.  
> > > > 
> > > > This is still way too vague. The crux is why should we treat hard and
> > > > high swap limit the same for mem_cgroup_swap_full purpose. Please
> > > > note that I am not saying this is wrong. I am asking for a more
> > > > detailed explanation mostly because I would bet that somebody
> > > > stumbles over this sooner or later.
> > > 
> > > Stumbles in what way?
> > 
> > Reading the code and trying to understand why this particular decision
> > has been made. Because it might be surprising that the hard and high
> > limits are treated same here.
> 
> I don't quite understand the controversy.

I do not think there is any controversy. All I am asking for is a
clarification because this is non-intuitive.
 
> The idea behind "swap full" is that as long as the workload has plenty
> of swap space available and it's not changing its memory contents, it
> makes sense to generously hold on to copies of data in the swap
> device, even after the swapin. A later reclaim cycle can drop the page
> without any IO. Trading disk space for IO.
> 
> But the only two ways to reclaim a swap slot is when they're faulted
> in and the references go away, or by scanning the virtual address space
> like swapoff does - which is very expensive (one could argue it's too
> expensive even for swapoff, it's often more practical to just reboot).
> 
> So at some point in the fill level, we have to start freeing up swap
> slots on fault/swapin. Otherwise we could eventually run out of swap
> slots while they're filled with copies of data that is also in RAM.
> 
> We don't want to OOM a workload because its available swap space is
> filled with redundant cache.

Thanks this is a useful summary.
 
> That applies to physical swap limits, swap.max, and naturally also to
> swap.high which is a limit to implement userspace OOM for swap space
> exhaustion.
> 
> > > Isn't it expected for the kernel to take reasonable precautions to
> > > avoid hitting limits?
> > 
> > Isn't the throttling itself the precautious? How does the swap cache
> > and its control via mem_cgroup_swap_full interact here. See? This is
> > what I am asking to have explained in the changelog.
> 
> It sounds like we need better documentation of what vm_swap_full() and
> friends are there for. It should have been obvious why swap.high - a
> limit on available swap space - hooks into it.

Agreed. The primary source for a confusion is the naming here. Because
vm_swap_full doesn't really try to tell that the swap is full. It merely
tries to tell that it is getting full and so duplicated data should be
dropped.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 5f12f203822e..c60226daa193 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1374,6 +1374,22 @@  PAGE_SIZE multiple when read back.
 	The total amount of swap currently being used by the cgroup
 	and its descendants.
 
+  memory.swap.high
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max".
+
+	Swap usage throttle limit.  If a cgroup's swap usage exceeds
+	this limit, all its further allocations will be throttled to
+	allow userspace to implement custom out-of-memory procedures.
+
+	This limit marks a point of no return for the cgroup. It is NOT
+	designed to manage the amount of swapping a workload does
+	during regular operation. Compare to memory.swap.max, which
+	prohibits swapping past a set amount, but lets the cgroup
+	continue unimpeded as long as other memory can be reclaimed.
+
+	Healthy workloads are not expected to reach this limit.
+
   memory.swap.max
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b478a4e83297..882bda952a5c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -45,6 +45,7 @@  enum memcg_memory_event {
 	MEMCG_MAX,
 	MEMCG_OOM,
 	MEMCG_OOM_KILL,
+	MEMCG_SWAP_HIGH,
 	MEMCG_SWAP_MAX,
 	MEMCG_SWAP_FAIL,
 	MEMCG_NR_MEMORY_EVENTS,
@@ -212,6 +213,9 @@  struct mem_cgroup {
 	/* Upper bound of normal memory consumption range */
 	unsigned long high;
 
+	/* Upper bound of swap consumption range */
+	unsigned long swap_high;
+
 	/* Range enforcement for interrupt charges */
 	struct work_struct high_work;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 66dd87bb9e0f..a3d13b30e3d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2353,12 +2353,34 @@  static u64 mem_find_max_overage(struct mem_cgroup *memcg)
 	return max_overage;
 }
 
+static u64 swap_find_max_overage(struct mem_cgroup *memcg)
+{
+	u64 overage, max_overage = 0;
+	struct mem_cgroup *max_cg;
+
+	do {
+		overage = calculate_overage(page_counter_read(&memcg->swap),
+					    READ_ONCE(memcg->swap_high));
+		if (overage > max_overage) {
+			max_overage = overage;
+			max_cg = memcg;
+		}
+	} while ((memcg = parent_mem_cgroup(memcg)) &&
+		 !mem_cgroup_is_root(memcg));
+
+	if (max_overage)
+		memcg_memory_event(max_cg, MEMCG_SWAP_HIGH);
+
+	return max_overage;
+}
+
 /*
  * Get the number of jiffies that we should penalise a mischievous cgroup which
  * is exceeding its memory.high by checking both it and its ancestors.
  */
 static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 					  unsigned int nr_pages,
+					  unsigned char cost_shift,
 					  u64 max_overage)
 {
 	unsigned long penalty_jiffies;
@@ -2366,6 +2388,9 @@  static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	if (!max_overage)
 		return 0;
 
+	if (cost_shift)
+		max_overage >>= cost_shift;
+
 	/*
 	 * We use overage compared to memory.high to calculate the number of
 	 * jiffies to sleep (penalty_jiffies). Ideally this value should be
@@ -2411,9 +2436,16 @@  void mem_cgroup_handle_over_high(void)
 	 * memory.high is breached and reclaim is unable to keep up. Throttle
 	 * allocators proactively to slow down excessive growth.
 	 */
-	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
+	penalty_jiffies = calculate_high_delay(memcg, nr_pages, 0,
 					       mem_find_max_overage(memcg));
 
+	/*
+	 * Make the swap curve more gradual, swap can be considered "cheaper",
+	 * and is allocated in larger chunks. We want the delays to be gradual.
+	 */
+	penalty_jiffies += calculate_high_delay(memcg, nr_pages, 2,
+						swap_find_max_overage(memcg));
+
 	/*
 	 * Clamp the max delay per usermode return so as to still keep the
 	 * application moving forwards and also permit diagnostics, albeit
@@ -2604,12 +2636,23 @@  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * reclaim, the cost of mismatch is negligible.
 	 */
 	do {
-		if (page_counter_read(&memcg->memory) > READ_ONCE(memcg->high)) {
-			/* Don't bother a random interrupted task */
-			if (in_interrupt()) {
+		bool mem_high, swap_high;
+
+		mem_high = page_counter_read(&memcg->memory) >
+			READ_ONCE(memcg->high);
+		swap_high = page_counter_read(&memcg->swap) >
+			READ_ONCE(memcg->swap_high);
+
+		/* Don't bother a random interrupted task */
+		if (in_interrupt()) {
+			if (mem_high) {
 				schedule_work(&memcg->high_work);
 				break;
 			}
+			continue;
+		}
+
+		if (mem_high || swap_high) {
 			current->memcg_nr_pages_over_high += batch;
 			set_notify_resume(current);
 			break;
@@ -5076,6 +5119,7 @@  mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
+	WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX);
 	if (parent) {
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
@@ -5229,6 +5273,7 @@  static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	page_counter_set_low(&memcg->memory, 0);
 	WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
+	WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX);
 	memcg_wb_domain_size_changed(memcg);
 }
 
@@ -7136,10 +7181,13 @@  bool mem_cgroup_swap_full(struct page *page)
 	if (!memcg)
 		return false;
 
-	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
-		if (page_counter_read(&memcg->swap) * 2 >=
-		    READ_ONCE(memcg->swap.max))
+	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+		unsigned long usage = page_counter_read(&memcg->swap);
+
+		if (usage * 2 >= READ_ONCE(memcg->swap_high) ||
+		    usage * 2 >= READ_ONCE(memcg->swap.max))
 			return true;
+	}
 
 	return false;
 }
@@ -7169,6 +7217,30 @@  static u64 swap_current_read(struct cgroup_subsys_state *css,
 	return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE;
 }
 
+static int swap_high_show(struct seq_file *m, void *v)
+{
+	unsigned long high = READ_ONCE(mem_cgroup_from_seq(m)->swap_high);
+
+	return seq_puts_memcg_tunable(m, high);
+}
+
+static ssize_t swap_high_write(struct kernfs_open_file *of,
+			       char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long high;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "max", &high);
+	if (err)
+		return err;
+
+	WRITE_ONCE(memcg->swap_high, high);
+
+	return nbytes;
+}
+
 static int swap_max_show(struct seq_file *m, void *v)
 {
 	return seq_puts_memcg_tunable(m,
@@ -7196,6 +7268,8 @@  static int swap_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
+	seq_printf(m, "high %lu\n",
+		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH]));
 	seq_printf(m, "max %lu\n",
 		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX]));
 	seq_printf(m, "fail %lu\n",
@@ -7210,6 +7284,12 @@  static struct cftype swap_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = swap_current_read,
 	},
+	{
+		.name = "swap.high",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_high_show,
+		.write = swap_high_write,
+	},
 	{
 		.name = "swap.max",
 		.flags = CFTYPE_NOT_ON_ROOT,