diff mbox series

[v2,4/4] memcg: synchronously enforce memory.high for large overcharges

Message ID 20220211064917.2028469-5-shakeelb@google.com (mailing list archive)
State New
Headers show
Series memcg: robust enforcement of memory.high | expand

Commit Message

Shakeel Butt Feb. 11, 2022, 6:49 a.m. UTC
The high limit is used to throttle the workload without invoking the
oom-killer. Recently we tried to use the high limit to right size our
internal workloads. More specifically dynamically adjusting the limits
of the workload without letting the workload get oom-killed. However due
to the limitation of the implementation of high limit enforcement, we
observed the mechanism fails for some real workloads.

The high limit is enforced on return-to-userspace i.e. the kernel let
the usage goes over the limit and when the execution returns to
userspace, the high reclaim is triggered and the process can get
throttled as well. However this mechanism fails for workloads which do
large allocations in a single kernel entry e.g. applications that
mlock() a large chunk of memory in a single syscall. Such applications
bypass the high limit and can trigger the oom-killer.

To make high limit enforcement more robust, this patch makes the limit
enforcement synchronous only if the accumulated overcharge becomes
larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still
be throttled on the return-to-userspace path but only the extreme
allocations which accumulates large amount of overcharge without
returning to the userspace will be throttled synchronously. The value
MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
memcg codebase uses this constant therefore for now uses the same one.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
Changes since v1:
- Based on Roman's comment simply the sync enforcement and only target
  the extreme cases.

 mm/memcontrol.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Chris Down Feb. 11, 2022, 12:13 p.m. UTC | #1
Shakeel Butt writes:
>The high limit is used to throttle the workload without invoking the
>oom-killer. Recently we tried to use the high limit to right size our
>internal workloads. More specifically dynamically adjusting the limits
>of the workload without letting the workload get oom-killed. However due
>to the limitation of the implementation of high limit enforcement, we
>observed the mechanism fails for some real workloads.
>
>The high limit is enforced on return-to-userspace i.e. the kernel let
>the usage goes over the limit and when the execution returns to
>userspace, the high reclaim is triggered and the process can get
>throttled as well. However this mechanism fails for workloads which do
>large allocations in a single kernel entry e.g. applications that
>mlock() a large chunk of memory in a single syscall. Such applications
>bypass the high limit and can trigger the oom-killer.
>
>To make high limit enforcement more robust, this patch makes the limit
>enforcement synchronous only if the accumulated overcharge becomes
>larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still
>be throttled on the return-to-userspace path but only the extreme
>allocations which accumulates large amount of overcharge without
>returning to the userspace will be throttled synchronously. The value
>MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
>memcg codebase uses this constant therefore for now uses the same one.

Note that mem_cgroup_handle_over_high() has its own allocator throttling grace 
period, where it bails out if the penalty to apply is less than 10ms. The 
reclaim will still happen, though. So throttling might not happen even for 
roughly MEMCG_CHARGE_BATCH-sized allocations, depending on the overall size of 
the cgroup and its protection.

>Signed-off-by: Shakeel Butt <shakeelb@google.com>
>---
>Changes since v1:
>- Based on Roman's comment simply the sync enforcement and only target
>  the extreme cases.
>
> mm/memcontrol.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
>diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>index 292b0b99a2c7..0da4be4798e7 100644
>--- a/mm/memcontrol.c
>+++ b/mm/memcontrol.c
>@@ -2703,6 +2703,11 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> 		}
> 	} while ((memcg = parent_mem_cgroup(memcg)));
>
>+	if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
>+	    !(current->flags & PF_MEMALLOC) &&
>+	    gfpflags_allow_blocking(gfp_mask)) {
>+		mem_cgroup_handle_over_high();

Thanks, I was going to comment on v1 that I prefer to keep the implementation 
of mem_cgroup_handle_over_high if possible since we know that the mechanism has 
been safe in production over the past few years.

One question I have is about throttling. It looks like this new 
mem_cgroup_handle_over_high callsite may mean that throttling is invoked more 
than once on a misbehaving workload that's failing to reclaim since the 
throttling could be invoked both here and in return to userspace, right? That 
might not be a problem, but we should think about the implications of that, 
especially in relation to MEMCG_MAX_HIGH_DELAY_JIFFIES.

Maybe we should record if throttling happened previously and avoid doing it 
again for this entry into kernelspace? Not certain that's the right answer, but 
we should think about what the new semantics should be.

>+	}
> 	return 0;
> }
>
>-- 
>2.35.1.265.g69c8d7142f-goog
>
Shakeel Butt Feb. 11, 2022, 8:36 p.m. UTC | #2
On Fri, Feb 11, 2022 at 4:13 AM Chris Down <chris@chrisdown.name> wrote:
>
[...]
> >To make high limit enforcement more robust, this patch makes the limit
> >enforcement synchronous only if the accumulated overcharge becomes
> >larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still
> >be throttled on the return-to-userspace path but only the extreme
> >allocations which accumulates large amount of overcharge without
> >returning to the userspace will be throttled synchronously. The value
> >MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
> >memcg codebase uses this constant therefore for now uses the same one.
>
> Note that mem_cgroup_handle_over_high() has its own allocator throttling grace
> period, where it bails out if the penalty to apply is less than 10ms. The
> reclaim will still happen, though. So throttling might not happen even for
> roughly MEMCG_CHARGE_BATCH-sized allocations, depending on the overall size of
> the cgroup and its protection.
>

Here by throttling, I meant both reclaim and
schedule_timeout_killable(). I don't want to say low level details
which might change in future.

[...]
>
> Thanks, I was going to comment on v1 that I prefer to keep the implementation
> of mem_cgroup_handle_over_high if possible since we know that the mechanism has
> been safe in production over the past few years.
>
> One question I have is about throttling. It looks like this new
> mem_cgroup_handle_over_high callsite may mean that throttling is invoked more
> than once on a misbehaving workload that's failing to reclaim since the
> throttling could be invoked both here and in return to userspace, right? That
> might not be a problem, but we should think about the implications of that,
> especially in relation to MEMCG_MAX_HIGH_DELAY_JIFFIES.
>

Please note that mem_cgroup_handle_over_high() clears
memcg_nr_pages_over_high and if on the return-to-userspace path
mem_cgroup_handle_over_high() finds that memcg_nr_pages_over_high is
non-zero, then it means the task has further accumulated the charges
over high limit after a possibly synchronous
memcg_nr_pages_over_high() call.

> Maybe we should record if throttling happened previously and avoid doing it
> again for this entry into kernelspace? Not certain that's the right answer, but
> we should think about what the new semantics should be.

For now, I will keep this as is and will add a comment in the code and
a mention in the commit message about it. I will wait for others to
comment before sending the next version and thanks for taking a look.
Shakeel Butt Feb. 15, 2022, 6:50 p.m. UTC | #3
On Thu, Feb 10, 2022 at 10:49 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> The high limit is used to throttle the workload without invoking the
> oom-killer. Recently we tried to use the high limit to right size our
> internal workloads. More specifically dynamically adjusting the limits
> of the workload without letting the workload get oom-killed. However due
> to the limitation of the implementation of high limit enforcement, we
> observed the mechanism fails for some real workloads.
>
> The high limit is enforced on return-to-userspace i.e. the kernel let
> the usage goes over the limit and when the execution returns to
> userspace, the high reclaim is triggered and the process can get
> throttled as well. However this mechanism fails for workloads which do
> large allocations in a single kernel entry e.g. applications that
> mlock() a large chunk of memory in a single syscall. Such applications
> bypass the high limit and can trigger the oom-killer.
>
> To make high limit enforcement more robust, this patch makes the limit
> enforcement synchronous only if the accumulated overcharge becomes
> larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still
> be throttled on the return-to-userspace path but only the extreme
> allocations which accumulates large amount of overcharge without
> returning to the userspace will be throttled synchronously. The value
> MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
> memcg codebase uses this constant therefore for now uses the same one.
>
> Signed-off-by: Shakeel Butt <shakeelb@google.com>

Any comments or concerns on this patch? Otherwise I would ask Andrew
to add this series into the mm tree.

> ---
> Changes since v1:
> - Based on Roman's comment simply the sync enforcement and only target
>   the extreme cases.
>
>  mm/memcontrol.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 292b0b99a2c7..0da4be4798e7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2703,6 +2703,11 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>                 }
>         } while ((memcg = parent_mem_cgroup(memcg)));
>
> +       if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
> +           !(current->flags & PF_MEMALLOC) &&
> +           gfpflags_allow_blocking(gfp_mask)) {
> +               mem_cgroup_handle_over_high();
> +       }
>         return 0;
>  }
>
> --
> 2.35.1.265.g69c8d7142f-goog
>
Roman Gushchin Feb. 15, 2022, 11:27 p.m. UTC | #4
On Thu, Feb 10, 2022 at 10:49:17PM -0800, Shakeel Butt wrote:
> The high limit is used to throttle the workload without invoking the
> oom-killer. Recently we tried to use the high limit to right size our
> internal workloads. More specifically dynamically adjusting the limits
> of the workload without letting the workload get oom-killed. However due
> to the limitation of the implementation of high limit enforcement, we
> observed the mechanism fails for some real workloads.
> 
> The high limit is enforced on return-to-userspace i.e. the kernel let
> the usage goes over the limit and when the execution returns to
> userspace, the high reclaim is triggered and the process can get
> throttled as well. However this mechanism fails for workloads which do
> large allocations in a single kernel entry e.g. applications that
> mlock() a large chunk of memory in a single syscall. Such applications
> bypass the high limit and can trigger the oom-killer.
> 
> To make high limit enforcement more robust, this patch makes the limit
> enforcement synchronous only if the accumulated overcharge becomes
> larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still
> be throttled on the return-to-userspace path but only the extreme
> allocations which accumulates large amount of overcharge without
> returning to the userspace will be throttled synchronously. The value
> MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
> memcg codebase uses this constant therefore for now uses the same one.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
> Changes since v1:
> - Based on Roman's comment simply the sync enforcement and only target
>   the extreme cases.

Reviewed-by: Roman Gushchin <guro@fb.com>

This version indeed looks more safe to me.

Thanks!
Chris Down Feb. 16, 2022, 1:12 p.m. UTC | #5
Shakeel Butt writes:
>> Thanks, I was going to comment on v1 that I prefer to keep the implementation
>> of mem_cgroup_handle_over_high if possible since we know that the mechanism has
>> been safe in production over the past few years.
>>
>> One question I have is about throttling. It looks like this new
>> mem_cgroup_handle_over_high callsite may mean that throttling is invoked more
>> than once on a misbehaving workload that's failing to reclaim since the
>> throttling could be invoked both here and in return to userspace, right? That
>> might not be a problem, but we should think about the implications of that,
>> especially in relation to MEMCG_MAX_HIGH_DELAY_JIFFIES.
>>
>
>Please note that mem_cgroup_handle_over_high() clears
>memcg_nr_pages_over_high and if on the return-to-userspace path
>mem_cgroup_handle_over_high() finds that memcg_nr_pages_over_high is
>non-zero, then it means the task has further accumulated the charges
>over high limit after a possibly synchronous
>memcg_nr_pages_over_high() call.

Oh sure, my point was only that MEMCG_MAX_HIGH_DELAY_JIFFIES was to more 
reliably ensure we are returning to userspace at some point in the near future 
to allow the task to have another chance at good behaviour instead of being 
immediately whacked with whatever is monitoring PSI -- for example, in the case 
where we have a daemon which is monitoring its own PSI contributions and will 
make a proactive attempt to free structures in userspace.

That said, the throttling here still isn't unbounded, and it's not likely that 
anyone doing such large allocations after already exceeding memory.high is 
being a good citizen, so I think the patch makes sense as long as the change is 
understood and documented internally.

Thanks!

Acked-by: Chris Down <chris@chrisdown.name>
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 292b0b99a2c7..0da4be4798e7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2703,6 +2703,11 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		}
 	} while ((memcg = parent_mem_cgroup(memcg)));
 
+	if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
+	    !(current->flags & PF_MEMALLOC) &&
+	    gfpflags_allow_blocking(gfp_mask)) {
+		mem_cgroup_handle_over_high();
+	}
 	return 0;
 }