diff mbox series

[RFC] mm: don't raise MEMCG_OOM event due to failed high-order allocation

Message ID 20180910215622.4428-1-guro@fb.com (mailing list archive)
State New, archived
Headers show
Series [RFC] mm: don't raise MEMCG_OOM event due to failed high-order allocation | expand

Commit Message

Roman Gushchin Sept. 10, 2018, 9:56 p.m. UTC
The memcg OOM killer is never invoked due to a failed high-order
allocation, however the MEMCG_OOM event can be easily raised.

Under some memory pressure it can happen easily because of a
concurrent allocation. Let's look at try_charge(). Even if we were
able to reclaim enough memory, this check can fail due to a race
with another allocation:

    if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
        goto retry;

For regular pages the following condition will save us from triggering
the OOM:

   if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
       goto retry;

But for high-order allocation this condition will intentionally fail.
The reason behind is that we'll likely fall to regular pages anyway,
so it's ok and even preferred to return ENOMEM.

In this case the idea of raising the MEMCG_OOM event looks dubious.

Fix this by moving MEMCG_OOM raising to  mem_cgroup_oom() after
allocation order check, so that the event won't be raised for high
order allocations.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

David Rientjes Sept. 11, 2018, 12:40 a.m. UTC | #1
On Mon, 10 Sep 2018, Roman Gushchin wrote:

> The memcg OOM killer is never invoked due to a failed high-order
> allocation, however the MEMCG_OOM event can be easily raised.
> 
> Under some memory pressure it can happen easily because of a
> concurrent allocation. Let's look at try_charge(). Even if we were
> able to reclaim enough memory, this check can fail due to a race
> with another allocation:
> 
>     if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>         goto retry;
> 
> For regular pages the following condition will save us from triggering
> the OOM:
> 
>    if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
>        goto retry;
> 
> But for high-order allocation this condition will intentionally fail.
> The reason behind is that we'll likely fall to regular pages anyway,
> so it's ok and even preferred to return ENOMEM.
> 
> In this case the idea of raising the MEMCG_OOM event looks dubious.
> 
> Fix this by moving MEMCG_OOM raising to  mem_cgroup_oom() after
> allocation order check, so that the event won't be raised for high
> order allocations.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>

Acked-by: David Rientjes <rientjes@google.com>
Michal Hocko Sept. 11, 2018, 12:11 p.m. UTC | #2
On Mon 10-09-18 14:56:22, Roman Gushchin wrote:
> The memcg OOM killer is never invoked due to a failed high-order
> allocation, however the MEMCG_OOM event can be easily raised.
> 
> Under some memory pressure it can happen easily because of a
> concurrent allocation. Let's look at try_charge(). Even if we were
> able to reclaim enough memory, this check can fail due to a race
> with another allocation:
> 
>     if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>         goto retry;
> 
> For regular pages the following condition will save us from triggering
> the OOM:
> 
>    if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
>        goto retry;
> 
> But for high-order allocation this condition will intentionally fail.
> The reason behind is that we'll likely fall to regular pages anyway,
> so it's ok and even preferred to return ENOMEM.
> 
> In this case the idea of raising the MEMCG_OOM event looks dubious.

Why is this a problem though? IIRC this event was deliberately placed
outside of the oom path because we wanted to count allocation failures
and this is also documented that way

          oom
                The number of time the cgroup's memory usage was
                reached the limit and allocation was about to fail.

                Depending on context result could be invocation of OOM
                killer and retrying allocation or failing a

One could argue that we do not apply the same logic to GFP_NOWAIT
requests but in general I would like to see a good reason to change
the behavior and if it is really the right thing to do then we need to
update the documentation as well.

> Fix this by moving MEMCG_OOM raising to  mem_cgroup_oom() after
> allocation order check, so that the event won't be raised for high
> order allocations.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> ---
>  mm/memcontrol.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fcec9b39e2a3..103ca3c31c04 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1669,6 +1669,8 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  	if (order > PAGE_ALLOC_COSTLY_ORDER)
>  		return OOM_SKIPPED;
>  
> +	memcg_memory_event(memcg, MEMCG_OOM);
> +
>  	/*
>  	 * We are in the middle of the charge context here, so we
>  	 * don't want to block when potentially sitting on a callstack
> @@ -2250,8 +2252,6 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (fatal_signal_pending(current))
>  		goto force;
>  
> -	memcg_memory_event(mem_over_limit, MEMCG_OOM);
> -
>  	/*
>  	 * keep retrying as long as the memcg oom killer is able to make
>  	 * a forward progress or bypass the charge if the oom killer
> -- 
> 2.17.1
Peter Enderborg Sept. 11, 2018, 12:41 p.m. UTC | #3
On 09/11/2018 02:11 PM, Michal Hocko wrote:
> Why is this a problem though? IIRC this event was deliberately placed
> outside of the oom path because we wanted to count allocation failures
> and this is also documented that way
>
>           oom
>                 The number of time the cgroup's memory usage was
>                 reached the limit and allocation was about to fail.
>
>                 Depending on context result could be invocation of OOM
>                 killer and retrying allocation or failing a
>
> One could argue that we do not apply the same logic to GFP_NOWAIT
> requests but in general I would like to see a good reason to change
> the behavior and if it is really the right thing to do then we need to
> update the documentation as well.
>

Why not introduce a MEMCG_ALLOC_FAIL in to memcg_memory_event?
Johannes Weiner Sept. 11, 2018, 12:43 p.m. UTC | #4
On Mon, Sep 10, 2018 at 02:56:22PM -0700, Roman Gushchin wrote:
> The memcg OOM killer is never invoked due to a failed high-order
> allocation, however the MEMCG_OOM event can be easily raised.

Wasn't the same also true for kernel allocations until recently? We'd
signal MEMCG_OOM and then return -ENOMEM.

> Under some memory pressure it can happen easily because of a
> concurrent allocation. Let's look at try_charge(). Even if we were
> able to reclaim enough memory, this check can fail due to a race
> with another allocation:
> 
>     if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>         goto retry;
> 
> For regular pages the following condition will save us from triggering
> the OOM:
> 
>    if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
>        goto retry;
> 
> But for high-order allocation this condition will intentionally fail.
> The reason behind is that we'll likely fall to regular pages anyway,
> so it's ok and even preferred to return ENOMEM.

These seem to be more implementation details than anything else.

Personally, I'm confused by the difference between the "oom" and
"oom_kill" events, and I don't understand when you would be interested
in one and when in the other. The difference again seems to be mostly
implementation details.

But the definition of "oom"/MEMCG_OOM in cgroup-v2.rst applies to the
situation of failing higher-order allocations. I'm not per-se against
changing the semantics here, as I don't think they are great. But can
you please start out with rewriting the definition in a way that shows
the practical difference for users?

The original idea behind MEMCG_OOM was to signal when reclaim had
failed and we defer to the oom killer. The oom killer may or may not
kill anything, which is the case for higher order allocations, but
that doesn't change the out-of-memory situation that has occurred.

Konstantin added the OOM_KILL events to count actual kills. It seems
to me that this has much more practical applications than the more
theoretical OOM, since users care more about kills and not necessarily
about "reclaim failed (but i might have been able to handle it with
retries and fallback allocations, and so there isn't an actual issue".

Is there a good reason for keeping OOM now that we have OOM_KILL?
Michal Hocko Sept. 11, 2018, 12:47 p.m. UTC | #5
On Tue 11-09-18 14:41:04, peter enderborg wrote:
> On 09/11/2018 02:11 PM, Michal Hocko wrote:
> > Why is this a problem though? IIRC this event was deliberately placed
> > outside of the oom path because we wanted to count allocation failures
> > and this is also documented that way
> >
> >           oom
> >                 The number of time the cgroup's memory usage was
> >                 reached the limit and allocation was about to fail.
> >
> >                 Depending on context result could be invocation of OOM
> >                 killer and retrying allocation or failing a
> >
> > One could argue that we do not apply the same logic to GFP_NOWAIT
> > requests but in general I would like to see a good reason to change
> > the behavior and if it is really the right thing to do then we need to
> > update the documentation as well.
> >
> 
> Why not introduce a MEMCG_ALLOC_FAIL in to memcg_memory_event?

My understanding is that this is what the oom event is about.
Roman Gushchin Sept. 11, 2018, 3:27 p.m. UTC | #6
On Tue, Sep 11, 2018 at 02:11:41PM +0200, Michal Hocko wrote:
> On Mon 10-09-18 14:56:22, Roman Gushchin wrote:
> > The memcg OOM killer is never invoked due to a failed high-order
> > allocation, however the MEMCG_OOM event can be easily raised.
> > 
> > Under some memory pressure it can happen easily because of a
> > concurrent allocation. Let's look at try_charge(). Even if we were
> > able to reclaim enough memory, this check can fail due to a race
> > with another allocation:
> > 
> >     if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> >         goto retry;
> > 
> > For regular pages the following condition will save us from triggering
> > the OOM:
> > 
> >    if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
> >        goto retry;
> > 
> > But for high-order allocation this condition will intentionally fail.
> > The reason behind is that we'll likely fall to regular pages anyway,
> > so it's ok and even preferred to return ENOMEM.
> > 
> > In this case the idea of raising the MEMCG_OOM event looks dubious.
> 
> Why is this a problem though? IIRC this event was deliberately placed
> outside of the oom path because we wanted to count allocation failures
> and this is also documented that way
> 
>           oom
>                 The number of time the cgroup's memory usage was
>                 reached the limit and allocation was about to fail.
> 
>                 Depending on context result could be invocation of OOM
>                 killer and retrying allocation or failing a
> 
> One could argue that we do not apply the same logic to GFP_NOWAIT
> requests but in general I would like to see a good reason to change
> the behavior and if it is really the right thing to do then we need to
> update the documentation as well.

Right, the current behavior matches the documentation, because the description
of the event is broad enough. My point is that the current behavior is not
useful in my corner case.

Let me explain my case in details: I've got a report about sporadic memcg oom
kills on some hosts with plenty of pagecache and low memory pressure.
You'll probably agree, that raising OOM signal in this case looks strange.

It's natural for cgroup memory usage to be around memory.max border, and
I've explained in the commit message how an attempt to charge a high-order
allocation can fail in this case, even if there no real memory pressure
in the cgroup.

Thanks!
Roman Gushchin Sept. 11, 2018, 3:34 p.m. UTC | #7
On Tue, Sep 11, 2018 at 02:41:04PM +0200, peter enderborg wrote:
> On 09/11/2018 02:11 PM, Michal Hocko wrote:
> > Why is this a problem though? IIRC this event was deliberately placed
> > outside of the oom path because we wanted to count allocation failures
> > and this is also documented that way
> >
> >           oom
> >                 The number of time the cgroup's memory usage was
> >                 reached the limit and allocation was about to fail.
> >
> >                 Depending on context result could be invocation of OOM
> >                 killer and retrying allocation or failing a
> >
> > One could argue that we do not apply the same logic to GFP_NOWAIT
> > requests but in general I would like to see a good reason to change
> > the behavior and if it is really the right thing to do then we need to
> > update the documentation as well.
> >
> 
> Why not introduce a MEMCG_ALLOC_FAIL in to memcg_memory_event?

memory.events contains events which are useful (actionable) for userspace.
E.g. memory.high event may signal that high limit is reached and the workload
is slowing down by forcing into the direct reclaim.

Kernel allocation failure is not a userspace problem, so it's not actionable.
I'd say memory.stat can be a good place for a such counter.

Thanks!
Roman Gushchin Sept. 11, 2018, 3:47 p.m. UTC | #8
On Tue, Sep 11, 2018 at 08:43:03AM -0400, Johannes Weiner wrote:
> On Mon, Sep 10, 2018 at 02:56:22PM -0700, Roman Gushchin wrote:
> > The memcg OOM killer is never invoked due to a failed high-order
> > allocation, however the MEMCG_OOM event can be easily raised.
> 
> Wasn't the same also true for kernel allocations until recently? We'd
> signal MEMCG_OOM and then return -ENOMEM.

Well, assuming that it's normal for a cgroup to have its memory usage
about the memory.max border, that sounds strange.

> 
> > Under some memory pressure it can happen easily because of a
> > concurrent allocation. Let's look at try_charge(). Even if we were
> > able to reclaim enough memory, this check can fail due to a race
> > with another allocation:
> > 
> >     if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> >         goto retry;
> > 
> > For regular pages the following condition will save us from triggering
> > the OOM:
> > 
> >    if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
> >        goto retry;
> > 
> > But for high-order allocation this condition will intentionally fail.
> > The reason behind is that we'll likely fall to regular pages anyway,
> > so it's ok and even preferred to return ENOMEM.
> 
> These seem to be more implementation details than anything else.
> 
> Personally, I'm confused by the difference between the "oom" and
> "oom_kill" events, and I don't understand when you would be interested
> in one and when in the other. The difference again seems to be mostly
> implementation details.
> 
> But the definition of "oom"/MEMCG_OOM in cgroup-v2.rst applies to the
> situation of failing higher-order allocations. I'm not per-se against
> changing the semantics here, as I don't think they are great. But can
> you please start out with rewriting the definition in a way that shows
> the practical difference for users?
> 
> The original idea behind MEMCG_OOM was to signal when reclaim had
> failed and we defer to the oom killer. The oom killer may or may not
> kill anything, which is the case for higher order allocations, but
> that doesn't change the out-of-memory situation that has occurred.
> 
> Konstantin added the OOM_KILL events to count actual kills. It seems
> to me that this has much more practical applications than the more
> theoretical OOM, since users care more about kills and not necessarily
> about "reclaim failed (but i might have been able to handle it with
> retries and fallback allocations, and so there isn't an actual issue".
> 
> Is there a good reason for keeping OOM now that we have OOM_KILL?

I totally agree that oom_kill is more useful, and I did propose to
convert existing oom counter into oom_kill semantics back to time when
Konstantin's patch was discussed. So, I'm not arguing here that having two
counter is really useful, I've expressed the opposite meaning from scratch.

However I'm not sure if it's not too late to remove the oom event.
But if it is too late, let's make it less confusing.

Definition of the oom event in docs is quite broad, so both current
behavior and proposed change will fit. So it's not a semantics change
at all, pure implementation details.

Let's agree that oom event should not indicate a "random" allocation
failure, but one caused by high memory pressure. Otherwise it's really
a alloc_failure counter, which has to be moved to memory.stat.

Thanks!
Michal Hocko Sept. 12, 2018, 12:35 p.m. UTC | #9
On Tue 11-09-18 08:27:30, Roman Gushchin wrote:
> On Tue, Sep 11, 2018 at 02:11:41PM +0200, Michal Hocko wrote:
> > On Mon 10-09-18 14:56:22, Roman Gushchin wrote:
> > > The memcg OOM killer is never invoked due to a failed high-order
> > > allocation, however the MEMCG_OOM event can be easily raised.
> > > 
> > > Under some memory pressure it can happen easily because of a
> > > concurrent allocation. Let's look at try_charge(). Even if we were
> > > able to reclaim enough memory, this check can fail due to a race
> > > with another allocation:
> > > 
> > >     if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> > >         goto retry;
> > > 
> > > For regular pages the following condition will save us from triggering
> > > the OOM:
> > > 
> > >    if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
> > >        goto retry;
> > > 
> > > But for high-order allocation this condition will intentionally fail.
> > > The reason behind is that we'll likely fall to regular pages anyway,
> > > so it's ok and even preferred to return ENOMEM.
> > > 
> > > In this case the idea of raising the MEMCG_OOM event looks dubious.
> > 
> > Why is this a problem though? IIRC this event was deliberately placed
> > outside of the oom path because we wanted to count allocation failures
> > and this is also documented that way
> > 
> >           oom
> >                 The number of time the cgroup's memory usage was
> >                 reached the limit and allocation was about to fail.
> > 
> >                 Depending on context result could be invocation of OOM
> >                 killer and retrying allocation or failing a
> > 
> > One could argue that we do not apply the same logic to GFP_NOWAIT
> > requests but in general I would like to see a good reason to change
> > the behavior and if it is really the right thing to do then we need to
> > update the documentation as well.
> 
> Right, the current behavior matches the documentation, because the description
> of the event is broad enough. My point is that the current behavior is not
> useful in my corner case.
> 
> Let me explain my case in details: I've got a report about sporadic memcg oom
> kills on some hosts with plenty of pagecache and low memory pressure.
> You'll probably agree, that raising OOM signal in this case looks strange.

I am not sure I follow. So you see both OOM_KILL and OOM events and the
user misinterprets OOM ones?

My understanding was that OOM event should tell admin that the limit
should be increased in order to allow more charges. Without OOM_KILL
events it means that those failed charges have some sort of fallback
so it is not critical condition for the workload yet. Something to watch
for though in case of perf. degradation or potential misbehavior.

Whether this is how the event is used, I dunno. Anyway, if you want to
just move the event and make it closer to OOM_KILL then I strongly
suspect the event is losing its relevance.
Roman Gushchin Sept. 12, 2018, 4:25 p.m. UTC | #10
On Wed, Sep 12, 2018 at 02:35:34PM +0200, Michal Hocko wrote:
> On Tue 11-09-18 08:27:30, Roman Gushchin wrote:
> > On Tue, Sep 11, 2018 at 02:11:41PM +0200, Michal Hocko wrote:
> > > On Mon 10-09-18 14:56:22, Roman Gushchin wrote:
> > > > The memcg OOM killer is never invoked due to a failed high-order
> > > > allocation, however the MEMCG_OOM event can be easily raised.
> > > > 
> > > > Under some memory pressure it can happen easily because of a
> > > > concurrent allocation. Let's look at try_charge(). Even if we were
> > > > able to reclaim enough memory, this check can fail due to a race
> > > > with another allocation:
> > > > 
> > > >     if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> > > >         goto retry;
> > > > 
> > > > For regular pages the following condition will save us from triggering
> > > > the OOM:
> > > > 
> > > >    if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
> > > >        goto retry;
> > > > 
> > > > But for high-order allocation this condition will intentionally fail.
> > > > The reason behind is that we'll likely fall to regular pages anyway,
> > > > so it's ok and even preferred to return ENOMEM.
> > > > 
> > > > In this case the idea of raising the MEMCG_OOM event looks dubious.
> > > 
> > > Why is this a problem though? IIRC this event was deliberately placed
> > > outside of the oom path because we wanted to count allocation failures
> > > and this is also documented that way
> > > 
> > >           oom
> > >                 The number of time the cgroup's memory usage was
> > >                 reached the limit and allocation was about to fail.
> > > 
> > >                 Depending on context result could be invocation of OOM
> > >                 killer and retrying allocation or failing a
> > > 
> > > One could argue that we do not apply the same logic to GFP_NOWAIT
> > > requests but in general I would like to see a good reason to change
> > > the behavior and if it is really the right thing to do then we need to
> > > update the documentation as well.
> > 
> > Right, the current behavior matches the documentation, because the description
> > of the event is broad enough. My point is that the current behavior is not
> > useful in my corner case.
> > 
> > Let me explain my case in details: I've got a report about sporadic memcg oom
> > kills on some hosts with plenty of pagecache and low memory pressure.
> > You'll probably agree, that raising OOM signal in this case looks strange.
> 
> I am not sure I follow. So you see both OOM_KILL and OOM events and the
> user misinterprets OOM ones?

No, I see sporadic OOMs without OOM_KILLs in cgroups with plenty of pagecache
and low memory pressure. It's not a pre-OOM condition at all.

> 
> My understanding was that OOM event should tell admin that the limit
> should be increased in order to allow more charges. Without OOM_KILL
> events it means that those failed charges have some sort of fallback
> so it is not critical condition for the workload yet. Something to watch
> for though in case of perf. degradation or potential misbehavior.

Right, something like "there is a shortage of memory which will likely
lead to OOM soon". It's not my case.

> 
> Whether this is how the event is used, I dunno. Anyway, if you want to
> just move the event and make it closer to OOM_KILL then I strongly
> suspect the event is losing its relevance.

I agree here (about losing relevance), but don't think it's a reason
to generate misleading events.

Thanks!
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fcec9b39e2a3..103ca3c31c04 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1669,6 +1669,8 @@  static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 	if (order > PAGE_ALLOC_COSTLY_ORDER)
 		return OOM_SKIPPED;
 
+	memcg_memory_event(memcg, MEMCG_OOM);
+
 	/*
 	 * We are in the middle of the charge context here, so we
 	 * don't want to block when potentially sitting on a callstack
@@ -2250,8 +2252,6 @@  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (fatal_signal_pending(current))
 		goto force;
 
-	memcg_memory_event(mem_over_limit, MEMCG_OOM);
-
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
 	 * a forward progress or bypass the charge if the oom killer