[v3,2/4] mm/oom: handle remote ooms

Message ID	20211111234203.1824138-3-almasrymina@google.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=7n0b=P6=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 2C6E36108B Date: Thu, 11 Nov 2021 15:42:01 -0800 In-Reply-To: <20211111234203.1824138-1-almasrymina@google.com> Message-Id: <20211111234203.1824138-3-almasrymina@google.com> Mime-Version: 1.0 References: <20211111234203.1824138-1-almasrymina@google.com> Subject: [PATCH v3 2/4] mm/oom: handle remote ooms From: Mina Almasry <almasrymina@google.com> Cc: Mina Almasry <almasrymina@google.com>, Michal Hocko <mhocko@suse.com>, "Theodore Ts'o" <tytso@mit.edu>, Greg Thelen <gthelen@google.com>, Shakeel Butt <shakeelb@google.com>, Andrew Morton <akpm@linux-foundation.org>, Hugh Dickins <hughd@google.com>, Roman Gushchin <guro@fb.com>, Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, Muchun Song <songmuchun@bytedance.com>, riel@surriel.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v3,1/4] mm/shmem: support deterministic charging of tmpfs \| expand [v3,1/4] mm/shmem: support deterministic charging of tmpfs [v3,2/4] mm/oom: handle remote ooms [v3,3/4] mm, shmem: add tmpfs memcg= option documentation [v3,4/4] mm, shmem, selftests: add tmpfs memcg= mount option tests

Mina Almasry Nov. 11, 2021, 11:42 p.m. UTC

On remote ooms (OOMs due to remote charging), the oom-killer will attempt
to find a task to kill in the memcg under oom, if the oom-killer
is unable to find one, the oom-killer should simply return ENOMEM to the
allocating process.

If we're in pagefault path and we're unable to return ENOMEM to the
allocating process, we instead kill the allocating process.

Signed-off-by: Mina Almasry <almasrymina@google.com>

Cc: Michal Hocko <mhocko@suse.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Greg Thelen <gthelen@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
CC: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: riel@surriel.com
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Cc: cgroups@vger.kernel.org

---

Changes in v3:
- Fixed build failures/warnings Reported-by: kernel test robot <lkp@intel.com>

Changes in v2:
- Moved the remote oom handling as Roman requested.
- Used mem_cgroup_from_task(current) instead of grabbing the memcg from
current->mm

---
 include/linux/memcontrol.h | 16 ++++++++++++++++
 mm/memcontrol.c            | 29 +++++++++++++++++++++++++++++
 mm/oom_kill.c              | 22 ++++++++++++++++++++++
 3 files changed, 67 insertions(+)

--
2.34.0.rc1.387.gb447b232ab-goog

Michal Hocko Nov. 12, 2021, 7:51 a.m. UTC | #1

On Thu 11-11-21 15:42:01, Mina Almasry wrote:
> On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> to find a task to kill in the memcg under oom, if the oom-killer
> is unable to find one, the oom-killer should simply return ENOMEM to the
> allocating process.

This really begs for some justification.

> If we're in pagefault path and we're unable to return ENOMEM to the
> allocating process, we instead kill the allocating process.

Why do you handle those differently?

> Signed-off-by: Mina Almasry <almasrymina@google.com>
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: Greg Thelen <gthelen@google.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hugh Dickins <hughd@google.com>
> CC: Roman Gushchin <guro@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: riel@surriel.com
> Cc: linux-mm@kvack.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: cgroups@vger.kernel.org

Mina Almasry Nov. 12, 2021, 8:12 a.m. UTC | #2

On Thu, Nov 11, 2021 at 11:52 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 11-11-21 15:42:01, Mina Almasry wrote:
> > On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> > to find a task to kill in the memcg under oom, if the oom-killer
> > is unable to find one, the oom-killer should simply return ENOMEM to the
> > allocating process.
>
> This really begs for some justification.
>

I'm thinking (and I can add to the commit message in v4) that we have
2 reasonable options when the oom-killer gets invoked and finds
nothing to kill: (1) return ENOMEM, (2) kill the allocating task. I'm
thinking returning ENOMEM allows the application to gracefully handle
the failure to remote charge and continue operation.

For example, in the network service use case that I mentioned in the
RFC proposal, it's beneficial for the network service to get an ENOMEM
and continue to service network requests for other clients running on
the machine, rather than get oom-killed when hitting the remote memcg
limit. But, this is not a hard requirement, the network service could
fork a process that does the remote charging to guard against the
remote charge bringing down the entire process.

> > If we're in pagefault path and we're unable to return ENOMEM to the
> > allocating process, we instead kill the allocating process.
>
> Why do you handle those differently?
>

I'm thinking (possibly incorrectly) it's beneficial to return ENOMEM
to the allocating task rather than killing it. I would love to return
ENOMEM in both these cases, but I can't return ENOMEM in the fault
path. The behavior I see is that the oom-killer gets invoked over and
over again looking to find something to kill and continually failing
to find something to kill and the pagefault never gets handled.

I could, however, kill the allocating task whether it's in the
pagefault path or not; it's not a hard requirement that I return
ENOMEM. If this is what you'd like to see in v4, please let me know,
but I do see some value in allowing some callers to gracefully handle
the ENOMEM.

> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Theodore Ts'o <tytso@mit.edu>
> > Cc: Greg Thelen <gthelen@google.com>
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > CC: Roman Gushchin <guro@fb.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: Muchun Song <songmuchun@bytedance.com>
> > Cc: riel@surriel.com
> > Cc: linux-mm@kvack.org
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: cgroups@vger.kernel.org
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 12, 2021, 8:36 a.m. UTC | #3

On Fri 12-11-21 00:12:52, Mina Almasry wrote:
> On Thu, Nov 11, 2021 at 11:52 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 11-11-21 15:42:01, Mina Almasry wrote:
> > > On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> > > to find a task to kill in the memcg under oom, if the oom-killer
> > > is unable to find one, the oom-killer should simply return ENOMEM to the
> > > allocating process.
> >
> > This really begs for some justification.
> >
> 
> I'm thinking (and I can add to the commit message in v4) that we have
> 2 reasonable options when the oom-killer gets invoked and finds
> nothing to kill: (1) return ENOMEM, (2) kill the allocating task. I'm
> thinking returning ENOMEM allows the application to gracefully handle
> the failure to remote charge and continue operation.
> 
> For example, in the network service use case that I mentioned in the
> RFC proposal, it's beneficial for the network service to get an ENOMEM
> and continue to service network requests for other clients running on
> the machine, rather than get oom-killed when hitting the remote memcg
> limit. But, this is not a hard requirement, the network service could
> fork a process that does the remote charging to guard against the
> remote charge bringing down the entire process.

This all belongs to the changelog so that we can discuss all potential
implication and do not rely on any implicit assumptions. E.g. why does
it even make sense to kill a task in the origin cgroup?

> > > If we're in pagefault path and we're unable to return ENOMEM to the
> > > allocating process, we instead kill the allocating process.
> >
> > Why do you handle those differently?
> >
> 
> I'm thinking (possibly incorrectly) it's beneficial to return ENOMEM
> to the allocating task rather than killing it. I would love to return
> ENOMEM in both these cases, but I can't return ENOMEM in the fault
> path. The behavior I see is that the oom-killer gets invoked over and
> over again looking to find something to kill and continually failing
> to find something to kill and the pagefault never gets handled.

Just one remark. Until just very recently VM_FAULT_OOM (a result of
ENOMEM) would trigger the global OOM killer. This has changed by
60e2793d440a ("mm, oom: do not trigger out_of_memory from the #PF").
But you are right that you might just end up looping in the page fault
for ever. Is that bad though? The situation is fundamentaly
unresolveable at this stage. On the other hand the task is still
killable so the userspace can decide to terminate and break out of the
loop.

What is the best approach I am not quite sure. As I've said earlier this
is very likely going to open a can of worms and so it should be
evaluated very carefuly. For that, please make sure to describe your
thinking in details.
 
> I could, however, kill the allocating task whether it's in the
> pagefault path or not; it's not a hard requirement that I return
> ENOMEM. If this is what you'd like to see in v4, please let me know,
> but I do see some value in allowing some callers to gracefully handle
> the ENOMEM.
> 
> > > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > >
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Theodore Ts'o <tytso@mit.edu>
> > > Cc: Greg Thelen <gthelen@google.com>
> > > Cc: Shakeel Butt <shakeelb@google.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > CC: Roman Gushchin <guro@fb.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Tejun Heo <tj@kernel.org>
> > > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > > Cc: Muchun Song <songmuchun@bytedance.com>
> > > Cc: riel@surriel.com
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-fsdevel@vger.kernel.org
> > > Cc: cgroups@vger.kernel.org
> > --
> > Michal Hocko
> > SUSE Labs

Mina Almasry Nov. 12, 2021, 5:59 p.m. UTC | #4

On Fri, Nov 12, 2021 at 12:36 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 12-11-21 00:12:52, Mina Almasry wrote:
> > On Thu, Nov 11, 2021 at 11:52 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 11-11-21 15:42:01, Mina Almasry wrote:
> > > > On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> > > > to find a task to kill in the memcg under oom, if the oom-killer
> > > > is unable to find one, the oom-killer should simply return ENOMEM to the
> > > > allocating process.
> > >
> > > This really begs for some justification.
> > >
> >
> > I'm thinking (and I can add to the commit message in v4) that we have
> > 2 reasonable options when the oom-killer gets invoked and finds
> > nothing to kill: (1) return ENOMEM, (2) kill the allocating task. I'm
> > thinking returning ENOMEM allows the application to gracefully handle
> > the failure to remote charge and continue operation.
> >
> > For example, in the network service use case that I mentioned in the
> > RFC proposal, it's beneficial for the network service to get an ENOMEM
> > and continue to service network requests for other clients running on
> > the machine, rather than get oom-killed when hitting the remote memcg
> > limit. But, this is not a hard requirement, the network service could
> > fork a process that does the remote charging to guard against the
> > remote charge bringing down the entire process.
>
> This all belongs to the changelog so that we can discuss all potential
> implication and do not rely on any implicit assumptions.

Understood. Maybe I'll wait to collect more feedback and upload v4
with a thorough explanation of the thought process.

> E.g. why does
> it even make sense to kill a task in the origin cgroup?
>

The behavior I saw returning ENOMEM for this edge case was that the
code was forever looping the pagefault, and I was (seemingly
incorrectly) under the impression that a suggestion to forever loop
the pagefault would be completely fundamentally unacceptable.

> > > > If we're in pagefault path and we're unable to return ENOMEM to the
> > > > allocating process, we instead kill the allocating process.
> > >
> > > Why do you handle those differently?
> > >
> >
> > I'm thinking (possibly incorrectly) it's beneficial to return ENOMEM
> > to the allocating task rather than killing it. I would love to return
> > ENOMEM in both these cases, but I can't return ENOMEM in the fault
> > path. The behavior I see is that the oom-killer gets invoked over and
> > over again looking to find something to kill and continually failing
> > to find something to kill and the pagefault never gets handled.
>
> Just one remark. Until just very recently VM_FAULT_OOM (a result of
> ENOMEM) would trigger the global OOM killer. This has changed by
> 60e2793d440a ("mm, oom: do not trigger out_of_memory from the #PF").
> But you are right that you might just end up looping in the page fault
> for ever. Is that bad though? The situation is fundamentaly
> unresolveable at this stage. On the other hand the task is still
> killable so the userspace can decide to terminate and break out of the
> loop.
>

I think what you're saying here makes a lot of sense and I think is a
workable approach, and maybe is slightly preferable to killing the
task IMO (and both are workable IMO). The pagefault can loop until
memory becomes available in the remote memcg, and the userspace can
decide to always terminate the process if desired or maybe handle the
issue more gracefully by freeing memory in the remote memcg somehow;
i.e. maybe we don't need the kernel to be heavy handed here and kill
the remote allocating task immediately.

> What is the best approach I am not quite sure. As I've said earlier this
> is very likely going to open a can of worms and so it should be
> evaluated very carefuly. For that, please make sure to describe your
> thinking in details.
>

OK, thanks for reviewing and the next iteration should include a
thorough explanation of my thinking.

> > I could, however, kill the allocating task whether it's in the
> > pagefault path or not; it's not a hard requirement that I return
> > ENOMEM. If this is what you'd like to see in v4, please let me know,
> > but I do see some value in allowing some callers to gracefully handle
> > the ENOMEM.
> >
> > > > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > >
> > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > Cc: Theodore Ts'o <tytso@mit.edu>
> > > > Cc: Greg Thelen <gthelen@google.com>
> > > > Cc: Shakeel Butt <shakeelb@google.com>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Hugh Dickins <hughd@google.com>
> > > > CC: Roman Gushchin <guro@fb.com>
> > > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > > Cc: Hugh Dickins <hughd@google.com>
> > > > Cc: Tejun Heo <tj@kernel.org>
> > > > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > > > Cc: Muchun Song <songmuchun@bytedance.com>
> > > > Cc: riel@surriel.com
> > > > Cc: linux-mm@kvack.org
> > > > Cc: linux-fsdevel@vger.kernel.org
> > > > Cc: cgroups@vger.kernel.org
> > > --
> > > Michal Hocko
> > > SUSE Labs
>
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 15, 2021, 10:58 a.m. UTC | #5

On Fri 12-11-21 09:59:22, Mina Almasry wrote:
> On Fri, Nov 12, 2021 at 12:36 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 12-11-21 00:12:52, Mina Almasry wrote:
> > > On Thu, Nov 11, 2021 at 11:52 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 11-11-21 15:42:01, Mina Almasry wrote:
> > > > > On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> > > > > to find a task to kill in the memcg under oom, if the oom-killer
> > > > > is unable to find one, the oom-killer should simply return ENOMEM to the
> > > > > allocating process.
> > > >
> > > > This really begs for some justification.
> > > >
> > >
> > > I'm thinking (and I can add to the commit message in v4) that we have
> > > 2 reasonable options when the oom-killer gets invoked and finds
> > > nothing to kill: (1) return ENOMEM, (2) kill the allocating task. I'm
> > > thinking returning ENOMEM allows the application to gracefully handle
> > > the failure to remote charge and continue operation.
> > >
> > > For example, in the network service use case that I mentioned in the
> > > RFC proposal, it's beneficial for the network service to get an ENOMEM
> > > and continue to service network requests for other clients running on
> > > the machine, rather than get oom-killed when hitting the remote memcg
> > > limit. But, this is not a hard requirement, the network service could
> > > fork a process that does the remote charging to guard against the
> > > remote charge bringing down the entire process.
> >
> > This all belongs to the changelog so that we can discuss all potential
> > implication and do not rely on any implicit assumptions.
> 
> Understood. Maybe I'll wait to collect more feedback and upload v4
> with a thorough explanation of the thought process.
> 
> > E.g. why does
> > it even make sense to kill a task in the origin cgroup?
> >
> 
> The behavior I saw returning ENOMEM for this edge case was that the
> code was forever looping the pagefault, and I was (seemingly
> incorrectly) under the impression that a suggestion to forever loop
> the pagefault would be completely fundamentally unacceptable.

Well, I have to say I am not entirely sure what is the best way to
handle this situation. Another option would be to treat this similar to
ENOSPACE situation. This would result into SIGBUS IIRC.

The main problem with OOM killer is that it will not resolve the
underlying problem in most situations. Shmem files would likely stay
laying around and their charge along with them. Killing the allocating
task has problems on its own because this could be just a DoS vector by
other unrelated tasks sharing the shmem mount point without a gracefull
fallback. Retrying the page fault is hard to detect. SIGBUS might be
something that helps with the latest. The question is how to communicate
this requerement down to the memcg code to know that the memory reclaim
should happen (Should it? How hard we should try?) but do not invoke the
oom killer. The more I think about this the nastier this is.

Shakeel Butt Nov. 15, 2021, 5:32 p.m. UTC | #6

On Mon, Nov 15, 2021 at 2:58 AM Michal Hocko <mhocko@suse.com> wrote:
>
[...]
> >
> > The behavior I saw returning ENOMEM for this edge case was that the
> > code was forever looping the pagefault, and I was (seemingly
> > incorrectly) under the impression that a suggestion to forever loop
> > the pagefault would be completely fundamentally unacceptable.
>
> Well, I have to say I am not entirely sure what is the best way to
> handle this situation. Another option would be to treat this similar to
> ENOSPACE situation. This would result into SIGBUS IIRC.
>
> The main problem with OOM killer is that it will not resolve the
> underlying problem in most situations. Shmem files would likely stay
> laying around and their charge along with them.

This and similar topics were discussed during LSFMM 2019
(https://lwn.net/Articles/787626/).

> Killing the allocating
> task has problems on its own because this could be just a DoS vector by
> other unrelated tasks sharing the shmem mount point without a gracefull
> fallback. Retrying the page fault is hard to detect. SIGBUS might be
> something that helps with the latest. The question is how to communicate
> this requerement down to the memcg code to know that the memory reclaim
> should happen (Should it? How hard we should try?) but do not invoke the
> oom killer. The more I think about this the nastier this is.
> --

IMHO we should punt the resolution to the userspace and keep the
kernel simple. This is an opt-in feature and the user is expected to
know and handle exceptional scenarios. The kernel just needs to tell
the userspace that this exceptional situation is happening somehow.

How about for remote ooms irrespective of page fault path or not, keep
the allocator looping but keep incrementing a new memcg event
MEMCG_OOM_NO_VICTIM? The userspace will get to know the situation
either through inotify or polling and can handle the situation by
either increasing the limit or by releasing the memory of the
monitored memcg.

thanks,
Shakeel

Mina Almasry Nov. 16, 2021, 12:58 a.m. UTC | #7

On Mon, Nov 15, 2021 at 2:58 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 12-11-21 09:59:22, Mina Almasry wrote:
> > On Fri, Nov 12, 2021 at 12:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 12-11-21 00:12:52, Mina Almasry wrote:
> > > > On Thu, Nov 11, 2021 at 11:52 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 11-11-21 15:42:01, Mina Almasry wrote:
> > > > > > On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> > > > > > to find a task to kill in the memcg under oom, if the oom-killer
> > > > > > is unable to find one, the oom-killer should simply return ENOMEM to the
> > > > > > allocating process.
> > > > >
> > > > > This really begs for some justification.
> > > > >
> > > >
> > > > I'm thinking (and I can add to the commit message in v4) that we have
> > > > 2 reasonable options when the oom-killer gets invoked and finds
> > > > nothing to kill: (1) return ENOMEM, (2) kill the allocating task. I'm
> > > > thinking returning ENOMEM allows the application to gracefully handle
> > > > the failure to remote charge and continue operation.
> > > >
> > > > For example, in the network service use case that I mentioned in the
> > > > RFC proposal, it's beneficial for the network service to get an ENOMEM
> > > > and continue to service network requests for other clients running on
> > > > the machine, rather than get oom-killed when hitting the remote memcg
> > > > limit. But, this is not a hard requirement, the network service could
> > > > fork a process that does the remote charging to guard against the
> > > > remote charge bringing down the entire process.
> > >
> > > This all belongs to the changelog so that we can discuss all potential
> > > implication and do not rely on any implicit assumptions.
> >
> > Understood. Maybe I'll wait to collect more feedback and upload v4
> > with a thorough explanation of the thought process.
> >
> > > E.g. why does
> > > it even make sense to kill a task in the origin cgroup?
> > >
> >
> > The behavior I saw returning ENOMEM for this edge case was that the
> > code was forever looping the pagefault, and I was (seemingly
> > incorrectly) under the impression that a suggestion to forever loop
> > the pagefault would be completely fundamentally unacceptable.
>
> Well, I have to say I am not entirely sure what is the best way to
> handle this situation. Another option would be to treat this similar to
> ENOSPACE situation. This would result into SIGBUS IIRC.
>
> The main problem with OOM killer is that it will not resolve the
> underlying problem in most situations. Shmem files would likely stay
> laying around and their charge along with them. Killing the allocating
> task has problems on its own because this could be just a DoS vector by
> other unrelated tasks sharing the shmem mount point without a gracefull
> fallback. Retrying the page fault is hard to detect. SIGBUS might be
> something that helps with the latest. The question is how to communicate
> this requerement down to the memcg code to know that the memory reclaim
> should happen (Should it? How hard we should try?) but do not invoke the
> oom killer. The more I think about this the nastier this is.

So actually I thought the ENOSPC suggestion was interesting so I took
the liberty to prototype it. The changes required:

1. In out_of_memory() we return false if !oc->chosen &&
is_remote_oom(). This gets bubbled up to try_charge_memcg() as
mem_cgroup_oom() returning OOM_FAILED.
2. In try_charge_memcg(), if we get an OOM_FAILED we again check
is_remote_oom(), if it is a remote oom, return ENOSPC.
3. The calling code would return ENOSPC to the user in the no-fault
path, and SIGBUS the user in the fault path with no changes.

To be honest I think this is very workable, as is Shakeel's suggestion
of MEMCG_OOM_NO_VICTIM. Since this is an opt-in feature, we can
document the behavior and if the userspace doesn't want to get killed
they can catch the sigbus and handle it gracefully. If not, the
userspace just gets killed if we hit this edge case.

I may be missing something but AFAICT we don't have to "communicate
this requirement down to the memcg code" with this implementation. The
memcg code is aware the memcg is oom and will do the reclaim or
whatever before invoking the oom-killer. It's only when the oom-killer
can't find something to kill that we return ENOSPC or SIGBUS.

As always thank you very much for reviewing and providing feedback.

Michal Hocko Nov. 16, 2021, 9:28 a.m. UTC | #8

On Mon 15-11-21 16:58:19, Mina Almasry wrote:
> On Mon, Nov 15, 2021 at 2:58 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 12-11-21 09:59:22, Mina Almasry wrote:
> > > On Fri, Nov 12, 2021 at 12:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 12-11-21 00:12:52, Mina Almasry wrote:
> > > > > On Thu, Nov 11, 2021 at 11:52 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Thu 11-11-21 15:42:01, Mina Almasry wrote:
> > > > > > > On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> > > > > > > to find a task to kill in the memcg under oom, if the oom-killer
> > > > > > > is unable to find one, the oom-killer should simply return ENOMEM to the
> > > > > > > allocating process.
> > > > > >
> > > > > > This really begs for some justification.
> > > > > >
> > > > >
> > > > > I'm thinking (and I can add to the commit message in v4) that we have
> > > > > 2 reasonable options when the oom-killer gets invoked and finds
> > > > > nothing to kill: (1) return ENOMEM, (2) kill the allocating task. I'm
> > > > > thinking returning ENOMEM allows the application to gracefully handle
> > > > > the failure to remote charge and continue operation.
> > > > >
> > > > > For example, in the network service use case that I mentioned in the
> > > > > RFC proposal, it's beneficial for the network service to get an ENOMEM
> > > > > and continue to service network requests for other clients running on
> > > > > the machine, rather than get oom-killed when hitting the remote memcg
> > > > > limit. But, this is not a hard requirement, the network service could
> > > > > fork a process that does the remote charging to guard against the
> > > > > remote charge bringing down the entire process.
> > > >
> > > > This all belongs to the changelog so that we can discuss all potential
> > > > implication and do not rely on any implicit assumptions.
> > >
> > > Understood. Maybe I'll wait to collect more feedback and upload v4
> > > with a thorough explanation of the thought process.
> > >
> > > > E.g. why does
> > > > it even make sense to kill a task in the origin cgroup?
> > > >
> > >
> > > The behavior I saw returning ENOMEM for this edge case was that the
> > > code was forever looping the pagefault, and I was (seemingly
> > > incorrectly) under the impression that a suggestion to forever loop
> > > the pagefault would be completely fundamentally unacceptable.
> >
> > Well, I have to say I am not entirely sure what is the best way to
> > handle this situation. Another option would be to treat this similar to
> > ENOSPACE situation. This would result into SIGBUS IIRC.
> >
> > The main problem with OOM killer is that it will not resolve the
> > underlying problem in most situations. Shmem files would likely stay
> > laying around and their charge along with them. Killing the allocating
> > task has problems on its own because this could be just a DoS vector by
> > other unrelated tasks sharing the shmem mount point without a gracefull
> > fallback. Retrying the page fault is hard to detect. SIGBUS might be
> > something that helps with the latest. The question is how to communicate
> > this requerement down to the memcg code to know that the memory reclaim
> > should happen (Should it? How hard we should try?) but do not invoke the
> > oom killer. The more I think about this the nastier this is.
> 
> So actually I thought the ENOSPC suggestion was interesting so I took
> the liberty to prototype it. The changes required:
> 
> 1. In out_of_memory() we return false if !oc->chosen &&
> is_remote_oom(). This gets bubbled up to try_charge_memcg() as
> mem_cgroup_oom() returning OOM_FAILED.
> 2. In try_charge_memcg(), if we get an OOM_FAILED we again check
> is_remote_oom(), if it is a remote oom, return ENOSPC.
> 3. The calling code would return ENOSPC to the user in the no-fault
> path, and SIGBUS the user in the fault path with no changes.

I think this should be implemented at the caller side rather than
somehow hacked into the memcg core. It is the caller to know what to do.
The caller can use gfp flags to control the reclaim behavior.

> To be honest I think this is very workable, as is Shakeel's suggestion
> of MEMCG_OOM_NO_VICTIM. Since this is an opt-in feature, we can
> document the behavior and if the userspace doesn't want to get killed
> they can catch the sigbus and handle it gracefully. If not, the
> userspace just gets killed if we hit this edge case.

I am not sure about the MEMCG_OOM_NO_VICTIM approach. It sounds really
hackish to me. I will get back to Shakeel's email as time permits. The
primary problem I have with this, though, is that the kernel oom killer
cannot really do anything sensible if the limit is reached and there
is nothing reclaimable left in this case. The tmpfs backed memory will
simply stay around and there are no means to recover without userspace
intervention.

Michal Hocko Nov. 16, 2021, 9:39 a.m. UTC | #9

On Tue 16-11-21 10:28:25, Michal Hocko wrote:
> On Mon 15-11-21 16:58:19, Mina Almasry wrote:
[...]
> > To be honest I think this is very workable, as is Shakeel's suggestion
> > of MEMCG_OOM_NO_VICTIM. Since this is an opt-in feature, we can
> > document the behavior and if the userspace doesn't want to get killed
> > they can catch the sigbus and handle it gracefully. If not, the
> > userspace just gets killed if we hit this edge case.
> 
> I am not sure about the MEMCG_OOM_NO_VICTIM approach. It sounds really
> hackish to me. I will get back to Shakeel's email as time permits. The
> primary problem I have with this, though, is that the kernel oom killer
> cannot really do anything sensible if the limit is reached and there
> is nothing reclaimable left in this case. The tmpfs backed memory will
> simply stay around and there are no means to recover without userspace
> intervention.

And just a small clarification. Tmpfs is fundamentally problematic from
the OOM handling POV. The nuance here is that the OOM happens in a
different memcg and thus a different resource domain. If you kill a task
in the target memcg then you effectively DoS that workload. If you kill
the allocating task then it is DoSed by anybody allowed to write to that
shmem. All that without a graceful fallback.

I still have very hard time seeing how that can work reasonably except
for a very special case with a lot of other measures to ensure the
target memcg never hits the hard limit so the OOM simply is not a
problem.

Memory controller has always been used to enforce and balance memory
usage among resource domains and this goes against that principle.
I would be really curious what Johannes thinks about this.

Mina Almasry Nov. 16, 2021, 10:17 a.m. UTC | #10

On Tue, Nov 16, 2021 at 1:28 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 15-11-21 16:58:19, Mina Almasry wrote:
> > On Mon, Nov 15, 2021 at 2:58 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 12-11-21 09:59:22, Mina Almasry wrote:
> > > > On Fri, Nov 12, 2021 at 12:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 12-11-21 00:12:52, Mina Almasry wrote:
> > > > > > On Thu, Nov 11, 2021 at 11:52 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Thu 11-11-21 15:42:01, Mina Almasry wrote:
> > > > > > > > On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> > > > > > > > to find a task to kill in the memcg under oom, if the oom-killer
> > > > > > > > is unable to find one, the oom-killer should simply return ENOMEM to the
> > > > > > > > allocating process.
> > > > > > >
> > > > > > > This really begs for some justification.
> > > > > > >
> > > > > >
> > > > > > I'm thinking (and I can add to the commit message in v4) that we have
> > > > > > 2 reasonable options when the oom-killer gets invoked and finds
> > > > > > nothing to kill: (1) return ENOMEM, (2) kill the allocating task. I'm
> > > > > > thinking returning ENOMEM allows the application to gracefully handle
> > > > > > the failure to remote charge and continue operation.
> > > > > >
> > > > > > For example, in the network service use case that I mentioned in the
> > > > > > RFC proposal, it's beneficial for the network service to get an ENOMEM
> > > > > > and continue to service network requests for other clients running on
> > > > > > the machine, rather than get oom-killed when hitting the remote memcg
> > > > > > limit. But, this is not a hard requirement, the network service could
> > > > > > fork a process that does the remote charging to guard against the
> > > > > > remote charge bringing down the entire process.
> > > > >
> > > > > This all belongs to the changelog so that we can discuss all potential
> > > > > implication and do not rely on any implicit assumptions.
> > > >
> > > > Understood. Maybe I'll wait to collect more feedback and upload v4
> > > > with a thorough explanation of the thought process.
> > > >
> > > > > E.g. why does
> > > > > it even make sense to kill a task in the origin cgroup?
> > > > >
> > > >
> > > > The behavior I saw returning ENOMEM for this edge case was that the
> > > > code was forever looping the pagefault, and I was (seemingly
> > > > incorrectly) under the impression that a suggestion to forever loop
> > > > the pagefault would be completely fundamentally unacceptable.
> > >
> > > Well, I have to say I am not entirely sure what is the best way to
> > > handle this situation. Another option would be to treat this similar to
> > > ENOSPACE situation. This would result into SIGBUS IIRC.
> > >
> > > The main problem with OOM killer is that it will not resolve the
> > > underlying problem in most situations. Shmem files would likely stay
> > > laying around and their charge along with them. Killing the allocating
> > > task has problems on its own because this could be just a DoS vector by
> > > other unrelated tasks sharing the shmem mount point without a gracefull
> > > fallback. Retrying the page fault is hard to detect. SIGBUS might be
> > > something that helps with the latest. The question is how to communicate
> > > this requerement down to the memcg code to know that the memory reclaim
> > > should happen (Should it? How hard we should try?) but do not invoke the
> > > oom killer. The more I think about this the nastier this is.
> >
> > So actually I thought the ENOSPC suggestion was interesting so I took
> > the liberty to prototype it. The changes required:
> >
> > 1. In out_of_memory() we return false if !oc->chosen &&
> > is_remote_oom(). This gets bubbled up to try_charge_memcg() as
> > mem_cgroup_oom() returning OOM_FAILED.
> > 2. In try_charge_memcg(), if we get an OOM_FAILED we again check
> > is_remote_oom(), if it is a remote oom, return ENOSPC.
> > 3. The calling code would return ENOSPC to the user in the no-fault
> > path, and SIGBUS the user in the fault path with no changes.
>
> I think this should be implemented at the caller side rather than
> somehow hacked into the memcg core. It is the caller to know what to do.
> The caller can use gfp flags to control the reclaim behavior.
>

Hmm I'm a bit struggling to envision this.  So would it be acceptable
at the call sites where we doing a remote charge, such as
shmem_add_to_page_cache(), if we get ENOMEM from the
mem_cgroup_charge(), and we know we're doing a remote charge (because
current's memcg != the super block memcg), then we return ENOSPC from
shmem_add_to_page_cache()? I believe that will return ENOSPC to the
userspace in the non-pagefault path and SIGBUS in the pagefault path.
Or you had something else in mind?

> > To be honest I think this is very workable, as is Shakeel's suggestion
> > of MEMCG_OOM_NO_VICTIM. Since this is an opt-in feature, we can
> > document the behavior and if the userspace doesn't want to get killed
> > they can catch the sigbus and handle it gracefully. If not, the
> > userspace just gets killed if we hit this edge case.
>
> I am not sure about the MEMCG_OOM_NO_VICTIM approach. It sounds really
> hackish to me. I will get back to Shakeel's email as time permits. The
> primary problem I have with this, though, is that the kernel oom killer
> cannot really do anything sensible if the limit is reached and there
> is nothing reclaimable left in this case. The tmpfs backed memory will
> simply stay around and there are no means to recover without userspace
> intervention.
> --
> Michal Hocko
> SUSE Labs

On Tue, Nov 16, 2021 at 1:39 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 16-11-21 10:28:25, Michal Hocko wrote:
> > On Mon 15-11-21 16:58:19, Mina Almasry wrote:
> [...]
> > > To be honest I think this is very workable, as is Shakeel's suggestion
> > > of MEMCG_OOM_NO_VICTIM. Since this is an opt-in feature, we can
> > > document the behavior and if the userspace doesn't want to get killed
> > > they can catch the sigbus and handle it gracefully. If not, the
> > > userspace just gets killed if we hit this edge case.
> >
> > I am not sure about the MEMCG_OOM_NO_VICTIM approach. It sounds really
> > hackish to me. I will get back to Shakeel's email as time permits. The
> > primary problem I have with this, though, is that the kernel oom killer
> > cannot really do anything sensible if the limit is reached and there
> > is nothing reclaimable left in this case. The tmpfs backed memory will
> > simply stay around and there are no means to recover without userspace
> > intervention.
>
> And just a small clarification. Tmpfs is fundamentally problematic from
> the OOM handling POV. The nuance here is that the OOM happens in a
> different memcg and thus a different resource domain. If you kill a task
> in the target memcg then you effectively DoS that workload. If you kill
> the allocating task then it is DoSed by anybody allowed to write to that
> shmem. All that without a graceful fallback.

I don't know if this addresses your concern, but I'm limiting the
memcg= use to processes that can enter that memcg. Therefore they
would be able to allocate memory in that memcg anyway by entering it.
So if they wanted to intentionally DoS that memcg they can already do
it without this feature.

Michal Hocko Nov. 16, 2021, 11:29 a.m. UTC | #11

On Tue 16-11-21 02:17:09, Mina Almasry wrote:
> On Tue, Nov 16, 2021 at 1:28 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 15-11-21 16:58:19, Mina Almasry wrote:
> > > On Mon, Nov 15, 2021 at 2:58 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 12-11-21 09:59:22, Mina Almasry wrote:
> > > > > On Fri, Nov 12, 2021 at 12:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Fri 12-11-21 00:12:52, Mina Almasry wrote:
> > > > > > > On Thu, Nov 11, 2021 at 11:52 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Thu 11-11-21 15:42:01, Mina Almasry wrote:
> > > > > > > > > On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> > > > > > > > > to find a task to kill in the memcg under oom, if the oom-killer
> > > > > > > > > is unable to find one, the oom-killer should simply return ENOMEM to the
> > > > > > > > > allocating process.
> > > > > > > >
> > > > > > > > This really begs for some justification.
> > > > > > > >
> > > > > > >
> > > > > > > I'm thinking (and I can add to the commit message in v4) that we have
> > > > > > > 2 reasonable options when the oom-killer gets invoked and finds
> > > > > > > nothing to kill: (1) return ENOMEM, (2) kill the allocating task. I'm
> > > > > > > thinking returning ENOMEM allows the application to gracefully handle
> > > > > > > the failure to remote charge and continue operation.
> > > > > > >
> > > > > > > For example, in the network service use case that I mentioned in the
> > > > > > > RFC proposal, it's beneficial for the network service to get an ENOMEM
> > > > > > > and continue to service network requests for other clients running on
> > > > > > > the machine, rather than get oom-killed when hitting the remote memcg
> > > > > > > limit. But, this is not a hard requirement, the network service could
> > > > > > > fork a process that does the remote charging to guard against the
> > > > > > > remote charge bringing down the entire process.
> > > > > >
> > > > > > This all belongs to the changelog so that we can discuss all potential
> > > > > > implication and do not rely on any implicit assumptions.
> > > > >
> > > > > Understood. Maybe I'll wait to collect more feedback and upload v4
> > > > > with a thorough explanation of the thought process.
> > > > >
> > > > > > E.g. why does
> > > > > > it even make sense to kill a task in the origin cgroup?
> > > > > >
> > > > >
> > > > > The behavior I saw returning ENOMEM for this edge case was that the
> > > > > code was forever looping the pagefault, and I was (seemingly
> > > > > incorrectly) under the impression that a suggestion to forever loop
> > > > > the pagefault would be completely fundamentally unacceptable.
> > > >
> > > > Well, I have to say I am not entirely sure what is the best way to
> > > > handle this situation. Another option would be to treat this similar to
> > > > ENOSPACE situation. This would result into SIGBUS IIRC.
> > > >
> > > > The main problem with OOM killer is that it will not resolve the
> > > > underlying problem in most situations. Shmem files would likely stay
> > > > laying around and their charge along with them. Killing the allocating
> > > > task has problems on its own because this could be just a DoS vector by
> > > > other unrelated tasks sharing the shmem mount point without a gracefull
> > > > fallback. Retrying the page fault is hard to detect. SIGBUS might be
> > > > something that helps with the latest. The question is how to communicate
> > > > this requerement down to the memcg code to know that the memory reclaim
> > > > should happen (Should it? How hard we should try?) but do not invoke the
> > > > oom killer. The more I think about this the nastier this is.
> > >
> > > So actually I thought the ENOSPC suggestion was interesting so I took
> > > the liberty to prototype it. The changes required:
> > >
> > > 1. In out_of_memory() we return false if !oc->chosen &&
> > > is_remote_oom(). This gets bubbled up to try_charge_memcg() as
> > > mem_cgroup_oom() returning OOM_FAILED.
> > > 2. In try_charge_memcg(), if we get an OOM_FAILED we again check
> > > is_remote_oom(), if it is a remote oom, return ENOSPC.
> > > 3. The calling code would return ENOSPC to the user in the no-fault
> > > path, and SIGBUS the user in the fault path with no changes.
> >
> > I think this should be implemented at the caller side rather than
> > somehow hacked into the memcg core. It is the caller to know what to do.
> > The caller can use gfp flags to control the reclaim behavior.
> >
> 
> Hmm I'm a bit struggling to envision this.  So would it be acceptable
> at the call sites where we doing a remote charge, such as
> shmem_add_to_page_cache(), if we get ENOMEM from the
> mem_cgroup_charge(), and we know we're doing a remote charge (because
> current's memcg != the super block memcg), then we return ENOSPC from
> shmem_add_to_page_cache()? I believe that will return ENOSPC to the
> userspace in the non-pagefault path and SIGBUS in the pagefault path.
> Or you had something else in mind?

Yes, exactly. I meant that all this special casing would be done at the
shmem layer as it knows how to communicate this usecase.

[...]

> > And just a small clarification. Tmpfs is fundamentally problematic from
> > the OOM handling POV. The nuance here is that the OOM happens in a
> > different memcg and thus a different resource domain. If you kill a task
> > in the target memcg then you effectively DoS that workload. If you kill
> > the allocating task then it is DoSed by anybody allowed to write to that
> > shmem. All that without a graceful fallback.
> 
> I don't know if this addresses your concern, but I'm limiting the
> memcg= use to processes that can enter that memcg. Therefore they
> would be able to allocate memory in that memcg anyway by entering it.
> So if they wanted to intentionally DoS that memcg they can already do
> it without this feature.

Can you elaborate some more? How do you enforce that the mount point
cannot be accessed by anybody outside of that constraint?

Mina Almasry Nov. 16, 2021, 9:27 p.m. UTC | #12

On Tue, Nov 16, 2021 at 3:29 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 16-11-21 02:17:09, Mina Almasry wrote:
> > On Tue, Nov 16, 2021 at 1:28 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 15-11-21 16:58:19, Mina Almasry wrote:
> > > > On Mon, Nov 15, 2021 at 2:58 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 12-11-21 09:59:22, Mina Almasry wrote:
> > > > > > On Fri, Nov 12, 2021 at 12:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Fri 12-11-21 00:12:52, Mina Almasry wrote:
> > > > > > > > On Thu, Nov 11, 2021 at 11:52 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu 11-11-21 15:42:01, Mina Almasry wrote:
> > > > > > > > > > On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> > > > > > > > > > to find a task to kill in the memcg under oom, if the oom-killer
> > > > > > > > > > is unable to find one, the oom-killer should simply return ENOMEM to the
> > > > > > > > > > allocating process.
> > > > > > > > >
> > > > > > > > > This really begs for some justification.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm thinking (and I can add to the commit message in v4) that we have
> > > > > > > > 2 reasonable options when the oom-killer gets invoked and finds
> > > > > > > > nothing to kill: (1) return ENOMEM, (2) kill the allocating task. I'm
> > > > > > > > thinking returning ENOMEM allows the application to gracefully handle
> > > > > > > > the failure to remote charge and continue operation.
> > > > > > > >
> > > > > > > > For example, in the network service use case that I mentioned in the
> > > > > > > > RFC proposal, it's beneficial for the network service to get an ENOMEM
> > > > > > > > and continue to service network requests for other clients running on
> > > > > > > > the machine, rather than get oom-killed when hitting the remote memcg
> > > > > > > > limit. But, this is not a hard requirement, the network service could
> > > > > > > > fork a process that does the remote charging to guard against the
> > > > > > > > remote charge bringing down the entire process.
> > > > > > >
> > > > > > > This all belongs to the changelog so that we can discuss all potential
> > > > > > > implication and do not rely on any implicit assumptions.
> > > > > >
> > > > > > Understood. Maybe I'll wait to collect more feedback and upload v4
> > > > > > with a thorough explanation of the thought process.
> > > > > >
> > > > > > > E.g. why does
> > > > > > > it even make sense to kill a task in the origin cgroup?
> > > > > > >
> > > > > >
> > > > > > The behavior I saw returning ENOMEM for this edge case was that the
> > > > > > code was forever looping the pagefault, and I was (seemingly
> > > > > > incorrectly) under the impression that a suggestion to forever loop
> > > > > > the pagefault would be completely fundamentally unacceptable.
> > > > >
> > > > > Well, I have to say I am not entirely sure what is the best way to
> > > > > handle this situation. Another option would be to treat this similar to
> > > > > ENOSPACE situation. This would result into SIGBUS IIRC.
> > > > >
> > > > > The main problem with OOM killer is that it will not resolve the
> > > > > underlying problem in most situations. Shmem files would likely stay
> > > > > laying around and their charge along with them. Killing the allocating
> > > > > task has problems on its own because this could be just a DoS vector by
> > > > > other unrelated tasks sharing the shmem mount point without a gracefull
> > > > > fallback. Retrying the page fault is hard to detect. SIGBUS might be
> > > > > something that helps with the latest. The question is how to communicate
> > > > > this requerement down to the memcg code to know that the memory reclaim
> > > > > should happen (Should it? How hard we should try?) but do not invoke the
> > > > > oom killer. The more I think about this the nastier this is.
> > > >
> > > > So actually I thought the ENOSPC suggestion was interesting so I took
> > > > the liberty to prototype it. The changes required:
> > > >
> > > > 1. In out_of_memory() we return false if !oc->chosen &&
> > > > is_remote_oom(). This gets bubbled up to try_charge_memcg() as
> > > > mem_cgroup_oom() returning OOM_FAILED.
> > > > 2. In try_charge_memcg(), if we get an OOM_FAILED we again check
> > > > is_remote_oom(), if it is a remote oom, return ENOSPC.
> > > > 3. The calling code would return ENOSPC to the user in the no-fault
> > > > path, and SIGBUS the user in the fault path with no changes.
> > >
> > > I think this should be implemented at the caller side rather than
> > > somehow hacked into the memcg core. It is the caller to know what to do.
> > > The caller can use gfp flags to control the reclaim behavior.
> > >
> >
> > Hmm I'm a bit struggling to envision this.  So would it be acceptable
> > at the call sites where we doing a remote charge, such as
> > shmem_add_to_page_cache(), if we get ENOMEM from the
> > mem_cgroup_charge(), and we know we're doing a remote charge (because
> > current's memcg != the super block memcg), then we return ENOSPC from
> > shmem_add_to_page_cache()? I believe that will return ENOSPC to the
> > userspace in the non-pagefault path and SIGBUS in the pagefault path.
> > Or you had something else in mind?
>
> Yes, exactly. I meant that all this special casing would be done at the
> shmem layer as it knows how to communicate this usecase.
>

Awesome. The more I think of it I think the ENOSPC handling is perfect
for this use case, because it gives all users of the shared memory and
remote chargers a chance to gracefully handle the ENOSPC or the SIGBUS
when we hit the nothing to kill case. The only issue is finding a
clean implementation, and if the implementation I just proposed sounds
good to you then I see no issues and I'm happy to submit this in the
next version. Shakeel and others I would love to know what you think
either now or when I post the next version.

> [...]
>
> > > And just a small clarification. Tmpfs is fundamentally problematic from
> > > the OOM handling POV. The nuance here is that the OOM happens in a
> > > different memcg and thus a different resource domain. If you kill a task
> > > in the target memcg then you effectively DoS that workload. If you kill
> > > the allocating task then it is DoSed by anybody allowed to write to that
> > > shmem. All that without a graceful fallback.
> >
> > I don't know if this addresses your concern, but I'm limiting the
> > memcg= use to processes that can enter that memcg. Therefore they
> > would be able to allocate memory in that memcg anyway by entering it.
> > So if they wanted to intentionally DoS that memcg they can already do
> > it without this feature.
>
> Can you elaborate some more? How do you enforce that the mount point
> cannot be accessed by anybody outside of that constraint?

So if I'm a bad actor that wants to intentionally DoS random memcgs on
the system I can:

mount -t tmpfs -o memcg=/sys/fs/cgroup/unified/memcg-to-dos tmpfs /mnt/tmpfs
cat /dev/random > /mnt/tmpfs

That will reliably DoS the container. But we only allow you to mount
with memcg=/sys/fs/cgroup/unified/memcg-to-dos if you're able to enter
that memcg, so I can just do:

echo $$ > /sys/fs/cgroup/unified/memcg-to-dos/cgroup.procs
allocate_infinited_memory()

So we haven't added an attack vector really. More reasonably a sys
admin will set up a tmpfs mount with
memcg=/sys/fs/cgroup/unified/shared-memory-owner, and set the limit of
the shared-memory-owner to be big enough to handle the tasks running
in that memcg _and_ all the shared memory. The sys admin can also
limit the tmpfs with the size= option to limit the total size of the
shared memory. I think the sys admin could also set permissions on the
mount so only the users that share the memory can read/write, etc.

I'm sorry if this wasn't clear before and I'll take a good look at the
commit messages I'm writing and put as much info as possible in each.

As always thank you very much for your review and feedback.

> --
> Michal Hocko
> SUSE Labs

Shakeel Butt Nov. 16, 2021, 9:55 p.m. UTC | #13

On Tue, Nov 16, 2021 at 1:27 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Tue, Nov 16, 2021 at 3:29 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Yes, exactly. I meant that all this special casing would be done at the
> > shmem layer as it knows how to communicate this usecase.
> >
>
> Awesome. The more I think of it I think the ENOSPC handling is perfect
> for this use case, because it gives all users of the shared memory and
> remote chargers a chance to gracefully handle the ENOSPC or the SIGBUS
> when we hit the nothing to kill case. The only issue is finding a
> clean implementation, and if the implementation I just proposed sounds
> good to you then I see no issues and I'm happy to submit this in the
> next version. Shakeel and others I would love to know what you think
> either now or when I post the next version.
>

The direction seems reasonable to me. I would have more comments on
the actual code. At the high level I would prefer not to expose these
cases in the filesystem code (shmem or others) and instead be done in
a new memcg interface for filesystem users.

Michal Hocko Nov. 18, 2021, 8:47 a.m. UTC | #14

On Tue 16-11-21 13:27:34, Mina Almasry wrote:
> On Tue, Nov 16, 2021 at 3:29 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Can you elaborate some more? How do you enforce that the mount point
> > cannot be accessed by anybody outside of that constraint?
> 
> So if I'm a bad actor that wants to intentionally DoS random memcgs on
> the system I can:
> 
> mount -t tmpfs -o memcg=/sys/fs/cgroup/unified/memcg-to-dos tmpfs /mnt/tmpfs
> cat /dev/random > /mnt/tmpfs

If you can mount tmpfs then you do not need to fiddle with memcgs at
all. You just DoS the whole machine. That is not what I was asking
though.

My question was more towards a difference scenario. How do you
prevent random processes to _write_ to those mount points? User/group
permissions might be just too coarse to describe memcg relation. Without
memcg in place somebody could cause ENOSPC to the mount point users
and that is not great either but that should be recoverable to some
degree. With memcg configuration this would cause the memcg OOM which
would be harder to recover from because it affects all memcg charges in
that cgroup - not just that specific fs access. See what I mean? This is
a completely new failure mode. 

The only reasonable way would be to reduce the visibility of that mount
point. This is certainly possible but it seems rather awkward when it
should be accessible from multiple resource domains.

I cannot really shake off feeling that this is potentially adding more
problems than it solves.

Michal Hocko Nov. 18, 2021, 8:48 a.m. UTC | #15

On Tue 16-11-21 13:55:54, Shakeel Butt wrote:
> On Tue, Nov 16, 2021 at 1:27 PM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Tue, Nov 16, 2021 at 3:29 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > Yes, exactly. I meant that all this special casing would be done at the
> > > shmem layer as it knows how to communicate this usecase.
> > >
> >
> > Awesome. The more I think of it I think the ENOSPC handling is perfect
> > for this use case, because it gives all users of the shared memory and
> > remote chargers a chance to gracefully handle the ENOSPC or the SIGBUS
> > when we hit the nothing to kill case. The only issue is finding a
> > clean implementation, and if the implementation I just proposed sounds
> > good to you then I see no issues and I'm happy to submit this in the
> > next version. Shakeel and others I would love to know what you think
> > either now or when I post the next version.
> >
> 
> The direction seems reasonable to me. I would have more comments on
> the actual code. At the high level I would prefer not to expose these
> cases in the filesystem code (shmem or others) and instead be done in
> a new memcg interface for filesystem users.

A library like function in the memcg proper sounds good to me I just
want to avoid any special casing in the core of the memcg charging and
special casing there.

Mina Almasry Nov. 19, 2021, 10:32 p.m. UTC | #16

On Thu, Nov 18, 2021 at 12:47 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 16-11-21 13:27:34, Mina Almasry wrote:
> > On Tue, Nov 16, 2021 at 3:29 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > Can you elaborate some more? How do you enforce that the mount point
> > > cannot be accessed by anybody outside of that constraint?
> >
> > So if I'm a bad actor that wants to intentionally DoS random memcgs on
> > the system I can:
> >
> > mount -t tmpfs -o memcg=/sys/fs/cgroup/unified/memcg-to-dos tmpfs /mnt/tmpfs
> > cat /dev/random > /mnt/tmpfs
>
> If you can mount tmpfs then you do not need to fiddle with memcgs at
> all. You just DoS the whole machine. That is not what I was asking
> though.
>
> My question was more towards a difference scenario. How do you
> prevent random processes to _write_ to those mount points? User/group
> permissions might be just too coarse to describe memcg relation. Without
> memcg in place somebody could cause ENOSPC to the mount point users
> and that is not great either but that should be recoverable to some
> degree. With memcg configuration this would cause the memcg OOM which
> would be harder to recover from because it affects all memcg charges in
> that cgroup - not just that specific fs access. See what I mean? This is
> a completely new failure mode.
>
> The only reasonable way would be to reduce the visibility of that mount
> point. This is certainly possible but it seems rather awkward when it
> should be accessible from multiple resource domains.
>

So the problem of preventing random processes from writing to a mount
point is a generic problem on machine configurations where you have
untrusted code running on the machine, which is a very common case.
For us we have any number of random workloads or VMs running on the
machine and it's critical to limit their credentials to exactly what
these workloads need. Because of this, regardless of whether the
filesystem is mounted with memcg= or not, the write/execute/read
permissions are only given to those that need access to the mount
point. If this is not done correctly, there are potentially even more
serious problems than causing OOMs or SIGBUSes to users of the mount
point.

Because this is a generic problem, it's addressed 'elsewhere'. I'm
honestly not extremely familiar but my rough understanding is that
there are linux filesystem permissions and user namespaces to address
this, and there are also higher level constructs like containerd which
which limits the visibility of jobs running on the system. My
understanding is that there are also sandboxes which go well beyond
limiting file access permissions.

To speak more concretely, for the 3 use cases I mention in the RFC
proposal (I'll attach that as cover letter in the next version):
1. For services running on the system, the shared tmpfs mount is only
visible and accessible (write/read) to the network service and its
client.
2. For large jobs with subprocesses that share memory like kubernetes,
the shared tmpfs is again only visible and accessible to the processes
in this job.
3. For filesystems that host shared libraries, it's a big no-no to
give anyone on the machine write permissions to the runtime AFAIU, so
I expect the mount point to be read-only.

Note that all these restrictions should and would be in place
regardless of whether the kernel supports the memcg= option or the
filesystem is mounted with memcg=. I'm not extremely familiar with the
implementation details on these restrictions, but I can grab them.

> I cannot really shake off feeling that this is potentially adding more
> problems than it solves.
> --
> Michal Hocko
> SUSE Labs

On Thu, Nov 18, 2021 at 12:48 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 16-11-21 13:55:54, Shakeel Butt wrote:
> > On Tue, Nov 16, 2021 at 1:27 PM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > On Tue, Nov 16, 2021 at 3:29 AM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > Yes, exactly. I meant that all this special casing would be done at the
> > > > shmem layer as it knows how to communicate this usecase.
> > > >
> > >
> > > Awesome. The more I think of it I think the ENOSPC handling is perfect
> > > for this use case, because it gives all users of the shared memory and
> > > remote chargers a chance to gracefully handle the ENOSPC or the SIGBUS
> > > when we hit the nothing to kill case. The only issue is finding a
> > > clean implementation, and if the implementation I just proposed sounds
> > > good to you then I see no issues and I'm happy to submit this in the
> > > next version. Shakeel and others I would love to know what you think
> > > either now or when I post the next version.
> > >
> >
> > The direction seems reasonable to me. I would have more comments on
> > the actual code. At the high level I would prefer not to expose these
> > cases in the filesystem code (shmem or others) and instead be done in
> > a new memcg interface for filesystem users.
>
> A library like function in the memcg proper sounds good to me I just
> want to avoid any special casing in the core of the memcg charging and
> special casing there.
>

Yes, this is the implementation I'm working on and I'll submit another version.

> --
> Michal Hocko
> SUSE Labs

[v3,2/4] mm/oom: handle remote ooms

Commit Message

Comments

Patch