mbox series

[v5,bpf-next,0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf

Message ID 20231214125033.4158-1-laoar.shao@gmail.com (mailing list archive)
Headers show
Series mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf | expand

Message

Yafang Shao Dec. 14, 2023, 12:50 p.m. UTC
Background
==========

In our containerized environment, we've identified unexpected OOM events
where the OOM-killer terminates tasks despite having ample free memory.
This anomaly is traced back to tasks within a container using mbind(2) to
bind memory to a specific NUMA node. When the allocated memory on this node
is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
indiscriminately kills tasks. 

The Challenge 
=============

In a containerized environment, independent memory binding by a user can
lead to unexpected system issues or disrupt tasks being run by other users
on the same server. If a user genuinely requires memory binding, we will
allocate dedicated servers to them by leveraging kubelet deployment.

Currently, users possess the ability to autonomously bind their memory to
specific nodes without explicit agreement or authorization from our end.
It's imperative that we establish a method to prevent this behavior.

Proposed Solution
=================

- Capability
  Currently, any task can perform MPOL_BIND without specific capabilities.
  Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
  may have unintended consequences. Capabilities, being broad, might grant
  unnecessary privileges. We should explore alternatives to prevent
  unexpected side effects.

- LSM 
  Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
  to disable MPOL_BIND. This approach is more flexibility and allows for
  fine-grained control without unintended consequences. A sample LSM BPF
  program is included, demonstrating practical implementation in a
  production environment.

- seccomp
  seccomp is relatively heavyweight, making it less suitable for
  enabling in our production environment:
  - Both kubelet and containers need adaptation to support it.
  - Dynamically altering security policies for individual containers
    without interrupting their operations isn't straightforward.

Future Considerations
=====================

In addition, there's room for enhancement in the OOM-killer for cases
involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
prioritize selecting a victim that has allocated memory on the same NUMA
node. My exploration on the lore led me to a proposal[0] related to this
matter, although consensus seems elusive at this point. Nevertheless,
delving into this specific topic is beyond the scope of the current
patchset.

[0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/

Changes:
- v4 -> v5:
  - Revise the commit log in patch #5. (KP)
- v3 -> v4: https://lwn.net/Articles/954126/
  - Drop the changes around security_task_movememory (Serge) 
- RCC v2 -> v3: https://lwn.net/Articles/953526/
  - Add MPOL_F_NUMA_BALANCING man-page (Ying)
  - Fix bpf selftests error reported by bot+bpf-ci
- RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
  - Refine the commit log to avoid misleading
  - Use one common lsm hook instead and add comment for it
  - Add selinux implementation
  - Other improments in mempolicy
- RFC v1: https://lwn.net/Articles/951188/

Yafang Shao (5):
  mm, doc: Add doc for MPOL_F_NUMA_BALANCING
  mm: mempolicy: Revise comment regarding mempolicy mode flags
  mm, security: Add lsm hook for memory policy adjustment
  security: selinux: Implement set_mempolicy hook
  selftests/bpf: Add selftests for set_mempolicy with a lsm prog

 .../admin-guide/mm/numa_memory_policy.rst          | 27 +++++++
 include/linux/lsm_hook_defs.h                      |  3 +
 include/linux/security.h                           |  9 +++
 include/uapi/linux/mempolicy.h                     |  2 +-
 mm/mempolicy.c                                     |  8 +++
 security/security.c                                | 13 ++++
 security/selinux/hooks.c                           |  8 +++
 security/selinux/include/classmap.h                |  2 +-
 .../selftests/bpf/prog_tests/set_mempolicy.c       | 84 ++++++++++++++++++++++
 .../selftests/bpf/progs/test_set_mempolicy.c       | 28 ++++++++
 10 files changed, 182 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c

Comments

Paul Moore Dec. 23, 2023, 12:16 a.m. UTC | #1
On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Background
> ==========
>
> In our containerized environment, we've identified unexpected OOM events
> where the OOM-killer terminates tasks despite having ample free memory.
> This anomaly is traced back to tasks within a container using mbind(2) to
> bind memory to a specific NUMA node. When the allocated memory on this node
> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> indiscriminately kills tasks.
>
> The Challenge
> =============
>
> In a containerized environment, independent memory binding by a user can
> lead to unexpected system issues or disrupt tasks being run by other users
> on the same server. If a user genuinely requires memory binding, we will
> allocate dedicated servers to them by leveraging kubelet deployment.
>
> Currently, users possess the ability to autonomously bind their memory to
> specific nodes without explicit agreement or authorization from our end.
> It's imperative that we establish a method to prevent this behavior.
>
> Proposed Solution
> =================
>
> - Capability
>   Currently, any task can perform MPOL_BIND without specific capabilities.
>   Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
>   may have unintended consequences. Capabilities, being broad, might grant
>   unnecessary privileges. We should explore alternatives to prevent
>   unexpected side effects.
>
> - LSM
>   Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
>   to disable MPOL_BIND. This approach is more flexibility and allows for
>   fine-grained control without unintended consequences. A sample LSM BPF
>   program is included, demonstrating practical implementation in a
>   production environment.
>
> - seccomp
>   seccomp is relatively heavyweight, making it less suitable for
>   enabling in our production environment:
>   - Both kubelet and containers need adaptation to support it.
>   - Dynamically altering security policies for individual containers
>     without interrupting their operations isn't straightforward.
>
> Future Considerations
> =====================
>
> In addition, there's room for enhancement in the OOM-killer for cases
> involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> prioritize selecting a victim that has allocated memory on the same NUMA
> node. My exploration on the lore led me to a proposal[0] related to this
> matter, although consensus seems elusive at this point. Nevertheless,
> delving into this specific topic is beyond the scope of the current
> patchset.
>
> [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
>
> Changes:
> - v4 -> v5:
>   - Revise the commit log in patch #5. (KP)
> - v3 -> v4: https://lwn.net/Articles/954126/
>   - Drop the changes around security_task_movememory (Serge)
> - RCC v2 -> v3: https://lwn.net/Articles/953526/
>   - Add MPOL_F_NUMA_BALANCING man-page (Ying)
>   - Fix bpf selftests error reported by bot+bpf-ci
> - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
>   - Refine the commit log to avoid misleading
>   - Use one common lsm hook instead and add comment for it
>   - Add selinux implementation
>   - Other improments in mempolicy
> - RFC v1: https://lwn.net/Articles/951188/
>
> Yafang Shao (5):
>   mm, doc: Add doc for MPOL_F_NUMA_BALANCING
>   mm: mempolicy: Revise comment regarding mempolicy mode flags
>   mm, security: Add lsm hook for memory policy adjustment
>   security: selinux: Implement set_mempolicy hook
>   selftests/bpf: Add selftests for set_mempolicy with a lsm prog
>
>  .../admin-guide/mm/numa_memory_policy.rst          | 27 +++++++
>  include/linux/lsm_hook_defs.h                      |  3 +
>  include/linux/security.h                           |  9 +++
>  include/uapi/linux/mempolicy.h                     |  2 +-
>  mm/mempolicy.c                                     |  8 +++
>  security/security.c                                | 13 ++++
>  security/selinux/hooks.c                           |  8 +++
>  security/selinux/include/classmap.h                |  2 +-
>  .../selftests/bpf/prog_tests/set_mempolicy.c       | 84 ++++++++++++++++++++++
>  .../selftests/bpf/progs/test_set_mempolicy.c       | 28 ++++++++
>  10 files changed, 182 insertions(+), 2 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c

In your original patchset there was a lot of good discussion about
ways to solve, or mitigate, this problem using existing mechanisms;
while you disputed many (all?) of those suggestions, I felt that they
still had merit over your objections.   I also don't believe the
SELinux implementation of the set_mempolicy hook fits with the
existing SELinux philosophy of access control via type enforcement;
outside of some checks on executable memory and low memory ranges,
SELinux doesn't currently enforce policy on memory ranges like this,
SELinux focuses more on tasks being able to access data/resources on
the system.

My current opinion is that you should pursue some of the mitigations
that have already been mentioned, including seccomp and/or a better
NUMA workload configuration.  I would also encourage you to pursue the
OOM improvement you briefly described.  All of those seem like better
options than this new LSM/SELinux hook.
Yafang Shao Dec. 24, 2023, 3:35 a.m. UTC | #2
On Sat, Dec 23, 2023 at 8:16 AM Paul Moore <paul@paul-moore.com> wrote:
>
> On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Background
> > ==========
> >
> > In our containerized environment, we've identified unexpected OOM events
> > where the OOM-killer terminates tasks despite having ample free memory.
> > This anomaly is traced back to tasks within a container using mbind(2) to
> > bind memory to a specific NUMA node. When the allocated memory on this node
> > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > indiscriminately kills tasks.
> >
> > The Challenge
> > =============
> >
> > In a containerized environment, independent memory binding by a user can
> > lead to unexpected system issues or disrupt tasks being run by other users
> > on the same server. If a user genuinely requires memory binding, we will
> > allocate dedicated servers to them by leveraging kubelet deployment.
> >
> > Currently, users possess the ability to autonomously bind their memory to
> > specific nodes without explicit agreement or authorization from our end.
> > It's imperative that we establish a method to prevent this behavior.
> >
> > Proposed Solution
> > =================
> >
> > - Capability
> >   Currently, any task can perform MPOL_BIND without specific capabilities.
> >   Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
> >   may have unintended consequences. Capabilities, being broad, might grant
> >   unnecessary privileges. We should explore alternatives to prevent
> >   unexpected side effects.
> >
> > - LSM
> >   Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
> >   to disable MPOL_BIND. This approach is more flexibility and allows for
> >   fine-grained control without unintended consequences. A sample LSM BPF
> >   program is included, demonstrating practical implementation in a
> >   production environment.
> >
> > - seccomp
> >   seccomp is relatively heavyweight, making it less suitable for
> >   enabling in our production environment:
> >   - Both kubelet and containers need adaptation to support it.
> >   - Dynamically altering security policies for individual containers
> >     without interrupting their operations isn't straightforward.
> >
> > Future Considerations
> > =====================
> >
> > In addition, there's room for enhancement in the OOM-killer for cases
> > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> > prioritize selecting a victim that has allocated memory on the same NUMA
> > node. My exploration on the lore led me to a proposal[0] related to this
> > matter, although consensus seems elusive at this point. Nevertheless,
> > delving into this specific topic is beyond the scope of the current
> > patchset.
> >
> > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
> >
> > Changes:
> > - v4 -> v5:
> >   - Revise the commit log in patch #5. (KP)
> > - v3 -> v4: https://lwn.net/Articles/954126/
> >   - Drop the changes around security_task_movememory (Serge)
> > - RCC v2 -> v3: https://lwn.net/Articles/953526/
> >   - Add MPOL_F_NUMA_BALANCING man-page (Ying)
> >   - Fix bpf selftests error reported by bot+bpf-ci
> > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
> >   - Refine the commit log to avoid misleading
> >   - Use one common lsm hook instead and add comment for it
> >   - Add selinux implementation
> >   - Other improments in mempolicy
> > - RFC v1: https://lwn.net/Articles/951188/
> >
> > Yafang Shao (5):
> >   mm, doc: Add doc for MPOL_F_NUMA_BALANCING
> >   mm: mempolicy: Revise comment regarding mempolicy mode flags
> >   mm, security: Add lsm hook for memory policy adjustment
> >   security: selinux: Implement set_mempolicy hook
> >   selftests/bpf: Add selftests for set_mempolicy with a lsm prog
> >
> >  .../admin-guide/mm/numa_memory_policy.rst          | 27 +++++++
> >  include/linux/lsm_hook_defs.h                      |  3 +
> >  include/linux/security.h                           |  9 +++
> >  include/uapi/linux/mempolicy.h                     |  2 +-
> >  mm/mempolicy.c                                     |  8 +++
> >  security/security.c                                | 13 ++++
> >  security/selinux/hooks.c                           |  8 +++
> >  security/selinux/include/classmap.h                |  2 +-
> >  .../selftests/bpf/prog_tests/set_mempolicy.c       | 84 ++++++++++++++++++++++
> >  .../selftests/bpf/progs/test_set_mempolicy.c       | 28 ++++++++
> >  10 files changed, 182 insertions(+), 2 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
>
> In your original patchset there was a lot of good discussion about
> ways to solve, or mitigate, this problem using existing mechanisms;
> while you disputed many (all?) of those suggestions, I felt that they
> still had merit over your objections.

JFYI. The initial patchset presents three suggestions:
- Disabling CONFIG_NUMA, proposed by Michal:
  By default, tasks on a server allocate memory from their local
memory node initially. Disabling CONFIG_NUMA could potentially lead to
a performance hit.

- Adjusting NUMA workload configuration, also from Michal:
  This adjustment has been successfully implemented on some dedicated
clusters, as mentioned in the commit log. However, applying this
change universally across a large fleet of servers might result in
significant wastage of physical memory.

- Implementing seccomp, suggested by Ondrej and Casey:
  As indicated in the commit log, altering the security policy
dynamically without interrupting a running container isn't
straightforward. Implementing seccomp requires the introduction of an
eBPF-based seccomp, which constitutes a substantial change.
  [ The seccomp maintainer has been added to this mail thread for
further discussion. ]


> I also don't believe the
> SELinux implementation of the set_mempolicy hook fits with the
> existing SELinux philosophy of access control via type enforcement;
> outside of some checks on executable memory and low memory ranges,
> SELinux doesn't currently enforce policy on memory ranges like this,
> SELinux focuses more on tasks being able to access data/resources on
> the system.
>
> My current opinion is that you should pursue some of the mitigations
> that have already been mentioned, including seccomp and/or a better
> NUMA workload configuration.  I would also encourage you to pursue the
> OOM improvement you briefly described.  All of those seem like better
> options than this new LSM/SELinux hook.

Using the OOM solution should not be our primary approach. Whenever
possible, we should prioritize alternative solutions to prevent
encountering the OOM situation.
Paul Moore Dec. 24, 2023, 7:44 p.m. UTC | #3
On Sat, Dec 23, 2023 at 10:35 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> On Sat, Dec 23, 2023 at 8:16 AM Paul Moore <paul@paul-moore.com> wrote:
> > On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > Background
> > > ==========
> > >
> > > In our containerized environment, we've identified unexpected OOM events
> > > where the OOM-killer terminates tasks despite having ample free memory.
> > > This anomaly is traced back to tasks within a container using mbind(2) to
> > > bind memory to a specific NUMA node. When the allocated memory on this node
> > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > indiscriminately kills tasks.
> > >
> > > The Challenge
> > > =============
> > >
> > > In a containerized environment, independent memory binding by a user can
> > > lead to unexpected system issues or disrupt tasks being run by other users
> > > on the same server. If a user genuinely requires memory binding, we will
> > > allocate dedicated servers to them by leveraging kubelet deployment.
> > >
> > > Currently, users possess the ability to autonomously bind their memory to
> > > specific nodes without explicit agreement or authorization from our end.
> > > It's imperative that we establish a method to prevent this behavior.
> > >
> > > Proposed Solution
> > > =================
> > >
> > > - Capability
> > >   Currently, any task can perform MPOL_BIND without specific capabilities.
> > >   Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
> > >   may have unintended consequences. Capabilities, being broad, might grant
> > >   unnecessary privileges. We should explore alternatives to prevent
> > >   unexpected side effects.
> > >
> > > - LSM
> > >   Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
> > >   to disable MPOL_BIND. This approach is more flexibility and allows for
> > >   fine-grained control without unintended consequences. A sample LSM BPF
> > >   program is included, demonstrating practical implementation in a
> > >   production environment.
> > >
> > > - seccomp
> > >   seccomp is relatively heavyweight, making it less suitable for
> > >   enabling in our production environment:
> > >   - Both kubelet and containers need adaptation to support it.
> > >   - Dynamically altering security policies for individual containers
> > >     without interrupting their operations isn't straightforward.
> > >
> > > Future Considerations
> > > =====================
> > >
> > > In addition, there's room for enhancement in the OOM-killer for cases
> > > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> > > prioritize selecting a victim that has allocated memory on the same NUMA
> > > node. My exploration on the lore led me to a proposal[0] related to this
> > > matter, although consensus seems elusive at this point. Nevertheless,
> > > delving into this specific topic is beyond the scope of the current
> > > patchset.
> > >
> > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
> > >
> > > Changes:
> > > - v4 -> v5:
> > >   - Revise the commit log in patch #5. (KP)
> > > - v3 -> v4: https://lwn.net/Articles/954126/
> > >   - Drop the changes around security_task_movememory (Serge)
> > > - RCC v2 -> v3: https://lwn.net/Articles/953526/
> > >   - Add MPOL_F_NUMA_BALANCING man-page (Ying)
> > >   - Fix bpf selftests error reported by bot+bpf-ci
> > > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
> > >   - Refine the commit log to avoid misleading
> > >   - Use one common lsm hook instead and add comment for it
> > >   - Add selinux implementation
> > >   - Other improments in mempolicy
> > > - RFC v1: https://lwn.net/Articles/951188/
> > >
> > > Yafang Shao (5):
> > >   mm, doc: Add doc for MPOL_F_NUMA_BALANCING
> > >   mm: mempolicy: Revise comment regarding mempolicy mode flags
> > >   mm, security: Add lsm hook for memory policy adjustment
> > >   security: selinux: Implement set_mempolicy hook
> > >   selftests/bpf: Add selftests for set_mempolicy with a lsm prog
> > >
> > >  .../admin-guide/mm/numa_memory_policy.rst          | 27 +++++++
> > >  include/linux/lsm_hook_defs.h                      |  3 +
> > >  include/linux/security.h                           |  9 +++
> > >  include/uapi/linux/mempolicy.h                     |  2 +-
> > >  mm/mempolicy.c                                     |  8 +++
> > >  security/security.c                                | 13 ++++
> > >  security/selinux/hooks.c                           |  8 +++
> > >  security/selinux/include/classmap.h                |  2 +-
> > >  .../selftests/bpf/prog_tests/set_mempolicy.c       | 84 ++++++++++++++++++++++
> > >  .../selftests/bpf/progs/test_set_mempolicy.c       | 28 ++++++++
> > >  10 files changed, 182 insertions(+), 2 deletions(-)
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
> > >  create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
> >
> > In your original patchset there was a lot of good discussion about
> > ways to solve, or mitigate, this problem using existing mechanisms;
> > while you disputed many (all?) of those suggestions, I felt that they
> > still had merit over your objections.
>
> JFYI. The initial patchset presents three suggestions:
> - Disabling CONFIG_NUMA, proposed by Michal:
>   By default, tasks on a server allocate memory from their local
> memory node initially. Disabling CONFIG_NUMA could potentially lead to
> a performance hit.
>
> - Adjusting NUMA workload configuration, also from Michal:
>   This adjustment has been successfully implemented on some dedicated
> clusters, as mentioned in the commit log. However, applying this
> change universally across a large fleet of servers might result in
> significant wastage of physical memory.
>
> - Implementing seccomp, suggested by Ondrej and Casey:
>   As indicated in the commit log, altering the security policy
> dynamically without interrupting a running container isn't
> straightforward. Implementing seccomp requires the introduction of an
> eBPF-based seccomp, which constitutes a substantial change.
>   [ The seccomp maintainer has been added to this mail thread for
> further discussion. ]

The seccomp filter runs cBFF (classic BPF) and not eBPF; there are a
number of sandboxing tools designed to make this easier to use,
including systemd, and if you need to augment your existing
application there are libraries available to make this easier.

> > I also don't believe the
> > SELinux implementation of the set_mempolicy hook fits with the
> > existing SELinux philosophy of access control via type enforcement;
> > outside of some checks on executable memory and low memory ranges,
> > SELinux doesn't currently enforce policy on memory ranges like this,
> > SELinux focuses more on tasks being able to access data/resources on
> > the system.
> >
> > My current opinion is that you should pursue some of the mitigations
> > that have already been mentioned, including seccomp and/or a better
> > NUMA workload configuration.  I would also encourage you to pursue the
> > OOM improvement you briefly described.  All of those seem like better
> > options than this new LSM/SELinux hook.
>
> Using the OOM solution should not be our primary approach. Whenever
> possible, we should prioritize alternative solutions to prevent
> encountering the OOM situation.

It's a good thing that there exist other options.
Yafang Shao Dec. 25, 2023, 3:12 a.m. UTC | #4
On Mon, Dec 25, 2023 at 3:44 AM Paul Moore <paul@paul-moore.com> wrote:
>
> On Sat, Dec 23, 2023 at 10:35 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > On Sat, Dec 23, 2023 at 8:16 AM Paul Moore <paul@paul-moore.com> wrote:
> > > On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > Background
> > > > ==========
> > > >
> > > > In our containerized environment, we've identified unexpected OOM events
> > > > where the OOM-killer terminates tasks despite having ample free memory.
> > > > This anomaly is traced back to tasks within a container using mbind(2) to
> > > > bind memory to a specific NUMA node. When the allocated memory on this node
> > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > > indiscriminately kills tasks.
> > > >
> > > > The Challenge
> > > > =============
> > > >
> > > > In a containerized environment, independent memory binding by a user can
> > > > lead to unexpected system issues or disrupt tasks being run by other users
> > > > on the same server. If a user genuinely requires memory binding, we will
> > > > allocate dedicated servers to them by leveraging kubelet deployment.
> > > >
> > > > Currently, users possess the ability to autonomously bind their memory to
> > > > specific nodes without explicit agreement or authorization from our end.
> > > > It's imperative that we establish a method to prevent this behavior.
> > > >
> > > > Proposed Solution
> > > > =================
> > > >
> > > > - Capability
> > > >   Currently, any task can perform MPOL_BIND without specific capabilities.
> > > >   Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
> > > >   may have unintended consequences. Capabilities, being broad, might grant
> > > >   unnecessary privileges. We should explore alternatives to prevent
> > > >   unexpected side effects.
> > > >
> > > > - LSM
> > > >   Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
> > > >   to disable MPOL_BIND. This approach is more flexibility and allows for
> > > >   fine-grained control without unintended consequences. A sample LSM BPF
> > > >   program is included, demonstrating practical implementation in a
> > > >   production environment.
> > > >
> > > > - seccomp
> > > >   seccomp is relatively heavyweight, making it less suitable for
> > > >   enabling in our production environment:
> > > >   - Both kubelet and containers need adaptation to support it.
> > > >   - Dynamically altering security policies for individual containers
> > > >     without interrupting their operations isn't straightforward.
> > > >
> > > > Future Considerations
> > > > =====================
> > > >
> > > > In addition, there's room for enhancement in the OOM-killer for cases
> > > > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> > > > prioritize selecting a victim that has allocated memory on the same NUMA
> > > > node. My exploration on the lore led me to a proposal[0] related to this
> > > > matter, although consensus seems elusive at this point. Nevertheless,
> > > > delving into this specific topic is beyond the scope of the current
> > > > patchset.
> > > >
> > > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
> > > >
> > > > Changes:
> > > > - v4 -> v5:
> > > >   - Revise the commit log in patch #5. (KP)
> > > > - v3 -> v4: https://lwn.net/Articles/954126/
> > > >   - Drop the changes around security_task_movememory (Serge)
> > > > - RCC v2 -> v3: https://lwn.net/Articles/953526/
> > > >   - Add MPOL_F_NUMA_BALANCING man-page (Ying)
> > > >   - Fix bpf selftests error reported by bot+bpf-ci
> > > > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
> > > >   - Refine the commit log to avoid misleading
> > > >   - Use one common lsm hook instead and add comment for it
> > > >   - Add selinux implementation
> > > >   - Other improments in mempolicy
> > > > - RFC v1: https://lwn.net/Articles/951188/
> > > >
> > > > Yafang Shao (5):
> > > >   mm, doc: Add doc for MPOL_F_NUMA_BALANCING
> > > >   mm: mempolicy: Revise comment regarding mempolicy mode flags
> > > >   mm, security: Add lsm hook for memory policy adjustment
> > > >   security: selinux: Implement set_mempolicy hook
> > > >   selftests/bpf: Add selftests for set_mempolicy with a lsm prog
> > > >
> > > >  .../admin-guide/mm/numa_memory_policy.rst          | 27 +++++++
> > > >  include/linux/lsm_hook_defs.h                      |  3 +
> > > >  include/linux/security.h                           |  9 +++
> > > >  include/uapi/linux/mempolicy.h                     |  2 +-
> > > >  mm/mempolicy.c                                     |  8 +++
> > > >  security/security.c                                | 13 ++++
> > > >  security/selinux/hooks.c                           |  8 +++
> > > >  security/selinux/include/classmap.h                |  2 +-
> > > >  .../selftests/bpf/prog_tests/set_mempolicy.c       | 84 ++++++++++++++++++++++
> > > >  .../selftests/bpf/progs/test_set_mempolicy.c       | 28 ++++++++
> > > >  10 files changed, 182 insertions(+), 2 deletions(-)
> > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
> > > >  create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
> > >
> > > In your original patchset there was a lot of good discussion about
> > > ways to solve, or mitigate, this problem using existing mechanisms;
> > > while you disputed many (all?) of those suggestions, I felt that they
> > > still had merit over your objections.
> >
> > JFYI. The initial patchset presents three suggestions:
> > - Disabling CONFIG_NUMA, proposed by Michal:
> >   By default, tasks on a server allocate memory from their local
> > memory node initially. Disabling CONFIG_NUMA could potentially lead to
> > a performance hit.
> >
> > - Adjusting NUMA workload configuration, also from Michal:
> >   This adjustment has been successfully implemented on some dedicated
> > clusters, as mentioned in the commit log. However, applying this
> > change universally across a large fleet of servers might result in
> > significant wastage of physical memory.
> >
> > - Implementing seccomp, suggested by Ondrej and Casey:
> >   As indicated in the commit log, altering the security policy
> > dynamically without interrupting a running container isn't
> > straightforward. Implementing seccomp requires the introduction of an
> > eBPF-based seccomp, which constitutes a substantial change.
> >   [ The seccomp maintainer has been added to this mail thread for
> > further discussion. ]
>
> The seccomp filter runs cBFF (classic BPF) and not eBPF; there are a
> number of sandboxing tools designed to make this easier to use,
> including systemd, and if you need to augment your existing
> application there are libraries available to make this easier.

Let's delve into how cBPF-based seccomp operates with runc [0] - our
application:

1. Create a seccomp filter in /path/to/seccomp/profile.json.
2. Initiate a container with this filter rule using
    docker run --rm \
             -it \
             --security-opt seccomp=/path/to/seccomp/profile.json \
             hello-world

However, modifying or removing the seccomp filter mandates stopping
the running container and repeating the aforementioned steps. This
interruption isn't desirable for us.

The inability to dynamically alter the seccomp filter with cBPF arises
from the kernel lacking a method to unload the seccomp once attached
to a task. In other words, cBPF-based seccomp cannot dynamically
attach and detach from tasks. Please correct me if my understanding is
incorrect.

[0]. https://docs.docker.com/engine/security/seccomp/

>
> > > I also don't believe the
> > > SELinux implementation of the set_mempolicy hook fits with the
> > > existing SELinux philosophy of access control via type enforcement;
> > > outside of some checks on executable memory and low memory ranges,
> > > SELinux doesn't currently enforce policy on memory ranges like this,
> > > SELinux focuses more on tasks being able to access data/resources on
> > > the system.
> > >
> > > My current opinion is that you should pursue some of the mitigations
> > > that have already been mentioned, including seccomp and/or a better
> > > NUMA workload configuration.  I would also encourage you to pursue the
> > > OOM improvement you briefly described.  All of those seem like better
> > > options than this new LSM/SELinux hook.
> >
> > Using the OOM solution should not be our primary approach. Whenever
> > possible, we should prioritize alternative solutions to prevent
> > encountering the OOM situation.
>
> It's a good thing that there exist other options.

Absolutely, let's explore alternative options beforehand.
Yafang Shao Jan. 10, 2024, 6:06 a.m. UTC | #5
On Mon, Dec 25, 2023 at 11:12 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Dec 25, 2023 at 3:44 AM Paul Moore <paul@paul-moore.com> wrote:
> >
> > On Sat, Dec 23, 2023 at 10:35 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > On Sat, Dec 23, 2023 at 8:16 AM Paul Moore <paul@paul-moore.com> wrote:
> > > > On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > Background
> > > > > ==========
> > > > >
> > > > > In our containerized environment, we've identified unexpected OOM events
> > > > > where the OOM-killer terminates tasks despite having ample free memory.
> > > > > This anomaly is traced back to tasks within a container using mbind(2) to
> > > > > bind memory to a specific NUMA node. When the allocated memory on this node
> > > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > > > indiscriminately kills tasks.
> > > > >
> > > > > The Challenge
> > > > > =============
> > > > >
> > > > > In a containerized environment, independent memory binding by a user can
> > > > > lead to unexpected system issues or disrupt tasks being run by other users
> > > > > on the same server. If a user genuinely requires memory binding, we will
> > > > > allocate dedicated servers to them by leveraging kubelet deployment.
> > > > >
> > > > > Currently, users possess the ability to autonomously bind their memory to
> > > > > specific nodes without explicit agreement or authorization from our end.
> > > > > It's imperative that we establish a method to prevent this behavior.
> > > > >
> > > > > Proposed Solution
> > > > > =================
> > > > >
> > > > > - Capability
> > > > >   Currently, any task can perform MPOL_BIND without specific capabilities.
> > > > >   Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
> > > > >   may have unintended consequences. Capabilities, being broad, might grant
> > > > >   unnecessary privileges. We should explore alternatives to prevent
> > > > >   unexpected side effects.
> > > > >
> > > > > - LSM
> > > > >   Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
> > > > >   to disable MPOL_BIND. This approach is more flexibility and allows for
> > > > >   fine-grained control without unintended consequences. A sample LSM BPF
> > > > >   program is included, demonstrating practical implementation in a
> > > > >   production environment.
> > > > >
> > > > > - seccomp
> > > > >   seccomp is relatively heavyweight, making it less suitable for
> > > > >   enabling in our production environment:
> > > > >   - Both kubelet and containers need adaptation to support it.
> > > > >   - Dynamically altering security policies for individual containers
> > > > >     without interrupting their operations isn't straightforward.
> > > > >
> > > > > Future Considerations
> > > > > =====================
> > > > >
> > > > > In addition, there's room for enhancement in the OOM-killer for cases
> > > > > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> > > > > prioritize selecting a victim that has allocated memory on the same NUMA
> > > > > node. My exploration on the lore led me to a proposal[0] related to this
> > > > > matter, although consensus seems elusive at this point. Nevertheless,
> > > > > delving into this specific topic is beyond the scope of the current
> > > > > patchset.
> > > > >
> > > > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
> > > > >
> > > > > Changes:
> > > > > - v4 -> v5:
> > > > >   - Revise the commit log in patch #5. (KP)
> > > > > - v3 -> v4: https://lwn.net/Articles/954126/
> > > > >   - Drop the changes around security_task_movememory (Serge)
> > > > > - RCC v2 -> v3: https://lwn.net/Articles/953526/
> > > > >   - Add MPOL_F_NUMA_BALANCING man-page (Ying)
> > > > >   - Fix bpf selftests error reported by bot+bpf-ci
> > > > > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
> > > > >   - Refine the commit log to avoid misleading
> > > > >   - Use one common lsm hook instead and add comment for it
> > > > >   - Add selinux implementation
> > > > >   - Other improments in mempolicy
> > > > > - RFC v1: https://lwn.net/Articles/951188/
> > > > >
> > > > > Yafang Shao (5):
> > > > >   mm, doc: Add doc for MPOL_F_NUMA_BALANCING
> > > > >   mm: mempolicy: Revise comment regarding mempolicy mode flags
> > > > >   mm, security: Add lsm hook for memory policy adjustment
> > > > >   security: selinux: Implement set_mempolicy hook
> > > > >   selftests/bpf: Add selftests for set_mempolicy with a lsm prog
> > > > >
> > > > >  .../admin-guide/mm/numa_memory_policy.rst          | 27 +++++++
> > > > >  include/linux/lsm_hook_defs.h                      |  3 +
> > > > >  include/linux/security.h                           |  9 +++
> > > > >  include/uapi/linux/mempolicy.h                     |  2 +-
> > > > >  mm/mempolicy.c                                     |  8 +++
> > > > >  security/security.c                                | 13 ++++
> > > > >  security/selinux/hooks.c                           |  8 +++
> > > > >  security/selinux/include/classmap.h                |  2 +-
> > > > >  .../selftests/bpf/prog_tests/set_mempolicy.c       | 84 ++++++++++++++++++++++
> > > > >  .../selftests/bpf/progs/test_set_mempolicy.c       | 28 ++++++++
> > > > >  10 files changed, 182 insertions(+), 2 deletions(-)
> > > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
> > > > >  create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
> > > >
> > > > In your original patchset there was a lot of good discussion about
> > > > ways to solve, or mitigate, this problem using existing mechanisms;
> > > > while you disputed many (all?) of those suggestions, I felt that they
> > > > still had merit over your objections.
> > >
> > > JFYI. The initial patchset presents three suggestions:
> > > - Disabling CONFIG_NUMA, proposed by Michal:
> > >   By default, tasks on a server allocate memory from their local
> > > memory node initially. Disabling CONFIG_NUMA could potentially lead to
> > > a performance hit.
> > >
> > > - Adjusting NUMA workload configuration, also from Michal:
> > >   This adjustment has been successfully implemented on some dedicated
> > > clusters, as mentioned in the commit log. However, applying this
> > > change universally across a large fleet of servers might result in
> > > significant wastage of physical memory.
> > >
> > > - Implementing seccomp, suggested by Ondrej and Casey:
> > >   As indicated in the commit log, altering the security policy
> > > dynamically without interrupting a running container isn't
> > > straightforward. Implementing seccomp requires the introduction of an
> > > eBPF-based seccomp, which constitutes a substantial change.
> > >   [ The seccomp maintainer has been added to this mail thread for
> > > further discussion. ]
> >
> > The seccomp filter runs cBFF (classic BPF) and not eBPF; there are a
> > number of sandboxing tools designed to make this easier to use,
> > including systemd, and if you need to augment your existing
> > application there are libraries available to make this easier.
>
> Let's delve into how cBPF-based seccomp operates with runc [0] - our
> application:
>
> 1. Create a seccomp filter in /path/to/seccomp/profile.json.
> 2. Initiate a container with this filter rule using
>     docker run --rm \
>              -it \
>              --security-opt seccomp=/path/to/seccomp/profile.json \
>              hello-world
>
> However, modifying or removing the seccomp filter mandates stopping
> the running container and repeating the aforementioned steps. This
> interruption isn't desirable for us.
>
> The inability to dynamically alter the seccomp filter with cBPF arises
> from the kernel lacking a method to unload the seccomp once attached
> to a task. In other words, cBPF-based seccomp cannot dynamically
> attach and detach from tasks. Please correct me if my understanding is
> incorrect.
>
> [0]. https://docs.docker.com/engine/security/seccomp/

Paul,

Do you have any additional comments or further suggestions?
Paul Moore Jan. 10, 2024, 2:28 p.m. UTC | #6
On Wed, Jan 10, 2024 at 1:07 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> Paul,
>
> Do you have any additional comments or further suggestions?

No, I'm still comfortable with my original comments and stand by them.
Yafang Shao Jan. 10, 2024, 3:56 p.m. UTC | #7
On Wed, Jan 10, 2024 at 10:28 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Wed, Jan 10, 2024 at 1:07 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > Paul,
> >
> > Do you have any additional comments or further suggestions?
>
> No, I'm still comfortable with my original comments and stand by them.

I understand your perspective, but it seems I have to propose an
eBPF-based seccomp in the next step.
Paul Moore Jan. 10, 2024, 4:14 p.m. UTC | #8
On Wed, Jan 10, 2024 at 10:56 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> On Wed, Jan 10, 2024 at 10:28 PM Paul Moore <paul@paul-moore.com> wrote:
> > On Wed, Jan 10, 2024 at 1:07 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > Paul,
> > >
> > > Do you have any additional comments or further suggestions?
> >
> > No, I'm still comfortable with my original comments and stand by them.
>
> I understand your perspective, but it seems I have to propose an
> eBPF-based seccomp in the next step.

You likely already know this, but just in case, eBPF-based seccomp has
been proposed many times in the past and has been rejected.  I don't
want to dissuade you from doing so again, but I suspect that this use
case will not be compelling enough to be successful.