Message ID | 20220815162028.926858-1-fred@cloudflare.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce security_create_user_ns() | expand |
On Mon, Aug 15, 2022 at 12:20 PM Frederick Lawler <fred@cloudflare.com> wrote: > > While user namespaces do not make the kernel more vulnerable, they are however > used to initiate exploits. Some users do not want to block namespace creation > for the entirety of the system, which some distributions provide. Instead, we > needed a way to have some applications be blocked, and others allowed. This is > not possible with those tools. Managing hierarchies also did not fit our case > because we're determining which tasks are allowed based on their attributes. > > While exploring a solution, we first leveraged the LSM cred_prepare hook > because that is the closest hook to prevent a call to create_user_ns(). > > The calls look something like this: > > cred = prepare_creds() > security_prepare_creds() > call_int_hook(cred_prepare, ... > if (cred) > create_user_ns(cred) > > We noticed that error codes were not propagated from this hook and > introduced a patch [1] to propagate those errors. > > The discussion notes that security_prepare_creds() is not appropriate for > MAC policies, and instead the hook is meant for LSM authors to prepare > credentials for mutation. [2] > > Additionally, cred_prepare hook is not without problems. Handling the clone3 > case is a bit more tricky due to the user space pointer passed to it. This > makes checking the syscall subject to a possible TOCTTOU attack. > > Ultimately, we concluded that a better course of action is to introduce > a new security hook for LSM authors. [3] > > This patch set first introduces a new security_create_user_ns() function > and userns_create LSM hook, then marks the hook as sleepable in BPF. The > following patches after include a BPF test and a patch for an SELinux > implementation. > > We want to encourage use of user namespaces, and also cater the needs > of users/administrators to observe and/or control access. There is no > expectation of an impact on user space applications because access control > is opt-in, and users wishing to observe within a LSM context > > > Links: > 1. https://lore.kernel.org/all/20220608150942.776446-1-fred@cloudflare.com/ > 2. https://lore.kernel.org/all/87y1xzyhub.fsf@email.froward.int.ebiederm.org/ > 3. https://lore.kernel.org/all/9fe9cd9f-1ded-a179-8ded-5fde8960a586@cloudflare.com/ > > Past discussions: > V4: https://lore.kernel.org/all/20220801180146.1157914-1-fred@cloudflare.com/ > V3: https://lore.kernel.org/all/20220721172808.585539-1-fred@cloudflare.com/ > V2: https://lore.kernel.org/all/20220707223228.1940249-1-fred@cloudflare.com/ > V1: https://lore.kernel.org/all/20220621233939.993579-1-fred@cloudflare.com/ > > Changes since v4: > - Update commit description > - Update cover letter > Changes since v3: > - Explicitly set CAP_SYS_ADMIN to test namespace is created given > permission > - Simplify BPF test to use sleepable hook only > - Prefer unshare() over clone() for tests > Changes since v2: > - Rename create_user_ns hook to userns_create > - Use user_namespace as an object opposed to a generic namespace object > - s/domB_t/domA_t in commit message > Changes since v1: > - Add selftests/bpf: Add tests verifying bpf lsm create_user_ns hook patch > - Add selinux: Implement create_user_ns hook patch > - Change function signature of security_create_user_ns() to only take > struct cred > - Move security_create_user_ns() call after id mapping check in > create_user_ns() > - Update documentation to reflect changes > > Frederick Lawler (4): > security, lsm: Introduce security_create_user_ns() > bpf-lsm: Make bpf_lsm_userns_create() sleepable > selftests/bpf: Add tests verifying bpf lsm userns_create hook > selinux: Implement userns_create hook > > include/linux/lsm_hook_defs.h | 1 + > include/linux/lsm_hooks.h | 4 + > include/linux/security.h | 6 ++ > kernel/bpf/bpf_lsm.c | 1 + > kernel/user_namespace.c | 5 + > security/security.c | 5 + > security/selinux/hooks.c | 9 ++ > security/selinux/include/classmap.h | 2 + > .../selftests/bpf/prog_tests/deny_namespace.c | 102 ++++++++++++++++++ > .../selftests/bpf/progs/test_deny_namespace.c | 33 ++++++ > 10 files changed, 168 insertions(+) > create mode 100644 tools/testing/selftests/bpf/prog_tests/deny_namespace.c > create mode 100644 tools/testing/selftests/bpf/progs/test_deny_namespace.c I just merged this into the lsm/next tree, thanks for seeing this through Frederick, and thank you to everyone who took the time to review the patches and add their tags. git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm.git next
> > I just merged this into the lsm/next tree, thanks for seeing this > through Frederick, and thank you to everyone who took the time to > review the patches and add their tags. > > git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm.git next Paul, Frederick I repeat my NACK, in part because I am being ignored and in part because the hook does not make technical sense. Linus I want you to know that this has been put in the lsm tree against my explicit and clear objections. My request to talk about the actual problems that are being address has been completely ignored. I have been a bit slow in dealing with this conversation because I am very much sick and not on top of my game, but that is no excuse to steam roll over me, instead of addressing my concerns. This is an irresponsible way of adding an access control to user namespace creation. This is a linux-api and manpages level kind of change, as this is a semantic change visible to userspace. Instead that concern has been brushed off as different return code to userspace. For observably this is a terrible LSM interface because there is no pair with user namespace destruction, nor is their any ability for the LSM to allocate any state to track the user namespace. As there is no patch actually calling audit or anything else observably does not appear to be a driving factor of this new interface. The common scenarios I am aware of for using the user namespace are: - Creating a container. - Using the user namespace to sandbox your application like chrome does. - Running an exploit. Returning an error code in the first 2 scenarios will create a userspace regression as either userspace will run less securely or it won't work at all. Returning an error code in the third scenario when someone is trying to exploit your machine is equally foolish as you are giving the exploit the chance to continue running. The application should be killed instead. Further adding a random failure mode to user namespace creation if it is used at all will just encourage userspace to use a setuid application to perform the namespace creation instead. Creating a less secure system overall. If the concern is to reduce the attack surface everything this proposed hook can do is already possible with the security_capable security hook. So Paul, Frederick please drop this. I can't see what this new hook is good for except creating regressions in existing userspace code. I am not willing to support such a hook in code that I maintain. Eric
On Wed, Aug 17, 2022 at 11:08 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > I just merged this into the lsm/next tree, thanks for seeing this > > through Frederick, and thank you to everyone who took the time to > > review the patches and add their tags. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm.git next > > Paul, Frederick > > I repeat my NACK, in part because I am being ignored and in part > because the hook does not make technical sense. > > Linus I want you to know that this has been put in the lsm tree against > my explicit and clear objections. Eric, we are disagreeing with you, not ignoring you; that's an important distinction. This is the fifth iteration of the patchset, or the sixth (?) if you could Frederick's earlier attempts using the credential hooks, and with each revision multiple people have tried to work with you to find a mutually agreeable solution to the use cases presented by Frederick and others. In the end of the v4 discussion it was my opinion that you kept moving the goalposts in an effort to prevent any additional hooks/controls/etc. to the user namespace code which is why I made the decision to merge the code into the lsm/next branch against your wishes. Multiple people have come out in support of this functionality, and you remain the only one opposed to the change; normally a maintainer's objection would be enough to block the change, but it is my opinion that Eric is acting in bad faith. At the end of the v4 patchset I suggested merging this into lsm/next so it could get a full -rc cycle in linux-next, assuming no issues were uncovered during testing I was planning to send it to Linus during the next merge window with commentary on the contentiousness of the patchset, including Eric's NACK. I'm personally very disappointed that it has come to this, but I'm at a loss of how to work with you (Eric) to find a solution; this is the only path forward that I can see at this point. Others have expressed their agreement with this approach, both on-list and privately. If anyone other than Eric or myself has a different view of the situation, *please* add your comments now. I believe I've done a fair job of summarizing things, but everyone has a bias and I'm definitely no exception. Finally, I'm going to refrain from rehashing the same arguments over again in this revision of the patchset, instead I'll just provide links to the previous drafts in case anyone wants to spend an hour or two: Revision v1 https://lore.kernel.org/linux-security-module/20220621233939.993579-1-fred@cloudflare.com/ Revision v2 https://lore.kernel.org/linux-security-module/20220707223228.1940249-1-fred@cloudflare.com/ Revision v3 https://lore.kernel.org/linux-security-module/20220721172808.585539-1-fred@cloudflare.com/ Revision v4 https://lore.kernel.org/linux-security-module/20220801180146.1157914-1-fred@cloudflare.com/ -- paul-moore.com
Paul Moore <paul@paul-moore.com> writes: > At the end of the v4 patchset I suggested merging this into lsm/next > so it could get a full -rc cycle in linux-next, assuming no issues > were uncovered during testing What in the world can be uncovered in linux-next for code that has no in tree users. That is one of my largest problems. I want to talk about the users and the use cases and I don't get dialog. Nor do I get hey look back there you missed it. Since you don't want to rehash this. I will just repeat my conclusion that the patchset appears to introduce an ineffective defense that will achieve nothing in the defense of the kernel, and so all it will achieve a code maintenance burden and to occasionally break legitimate users of the user namespace. Further the process is broken. You are changing the semantics of an operation with the introduction of a security hook. That needs a man-page and discussion on linux-abi. In general of the scrutiny we give to new systems and changed system calls. As this change fundamentally changes the semantics of creating a user namespace. Skipping that part of the process is not simply disagree that is being irresponsible. Eric
On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > Paul Moore <paul@paul-moore.com> writes: > > > At the end of the v4 patchset I suggested merging this into lsm/next > > so it could get a full -rc cycle in linux-next, assuming no issues > > were uncovered during testing > > What in the world can be uncovered in linux-next for code that has no in > tree users. The patchset provides both BPF LSM and SELinux implementations of the hooks along with a BPF LSM test under tools/testing/selftests/bpf/. If no one beats me to it, I plan to work on adding a test to the selinux-testsuite as soon as I'm done dealing with other urgent LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I run these tests multiple times a week (multiple times a day sometimes) against the -rcX kernels with the lsm/next, selinux/next, and audit/next branches applied on top. I know others do similar things.
Paul Moore <paul@paul-moore.com> writes: > On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> Paul Moore <paul@paul-moore.com> writes: >> >> > At the end of the v4 patchset I suggested merging this into lsm/next >> > so it could get a full -rc cycle in linux-next, assuming no issues >> > were uncovered during testing >> >> What in the world can be uncovered in linux-next for code that has no in >> tree users. > > The patchset provides both BPF LSM and SELinux implementations of the > hooks along with a BPF LSM test under tools/testing/selftests/bpf/. > If no one beats me to it, I plan to work on adding a test to the > selinux-testsuite as soon as I'm done dealing with other urgent > LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I > run these tests multiple times a week (multiple times a day sometimes) > against the -rcX kernels with the lsm/next, selinux/next, and > audit/next branches applied on top. I know others do similar things. A layer of hooks that leaves all of the logic to userspace is not an in-tree user for purposes of understanding the logic of the code. The reason why I implemented user namespaces is so that all of linux's neat features could be exposed to non-root userspace processes, in a way that doesn't break suid root processes. The access control you are adding to user namespaces looks to take that away. It looks to remove the whole point of user namespaces. So without any mention of how people intend to use this feature, without any code that uses this hook to implement semantics. Without any talk about how this semantic change is reasonable. I strenuously object. Eric
On Wed, Aug 17, 2022 at 4:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > Paul Moore <paul@paul-moore.com> writes: > > On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > >> Paul Moore <paul@paul-moore.com> writes: > >> > >> > At the end of the v4 patchset I suggested merging this into lsm/next > >> > so it could get a full -rc cycle in linux-next, assuming no issues > >> > were uncovered during testing > >> > >> What in the world can be uncovered in linux-next for code that has no in > >> tree users. > > > > The patchset provides both BPF LSM and SELinux implementations of the > > hooks along with a BPF LSM test under tools/testing/selftests/bpf/. > > If no one beats me to it, I plan to work on adding a test to the > > selinux-testsuite as soon as I'm done dealing with other urgent > > LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I > > run these tests multiple times a week (multiple times a day sometimes) > > against the -rcX kernels with the lsm/next, selinux/next, and > > audit/next branches applied on top. I know others do similar things. > > A layer of hooks that leaves all of the logic to userspace is not an > in-tree user for purposes of understanding the logic of the code. The BPF LSM selftests which are part of this patchset live in-tree. The SELinux hook implementation is completely in-tree with the subject/verb/object relationship clearly described by the code itself. After all, the selinux_userns_create() function consists of only two lines, one of which is an assignment. Yes, it is true that the SELinux policy lives outside the kernel, but that is because there is no singular SELinux policy for everyone. From a practical perspective, the SELinux policy is really just a configuration file used to setup the kernel at runtime; it is not significantly different than an iptables script, /etc/sysctl.conf, or any of the other myriad of configuration files used to configure the kernel during boot.
Paul Moore <paul@paul-moore.com> writes: > On Wed, Aug 17, 2022 at 4:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> Paul Moore <paul@paul-moore.com> writes: >> > On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> Paul Moore <paul@paul-moore.com> writes: >> >> >> >> > At the end of the v4 patchset I suggested merging this into lsm/next >> >> > so it could get a full -rc cycle in linux-next, assuming no issues >> >> > were uncovered during testing >> >> >> >> What in the world can be uncovered in linux-next for code that has no in >> >> tree users. >> > >> > The patchset provides both BPF LSM and SELinux implementations of the >> > hooks along with a BPF LSM test under tools/testing/selftests/bpf/. >> > If no one beats me to it, I plan to work on adding a test to the >> > selinux-testsuite as soon as I'm done dealing with other urgent >> > LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I >> > run these tests multiple times a week (multiple times a day sometimes) >> > against the -rcX kernels with the lsm/next, selinux/next, and >> > audit/next branches applied on top. I know others do similar things. >> >> A layer of hooks that leaves all of the logic to userspace is not an >> in-tree user for purposes of understanding the logic of the code. > > The BPF LSM selftests which are part of this patchset live in-tree. > The SELinux hook implementation is completely in-tree with the > subject/verb/object relationship clearly described by the code itself. > After all, the selinux_userns_create() function consists of only two > lines, one of which is an assignment. Yes, it is true that the > SELinux policy lives outside the kernel, but that is because there is > no singular SELinux policy for everyone. From a practical > perspective, the SELinux policy is really just a configuration file > used to setup the kernel at runtime; it is not significantly different > than an iptables script, /etc/sysctl.conf, or any of the other myriad > of configuration files used to configure the kernel during boot. I object to adding the new system configuration knob. Especially when I don't see people explaining why such a knob is a good idea. What is userspace going to do with this new feature that makes it worth maintaining in the kernel? That is always the conversation we have when adding new features, and that is exactly the conversation that has not happened here. Adding a layer of indirection should not exempt a new feature from needing to justify itself. Eric
On Wed, Aug 17, 2022 at 5:24 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > I object to adding the new system configuration knob. > > Especially when I don't see people explaining why such a knob is a good > idea. What is userspace going to do with this new feature that makes it > worth maintaining in the kernel? From https://lore.kernel.org/all/CAEiveUdPhEPAk7Y0ZXjPsD=Vb5hn453CHzS9aG-tkyRa8bf_eg@mail.gmail.com/ "We have valid use cases not specifically related to the attack surface, but go into the middle from bpf observability to enforcement. As we want to track namespace creation, changes, nesting and per task creds context depending on the nature of the workload." -Djalal Harouni From https://lore.kernel.org/linux-security-module/CALrw=nGT0kcHh4wyBwUF-Q8+v8DgnyEJM55vfmABwfU67EQn=g@mail.gmail.com/ "[W]e do want to embrace user namespaces in our code and some of our workloads already depend on it. Hence we didn't agree to Debian's approach of just having a global sysctl. But there is "our code" and there is "third party" code, which might not even be open source due to various reasons. And while the path exists for that code to do something bad - we want to block it." -Ignat Korchagin From https://lore.kernel.org/linux-security-module/CAHC9VhSKmqn5wxF3BZ67Z+-CV7sZzdnO+JODq48rZJ4WAe8ULA@mail.gmail.com/ "I've heard you talk about bugs being the only reason why people would want to ever block user namespaces, but I think we've all seen use cases now where it goes beyond that. However, even if it didn't, the need to build high confidence/assurance systems where big chunks of functionality can be disabled based on a security policy is a very real use case, and this patchset would help enable that." -Paul Moore (with apologies for self-quoting) From https://lore.kernel.org/linux-security-module/CAHC9VhRSCXCM51xpOT95G_WVi=UQ44gNV=uvvG23p8wn16uYSA@mail.gmail.com/ "One of the selling points of the BPF LSM is that it allows for various different ways of reporting and logging beyond audit. However, even if it was limited to just audit I believe that provides some useful justification as auditing fork()/clone() isn't quite the same and could be difficult to do at scale in some configurations." -Paul Moore (my apologies again) From https://lore.kernel.org/linux-security-module/20220722082159.jgvw7jgds3qwfyqk@wittgenstein/ "Nice and straightforward." -Christian Brauner
Hi, Please remove me from this list and stop harassing me. Jonathan Moore -----Original Message----- From: Paul Moore <paul@paul-moore.com> Sent: Wednesday, August 17, 2022 5:51 PM To: Eric W. Biederman <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>; Frederick Lawler <fred@cloudflare.com>; kpsingh@kernel.org; revest@chromium.org; jackmanb@chromium.org; ast@kernel.org; daniel@iogearbox.net; andrii@kernel.org; kafai@fb.com; songliubraving@fb.com; yhs@fb.com; john.fastabend@gmail.com; jmorris@namei.org; serge@hallyn.com; stephen.smalley.work@gmail.com; eparis@parisplace.org; shuah@kernel.org; brauner@kernel.org; casey@schaufler-ca.com; bpf@vger.kernel.org; linux-security-module@vger.kernel.org; selinux@vger.kernel.org; linux-kselftest@vger.kernel.org; linux-kernel@vger.kernel.org; netdev@vger.kernel.org; kernel-team@cloudflare.com; cgzones@googlemail.com; karl@bigbadwolfsecurity.com; tixxdz@gmail.com Subject: Re: [PATCH v5 0/4] Introduce security_create_user_ns() On Wed, Aug 17, 2022 at 5:24 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > I object to adding the new system configuration knob. > > Especially when I don't see people explaining why such a knob is a good > idea. What is userspace going to do with this new feature that makes it > worth maintaining in the kernel? From https://lore.kernel.org/all/CAEiveUdPhEPAk7Y0ZXjPsD=Vb5hn453CHzS9aG-tkyRa8bf_eg@mail.gmail.com/ "We have valid use cases not specifically related to the attack surface, but go into the middle from bpf observability to enforcement. As we want to track namespace creation, changes, nesting and per task creds context depending on the nature of the workload." -Djalal Harouni From https://lore.kernel.org/linux-security-module/CALrw=nGT0kcHh4wyBwUF-Q8+v8DgnyEJM55vfmABwfU67EQn=g@mail.gmail.com/ "[W]e do want to embrace user namespaces in our code and some of our workloads already depend on it. Hence we didn't agree to Debian's approach of just having a global sysctl. But there is "our code" and there is "third party" code, which might not even be open source due to various reasons. And while the path exists for that code to do something bad - we want to block it." -Ignat Korchagin From https://lore.kernel.org/linux-security-module/CAHC9VhSKmqn5wxF3BZ67Z+-CV7sZzdnO+JODq48rZJ4WAe8ULA@mail.gmail.com/ "I've heard you talk about bugs being the only reason why people would want to ever block user namespaces, but I think we've all seen use cases now where it goes beyond that. However, even if it didn't, the need to build high confidence/assurance systems where big chunks of functionality can be disabled based on a security policy is a very real use case, and this patchset would help enable that." -Paul Moore (with apologies for self-quoting) From https://lore.kernel.org/linux-security-module/CAHC9VhRSCXCM51xpOT95G_WVi=UQ44gNV=uvvG23p8wn16uYSA@mail.gmail.com/ "One of the selling points of the BPF LSM is that it allows for various different ways of reporting and logging beyond audit. However, even if it was limited to just audit I believe that provides some useful justification as auditing fork()/clone() isn't quite the same and could be difficult to do at scale in some configurations." -Paul Moore (my apologies again) From https://lore.kernel.org/linux-security-module/20220722082159.jgvw7jgds3qwfyqk@wittgenstein/ "Nice and straightforward." -Christian Brauner
On Wed, Aug 17, 2022 at 04:24:28PM -0500, Eric W. Biederman wrote: > Paul Moore <paul@paul-moore.com> writes: > > > On Wed, Aug 17, 2022 at 4:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > >> Paul Moore <paul@paul-moore.com> writes: > >> > On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > >> >> Paul Moore <paul@paul-moore.com> writes: > >> >> > >> >> > At the end of the v4 patchset I suggested merging this into lsm/next > >> >> > so it could get a full -rc cycle in linux-next, assuming no issues > >> >> > were uncovered during testing > >> >> > >> >> What in the world can be uncovered in linux-next for code that has no in > >> >> tree users. > >> > > >> > The patchset provides both BPF LSM and SELinux implementations of the > >> > hooks along with a BPF LSM test under tools/testing/selftests/bpf/. > >> > If no one beats me to it, I plan to work on adding a test to the > >> > selinux-testsuite as soon as I'm done dealing with other urgent > >> > LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I > >> > run these tests multiple times a week (multiple times a day sometimes) > >> > against the -rcX kernels with the lsm/next, selinux/next, and > >> > audit/next branches applied on top. I know others do similar things. > >> > >> A layer of hooks that leaves all of the logic to userspace is not an > >> in-tree user for purposes of understanding the logic of the code. > > > > The BPF LSM selftests which are part of this patchset live in-tree. > > The SELinux hook implementation is completely in-tree with the > > subject/verb/object relationship clearly described by the code itself. > > After all, the selinux_userns_create() function consists of only two > > lines, one of which is an assignment. Yes, it is true that the > > SELinux policy lives outside the kernel, but that is because there is > > no singular SELinux policy for everyone. From a practical > > perspective, the SELinux policy is really just a configuration file > > used to setup the kernel at runtime; it is not significantly different > > than an iptables script, /etc/sysctl.conf, or any of the other myriad > > of configuration files used to configure the kernel during boot. > > I object to adding the new system configuration knob. I do strongly sympathize with Eric's points. It will be very easy, once user namespace creation has been further restricted in some distros, to say "well see this stuff is silly" and go back to simply requiring root to create all containers and namespaces, which is generally quite a bit easier anywway. And then, of course, give everyone root so they can start containers. As Eric said, | Further adding a random failure mode to user namespace creation if it is | used at all will just encourage userspace to use a setuid application to | perform the namespace creation instead. Creating a less secure system | overall. However, I'm also looking at e.g. CVE-2022-2588 and CVE-2022-2586, and yes there are two issues which do require discussion (three if you count reportability, which is mainly a tool in guarding against the others). The first is, indeed, configuration knobs. There are tools, including chrome, which use user namespaces to make things better. The hope is that more and more tools will do so. The second is damage control. When an 0day has been announced, things change. You can say "well the bug was there all along", but it is different when every lazy ne'erdowell can pick an exploit off a mailing list and use it against a product for which spinning a new version with a new kernel and getting customers to update is probably a months-long endeavor. Some of these products do in fact require namespaces (user and otherwise) as part of their function. And - to my chagrin - I suspect most of them create usernamespace as the root user, before possibly processing untrusted user input, so unprivileged_userns_clone isn't a good fit. SELinux (and LSMs in generaly) do in fact seem like a useful place to add some configuration, because they tend to assign different domains to tasks with different purposes and trust levels. But another such place is the init system / service manager. And in most cases these days, this will use cgroups to collect tasks of certain types. So I wonder (this is ALMOST ENTIRELY thinking out loud, not thought through sufficiently) whether we should be setting a cgroup.nslock or somesuch. Of course, kernel livepatch is another potentially useful mitigation. Currently that's not possible for everyone. Maybe there is a more fundamental way we can approach this. Part of me still likes the idea of splitting the id mapping and capability-in-userns parts, but that's not sufficient. Maybe looking over all the relevant CVEs would give a better hint. Eric, you said | If the concern is to reduce the attack surface everything this | proposed hook can do is already possible with the security_capable | security hook. I suppose I could envision an LSM which gets activated when we find out there was a net-ns-exacerbated 0-day, which refuses CAP_NET_ADMIN for a task not in init_user_ns? Ideally it would be more flexible than that. > idea. What is userspace going to do with this new feature that makes it > worth maintaining in the kernel? > > That is always the conversation we have when adding new features, and > that is exactly the conversation that has not happened here. Eric and Paul, I wonder, will you - or some people you'd like to represent you - be at plumbers in September? Should there be a BOF session there? (I won't be there, but could join over video) I think a brainstorming session for solutions to the above problems would be good. > Adding a layer of indirection should not exempt a new feature from > needing to justify itself. > > Eric
On Thu, Aug 18, 2022 at 10:05 AM Serge E. Hallyn <serge@hallyn.com> wrote: > On Wed, Aug 17, 2022 at 04:24:28PM -0500, Eric W. Biederman wrote: > > Paul Moore <paul@paul-moore.com> writes: > > > On Wed, Aug 17, 2022 at 4:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > >> Paul Moore <paul@paul-moore.com> writes: > > >> > On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > >> >> Paul Moore <paul@paul-moore.com> writes: > > >> >> > > >> >> > At the end of the v4 patchset I suggested merging this into lsm/next > > >> >> > so it could get a full -rc cycle in linux-next, assuming no issues > > >> >> > were uncovered during testing > > >> >> > > >> >> What in the world can be uncovered in linux-next for code that has no in > > >> >> tree users. > > >> > > > >> > The patchset provides both BPF LSM and SELinux implementations of the > > >> > hooks along with a BPF LSM test under tools/testing/selftests/bpf/. > > >> > If no one beats me to it, I plan to work on adding a test to the > > >> > selinux-testsuite as soon as I'm done dealing with other urgent > > >> > LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I > > >> > run these tests multiple times a week (multiple times a day sometimes) > > >> > against the -rcX kernels with the lsm/next, selinux/next, and > > >> > audit/next branches applied on top. I know others do similar things. > > >> > > >> A layer of hooks that leaves all of the logic to userspace is not an > > >> in-tree user for purposes of understanding the logic of the code. > > > > > > The BPF LSM selftests which are part of this patchset live in-tree. > > > The SELinux hook implementation is completely in-tree with the > > > subject/verb/object relationship clearly described by the code itself. > > > After all, the selinux_userns_create() function consists of only two > > > lines, one of which is an assignment. Yes, it is true that the > > > SELinux policy lives outside the kernel, but that is because there is > > > no singular SELinux policy for everyone. From a practical > > > perspective, the SELinux policy is really just a configuration file > > > used to setup the kernel at runtime; it is not significantly different > > > than an iptables script, /etc/sysctl.conf, or any of the other myriad > > > of configuration files used to configure the kernel during boot. > > > > I object to adding the new system configuration knob. > > I do strongly sympathize with Eric's points. It will be very easy, once > user namespace creation has been further restricted in some distros, to > say "well see this stuff is silly" and go back to simply requiring root > to create all containers and namespaces, which is generally quite a bit > easier anywway. And then, of course, give everyone root so they can > start containers. That's assuming a lot. Many years have passed since namespaces were first introduced, and awareness of good security practices has improved, perhaps not as much as any of us would like, but to say that distros, system builders, and even users are the same as they were so many years ago is a bit of a stretch in my opinion. However, even ignoring that for a moment, do we really want to go to a place where we dictate how users compose and secure their systems? Linux "took over the world" because it offered a level of flexibility that wasn't really possible before, and it has flourished because it has kept that mentality. The Linux Kernel can be shoehorned onto most hardware that you can get your hands on these days, with driver support for most anything you can think to plug into the system. Do you want a single-user environment with no per-user separation? We can do that. Do you want a traditional DAC based system that leans heavy on ACLs and capabilities? We can do that. Do you want a container host that allows you to carve up the system with a high degree of granularity thanks to the different namespaces? We can do that. How about a system that leverages the LSM to enforce a least privilege ideal, even on the most privileged root user? We can do that too. This patchset is about giving distro, system builders, and users another choice in how they build their system. We've seen both in this patchset and in previously failed attempts that there is a definite want from a user perspective for functionality such as this, and I think it's time we deliver it in the upstream kernel so they don't have to keep patching their own systems with out-of-tree patches. > Eric and Paul, I wonder, will you - or some people you'd like to represent > you - be at plumbers in September? Should there be a BOF session there? (I > won't be there, but could join over video) I think a brainstorming session > for solutions to the above problems would be good. Regardless of if Eric or I will be at LPC, it is doubtful that all of the people who have participated in this discussion will be able to attend, and I think it's important that the users who are asking for this patchset have a chance to be heard in each forum where this is discussed. While conferences are definitely nice - I definitely missed them over the past couple of years - we can't use them as a crutch to help us reach a conclusion on this issue; we've debated much more difficult things over the mailing lists, I see no reason why this would be any different.
On Thu, Aug 18, 2022 at 11:11:06AM -0400, Paul Moore wrote: > On Thu, Aug 18, 2022 at 10:05 AM Serge E. Hallyn <serge@hallyn.com> wrote: > > On Wed, Aug 17, 2022 at 04:24:28PM -0500, Eric W. Biederman wrote: > > > Paul Moore <paul@paul-moore.com> writes: > > > > On Wed, Aug 17, 2022 at 4:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > > >> Paul Moore <paul@paul-moore.com> writes: > > > >> > On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > > >> >> Paul Moore <paul@paul-moore.com> writes: > > > >> >> > > > >> >> > At the end of the v4 patchset I suggested merging this into lsm/next > > > >> >> > so it could get a full -rc cycle in linux-next, assuming no issues > > > >> >> > were uncovered during testing > > > >> >> > > > >> >> What in the world can be uncovered in linux-next for code that has no in > > > >> >> tree users. > > > >> > > > > >> > The patchset provides both BPF LSM and SELinux implementations of the > > > >> > hooks along with a BPF LSM test under tools/testing/selftests/bpf/. > > > >> > If no one beats me to it, I plan to work on adding a test to the > > > >> > selinux-testsuite as soon as I'm done dealing with other urgent > > > >> > LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I > > > >> > run these tests multiple times a week (multiple times a day sometimes) > > > >> > against the -rcX kernels with the lsm/next, selinux/next, and > > > >> > audit/next branches applied on top. I know others do similar things. > > > >> > > > >> A layer of hooks that leaves all of the logic to userspace is not an > > > >> in-tree user for purposes of understanding the logic of the code. > > > > > > > > The BPF LSM selftests which are part of this patchset live in-tree. > > > > The SELinux hook implementation is completely in-tree with the > > > > subject/verb/object relationship clearly described by the code itself. > > > > After all, the selinux_userns_create() function consists of only two > > > > lines, one of which is an assignment. Yes, it is true that the > > > > SELinux policy lives outside the kernel, but that is because there is > > > > no singular SELinux policy for everyone. From a practical > > > > perspective, the SELinux policy is really just a configuration file > > > > used to setup the kernel at runtime; it is not significantly different > > > > than an iptables script, /etc/sysctl.conf, or any of the other myriad > > > > of configuration files used to configure the kernel during boot. > > > > > > I object to adding the new system configuration knob. > > > > I do strongly sympathize with Eric's points. It will be very easy, once > > user namespace creation has been further restricted in some distros, to > > say "well see this stuff is silly" and go back to simply requiring root > > to create all containers and namespaces, which is generally quite a bit > > easier anywway. And then, of course, give everyone root so they can > > start containers. > > That's assuming a lot. Many years have passed since namespaces were > first introduced, and awareness of good security practices has > improved, perhaps not as much as any of us would like, but to say that > distros, system builders, and even users are the same as they were so > many years ago is a bit of a stretch in my opinion. Maybe. But I do get a bit worried based on some of what I've been reading in mailing lists lately. Kernel dev definitely moves like fashion - remember when every api should have its own filesystem? That was not a different group of people. > However, even ignoring that for a moment, do we really want to go to a > place where we dictate how users compose and secure their systems? > Linux "took over the world" because it offered a level of flexibility > that wasn't really possible before, and it has flourished because it > has kept that mentality. The Linux Kernel can be shoehorned onto most > hardware that you can get your hands on these days, with driver > support for most anything you can think to plug into the system. Do > you want a single-user environment with no per-user separation? We > can do that. Do you want a traditional DAC based system that leans > heavy on ACLs and capabilities? We can do that. Do you want a > container host that allows you to carve up the system with a high > degree of granularity thanks to the different namespaces? We can do > that. How about a system that leverages the LSM to enforce a least > privilege ideal, even on the most privileged root user? We can do > that too. This patchset is about giving distro, system builders, and > users another choice in how they build their system. We've seen both Oh, you misunderstand. Whereas I do feel there are important concerns in Eric's objections, and whereas I don't feel this set sufficiently addresses the problems that I see and outlined above, I do see value in this set, and was not aiming to deter it. We need better ways to mitigate a certain clas sof 0-days without completely disallowing use of user namespaces, and this may help. > in this patchset and in previously failed attempts that there is a > definite want from a user perspective for functionality such as this, > and I think it's time we deliver it in the upstream kernel so they > don't have to keep patching their own systems with out-of-tree > patches. > > > Eric and Paul, I wonder, will you - or some people you'd like to represent > > you - be at plumbers in September? Should there be a BOF session there? (I > > won't be there, but could join over video) I think a brainstorming session > > for solutions to the above problems would be good. > > Regardless of if Eric or I will be at LPC, it is doubtful that all of > the people who have participated in this discussion will be able to > attend, and I think it's important that the users who are asking for > this patchset have a chance to be heard in each forum where this is > discussed. While conferences are definitely nice - I definitely > missed them over the past couple of years - we can't use them as a > crutch to help us reach a conclusion on this issue; we've debated much No I wasn't thinking we would use LPC to decide on this patchset. As far as I can see, the patchset is merged. I am hoping we can come up with "something better" to address people's needs, make everyone happy, and bring forth world peace. Which would stack just fine with what's here for defense in depth. You may well not be interested in further work, and that's fine. I need to set aside a few days to think on this. > more difficult things over the mailing lists, I see no reason why this > would be any different. > > -- > paul-moore.com
On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: > On Thu, Aug 18, 2022 at 11:11:06AM -0400, Paul Moore wrote: > > On Thu, Aug 18, 2022 at 10:05 AM Serge E. Hallyn <serge@hallyn.com> wrote: ... > > > I do strongly sympathize with Eric's points. It will be very easy, once > > > user namespace creation has been further restricted in some distros, to > > > say "well see this stuff is silly" and go back to simply requiring root > > > to create all containers and namespaces, which is generally quite a bit > > > easier anywway. And then, of course, give everyone root so they can > > > start containers. > > > > That's assuming a lot. Many years have passed since namespaces were > > first introduced, and awareness of good security practices has > > improved, perhaps not as much as any of us would like, but to say that > > distros, system builders, and even users are the same as they were so > > many years ago is a bit of a stretch in my opinion. > > Maybe. But I do get a bit worried based on some of what I've been > reading in mailing lists lately. Kernel dev definitely moves like > fashion - remember when every api should have its own filesystem? > That was not a different group of people. I'm not going to argue against the idea that kernel development is subject to fads, I just don't agree that adding a LSM control point for user namespace creation is going to be the end of user namespaces. > > However, even ignoring that for a moment, do we really want to go to a > > place where we dictate how users compose and secure their systems? > > Linux "took over the world" because it offered a level of flexibility > > that wasn't really possible before, and it has flourished because it > > has kept that mentality. The Linux Kernel can be shoehorned onto most > > hardware that you can get your hands on these days, with driver > > support for most anything you can think to plug into the system. Do > > you want a single-user environment with no per-user separation? We > > can do that. Do you want a traditional DAC based system that leans > > heavy on ACLs and capabilities? We can do that. Do you want a > > container host that allows you to carve up the system with a high > > degree of granularity thanks to the different namespaces? We can do > > that. How about a system that leverages the LSM to enforce a least > > privilege ideal, even on the most privileged root user? We can do > > that too. This patchset is about giving distro, system builders, and > > users another choice in how they build their system. We've seen both > > Oh, you misunderstand. Whereas I do feel there are important concerns in > Eric's objections, and whereas I don't feel this set sufficiently > addresses the problems that I see and outlined above, I do see value in > this set, and was not aiming to deter it. We need better ways to > mitigate a certain clas sof 0-days without completely disallowing use of > user namespaces, and this may help. Ah, thanks for the explanation, I missed that (obviously) in your previous email. If I'm perfectly honest, I suppose the protracted debate with Eric has also left me a little overly sensitive to any perceived arguments against this patchset. > > in this patchset and in previously failed attempts that there is a > > definite want from a user perspective for functionality such as this, > > and I think it's time we deliver it in the upstream kernel so they > > don't have to keep patching their own systems with out-of-tree > > patches. > > > > > Eric and Paul, I wonder, will you - or some people you'd like to represent > > > you - be at plumbers in September? Should there be a BOF session there? (I > > > won't be there, but could join over video) I think a brainstorming session > > > for solutions to the above problems would be good. > > > > Regardless of if Eric or I will be at LPC, it is doubtful that all of > > the people who have participated in this discussion will be able to > > attend, and I think it's important that the users who are asking for > > this patchset have a chance to be heard in each forum where this is > > discussed. While conferences are definitely nice - I definitely > > missed them over the past couple of years - we can't use them as a > > crutch to help us reach a conclusion on this issue; we've debated much > > No I wasn't thinking we would use LPC to decide on this patchset. As far > as I can see, the patchset is merged. While I maintain that Frederick's patches are a good thing, I'm not going to consider them "merged" until I see them in Linus' tree or Linus decided to voice his support on the lists. These patches do have Eric's NACK, and a maintainer's NACK isn't something to take lightly. I certainly don't. > I am hoping we can come up with > "something better" to address people's needs, make everyone happy, and > bring forth world peace. Which would stack just fine with what's here > for defense in depth. > > You may well not be interested in further work, and that's fine. I need > to set aside a few days to think on this. I'm happy to continue the discussion as long as it's constructive; I think we all are. My gut feeling is that Frederick's approach falls closest to the sweet spot of "workable without being overly offensive" (*cough*), but if you've got an additional approach in mind, or an alternative approach that solves the same use case problems, I think we'd all love to hear about it.
Paul Moore <paul@paul-moore.com> writes: > On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: >> I am hoping we can come up with >> "something better" to address people's needs, make everyone happy, and >> bring forth world peace. Which would stack just fine with what's here >> for defense in depth. >> >> You may well not be interested in further work, and that's fine. I need >> to set aside a few days to think on this. > > I'm happy to continue the discussion as long as it's constructive; I > think we all are. My gut feeling is that Frederick's approach falls > closest to the sweet spot of "workable without being overly offensive" > (*cough*), but if you've got an additional approach in mind, or an > alternative approach that solves the same use case problems, I think > we'd all love to hear about it. I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs. As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit. As such let me propose instead of returning an error code which will let the exploit continue, have the security hook return a bool. With true meaning the code can continue and on false it will trigger using SIGSYS to terminate the program like seccomp does. I am not super fond of that idea, but it means that userspace code is not expected to deal with the situation, and the only conversation a userspace application developer needs to enter into with a system administrator or security policy developer is one to prove they are not exploit code. Plus it makes much more sense to kill an exploit immediately instead of letting it run. In general when addressing code coverage concerns I think it makes more sense to use the security hooks to implement some variety of the principle of least privilege and only give applications access to the kernel facilities they are known to use. As far as I can tell creating a user namespace does not increase the attack surface. It is the creation of the other namespaces from a user namespace that begins to do that. So in general I would think restrictions should be in places they matter. Just like the bugs that have exploits that involve the user namespace are not user namespace bugs, but instead they are bugs in other subsystems that just happen to go through the user namespace as the easiest path to the buggy code, not the only path to the buggy code. Eric
On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > Paul Moore <paul@paul-moore.com> writes: > > On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: > >> I am hoping we can come up with > >> "something better" to address people's needs, make everyone happy, and > >> bring forth world peace. Which would stack just fine with what's here > >> for defense in depth. > >> > >> You may well not be interested in further work, and that's fine. I need > >> to set aside a few days to think on this. > > > > I'm happy to continue the discussion as long as it's constructive; I > > think we all are. My gut feeling is that Frederick's approach falls > > closest to the sweet spot of "workable without being overly offensive" > > (*cough*), but if you've got an additional approach in mind, or an > > alternative approach that solves the same use case problems, I think > > we'd all love to hear about it. > > I would love to actually hear the problems people are trying to solve so > that we can have a sensible conversation about the trade offs. Here are several taken from the previous threads, it's surely not a complete list, but it should give you a good idea: https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOFkTKPjjztsOa9Yrq+AeAQA@mail.gmail.com/ > As best I can tell without more information people want to use > the creation of a user namespace as a signal that the code is > attempting an exploit. Some use cases are like that, there are several other use cases that go beyond this; see all of our previous discussions on this topic/patchset. As has been mentioned before, there are use cases that require improved observability, access control, or both. > As such let me propose instead of returning an error code which will let > the exploit continue, have the security hook return a bool. With true > meaning the code can continue and on false it will trigger using SIGSYS > to terminate the program like seccomp does. Having the kernel forcibly exit the process isn't something that most LSMs would likely want. I suppose we could modify the hook/caller so that *if* an LSM wanted to return SIGSYS the system would kill the process, but I would want that to be something in addition to returning an error code like LSMs normally do (e.g. EACCES).
> On Aug 25, 2022, at 12:19 PM, Paul Moore <paul@paul-moore.com> wrote: > > On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> Paul Moore <paul@paul-moore.com> writes: >>> On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: >>>> I am hoping we can come up with >>>> "something better" to address people's needs, make everyone happy, and >>>> bring forth world peace. Which would stack just fine with what's here >>>> for defense in depth. >>>> >>>> You may well not be interested in further work, and that's fine. I need >>>> to set aside a few days to think on this. >>> >>> I'm happy to continue the discussion as long as it's constructive; I >>> think we all are. My gut feeling is that Frederick's approach falls >>> closest to the sweet spot of "workable without being overly offensive" >>> (*cough*), but if you've got an additional approach in mind, or an >>> alternative approach that solves the same use case problems, I think >>> we'd all love to hear about it. >> >> I would love to actually hear the problems people are trying to solve so >> that we can have a sensible conversation about the trade offs. > > Here are several taken from the previous threads, it's surely not a > complete list, but it should give you a good idea: > > https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOFkTKPjjztsOa9Yrq+AeAQA@mail.gmail.com/ > >> As best I can tell without more information people want to use >> the creation of a user namespace as a signal that the code is >> attempting an exploit. > > Some use cases are like that, there are several other use cases that > go beyond this; see all of our previous discussions on this > topic/patchset. As has been mentioned before, there are use cases > that require improved observability, access control, or both. > >> As such let me propose instead of returning an error code which will let >> the exploit continue, have the security hook return a bool. With true >> meaning the code can continue and on false it will trigger using SIGSYS >> to terminate the program like seccomp does. > > Having the kernel forcibly exit the process isn't something that most > LSMs would likely want. I suppose we could modify the hook/caller so > that *if* an LSM wanted to return SIGSYS the system would kill the > process, but I would want that to be something in addition to > returning an error code like LSMs normally do (e.g. EACCES). I am new to user_namespace and security work, so please pardon me if anything below is very wrong. IIUC, user_namespace is a tool that enables trusted userspace code to control the behavior of untrusted (or less trusted) userspace code. Failing create_user_ns() doesn't make the system more reliable. Specifically, we call create_user_ns() via two paths: fork/clone and unshare. For both paths, we need the userspace to use user_namespace, and to honor failed create_user_ns(). On the other hand, I would echo that killing the process is not practical in some use cases. Specifically, allowing the application to run in a less secure environment for a short period of time might be much better than killing it and taking down the whole service. Of course, there are other cases that security is more important, and taking down the whole service is the better choice. I guess the ultimate solution is a way to enforce using user_namespace in the kernel (if it ever makes sense...). But I don't know how that gonna work. Before we have such solution, maybe we only need an void hook for observability (or just a tracepoint, coming from BPF background). Thanks, Song
On Thu, Aug 25, 2022 at 5:58 PM Song Liu <songliubraving@fb.com> wrote: > > On Aug 25, 2022, at 12:19 PM, Paul Moore <paul@paul-moore.com> wrote: > > > > On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > >> Paul Moore <paul@paul-moore.com> writes: > >>> On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: > >>>> I am hoping we can come up with > >>>> "something better" to address people's needs, make everyone happy, and > >>>> bring forth world peace. Which would stack just fine with what's here > >>>> for defense in depth. > >>>> > >>>> You may well not be interested in further work, and that's fine. I need > >>>> to set aside a few days to think on this. > >>> > >>> I'm happy to continue the discussion as long as it's constructive; I > >>> think we all are. My gut feeling is that Frederick's approach falls > >>> closest to the sweet spot of "workable without being overly offensive" > >>> (*cough*), but if you've got an additional approach in mind, or an > >>> alternative approach that solves the same use case problems, I think > >>> we'd all love to hear about it. > >> > >> I would love to actually hear the problems people are trying to solve so > >> that we can have a sensible conversation about the trade offs. > > > > Here are several taken from the previous threads, it's surely not a > > complete list, but it should give you a good idea: > > > > https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOFkTKPjjztsOa9Yrq+AeAQA@mail.gmail.com/ > > > >> As best I can tell without more information people want to use > >> the creation of a user namespace as a signal that the code is > >> attempting an exploit. > > > > Some use cases are like that, there are several other use cases that > > go beyond this; see all of our previous discussions on this > > topic/patchset. As has been mentioned before, there are use cases > > that require improved observability, access control, or both. > > > >> As such let me propose instead of returning an error code which will let > >> the exploit continue, have the security hook return a bool. With true > >> meaning the code can continue and on false it will trigger using SIGSYS > >> to terminate the program like seccomp does. > > > > Having the kernel forcibly exit the process isn't something that most > > LSMs would likely want. I suppose we could modify the hook/caller so > > that *if* an LSM wanted to return SIGSYS the system would kill the > > process, but I would want that to be something in addition to > > returning an error code like LSMs normally do (e.g. EACCES). > > I am new to user_namespace and security work, so please pardon me if > anything below is very wrong. > > IIUC, user_namespace is a tool that enables trusted userspace code to > control the behavior of untrusted (or less trusted) userspace code. > Failing create_user_ns() doesn't make the system more reliable. > Specifically, we call create_user_ns() via two paths: fork/clone and > unshare. For both paths, we need the userspace to use user_namespace, > and to honor failed create_user_ns(). > > On the other hand, I would echo that killing the process is not > practical in some use cases. Specifically, allowing the application to > run in a less secure environment for a short period of time might be > much better than killing it and taking down the whole service. Of > course, there are other cases that security is more important, and > taking down the whole service is the better choice. > > I guess the ultimate solution is a way to enforce using user_namespace > in the kernel (if it ever makes sense...). The LSM framework, and the BPF and SELinux LSM implementations in this patchset, provide a mechanism to do just that: kernel enforced access controls using flexible security policies which can be tailored by the distro, solution provider, or end user to meet the specific needs of their use case.
> On Aug 25, 2022, at 3:10 PM, Paul Moore <paul@paul-moore.com> wrote: > > On Thu, Aug 25, 2022 at 5:58 PM Song Liu <songliubraving@fb.com> wrote: >>> On Aug 25, 2022, at 12:19 PM, Paul Moore <paul@paul-moore.com> wrote: >>> >>> On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >>>> Paul Moore <paul@paul-moore.com> writes: >>>>> On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: >>>>>> I am hoping we can come up with >>>>>> "something better" to address people's needs, make everyone happy, and >>>>>> bring forth world peace. Which would stack just fine with what's here >>>>>> for defense in depth. >>>>>> >>>>>> You may well not be interested in further work, and that's fine. I need >>>>>> to set aside a few days to think on this. >>>>> >>>>> I'm happy to continue the discussion as long as it's constructive; I >>>>> think we all are. My gut feeling is that Frederick's approach falls >>>>> closest to the sweet spot of "workable without being overly offensive" >>>>> (*cough*), but if you've got an additional approach in mind, or an >>>>> alternative approach that solves the same use case problems, I think >>>>> we'd all love to hear about it. >>>> >>>> I would love to actually hear the problems people are trying to solve so >>>> that we can have a sensible conversation about the trade offs. >>> >>> Here are several taken from the previous threads, it's surely not a >>> complete list, but it should give you a good idea: >>> >>> https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOFkTKPjjztsOa9Yrq+AeAQA@mail.gmail.com/ >>> >>>> As best I can tell without more information people want to use >>>> the creation of a user namespace as a signal that the code is >>>> attempting an exploit. >>> >>> Some use cases are like that, there are several other use cases that >>> go beyond this; see all of our previous discussions on this >>> topic/patchset. As has been mentioned before, there are use cases >>> that require improved observability, access control, or both. >>> >>>> As such let me propose instead of returning an error code which will let >>>> the exploit continue, have the security hook return a bool. With true >>>> meaning the code can continue and on false it will trigger using SIGSYS >>>> to terminate the program like seccomp does. >>> >>> Having the kernel forcibly exit the process isn't something that most >>> LSMs would likely want. I suppose we could modify the hook/caller so >>> that *if* an LSM wanted to return SIGSYS the system would kill the >>> process, but I would want that to be something in addition to >>> returning an error code like LSMs normally do (e.g. EACCES). >> >> I am new to user_namespace and security work, so please pardon me if >> anything below is very wrong. >> >> IIUC, user_namespace is a tool that enables trusted userspace code to >> control the behavior of untrusted (or less trusted) userspace code. >> Failing create_user_ns() doesn't make the system more reliable. >> Specifically, we call create_user_ns() via two paths: fork/clone and >> unshare. For both paths, we need the userspace to use user_namespace, >> and to honor failed create_user_ns(). >> >> On the other hand, I would echo that killing the process is not >> practical in some use cases. Specifically, allowing the application to >> run in a less secure environment for a short period of time might be >> much better than killing it and taking down the whole service. Of >> course, there are other cases that security is more important, and >> taking down the whole service is the better choice. >> >> I guess the ultimate solution is a way to enforce using user_namespace >> in the kernel (if it ever makes sense...). > > The LSM framework, and the BPF and SELinux LSM implementations in this > patchset, provide a mechanism to do just that: kernel enforced access > controls using flexible security policies which can be tailored by the > distro, solution provider, or end user to meet the specific needs of > their use case. In this case, I wouldn't call the kernel is enforcing access control. (I might be wrong). There are 3 components here: kernel, LSM, and trusted userspace (whoever calls unshare). AFAICT, kernel simply passes the decision made by LSM (BPF or SELinux) to the trusted userspace. It is up to the trusted userspace to honor the return value of unshare(). If the userspace simply ignores unshare failures, or does not call unshare(CLONE_NEWUSER), kernel and LSM cannot do much about it, right? This might still be useful in some cases. (I am far from an expert on these). I just feel this is not the typical solution to enforce something. Thanks, Song PS: If I said something very stupid, I would not feel offended if someone pointed it out loud. :)
On Thu, Aug 25, 2022 at 8:19 PM Paul Moore <paul@paul-moore.com> wrote: > > On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Paul Moore <paul@paul-moore.com> writes: > > > On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: > > >> I am hoping we can come up with > > >> "something better" to address people's needs, make everyone happy, and > > >> bring forth world peace. Which would stack just fine with what's here > > >> for defense in depth. > > >> > > >> You may well not be interested in further work, and that's fine. I need > > >> to set aside a few days to think on this. > > > > > > I'm happy to continue the discussion as long as it's constructive; I > > > think we all are. My gut feeling is that Frederick's approach falls > > > closest to the sweet spot of "workable without being overly offensive" > > > (*cough*), but if you've got an additional approach in mind, or an > > > alternative approach that solves the same use case problems, I think > > > we'd all love to hear about it. > > > > I would love to actually hear the problems people are trying to solve so > > that we can have a sensible conversation about the trade offs. > > Here are several taken from the previous threads, it's surely not a > complete list, but it should give you a good idea: > > https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOFkTKPjjztsOa9Yrq+AeAQA@mail.gmail.com/ > > > As best I can tell without more information people want to use > > the creation of a user namespace as a signal that the code is > > attempting an exploit. > > Some use cases are like that, there are several other use cases that > go beyond this; see all of our previous discussions on this > topic/patchset. As has been mentioned before, there are use cases > that require improved observability, access control, or both. > > > As such let me propose instead of returning an error code which will let > > the exploit continue, have the security hook return a bool. With true > > meaning the code can continue and on false it will trigger using SIGSYS > > to terminate the program like seccomp does. > > Having the kernel forcibly exit the process isn't something that most > LSMs would likely want. I suppose we could modify the hook/caller so > that *if* an LSM wanted to return SIGSYS the system would kill the > process, but I would want that to be something in addition to > returning an error code like LSMs normally do (e.g. EACCES). I would also add here that seccomp allows more flexibility than just delivering SIGSYS to a violating application. We can program seccomp bpf to: * deliver a signal * return a CUSTOM error code (and BTW somehow this does not trigger any requirements to change userapi or document in manpages: in my toy example in [1] I'm delivering ENETDOWN from a uname(2) system call, which is not documented in the man pages, but totally valid from a seccomp usage perspective) * do-nothing, but log the action So I would say the seccomp reference supports the current approach more than the alternative approach of delivering SIGSYS as technically an LSM implementation of the hook (at least in-kernel one) can chose to deliver a signal to a task via kernel-api, but BPF-LSM (and others) can deliver custom error codes and log the actions as well. Ignat > -- > paul-moore.com [1]: https://blog.cloudflare.com/sandboxing-in-linux-with-zero-lines-of-code/
On Thu, Aug 25, 2022 at 6:42 PM Song Liu <songliubraving@fb.com> wrote: > > On Aug 25, 2022, at 3:10 PM, Paul Moore <paul@paul-moore.com> wrote: > > On Thu, Aug 25, 2022 at 5:58 PM Song Liu <songliubraving@fb.com> wrote: ... > >> I am new to user_namespace and security work, so please pardon me if > >> anything below is very wrong. > >> > >> IIUC, user_namespace is a tool that enables trusted userspace code to > >> control the behavior of untrusted (or less trusted) userspace code. > >> Failing create_user_ns() doesn't make the system more reliable. > >> Specifically, we call create_user_ns() via two paths: fork/clone and > >> unshare. For both paths, we need the userspace to use user_namespace, > >> and to honor failed create_user_ns(). > >> > >> On the other hand, I would echo that killing the process is not > >> practical in some use cases. Specifically, allowing the application to > >> run in a less secure environment for a short period of time might be > >> much better than killing it and taking down the whole service. Of > >> course, there are other cases that security is more important, and > >> taking down the whole service is the better choice. > >> > >> I guess the ultimate solution is a way to enforce using user_namespace > >> in the kernel (if it ever makes sense...). > > > > The LSM framework, and the BPF and SELinux LSM implementations in this > > patchset, provide a mechanism to do just that: kernel enforced access > > controls using flexible security policies which can be tailored by the > > distro, solution provider, or end user to meet the specific needs of > > their use case. > > In this case, I wouldn't call the kernel is enforcing access control. > (I might be wrong). There are 3 components here: kernel, LSM, and > trusted userspace (whoever calls unshare). The LSM layer, and the LSMs themselves are part of the kernel; look at the changes in this patchset to see the LSM, BPF LSM, and SELinux kernel changes. Explaining how the different LSMs work is quite a bit beyond the scope of this discussion, but there is plenty of information available online that should be able to serve as an introduction, not to mention the kernel source itself. However, in very broad terms you can think of the individual LSMs as somewhat analogous to filesystem drivers, e.g. ext4, and the LSM itself as the VFS layer. > AFAICT, kernel simply passes > the decision made by LSM (BPF or SELinux) to the trusted userspace. It > is up to the trusted userspace to honor the return value of unshare(). With a LSM enabled and enforcing a security policy on user namespace creation, which appears to be the case of most concern, the kernel would make a decision on the namespace creation based on various factors (e.g. for SELinux this would be the calling process' security domain and the domain's permission set as determined by the configured security policy) and if the operation was rejected an error code would be returned to userspace and the operation rejected. It is the exact same thing as what would happen if the calling process is chrooted or doesn't have a proper UID/GID mapping. Don't forget that the create_user_ns() function already enforces a security policy and returns errors to userspace; this patchset doesn't add anything new in that regard, it just allows for a richer and more flexible security policy to be built on top of the existing constraints. > If the userspace simply ignores unshare failures, or does not call > unshare(CLONE_NEWUSER), kernel and LSM cannot do much about it, right? The process is still subject to any security policies that are active and being enforced by the kernel. A malicious or misconfigured application can still be constrained by the kernel using both the kernel's legacy Discretionary Access Controls (DAC) as well as the more comprehensive Mandatory Access Controls (MAC) provided by many of the LSMs.
On Fri, Aug 26, 2022 at 5:11 AM Ignat Korchagin <ignat@cloudflare.com> wrote: > I would also add here that seccomp allows more flexibility than just > delivering SIGSYS to a violating application. We can program seccomp > bpf to: > * deliver a signal > * return a CUSTOM error code (and BTW somehow this does not trigger > any requirements to change userapi or document in manpages: in my toy > example in [1] I'm delivering ENETDOWN from a uname(2) system call, > which is not documented in the man pages, but totally valid from a > seccomp usage perspective) > * do-nothing, but log the action > > So I would say the seccomp reference supports the current approach > more than the alternative approach of delivering SIGSYS as technically > an LSM implementation of the hook (at least in-kernel one) can chose > to deliver a signal to a task via kernel-api, but BPF-LSM (and others) > can deliver custom error codes and log the actions as well. I agree that seccomp mode 2 allows for more flexibility than was mentioned earlier, however seccomp filtering has some limitations in this particular case which can be an issue for some. The first, and perhaps most important, is that some of the information that a seccomp filter might want to inspect is effectively hidden with the clone3(2) syscall due to the clone_args struct; this would make it difficult for a seccomp filter to identify namespace related operations. The second issue is that a seccomp mode 2 based approach requires the applications themselves to "Do The Right Thing" and ensure that the proper seccomp filter is loaded into the kernel before the target fork()/clone()/unshare() call is executed; a LSM which implements a proper mandatory access control mechanism does not rely on the application, it enforces the system's security policy regardless of what actions userspace performs.
On Thu, Aug 25, 2022 at 01:15:46PM -0500, Eric W. Biederman wrote: > Paul Moore <paul@paul-moore.com> writes: > > > On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: > >> I am hoping we can come up with > >> "something better" to address people's needs, make everyone happy, and > >> bring forth world peace. Which would stack just fine with what's here > >> for defense in depth. > >> > >> You may well not be interested in further work, and that's fine. I need > >> to set aside a few days to think on this. > > > > I'm happy to continue the discussion as long as it's constructive; I > > think we all are. My gut feeling is that Frederick's approach falls > > closest to the sweet spot of "workable without being overly offensive" > > (*cough*), but if you've got an additional approach in mind, or an > > alternative approach that solves the same use case problems, I think > > we'd all love to hear about it. > > I would love to actually hear the problems people are trying to solve so > that we can have a sensible conversation about the trade offs. > > As best I can tell without more information people want to use > the creation of a user namespace as a signal that the code is > attempting an exploit. I don't think that's it at all. I think the problem is that it seems you can pretty reliably get a root shell at some point in the future by creating a user namespace, leaving it open for a bit, and waiting for a new announcement of the latest netfilter or whatever exploit that requires root in a user namespace. Then go back to your userns shell and run the exploit. So i was hoping we could do something more targeted. Be it splitting off the ability to run code under capable_ns code from uid mapping (to an extent), or maybe some limited-livepatch type of thing where certain parts of code become inaccessible to code in a non-init userns after some sysctl has been toggled, or something cooloer that I've failed to think of. -serge
On Thu, Aug 25, 2022 at 09:58:46PM +0000, Song Liu wrote: > > > > On Aug 25, 2022, at 12:19 PM, Paul Moore <paul@paul-moore.com> wrote: > > > > On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > >> Paul Moore <paul@paul-moore.com> writes: > >>> On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: > >>>> I am hoping we can come up with > >>>> "something better" to address people's needs, make everyone happy, and > >>>> bring forth world peace. Which would stack just fine with what's here > >>>> for defense in depth. > >>>> > >>>> You may well not be interested in further work, and that's fine. I need > >>>> to set aside a few days to think on this. > >>> > >>> I'm happy to continue the discussion as long as it's constructive; I > >>> think we all are. My gut feeling is that Frederick's approach falls > >>> closest to the sweet spot of "workable without being overly offensive" > >>> (*cough*), but if you've got an additional approach in mind, or an > >>> alternative approach that solves the same use case problems, I think > >>> we'd all love to hear about it. > >> > >> I would love to actually hear the problems people are trying to solve so > >> that we can have a sensible conversation about the trade offs. > > > > Here are several taken from the previous threads, it's surely not a > > complete list, but it should give you a good idea: > > > > https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOFkTKPjjztsOa9Yrq+AeAQA@mail.gmail.com/ > > > >> As best I can tell without more information people want to use > >> the creation of a user namespace as a signal that the code is > >> attempting an exploit. > > > > Some use cases are like that, there are several other use cases that > > go beyond this; see all of our previous discussions on this > > topic/patchset. As has been mentioned before, there are use cases > > that require improved observability, access control, or both. > > > >> As such let me propose instead of returning an error code which will let > >> the exploit continue, have the security hook return a bool. With true > >> meaning the code can continue and on false it will trigger using SIGSYS > >> to terminate the program like seccomp does. > > > > Having the kernel forcibly exit the process isn't something that most > > LSMs would likely want. I suppose we could modify the hook/caller so > > that *if* an LSM wanted to return SIGSYS the system would kill the > > process, but I would want that to be something in addition to > > returning an error code like LSMs normally do (e.g. EACCES). > > I am new to user_namespace and security work, so please pardon me if > anything below is very wrong. > > IIUC, user_namespace is a tool that enables trusted userspace code to > control the behavior of untrusted (or less trusted) userspace code. No. user namespaces are not a way for more trusted code to control the behavior of less trusted code. > Failing create_user_ns() doesn't make the system more reliable. > Specifically, we call create_user_ns() via two paths: fork/clone and > unshare. For both paths, we need the userspace to use user_namespace, > and to honor failed create_user_ns(). > > On the other hand, I would echo that killing the process is not > practical in some use cases. Specifically, allowing the application to > run in a less secure environment for a short period of time might be > much better than killing it and taking down the whole service. Of > course, there are other cases that security is more important, and > taking down the whole service is the better choice. > > I guess the ultimate solution is a way to enforce using user_namespace > in the kernel (if it ever makes sense...). But I don't know how that > gonna work. Before we have such solution, maybe we only need an > void hook for observability (or just a tracepoint, coming from BPF > background). > > Thanks, > Song
> On Aug 26, 2022, at 8:02 AM, Paul Moore <paul@paul-moore.com> wrote: > > On Thu, Aug 25, 2022 at 6:42 PM Song Liu <songliubraving@fb.com> wrote: >>> On Aug 25, 2022, at 3:10 PM, Paul Moore <paul@paul-moore.com> wrote: >>> On Thu, Aug 25, 2022 at 5:58 PM Song Liu <songliubraving@fb.com> wrote: > > ... > >>>> I am new to user_namespace and security work, so please pardon me if >>>> anything below is very wrong. >>>> >>>> IIUC, user_namespace is a tool that enables trusted userspace code to >>>> control the behavior of untrusted (or less trusted) userspace code. >>>> Failing create_user_ns() doesn't make the system more reliable. >>>> Specifically, we call create_user_ns() via two paths: fork/clone and >>>> unshare. For both paths, we need the userspace to use user_namespace, >>>> and to honor failed create_user_ns(). >>>> >>>> On the other hand, I would echo that killing the process is not >>>> practical in some use cases. Specifically, allowing the application to >>>> run in a less secure environment for a short period of time might be >>>> much better than killing it and taking down the whole service. Of >>>> course, there are other cases that security is more important, and >>>> taking down the whole service is the better choice. >>>> >>>> I guess the ultimate solution is a way to enforce using user_namespace >>>> in the kernel (if it ever makes sense...). >>> >>> The LSM framework, and the BPF and SELinux LSM implementations in this >>> patchset, provide a mechanism to do just that: kernel enforced access >>> controls using flexible security policies which can be tailored by the >>> distro, solution provider, or end user to meet the specific needs of >>> their use case. >> >> In this case, I wouldn't call the kernel is enforcing access control. >> (I might be wrong). There are 3 components here: kernel, LSM, and >> trusted userspace (whoever calls unshare). > > The LSM layer, and the LSMs themselves are part of the kernel; look at > the changes in this patchset to see the LSM, BPF LSM, and SELinux > kernel changes. Explaining how the different LSMs work is quite a bit > beyond the scope of this discussion, but there is plenty of > information available online that should be able to serve as an > introduction, not to mention the kernel source itself. However, in > very broad terms you can think of the individual LSMs as somewhat > analogous to filesystem drivers, e.g. ext4, and the LSM itself as the > VFS layer. Thanks for the explanation. This matches my understanding with LSM. > >> AFAICT, kernel simply passes >> the decision made by LSM (BPF or SELinux) to the trusted userspace. It >> is up to the trusted userspace to honor the return value of unshare(). > > With a LSM enabled and enforcing a security policy on user namespace > creation, which appears to be the case of most concern, the kernel > would make a decision on the namespace creation based on various > factors (e.g. for SELinux this would be the calling process' security > domain and the domain's permission set as determined by the configured > security policy) and if the operation was rejected an error code would > be returned to userspace and the operation rejected. It is the exact > same thing as what would happen if the calling process is chrooted or > doesn't have a proper UID/GID mapping. Don't forget that the > create_user_ns() function already enforces a security policy and > returns errors to userspace; this patchset doesn't add anything new in > that regard, it just allows for a richer and more flexible security > policy to be built on top of the existing constraints. I believe I don't understand user namespace enough to agree or disagree here. I guess I should read more. Thanks, Song > >> If the userspace simply ignores unshare failures, or does not call >> unshare(CLONE_NEWUSER), kernel and LSM cannot do much about it, right? > > The process is still subject to any security policies that are active > and being enforced by the kernel. A malicious or misconfigured > application can still be constrained by the kernel using both the > kernel's legacy Discretionary Access Controls (DAC) as well as the > more comprehensive Mandatory Access Controls (MAC) provided by many of > the LSMs. > > -- > paul-moore.com
> On Aug 26, 2022, at 8:24 AM, Serge E. Hallyn <serge@hallyn.com> wrote: > > On Thu, Aug 25, 2022 at 09:58:46PM +0000, Song Liu wrote: >> >> >>> On Aug 25, 2022, at 12:19 PM, Paul Moore <paul@paul-moore.com> wrote: >>> >>> On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >>>> Paul Moore <paul@paul-moore.com> writes: >>>>> On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: >>>>>> I am hoping we can come up with >>>>>> "something better" to address people's needs, make everyone happy, and >>>>>> bring forth world peace. Which would stack just fine with what's here >>>>>> for defense in depth. >>>>>> >>>>>> You may well not be interested in further work, and that's fine. I need >>>>>> to set aside a few days to think on this. >>>>> >>>>> I'm happy to continue the discussion as long as it's constructive; I >>>>> think we all are. My gut feeling is that Frederick's approach falls >>>>> closest to the sweet spot of "workable without being overly offensive" >>>>> (*cough*), but if you've got an additional approach in mind, or an >>>>> alternative approach that solves the same use case problems, I think >>>>> we'd all love to hear about it. >>>> >>>> I would love to actually hear the problems people are trying to solve so >>>> that we can have a sensible conversation about the trade offs. >>> >>> Here are several taken from the previous threads, it's surely not a >>> complete list, but it should give you a good idea: >>> >>> https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOFkTKPjjztsOa9Yrq+AeAQA@mail.gmail.com/ >>> >>>> As best I can tell without more information people want to use >>>> the creation of a user namespace as a signal that the code is >>>> attempting an exploit. >>> >>> Some use cases are like that, there are several other use cases that >>> go beyond this; see all of our previous discussions on this >>> topic/patchset. As has been mentioned before, there are use cases >>> that require improved observability, access control, or both. >>> >>>> As such let me propose instead of returning an error code which will let >>>> the exploit continue, have the security hook return a bool. With true >>>> meaning the code can continue and on false it will trigger using SIGSYS >>>> to terminate the program like seccomp does. >>> >>> Having the kernel forcibly exit the process isn't something that most >>> LSMs would likely want. I suppose we could modify the hook/caller so >>> that *if* an LSM wanted to return SIGSYS the system would kill the >>> process, but I would want that to be something in addition to >>> returning an error code like LSMs normally do (e.g. EACCES). >> >> I am new to user_namespace and security work, so please pardon me if >> anything below is very wrong. >> >> IIUC, user_namespace is a tool that enables trusted userspace code to >> control the behavior of untrusted (or less trusted) userspace code. > > No. user namespaces are not a way for more trusted code to control the > behavior of less trusted code. Hmm.. In this case, I think I really need to learn more. Thanks for pointing out my misunderstanding. Song > >> Failing create_user_ns() doesn't make the system more reliable. >> Specifically, we call create_user_ns() via two paths: fork/clone and >> unshare. For both paths, we need the userspace to use user_namespace, >> and to honor failed create_user_ns(). >> >> On the other hand, I would echo that killing the process is not >> practical in some use cases. Specifically, allowing the application to >> run in a less secure environment for a short period of time might be >> much better than killing it and taking down the whole service. Of >> course, there are other cases that security is more important, and >> taking down the whole service is the better choice. >> >> I guess the ultimate solution is a way to enforce using user_namespace >> in the kernel (if it ever makes sense...). But I don't know how that >> gonna work. Before we have such solution, maybe we only need an >> void hook for observability (or just a tracepoint, coming from BPF >> background). >> >> Thanks, >> Song
On Fri, Aug 26, 2022 at 05:00:51PM +0000, Song Liu wrote: > > > > On Aug 26, 2022, at 8:24 AM, Serge E. Hallyn <serge@hallyn.com> wrote: > > > > On Thu, Aug 25, 2022 at 09:58:46PM +0000, Song Liu wrote: > >> > >> > >>> On Aug 25, 2022, at 12:19 PM, Paul Moore <paul@paul-moore.com> wrote: > >>> > >>> On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > >>>> Paul Moore <paul@paul-moore.com> writes: > >>>>> On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: > >>>>>> I am hoping we can come up with > >>>>>> "something better" to address people's needs, make everyone happy, and > >>>>>> bring forth world peace. Which would stack just fine with what's here > >>>>>> for defense in depth. > >>>>>> > >>>>>> You may well not be interested in further work, and that's fine. I need > >>>>>> to set aside a few days to think on this. > >>>>> > >>>>> I'm happy to continue the discussion as long as it's constructive; I > >>>>> think we all are. My gut feeling is that Frederick's approach falls > >>>>> closest to the sweet spot of "workable without being overly offensive" > >>>>> (*cough*), but if you've got an additional approach in mind, or an > >>>>> alternative approach that solves the same use case problems, I think > >>>>> we'd all love to hear about it. > >>>> > >>>> I would love to actually hear the problems people are trying to solve so > >>>> that we can have a sensible conversation about the trade offs. > >>> > >>> Here are several taken from the previous threads, it's surely not a > >>> complete list, but it should give you a good idea: > >>> > >>> https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOFkTKPjjztsOa9Yrq+AeAQA@mail.gmail.com/ > >>> > >>>> As best I can tell without more information people want to use > >>>> the creation of a user namespace as a signal that the code is > >>>> attempting an exploit. > >>> > >>> Some use cases are like that, there are several other use cases that > >>> go beyond this; see all of our previous discussions on this > >>> topic/patchset. As has been mentioned before, there are use cases > >>> that require improved observability, access control, or both. > >>> > >>>> As such let me propose instead of returning an error code which will let > >>>> the exploit continue, have the security hook return a bool. With true > >>>> meaning the code can continue and on false it will trigger using SIGSYS > >>>> to terminate the program like seccomp does. > >>> > >>> Having the kernel forcibly exit the process isn't something that most > >>> LSMs would likely want. I suppose we could modify the hook/caller so > >>> that *if* an LSM wanted to return SIGSYS the system would kill the > >>> process, but I would want that to be something in addition to > >>> returning an error code like LSMs normally do (e.g. EACCES). > >> > >> I am new to user_namespace and security work, so please pardon me if > >> anything below is very wrong. > >> > >> IIUC, user_namespace is a tool that enables trusted userspace code to > >> control the behavior of untrusted (or less trusted) userspace code. > > > > No. user namespaces are not a way for more trusted code to control the > > behavior of less trusted code. > > Hmm.. In this case, I think I really need to learn more. > > Thanks for pointing out my misunderstanding. (I thought maybe Eric would chime in with a better explanation, but I'll fill it in for now :) One of the main goals of user namespaces is to allow unprivileged users to do things like chroot and mount, which are very useful development tools, without needing admin privileges. So it's almost the opposite of what you said: rather than to enable trusted userspace code to control the behavior of less trusted code, it's to allow less privileged code to do things which do not affect other users, without having to assume *more* privilege. To be precise, the goals were: 1. uid mapping - allow two users to both "use uid 500" without conflicting 2. provide (unprivileged) users privilege over their own resources 3. absolutely no extra privilege over other resources 4. be able to nest While (3) was technically achieved, the problem we have is that (2) provides unprivileged users the ability to exercise kernel code which they previously could not. -serge
> On Aug 26, 2022, at 2:00 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > > On Fri, Aug 26, 2022 at 05:00:51PM +0000, Song Liu wrote: >> >> >>> On Aug 26, 2022, at 8:24 AM, Serge E. Hallyn <serge@hallyn.com> wrote: >>> >>> On Thu, Aug 25, 2022 at 09:58:46PM +0000, Song Liu wrote: >>>> >>>> >>>>> On Aug 25, 2022, at 12:19 PM, Paul Moore <paul@paul-moore.com> wrote: >>>>> >>>>> On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >>>>>> Paul Moore <paul@paul-moore.com> writes: >>>>>>> On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: >>>>>>>> I am hoping we can come up with >>>>>>>> "something better" to address people's needs, make everyone happy, and >>>>>>>> bring forth world peace. Which would stack just fine with what's here >>>>>>>> for defense in depth. >>>>>>>> >>>>>>>> You may well not be interested in further work, and that's fine. I need >>>>>>>> to set aside a few days to think on this. >>>>>>> >>>>>>> I'm happy to continue the discussion as long as it's constructive; I >>>>>>> think we all are. My gut feeling is that Frederick's approach falls >>>>>>> closest to the sweet spot of "workable without being overly offensive" >>>>>>> (*cough*), but if you've got an additional approach in mind, or an >>>>>>> alternative approach that solves the same use case problems, I think >>>>>>> we'd all love to hear about it. >>>>>> >>>>>> I would love to actually hear the problems people are trying to solve so >>>>>> that we can have a sensible conversation about the trade offs. >>>>> >>>>> Here are several taken from the previous threads, it's surely not a >>>>> complete list, but it should give you a good idea: >>>>> >>>>> https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOFkTKPjjztsOa9Yrq+AeAQA@mail.gmail.com/ >>>>> >>>>>> As best I can tell without more information people want to use >>>>>> the creation of a user namespace as a signal that the code is >>>>>> attempting an exploit. >>>>> >>>>> Some use cases are like that, there are several other use cases that >>>>> go beyond this; see all of our previous discussions on this >>>>> topic/patchset. As has been mentioned before, there are use cases >>>>> that require improved observability, access control, or both. >>>>> >>>>>> As such let me propose instead of returning an error code which will let >>>>>> the exploit continue, have the security hook return a bool. With true >>>>>> meaning the code can continue and on false it will trigger using SIGSYS >>>>>> to terminate the program like seccomp does. >>>>> >>>>> Having the kernel forcibly exit the process isn't something that most >>>>> LSMs would likely want. I suppose we could modify the hook/caller so >>>>> that *if* an LSM wanted to return SIGSYS the system would kill the >>>>> process, but I would want that to be something in addition to >>>>> returning an error code like LSMs normally do (e.g. EACCES). >>>> >>>> I am new to user_namespace and security work, so please pardon me if >>>> anything below is very wrong. >>>> >>>> IIUC, user_namespace is a tool that enables trusted userspace code to >>>> control the behavior of untrusted (or less trusted) userspace code. >>> >>> No. user namespaces are not a way for more trusted code to control the >>> behavior of less trusted code. >> >> Hmm.. In this case, I think I really need to learn more. >> >> Thanks for pointing out my misunderstanding. > > (I thought maybe Eric would chime in with a better explanation, but I'll > fill it in for now :) > > One of the main goals of user namespaces is to allow unprivileged users > to do things like chroot and mount, which are very useful development > tools, without needing admin privileges. So it's almost the opposite > of what you said: rather than to enable trusted userspace code to control > the behavior of less trusted code, it's to allow less privileged code to > do things which do not affect other users, without having to assume *more* > privilege. Thanks for the explanation! > > To be precise, the goals were: > > 1. uid mapping - allow two users to both "use uid 500" without conflicting > 2. provide (unprivileged) users privilege over their own resources > 3. absolutely no extra privilege over other resources > 4. be able to nest Now I have better idea about "what". But I am not quite sure about how to do it. I will do more homework, and probably come back with more questions. :) > > While (3) was technically achieved, the problem we have is that > (2) provides unprivileged users the ability to exercise kernel code > which they previously could not. Do you mean this one? """ I think the problem is that it seems you can pretty reliably get a root shell at some point in the future by creating a user namespace, leaving it open for a bit, and waiting for a new announcement of the latest netfilter or whatever exploit that requires root in a user namespace. Then go back to your userns shell and run the exploit. """ Please don't share how to do it yet. I want to use it as a test for my study. :) Thanks again! Song
On Fri, Aug 26, 2022 at 04:00:39PM -0500, Serge Hallyn wrote: > On Fri, Aug 26, 2022 at 05:00:51PM +0000, Song Liu wrote: > > > > > > > On Aug 26, 2022, at 8:24 AM, Serge E. Hallyn <serge@hallyn.com> wrote: > > > > > > On Thu, Aug 25, 2022 at 09:58:46PM +0000, Song Liu wrote: > > >> > > >> > > >>> On Aug 25, 2022, at 12:19 PM, Paul Moore <paul@paul-moore.com> wrote: > > >>> > > >>> On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > >>>> Paul Moore <paul@paul-moore.com> writes: > > >>>>> On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: > > >>>>>> I am hoping we can come up with > > >>>>>> "something better" to address people's needs, make everyone happy, and > > >>>>>> bring forth world peace. Which would stack just fine with what's here > > >>>>>> for defense in depth. > > >>>>>> > > >>>>>> You may well not be interested in further work, and that's fine. I need > > >>>>>> to set aside a few days to think on this. > > >>>>> > > >>>>> I'm happy to continue the discussion as long as it's constructive; I > > >>>>> think we all are. My gut feeling is that Frederick's approach falls > > >>>>> closest to the sweet spot of "workable without being overly offensive" > > >>>>> (*cough*), but if you've got an additional approach in mind, or an > > >>>>> alternative approach that solves the same use case problems, I think > > >>>>> we'd all love to hear about it. > > >>>> > > >>>> I would love to actually hear the problems people are trying to solve so > > >>>> that we can have a sensible conversation about the trade offs. > > >>> > > >>> Here are several taken from the previous threads, it's surely not a > > >>> complete list, but it should give you a good idea: > > >>> > > >>> https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOFkTKPjjztsOa9Yrq+AeAQA@mail.gmail.com/ > > >>> > > >>>> As best I can tell without more information people want to use > > >>>> the creation of a user namespace as a signal that the code is > > >>>> attempting an exploit. > > >>> > > >>> Some use cases are like that, there are several other use cases that > > >>> go beyond this; see all of our previous discussions on this > > >>> topic/patchset. As has been mentioned before, there are use cases > > >>> that require improved observability, access control, or both. > > >>> > > >>>> As such let me propose instead of returning an error code which will let > > >>>> the exploit continue, have the security hook return a bool. With true > > >>>> meaning the code can continue and on false it will trigger using SIGSYS > > >>>> to terminate the program like seccomp does. > > >>> > > >>> Having the kernel forcibly exit the process isn't something that most > > >>> LSMs would likely want. I suppose we could modify the hook/caller so > > >>> that *if* an LSM wanted to return SIGSYS the system would kill the > > >>> process, but I would want that to be something in addition to > > >>> returning an error code like LSMs normally do (e.g. EACCES). > > >> > > >> I am new to user_namespace and security work, so please pardon me if > > >> anything below is very wrong. > > >> > > >> IIUC, user_namespace is a tool that enables trusted userspace code to > > >> control the behavior of untrusted (or less trusted) userspace code. > > > > > > No. user namespaces are not a way for more trusted code to control the > > > behavior of less trusted code. > > > > Hmm.. In this case, I think I really need to learn more. > > > > Thanks for pointing out my misunderstanding. > > (I thought maybe Eric would chime in with a better explanation, but I'll > fill it in for now :) > > One of the main goals of user namespaces is to allow unprivileged users > to do things like chroot and mount, which are very useful development > tools, without needing admin privileges. So it's almost the opposite > of what you said: rather than to enable trusted userspace code to control > the behavior of less trusted code, it's to allow less privileged code to > do things which do not affect other users, without having to assume *more* > privilege. > > To be precise, the goals were: > > 1. uid mapping - allow two users to both "use uid 500" without conflicting > 2. provide (unprivileged) users privilege over their own resources > 3. absolutely no extra privilege over other resources > 4. be able to nest > > While (3) was technically achieved, the problem we have is that > (2) provides unprivileged users the ability to exercise kernel code > which they previously could not. The consequence of the refusal to give users any way to control whether or not user namespaces are available to unprivileged users is that a non-significant number of distros still carry the same patch for about 10 years now that adds an unprivileged_userns_clone sysctl to restrict them to privileged users. That includes current Debian and Archlinux btw. The LSM hook is a simple way to allow administrators to control this and will allow user namespaces to be enabled in scenarios where they would otherwise not be accepted precisely because they are available to unprivileged users. I fully understand the motivation and usefulness in unprivileged scenarios but it's an unfounded fear that giving users the ability to control user namespace creation via an LSM hook will cause proliferation of setuid binaries (Ignoring for a moment that any fully unprivileged container with useful idmappings has to rely on the new{g,u}idmap setuid binaries to setup useful mappings anyway.) or decrease system safety let alone cause regressions (Which I don't think is an applicable term here at all.). Distros that have unprivileged user namespaces turned on by default are extremely unlikely to switch to an LSM profile that turns them off and distros that already turn them off will continue to turn them off whether or not that LSM hook is available. It's much more likely that workloads that want to minimize their attack surface while still getting the benefits of user namespaces for e.g. service isolation will feel comfortable enabling them for the first time since they can control them via an LSM profile.
On Mon, Aug 29, 2022 at 05:33:04PM +0200, Christian Brauner wrote: > On Fri, Aug 26, 2022 at 04:00:39PM -0500, Serge Hallyn wrote: > > On Fri, Aug 26, 2022 at 05:00:51PM +0000, Song Liu wrote: > > > > > > > > > > On Aug 26, 2022, at 8:24 AM, Serge E. Hallyn <serge@hallyn.com> wrote: > > > > > > > > On Thu, Aug 25, 2022 at 09:58:46PM +0000, Song Liu wrote: > > > >> > > > >> > > > >>> On Aug 25, 2022, at 12:19 PM, Paul Moore <paul@paul-moore.com> wrote: > > > >>> > > > >>> On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > > >>>> Paul Moore <paul@paul-moore.com> writes: > > > >>>>> On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn <serge@hallyn.com> wrote: > > > >>>>>> I am hoping we can come up with > > > >>>>>> "something better" to address people's needs, make everyone happy, and > > > >>>>>> bring forth world peace. Which would stack just fine with what's here > > > >>>>>> for defense in depth. > > > >>>>>> > > > >>>>>> You may well not be interested in further work, and that's fine. I need > > > >>>>>> to set aside a few days to think on this. > > > >>>>> > > > >>>>> I'm happy to continue the discussion as long as it's constructive; I > > > >>>>> think we all are. My gut feeling is that Frederick's approach falls > > > >>>>> closest to the sweet spot of "workable without being overly offensive" > > > >>>>> (*cough*), but if you've got an additional approach in mind, or an > > > >>>>> alternative approach that solves the same use case problems, I think > > > >>>>> we'd all love to hear about it. > > > >>>> > > > >>>> I would love to actually hear the problems people are trying to solve so > > > >>>> that we can have a sensible conversation about the trade offs. > > > >>> > > > >>> Here are several taken from the previous threads, it's surely not a > > > >>> complete list, but it should give you a good idea: > > > >>> > > > >>> https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOFkTKPjjztsOa9Yrq+AeAQA@mail.gmail.com/ > > > >>> > > > >>>> As best I can tell without more information people want to use > > > >>>> the creation of a user namespace as a signal that the code is > > > >>>> attempting an exploit. > > > >>> > > > >>> Some use cases are like that, there are several other use cases that > > > >>> go beyond this; see all of our previous discussions on this > > > >>> topic/patchset. As has been mentioned before, there are use cases > > > >>> that require improved observability, access control, or both. > > > >>> > > > >>>> As such let me propose instead of returning an error code which will let > > > >>>> the exploit continue, have the security hook return a bool. With true > > > >>>> meaning the code can continue and on false it will trigger using SIGSYS > > > >>>> to terminate the program like seccomp does. > > > >>> > > > >>> Having the kernel forcibly exit the process isn't something that most > > > >>> LSMs would likely want. I suppose we could modify the hook/caller so > > > >>> that *if* an LSM wanted to return SIGSYS the system would kill the > > > >>> process, but I would want that to be something in addition to > > > >>> returning an error code like LSMs normally do (e.g. EACCES). > > > >> > > > >> I am new to user_namespace and security work, so please pardon me if > > > >> anything below is very wrong. > > > >> > > > >> IIUC, user_namespace is a tool that enables trusted userspace code to > > > >> control the behavior of untrusted (or less trusted) userspace code. > > > > > > > > No. user namespaces are not a way for more trusted code to control the > > > > behavior of less trusted code. > > > > > > Hmm.. In this case, I think I really need to learn more. > > > > > > Thanks for pointing out my misunderstanding. > > > > (I thought maybe Eric would chime in with a better explanation, but I'll > > fill it in for now :) > > > > One of the main goals of user namespaces is to allow unprivileged users > > to do things like chroot and mount, which are very useful development > > tools, without needing admin privileges. So it's almost the opposite > > of what you said: rather than to enable trusted userspace code to control > > the behavior of less trusted code, it's to allow less privileged code to > > do things which do not affect other users, without having to assume *more* > > privilege. > > > > To be precise, the goals were: > > > > 1. uid mapping - allow two users to both "use uid 500" without conflicting > > 2. provide (unprivileged) users privilege over their own resources > > 3. absolutely no extra privilege over other resources > > 4. be able to nest > > > > While (3) was technically achieved, the problem we have is that > > (2) provides unprivileged users the ability to exercise kernel code > > which they previously could not. > > The consequence of the refusal to give users any way to control whether > or not user namespaces are available to unprivileged users is that a > non-significant number of distros still carry the same patch for about > 10 years now that adds an unprivileged_userns_clone sysctl to restrict > them to privileged users. That includes current Debian and Archlinux btw. Hi Christian, I'm wondering about your placement of this argument in the thread, and whether you interpreted what I said above as an argument against this patchset, or whether you're just expanding on what I said. > The LSM hook is a simple way to allow administrators to control this and (I think the "control" here is suboptimal, but I've not seen - nor conceived of - anything better as of yet) > will allow user namespaces to be enabled in scenarios where they > would otherwise not be accepted precisely because they are available to > unprivileged users. > > I fully understand the motivation and usefulness in unprivileged > scenarios but it's an unfounded fear that giving users the ability to > control user namespace creation via an LSM hook will cause proliferation > of setuid binaries (Ignoring for a moment that any fully unprivileged > container with useful idmappings has to rely on the new{g,u}idmap setuid > binaries to setup useful mappings anyway.) or decrease system safety let > alone cause regressions (Which I don't think is an applicable term here > at all.). Distros that have unprivileged user namespaces turned on by > default are extremely unlikely to switch to an LSM profile that turns > them off and distros that already turn them off will continue to turn > them off whether or not that LSM hook is available. > > It's much more likely that workloads that want to minimize their attack > surface while still getting the benefits of user namespaces for e.g. > service isolation will feel comfortable enabling them for the first time > since they can control them via an LSM profile.