Message ID | 20171103004436.40026-1-mahesh@bandewar.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Quoting Mahesh Bandewar (mahesh@bandewar.net): > Init-user-ns is always uncontrolled and a process that has SYS_ADMIN > that belongs to uncontrolled user-ns can create another (child) user- > namespace that is uncontrolled. Any other process (that either does > not have SYS_ADMIN or belongs to a controlled user-ns) can only > create a user-ns that is controlled. That's a huge change though. It means that any system that previously used unprivileged containers will need new privileged code (which always risks more privilege leaks through the new code) to re-enable what was possible without privilege before. That's a regression. I'm very much interested in what you want to do, But it seems like it would be worth starting with some automated code analysis that shows exactly what code becomes accessible to unprivileged users with user namespaces which was accessible to unprivileged users before. Then we can reason about classifying that code and perhaps limiting access to some of it.
On Sat, Nov 4, 2017 at 4:53 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > > Quoting Mahesh Bandewar (mahesh@bandewar.net): > > Init-user-ns is always uncontrolled and a process that has SYS_ADMIN > > that belongs to uncontrolled user-ns can create another (child) user- > > namespace that is uncontrolled. Any other process (that either does > > not have SYS_ADMIN or belongs to a controlled user-ns) can only > > create a user-ns that is controlled. > > That's a huge change though. It means that any system that previously > used unprivileged containers will need new privileged code (which always > risks more privilege leaks through the new code) to re-enable what was > possible without privilege before. That's a regression. > I wouldn't call it a regression since the existing behavior is preserved as it is if the default-mask is not altered. i.e. uncontrolled process can create user-ns and have full control inside that user-ns. The only difference is - as an example if 'something' comes up which makes a specific capability expose ring-0, so admin can quickly remove the capability in question from the mask, so that no untrusted code can exploit that capability until either the kernel is patched or workloads are sanitized keeping in mind what was discovered. (I have given some real life example vulnerabilities published recently about CAP_NET_RAW in the cover letter) > I'm very much interested in what you want to do, But it seems like > it would be worth starting with some automated code analysis that shows > exactly what code becomes accessible to unprivileged users with user > namespaces which was accessible to unprivileged users before. Then we > can reason about classifying that code and perhaps limiting access to > some of it. I would like to look at this as 'a tool' that is available to admins who can quickly take possible-compromise-situation under-control probably at the cost of some functionality-loss until kernel is patched and the mask is restored to default value. I'm not sure if automated tools could discover anything since these changes should not alter behavior in any way.
Quoting Mahesh Bandewar (महेश बंडेवार) (maheshb@google.com): > On Sat, Nov 4, 2017 at 4:53 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > > > > Quoting Mahesh Bandewar (mahesh@bandewar.net): > > > Init-user-ns is always uncontrolled and a process that has SYS_ADMIN > > > that belongs to uncontrolled user-ns can create another (child) user- > > > namespace that is uncontrolled. Any other process (that either does > > > not have SYS_ADMIN or belongs to a controlled user-ns) can only > > > create a user-ns that is controlled. > > > > That's a huge change though. It means that any system that previously > > used unprivileged containers will need new privileged code (which always > > risks more privilege leaks through the new code) to re-enable what was > > possible without privilege before. That's a regression. > > > I wouldn't call it a regression since the existing behavior is > preserved as it is if the default-mask is not altered. i.e. > uncontrolled process can create user-ns and have full control inside > that user-ns. The only difference is - as an example if 'something' > comes up which makes a specific capability expose ring-0, so admin can > quickly remove the capability in question from the mask, so that no > untrusted code can exploit that capability until either the kernel is Oh, sorry, I misread then, and missed that step. I thought the default with this patchset was that there were no capabilities exposed to user namespaces. > patched or workloads are sanitized keeping in mind what was > discovered. (I have given some real life example vulnerabilities > published recently about CAP_NET_RAW in the cover letter) > > > I'm very much interested in what you want to do, But it seems like > > it would be worth starting with some automated code analysis that shows > > exactly what code becomes accessible to unprivileged users with user > > namespaces which was accessible to unprivileged users before. Then we > > can reason about classifying that code and perhaps limiting access to > > some of it. > I would like to look at this as 'a tool' that is available to admins > who can quickly take possible-compromise-situation under-control > probably at the cost of some functionality-loss until kernel is > patched and the mask is restored to default value. The thing that makes me hesitate with this set is that it is a permanent new feature to address what (I hope) is a temporary problem. What would you think about doing this as a stackable (yama-style) LSM? > I'm not sure if automated tools could discover anything since these > changes should not alter behavior in any way. Seems like there are two naive ways to do it, the first being to just look at all code under ns_capable() plus code called from there. It seems like looking at the result of that could be fruitful. -serge
Substantial added attack surface will never go away as a problem. There aren't a finite number of vulnerabilities to be found.
Quoting Daniel Micay (danielmicay@gmail.com): > Substantial added attack surface will never go away as a problem. There > aren't a finite number of vulnerabilities to be found. There's varying levels of usefulness and quality. There is code which I want to be able to use in a container, and code which I can't ever see a reason for using there. The latter, especially if it's also in a staging driver, would be nice to have a toggle to disable. You're not advocating dropping the added attack surface, only adding a way of dealing with an 0day after the fact. Privilege raising 0days can exist anywhere, not just in code which only root in a user namespace can exercise. So from that point of view, ksplice seems a more complete solution. Why not just actually fix the bad code block when we know about it? Finally, it has been well argued that you can gain many new caps from having only a few others. Given that, how could you ever be sure that, if an 0day is found which allows root in a user ns to abuse CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them would suffice? It seems to me that the existing control in /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape in that case. -serge
On Mon, Nov 06, 2017 at 04:14:18PM -0600, Serge Hallyn wrote: > Quoting Daniel Micay (danielmicay@gmail.com): > > Substantial added attack surface will never go away as a problem. There > > aren't a finite number of vulnerabilities to be found. > > There's varying levels of usefulness and quality. There is code which I > want to be able to use in a container, and code which I can't ever see a > reason for using there. The latter, especially if it's also in a > staging driver, would be nice to have a toggle to disable. > > You're not advocating dropping the added attack surface, only adding a > way of dealing with an 0day after the fact. Privilege raising 0days can > exist anywhere, not just in code which only root in a user namespace can > exercise. So from that point of view, ksplice seems a more complete > solution. Why not just actually fix the bad code block when we know > about it? > > Finally, it has been well argued that you can gain many new caps from > having only a few others. Given that, how could you ever be sure that, > if an 0day is found which allows root in a user ns to abuse > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them > would suffice? It seems to me that the existing control in > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape > in that case. I agree that /proc/sys/kernel/unprivileged_userns_clone is the most reasonable thing to do. This patch introduces a layer of complexity to fine-tune user namespace creation that - in the relevant security critical scenario - should simply be turned of entirely. Is /proc/sys/kernel/unprivileged_userns_clone upstreamed or is this still only carried downstream?
On Mon, Nov 6, 2017 at 5:14 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > Quoting Daniel Micay (danielmicay@gmail.com): >> Substantial added attack surface will never go away as a problem. There >> aren't a finite number of vulnerabilities to be found. > > There's varying levels of usefulness and quality. There is code which I > want to be able to use in a container, and code which I can't ever see a > reason for using there. The latter, especially if it's also in a > staging driver, would be nice to have a toggle to disable. > > You're not advocating dropping the added attack surface, only adding a > way of dealing with an 0day after the fact. Privilege raising 0days can > exist anywhere, not just in code which only root in a user namespace can > exercise. So from that point of view, ksplice seems a more complete > solution. Why not just actually fix the bad code block when we know > about it? > > Finally, it has been well argued that you can gain many new caps from > having only a few others. Given that, how could you ever be sure that, > if an 0day is found which allows root in a user ns to abuse > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them > would suffice? It seems to me that the existing control in > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape > in that case. > > -serge This seems to be heading toward "we need full zones in Linux" with their own procfs and sysfs namespace and a stricter isolation model for resources and capabilities. So long as things can happen in a namespace which have a privileged relationship with host resources, this is going to be cat-and-mouse to one degree or another. Containers and namespaces dont have a one-to-one relationship, so i'm not sure that's the best term to use in the kernel security context since there's a bunch of userspace and implementation delta across the different systems (with their own security models and so forth). Without accounting for what a specific implementation may or may not do, and only looking at "how do we reduce privileged impact on parent context from unprivileged namespaces," this patch does seem to provide a logical way of reducing the privileges available in such a namespace and often needed to mount escapes/impact parent context. -Boris
Quoting Boris Lukashev (blukashev@sempervictus.com): > On Mon, Nov 6, 2017 at 5:14 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > > Quoting Daniel Micay (danielmicay@gmail.com): > >> Substantial added attack surface will never go away as a problem. There > >> aren't a finite number of vulnerabilities to be found. > > > > There's varying levels of usefulness and quality. There is code which I > > want to be able to use in a container, and code which I can't ever see a > > reason for using there. The latter, especially if it's also in a > > staging driver, would be nice to have a toggle to disable. > > > > You're not advocating dropping the added attack surface, only adding a > > way of dealing with an 0day after the fact. Privilege raising 0days can > > exist anywhere, not just in code which only root in a user namespace can > > exercise. So from that point of view, ksplice seems a more complete > > solution. Why not just actually fix the bad code block when we know > > about it? > > > > Finally, it has been well argued that you can gain many new caps from > > having only a few others. Given that, how could you ever be sure that, > > if an 0day is found which allows root in a user ns to abuse > > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them > > would suffice? It seems to me that the existing control in > > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape > > in that case. > > > > -serge > > This seems to be heading toward "we need full zones in Linux" with > their own procfs and sysfs namespace and a stricter isolation model > for resources and capabilities. So long as things can happen in a > namespace which have a privileged relationship with host resources, > this is going to be cat-and-mouse to one degree or another. > > Containers and namespaces dont have a one-to-one relationship, so i'm > not sure that's the best term to use in the kernel security context Sorry - what's not the best term to use? > since there's a bunch of userspace and implementation delta across the > different systems (with their own security models and so forth). > Without accounting for what a specific implementation may or may not > do, and only looking at "how do we reduce privileged impact on parent > context from unprivileged namespaces," this patch does seem to provide > a logical way of reducing the privileges available in such a namespace > and often needed to mount escapes/impact parent context. What different implementations do is irrelevant - as an unprivileged user I can always, with no help, create a new user namespace mapping my current uid to root, and exercise this code. So the security model implemented by a particular userspace namespace-using driver doesn't matter, as it only restricts me if I choose to use it. But, I guess you're actually saying that some program might know that it should never use network code so want to drop CAP_NET_*? And you're saying that a "global capability bounding set" might be useful? Would it be better to actually implement it as a new bounding set that is maintained across user namespace creations, but is per-task (inherted by children of course)? Instead of a sysctl? -serge
On Mon, Nov 6, 2017 at 6:39 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > Quoting Boris Lukashev (blukashev@sempervictus.com): >> On Mon, Nov 6, 2017 at 5:14 PM, Serge E. Hallyn <serge@hallyn.com> wrote: >> > Quoting Daniel Micay (danielmicay@gmail.com): >> >> Substantial added attack surface will never go away as a problem. There >> >> aren't a finite number of vulnerabilities to be found. >> > >> > There's varying levels of usefulness and quality. There is code which I >> > want to be able to use in a container, and code which I can't ever see a >> > reason for using there. The latter, especially if it's also in a >> > staging driver, would be nice to have a toggle to disable. >> > >> > You're not advocating dropping the added attack surface, only adding a >> > way of dealing with an 0day after the fact. Privilege raising 0days can >> > exist anywhere, not just in code which only root in a user namespace can >> > exercise. So from that point of view, ksplice seems a more complete >> > solution. Why not just actually fix the bad code block when we know >> > about it? >> > >> > Finally, it has been well argued that you can gain many new caps from >> > having only a few others. Given that, how could you ever be sure that, >> > if an 0day is found which allows root in a user ns to abuse >> > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them >> > would suffice? It seems to me that the existing control in >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape >> > in that case. >> > >> > -serge >> >> This seems to be heading toward "we need full zones in Linux" with >> their own procfs and sysfs namespace and a stricter isolation model >> for resources and capabilities. So long as things can happen in a >> namespace which have a privileged relationship with host resources, >> this is going to be cat-and-mouse to one degree or another. >> >> Containers and namespaces dont have a one-to-one relationship, so i'm >> not sure that's the best term to use in the kernel security context > > Sorry - what's not the best term to use? Pardon, "containers," since they're namespaces+system construct. > >> since there's a bunch of userspace and implementation delta across the >> different systems (with their own security models and so forth). >> Without accounting for what a specific implementation may or may not >> do, and only looking at "how do we reduce privileged impact on parent >> context from unprivileged namespaces," this patch does seem to provide >> a logical way of reducing the privileges available in such a namespace >> and often needed to mount escapes/impact parent context. > > What different implementations do is irrelevant - as an unprivileged user > I can always, with no help, create a new user namespace mapping my current > uid to root, and exercise this code. So the security model implemented > by a particular userspace namespace-using driver doesn't matter, as it > only restricts me if I choose to use it. > > But, I guess you're actually saying that some program might know that it > should never use network code so want to drop CAP_NET_*? And you're > saying that a "global capability bounding set" might be useful? > The "global capability bounding set" with forced inheritance can be used to prevent the vector you describe wherein the capability of UID 0 in the child NS is restricted from the parent implicitly, so yes, that nomenclature seems appropriate. > Would it be better to actually implement it as a new bounding set that > is maintained across user namespace creations, but is per-task (inherted > by children of course)? Instead of a sysctl? > > -serge In line with the previous comment, the inheritance across subsequent invocations should be forced to prevent the context you described. Please pardon my ignorance, not sure what you mean in terms of "per-task" across namespace creation. -Boris
On Mon, 2017-11-06 at 16:14 -0600, Serge E. Hallyn wrote: > Quoting Daniel Micay (danielmicay@gmail.com): > > Substantial added attack surface will never go away as a problem. > > There > > aren't a finite number of vulnerabilities to be found. > > There's varying levels of usefulness and quality. There is code which > I > want to be able to use in a container, and code which I can't ever see > a > reason for using there. The latter, especially if it's also in a > staging driver, would be nice to have a toggle to disable. > > You're not advocating dropping the added attack surface, only adding a > way of dealing with an 0day after the fact. Privilege raising 0days > can > exist anywhere, not just in code which only root in a user namespace > can > exercise. So from that point of view, ksplice seems a more complete > solution. Why not just actually fix the bad code block when we know > about it? That's not what I'm advocating. I only care about it for proactive attack surface reduction downstream. I have no interest in using it to block access to known vulnerabilities. > Finally, it has been well argued that you can gain many new caps from > having only a few others. Given that, how could you ever be sure > that, > if an 0day is found which allows root in a user ns to abuse > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them > would suffice? I didn't suggest using it that way... > It seems to me that the existing control in > /proc/sys/kernel/unprivileged_userns_clone might be the better duct > tape > in that case. There's no such thing as unprivileged_userns_clone in mainline. The advantage of this over unprivileged_userns_clone in Debian and maybe some other distributions is not giving up unprivileged app containers / sandboxes implemented via user namespaces. For example, Chromium's user namespace sandbox likely only needs to have CAP_SYS_CHROOT. Chromium will be dropping their setuid sandbox, forcing usage of user namespaces to avoid losing the sandbox which will greatly increase local kernel attack surface on the host by exposing netfilter management, etc. to unprivileged users. The proposed approach isn't necessarily the best way to implement this kind of mitigation but I think it's filling a real need.
On Mon, Nov 06, 2017 at 09:16:03PM -0500, Daniel Micay wrote: > On Mon, 2017-11-06 at 16:14 -0600, Serge E. Hallyn wrote: > > Quoting Daniel Micay (danielmicay@gmail.com): > > > Substantial added attack surface will never go away as a problem. > > > There > > > aren't a finite number of vulnerabilities to be found. > > > > There's varying levels of usefulness and quality. There is code which > > I > > want to be able to use in a container, and code which I can't ever see > > a > > reason for using there. The latter, especially if it's also in a > > staging driver, would be nice to have a toggle to disable. > > > > You're not advocating dropping the added attack surface, only adding a > > way of dealing with an 0day after the fact. Privilege raising 0days > > can > > exist anywhere, not just in code which only root in a user namespace > > can > > exercise. So from that point of view, ksplice seems a more complete > > solution. Why not just actually fix the bad code block when we know > > about it? > > That's not what I'm advocating. I only care about it for proactive > attack surface reduction downstream. I have no interest in using it to > block access to known vulnerabilities. > > > Finally, it has been well argued that you can gain many new caps from > > having only a few others. Given that, how could you ever be sure > > that, > > if an 0day is found which allows root in a user ns to abuse > > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them > > would suffice? > > I didn't suggest using it that way... > > > It seems to me that the existing control in > > /proc/sys/kernel/unprivileged_userns_clone might be the better duct > > tape > > in that case. > > There's no such thing as unprivileged_userns_clone in mainline. Hm. I was sure Kees had gotten that in... I guess I was wrong. > The advantage of this over unprivileged_userns_clone in Debian and maybe > some other distributions is not giving up unprivileged app containers / > sandboxes implemented via user namespaces. For example, Chromium's user > namespace sandbox likely only needs to have CAP_SYS_CHROOT. Chromium > will be dropping their setuid sandbox, forcing usage of user namespaces > to avoid losing the sandbox which will greatly increase local kernel > attack surface on the host by exposing netfilter management, etc. to > unprivileged users. > > The proposed approach isn't necessarily the best way to implement this > kind of mitigation but I think it's filling a real need. I think I definately prefer what I mentioned in the email to Boris. Basically a "permanent capability bounding set". The normal bounding set gets reset to a full set on every new user_ns creation. In this proposal, it would instead be set to the calling task's permanent capability set, which starts (at boot) full, and which privileged tasks can pull capabilities out of. -serge
On Mon, Nov 06, 2017 at 07:01:58PM -0500, Boris Lukashev wrote: > On Mon, Nov 6, 2017 at 6:39 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > > Quoting Boris Lukashev (blukashev@sempervictus.com): > >> On Mon, Nov 6, 2017 at 5:14 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > >> > Quoting Daniel Micay (danielmicay@gmail.com): > >> >> Substantial added attack surface will never go away as a problem. There > >> >> aren't a finite number of vulnerabilities to be found. > >> > > >> > There's varying levels of usefulness and quality. There is code which I > >> > want to be able to use in a container, and code which I can't ever see a > >> > reason for using there. The latter, especially if it's also in a > >> > staging driver, would be nice to have a toggle to disable. > >> > > >> > You're not advocating dropping the added attack surface, only adding a > >> > way of dealing with an 0day after the fact. Privilege raising 0days can > >> > exist anywhere, not just in code which only root in a user namespace can > >> > exercise. So from that point of view, ksplice seems a more complete > >> > solution. Why not just actually fix the bad code block when we know > >> > about it? > >> > > >> > Finally, it has been well argued that you can gain many new caps from > >> > having only a few others. Given that, how could you ever be sure that, > >> > if an 0day is found which allows root in a user ns to abuse > >> > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them > >> > would suffice? It seems to me that the existing control in > >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape > >> > in that case. > >> > > >> > -serge > >> > >> This seems to be heading toward "we need full zones in Linux" with > >> their own procfs and sysfs namespace and a stricter isolation model > >> for resources and capabilities. So long as things can happen in a > >> namespace which have a privileged relationship with host resources, > >> this is going to be cat-and-mouse to one degree or another. > >> > >> Containers and namespaces dont have a one-to-one relationship, so i'm > >> not sure that's the best term to use in the kernel security context > > > > Sorry - what's not the best term to use? > > Pardon, "containers," since they're namespaces+system construct. > > > > >> since there's a bunch of userspace and implementation delta across the > >> different systems (with their own security models and so forth). > >> Without accounting for what a specific implementation may or may not > >> do, and only looking at "how do we reduce privileged impact on parent > >> context from unprivileged namespaces," this patch does seem to provide > >> a logical way of reducing the privileges available in such a namespace > >> and often needed to mount escapes/impact parent context. > > > > What different implementations do is irrelevant - as an unprivileged user > > I can always, with no help, create a new user namespace mapping my current > > uid to root, and exercise this code. So the security model implemented > > by a particular userspace namespace-using driver doesn't matter, as it > > only restricts me if I choose to use it. > > > > But, I guess you're actually saying that some program might know that it > > should never use network code so want to drop CAP_NET_*? And you're > > saying that a "global capability bounding set" might be useful? > > > > The "global capability bounding set" with forced inheritance can be > used to prevent the vector you describe wherein the capability of UID > 0 in the child NS is restricted from the parent implicitly, so yes, > that nomenclature seems appropriate. > > > Would it be better to actually implement it as a new bounding set that > > is maintained across user namespace creations, but is per-task (inherted > > by children of course)? Instead of a sysctl? > > > > -serge > > In line with the previous comment, the inheritance across subsequent > invocations should be forced to prevent the context you described. > Please pardon my ignorance, not sure what you mean in terms of > "per-task" across namespace creation. I meant each task has a perm_cap_bset next to the cap_bset. So task p1 (if it has privilege) can drop CAP_SYS_ADMIN from perm_cap_bset, p2 (if it has privilege) can drop CAP_NET_ADMIN. When p1 creates a new user_ns, that init task has its cap_bset set to all caps but CAP_SYS_ADMIN. I think for simplicity perm_cap_bset would *only* affect the filling of cap_bset at user namespace creation. So if you wanted to drop a capability from your own cap_bset as well, you'd have to do that separately.
Sorry folks I was traveling and seems like lot happened on this thread. :p I will try to response few of these comments selectively - > The thing that makes me hesitate with this set is that it is a > permanent new feature to address what (I hope) is a temporary > problem. I agree this is permanent new feature but it's not solving a temporary problem. It's impossible to assess what and when new vulnerability that could show up. I think Daniel summed it up appropriately in his response > Seems like there are two naive ways to do it, the first being to just > look at all code under ns_capable() plus code called from there. It > seems like looking at the result of that could be fruitful. This is really hard. The main issue that there were features designed and developed before user-ns days with an assumption that unprivileged users will never get certain capabilities which only root user gets. Now that is not true anymore with user-ns creation with mapping root for any process. Also at the same time blocking user-ns creation for eveyone is a big-hammer which is not needed too. So it's not that easy to just perform a code-walk-though and correct those decisions now. > It seems to me that the existing control in > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape > in that case. This solution is essentially blocking unprivileged users from using the user-namespaces entirely. This is not really a solution that can work. The solution that this patch-set adds allows unprivileged users to create user-namespaces. Actually the proposed solution is more fine-grained approach than the unprivileged_userns_clone solution since you can selectively block capabilities rather than completely blocking the functionality. > I meant each task has a perm_cap_bset next to the cap_bset. So task > p1 (if it has privilege) can drop CAP_SYS_ADMIN from perm_cap_bset, > p2 (if it has privilege) can drop CAP_NET_ADMIN. When p1 creates a > new user_ns, that init task has its cap_bset set to all caps but > CAP_SYS_ADMIN. > > I think for simplicity perm_cap_bset would *only* affect the filling > of cap_bset at user namespace creation. So if you wanted to drop a > capability from your own cap_bset as well, you'd have to do that > separately. My original intention is to reduce the attack surface when vulnerabilities are discovered / published, but I don't see how this is solving that issue. Also the reason to have sysctl is to have simplistic control across the board to contain the situation. If that is not addressed then we might need some other solution on top of this.
On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote: > Sorry folks I was traveling and seems like lot happened on this thread. :p > > I will try to response few of these comments selectively - > > > The thing that makes me hesitate with this set is that it is a > > permanent new feature to address what (I hope) is a temporary > > problem. > I agree this is permanent new feature but it's not solving a temporary > problem. It's impossible to assess what and when new vulnerability > that could show up. I think Daniel summed it up appropriately in his > response > > > Seems like there are two naive ways to do it, the first being to just > > look at all code under ns_capable() plus code called from there. It > > seems like looking at the result of that could be fruitful. > This is really hard. The main issue that there were features designed > and developed before user-ns days with an assumption that unprivileged > users will never get certain capabilities which only root user gets. > Now that is not true anymore with user-ns creation with mapping root > for any process. Also at the same time blocking user-ns creation for > eveyone is a big-hammer which is not needed too. So it's not that easy > to just perform a code-walk-though and correct those decisions now. > > > It seems to me that the existing control in > > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape > > in that case. > This solution is essentially blocking unprivileged users from using > the user-namespaces entirely. This is not really a solution that can > work. The solution that this patch-set adds allows unprivileged users > to create user-namespaces. Actually the proposed solution is more > fine-grained approach than the unprivileged_userns_clone solution > since you can selectively block capabilities rather than completely > blocking the functionality. I've been talking to Stéphane today about this and we should also keep in mind that we have: chb@conventiont|~ > ls -al /proc/sys/user/ total 0 dr-xr-xr-x 1 root root 0 Nov 6 23:32 . dr-xr-xr-x 1 root root 0 Nov 2 22:13 .. -rw-r--r-- 1 root root 0 Nov 8 19:48 max_cgroup_namespaces -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_instances -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_watches -rw-r--r-- 1 root root 0 Nov 8 19:48 max_ipc_namespaces -rw-r--r-- 1 root root 0 Nov 8 19:48 max_mnt_namespaces -rw-r--r-- 1 root root 0 Nov 8 19:48 max_net_namespaces -rw-r--r-- 1 root root 0 Nov 8 19:48 max_pid_namespaces -rw-r--r-- 1 root root 0 Nov 8 19:48 max_user_namespaces -rw-r--r-- 1 root root 0 Nov 8 19:48 max_uts_namespaces These files allow you to limit the number of namespaces that can be created *per namespace* type. So let's say your system runs a bunch of user namespaces you can do: chb@conventiont|~ > echo 0 > /proc/sys/user/max_user_namespaces So that the next time you try to create a user namespaces you'd see: chb@conventiont|~ > unshare -U unshare: unshare failed: No space left on device So there's not even a need to upstream a new sysctl since we have ways of blocking this. Also I'd like to point out that a lot of capability checks and actual security vulnerabilities are associated with CAP_SYS_ADMIN. So what you likely want to do is block CAP_SYS_ADMIN in user namespaces but at this point they become basically useless for a lot of interesting use cases. In addition, this patch would add another layer of complexity that is - imho - not really warranted given what we already have. The relationship between capabilities and user namespaces should stay as simply as possible so that it stays maintaineable. User namespaces already introduce a proper layer of complexity. Just my two cents. I might be totally off here of course. Christian
On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner <christian.brauner@canonical.com> wrote: > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote: >> Sorry folks I was traveling and seems like lot happened on this thread. :p >> >> I will try to response few of these comments selectively - >> >> > The thing that makes me hesitate with this set is that it is a >> > permanent new feature to address what (I hope) is a temporary >> > problem. >> I agree this is permanent new feature but it's not solving a temporary >> problem. It's impossible to assess what and when new vulnerability >> that could show up. I think Daniel summed it up appropriately in his >> response >> >> > Seems like there are two naive ways to do it, the first being to just >> > look at all code under ns_capable() plus code called from there. It >> > seems like looking at the result of that could be fruitful. >> This is really hard. The main issue that there were features designed >> and developed before user-ns days with an assumption that unprivileged >> users will never get certain capabilities which only root user gets. >> Now that is not true anymore with user-ns creation with mapping root >> for any process. Also at the same time blocking user-ns creation for >> eveyone is a big-hammer which is not needed too. So it's not that easy >> to just perform a code-walk-though and correct those decisions now. >> >> > It seems to me that the existing control in >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape >> > in that case. >> This solution is essentially blocking unprivileged users from using >> the user-namespaces entirely. This is not really a solution that can >> work. The solution that this patch-set adds allows unprivileged users >> to create user-namespaces. Actually the proposed solution is more >> fine-grained approach than the unprivileged_userns_clone solution >> since you can selectively block capabilities rather than completely >> blocking the functionality. > > I've been talking to Stéphane today about this and we should also keep in mind > that we have: > > chb@conventiont|~ >> ls -al /proc/sys/user/ > total 0 > dr-xr-xr-x 1 root root 0 Nov 6 23:32 . > dr-xr-xr-x 1 root root 0 Nov 2 22:13 .. > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_cgroup_namespaces > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_instances > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_watches > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_ipc_namespaces > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_mnt_namespaces > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_net_namespaces > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_pid_namespaces > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_user_namespaces > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_uts_namespaces > > These files allow you to limit the number of namespaces that can be created > *per namespace* type. So let's say your system runs a bunch of user namespaces > you can do: > > chb@conventiont|~ >> echo 0 > /proc/sys/user/max_user_namespaces > > So that the next time you try to create a user namespaces you'd see: > > chb@conventiont|~ >> unshare -U > unshare: unshare failed: No space left on device > > So there's not even a need to upstream a new sysctl since we have ways of > blocking this. > I'm not sure how it's solving the problem that my patch-set is addressing? I agree though that the need for unprivileged_userns_clone sysctl goes away as this is equivalent to setting that sysctl to 0 as you have described above. However as I mentioned earlier, blocking processes from creating user-namespaces is not the solution. Processes should be able to create namespaces as they are designed but at the same time we need to have controls to 'contain' them if a need arise. Setting max_no to 0 is not the solution that I'm looking for since it doesn't solve the problem. > Also I'd like to point out that a lot of capability checks and actual security > vulnerabilities are associated with CAP_SYS_ADMIN. So what you likely want to do > is block CAP_SYS_ADMIN in user namespaces but at this point they become > basically useless for a lot of interesting use cases. In addition, this patch > would add another layer of complexity that is - imho - not really warranted > given what we already have. I disagree. I'm not sure how this patch is adding complexity? Simply the functionality is maintained exactly as it is with an extra knob which allows you to take control back if a situation arise. Once the kernel is patched for whatever was discovered, life returns back to normal by readjusting the knob. It's as simple as that! > The relationship between capabilities and user > namespaces should stay as simply as possible so that it stays maintaineable. > User namespaces already introduce a proper layer of complexity. > Just my two cents. I might be totally off here of course. The side effect of the solution is that you have sort-of scaled-down / broken functionality without blocking the feature completely until life returns to normal. If the workload needs the exact same capability that is being controlled, then tough luck but chances of you having a workload that is not in contention with the capability that is being controlled is very high. In a situation like that you wont even notice the change but at the same time admin can ensure that the exploit is contained. This has a very high value. > > Christian
On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) wrote: > On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner > <christian.brauner@canonical.com> wrote: > > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote: > >> Sorry folks I was traveling and seems like lot happened on this thread. :p > >> > >> I will try to response few of these comments selectively - > >> > >> > The thing that makes me hesitate with this set is that it is a > >> > permanent new feature to address what (I hope) is a temporary > >> > problem. > >> I agree this is permanent new feature but it's not solving a temporary > >> problem. It's impossible to assess what and when new vulnerability > >> that could show up. I think Daniel summed it up appropriately in his > >> response > >> > >> > Seems like there are two naive ways to do it, the first being to just > >> > look at all code under ns_capable() plus code called from there. It > >> > seems like looking at the result of that could be fruitful. > >> This is really hard. The main issue that there were features designed > >> and developed before user-ns days with an assumption that unprivileged > >> users will never get certain capabilities which only root user gets. > >> Now that is not true anymore with user-ns creation with mapping root > >> for any process. Also at the same time blocking user-ns creation for > >> eveyone is a big-hammer which is not needed too. So it's not that easy > >> to just perform a code-walk-though and correct those decisions now. > >> > >> > It seems to me that the existing control in > >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape > >> > in that case. > >> This solution is essentially blocking unprivileged users from using > >> the user-namespaces entirely. This is not really a solution that can > >> work. The solution that this patch-set adds allows unprivileged users > >> to create user-namespaces. Actually the proposed solution is more > >> fine-grained approach than the unprivileged_userns_clone solution > >> since you can selectively block capabilities rather than completely > >> blocking the functionality. > > > > I've been talking to Stéphane today about this and we should also keep in mind > > that we have: > > > > chb@conventiont|~ > >> ls -al /proc/sys/user/ > > total 0 > > dr-xr-xr-x 1 root root 0 Nov 6 23:32 . > > dr-xr-xr-x 1 root root 0 Nov 2 22:13 .. > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_cgroup_namespaces > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_instances > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_watches > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_ipc_namespaces > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_mnt_namespaces > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_net_namespaces > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_pid_namespaces > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_user_namespaces > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_uts_namespaces > > > > These files allow you to limit the number of namespaces that can be created > > *per namespace* type. So let's say your system runs a bunch of user namespaces > > you can do: > > > > chb@conventiont|~ > >> echo 0 > /proc/sys/user/max_user_namespaces > > > > So that the next time you try to create a user namespaces you'd see: > > > > chb@conventiont|~ > >> unshare -U > > unshare: unshare failed: No space left on device > > > > So there's not even a need to upstream a new sysctl since we have ways of > > blocking this. > > > I'm not sure how it's solving the problem that my patch-set is addressing? > I agree though that the need for unprivileged_userns_clone sysctl goes > away as this is equivalent to setting that sysctl to 0 as you have > described above. oh right that was the reasoning iirc for not needing the other sysctl. > However as I mentioned earlier, blocking processes from creating > user-namespaces is not the solution. Processes should be able to > create namespaces as they are designed but at the same time we need to > have controls to 'contain' them if a need arise. Setting max_no to 0 > is not the solution that I'm looking for since it doesn't solve the > problem. well yesterday we were told that was explicitly not the goal, but that was not by you ... i just mention it to explain why we seem to be walking in circles a bit. anyway the bounding set doesn't actually make sense so forget that. the question then is just whether it makes sense to allow things to continue at all in this situation. would you mind indulging me by giving one or two concrete examples in the previous known cves of what capabilities you would have dropped tto allow the rest to continue to be safely used? thanks, serge
On Thu, Nov 9, 2017 at 12:21 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) wrote: >> On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner >> <christian.brauner@canonical.com> wrote: >> > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote: >> >> Sorry folks I was traveling and seems like lot happened on this thread. :p >> >> >> >> I will try to response few of these comments selectively - >> >> >> >> > The thing that makes me hesitate with this set is that it is a >> >> > permanent new feature to address what (I hope) is a temporary >> >> > problem. >> >> I agree this is permanent new feature but it's not solving a temporary >> >> problem. It's impossible to assess what and when new vulnerability >> >> that could show up. I think Daniel summed it up appropriately in his >> >> response >> >> >> >> > Seems like there are two naive ways to do it, the first being to just >> >> > look at all code under ns_capable() plus code called from there. It >> >> > seems like looking at the result of that could be fruitful. >> >> This is really hard. The main issue that there were features designed >> >> and developed before user-ns days with an assumption that unprivileged >> >> users will never get certain capabilities which only root user gets. >> >> Now that is not true anymore with user-ns creation with mapping root >> >> for any process. Also at the same time blocking user-ns creation for >> >> eveyone is a big-hammer which is not needed too. So it's not that easy >> >> to just perform a code-walk-though and correct those decisions now. >> >> >> >> > It seems to me that the existing control in >> >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape >> >> > in that case. >> >> This solution is essentially blocking unprivileged users from using >> >> the user-namespaces entirely. This is not really a solution that can >> >> work. The solution that this patch-set adds allows unprivileged users >> >> to create user-namespaces. Actually the proposed solution is more >> >> fine-grained approach than the unprivileged_userns_clone solution >> >> since you can selectively block capabilities rather than completely >> >> blocking the functionality. >> > >> > I've been talking to Stéphane today about this and we should also keep in mind >> > that we have: >> > >> > chb@conventiont|~ >> >> ls -al /proc/sys/user/ >> > total 0 >> > dr-xr-xr-x 1 root root 0 Nov 6 23:32 . >> > dr-xr-xr-x 1 root root 0 Nov 2 22:13 .. >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_cgroup_namespaces >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_instances >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_watches >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_ipc_namespaces >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_mnt_namespaces >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_net_namespaces >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_pid_namespaces >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_user_namespaces >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_uts_namespaces >> > >> > These files allow you to limit the number of namespaces that can be created >> > *per namespace* type. So let's say your system runs a bunch of user namespaces >> > you can do: >> > >> > chb@conventiont|~ >> >> echo 0 > /proc/sys/user/max_user_namespaces >> > >> > So that the next time you try to create a user namespaces you'd see: >> > >> > chb@conventiont|~ >> >> unshare -U >> > unshare: unshare failed: No space left on device >> > >> > So there's not even a need to upstream a new sysctl since we have ways of >> > blocking this. >> > >> I'm not sure how it's solving the problem that my patch-set is addressing? >> I agree though that the need for unprivileged_userns_clone sysctl goes >> away as this is equivalent to setting that sysctl to 0 as you have >> described above. > > oh right that was the reasoning iirc for not needing the other sysctl. > >> However as I mentioned earlier, blocking processes from creating >> user-namespaces is not the solution. Processes should be able to >> create namespaces as they are designed but at the same time we need to >> have controls to 'contain' them if a need arise. Setting max_no to 0 >> is not the solution that I'm looking for since it doesn't solve the >> problem. > > well yesterday we were told that was explicitly not the goal, but that was > not by you ... i just mention it to explain why we seem to be walking in > circles a bit. > > anyway the bounding set doesn't actually make sense so forget that. the > question then is just whether it makes sense to allow things to continue > at all in this situation. would you mind indulging me by giving one or two > concrete examples in the previous known cves of what capabilities you would > have dropped tto allow the rest to continue to be safely used? > Of course. Let's take an example of the CVE that I have mentioned in my cover-letter - CVE-2017-7308 <https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308>. It's well documented and even has a exploit c-program <https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308> that can demonstrate how it can be used against non-patched kernel. There is very nice blog post <https://googleprojectzero.blogspot.kr/2017/05/exploiting-linux-kernel-via-packet.html> about this vulnerability by Andrey Konovalov. This is about the AF_PACKET socket interface that is protected behind NET_RAW capability. This capability is not available to unprivileged user. However, any unprivileged user can get NET_RAW capability (as demonstrated in the cover-letter code that I have attached in this patch series) so this NET_RAW capability is available to any unprivileged user on the host if the kernel has user-namespaces available. With this patch-set applied, all that is needed is to flip a bit with the sysctl (kernel.controlled_userns_caps_whitelist) as demonstrated below - root@lphh6:~# uname -a Linux lphh6 4.14.0-smp-DEV #97 SMP @1510203579 x86_64 GNU/Linux root@lphh6:~# sysctl -q kernel.controlled_userns_caps_whitelist kernel.controlled_userns_caps_whitelist = 1f,ffffffff Now when I run the program (demo from the cover-letter) as a normal unprivileged user I can't create a RAW socket in init-ns but I can in the child-ns. dumbo@lphh6:~$ /tmp/acquire_raw Attempting to open RAW socket before unshare()... socket() SOCK_RAW failed: : Operation not permitted Attempting to open RAW socket after unshare()... Successfully opened RAW-Sock after unshare(). dumbo@lphh6:~$ Now as a root user. Take off CAP_NET_RAW root@lphh6:~# sysctl -w kernel.controlled_userns_caps_whitelist=1f,ffffdfff kernel.controlled_userns_caps_whitelist = 1f,ffffdfff root@lphh6:~# Now run the same program as an unprivileged user - dumbo@lphh6:~$ /tmp/acquire_raw Attempting to open RAW socket before unshare()... socket() SOCK_RAW failed: : Operation not permitted Attempting to open RAW socket after unshare()... socket() SOCK_RAW failed: : Operation not permitted dumbo@lphh6:~$ Notice that it has failed to create a raw socket in init and in child namespace. It's not blocking creation of user-namespaces but allowing admin turn individual capability bits on and off. This is very simplistic example of just demonstrating how capability bits turn-on/off works. So let's assume a sandboxed environment where we don't know what a binary that we are about run in an environment which is identified as susceptible. By turning off the NET_RAW bit, the admin gets an assurance that system is safe and if binary fails because it's not getting this capability then that bad but a sad consequence (without compromising the host integrity) but if it doesn't use the NET_RAW capability but any other combination of remaining 36 capabilities, it would get whatever is necessary. This means we can safely allow processes to create user-namespaces by taking off certain capabilities in question for temporary/extended period until proper fix is applied without compromising the system integrity. The impact will vary based on which capability is taken off and admin would / should be ware of for the environment that he/she is dealing with. thanks, --mahesh.. > thanks, > serge
[resend response as earlier one failed because of formatting issues] On Thu, Nov 9, 2017 at 12:21 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > > On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) wrote: > > On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner > > <christian.brauner@canonical.com> wrote: > > > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote: > > >> Sorry folks I was traveling and seems like lot happened on this thread. :p > > >> > > >> I will try to response few of these comments selectively - > > >> > > >> > The thing that makes me hesitate with this set is that it is a > > >> > permanent new feature to address what (I hope) is a temporary > > >> > problem. > > >> I agree this is permanent new feature but it's not solving a temporary > > >> problem. It's impossible to assess what and when new vulnerability > > >> that could show up. I think Daniel summed it up appropriately in his > > >> response > > >> > > >> > Seems like there are two naive ways to do it, the first being to just > > >> > look at all code under ns_capable() plus code called from there. It > > >> > seems like looking at the result of that could be fruitful. > > >> This is really hard. The main issue that there were features designed > > >> and developed before user-ns days with an assumption that unprivileged > > >> users will never get certain capabilities which only root user gets. > > >> Now that is not true anymore with user-ns creation with mapping root > > >> for any process. Also at the same time blocking user-ns creation for > > >> eveyone is a big-hammer which is not needed too. So it's not that easy > > >> to just perform a code-walk-though and correct those decisions now. > > >> > > >> > It seems to me that the existing control in > > >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape > > >> > in that case. > > >> This solution is essentially blocking unprivileged users from using > > >> the user-namespaces entirely. This is not really a solution that can > > >> work. The solution that this patch-set adds allows unprivileged users > > >> to create user-namespaces. Actually the proposed solution is more > > >> fine-grained approach than the unprivileged_userns_clone solution > > >> since you can selectively block capabilities rather than completely > > >> blocking the functionality. > > > > > > I've been talking to Stéphane today about this and we should also keep in mind > > > that we have: > > > > > > chb@conventiont|~ > > >> ls -al /proc/sys/user/ > > > total 0 > > > dr-xr-xr-x 1 root root 0 Nov 6 23:32 . > > > dr-xr-xr-x 1 root root 0 Nov 2 22:13 .. > > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_cgroup_namespaces > > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_instances > > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_watches > > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_ipc_namespaces > > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_mnt_namespaces > > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_net_namespaces > > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_pid_namespaces > > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_user_namespaces > > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_uts_namespaces > > > > > > These files allow you to limit the number of namespaces that can be created > > > *per namespace* type. So let's say your system runs a bunch of user namespaces > > > you can do: > > > > > > chb@conventiont|~ > > >> echo 0 > /proc/sys/user/max_user_namespaces > > > > > > So that the next time you try to create a user namespaces you'd see: > > > > > > chb@conventiont|~ > > >> unshare -U > > > unshare: unshare failed: No space left on device > > > > > > So there's not even a need to upstream a new sysctl since we have ways of > > > blocking this. > > > > > I'm not sure how it's solving the problem that my patch-set is addressing? > > I agree though that the need for unprivileged_userns_clone sysctl goes > > away as this is equivalent to setting that sysctl to 0 as you have > > described above. > > oh right that was the reasoning iirc for not needing the other sysctl. > > > However as I mentioned earlier, blocking processes from creating > > user-namespaces is not the solution. Processes should be able to > > create namespaces as they are designed but at the same time we need to > > have controls to 'contain' them if a need arise. Setting max_no to 0 > > is not the solution that I'm looking for since it doesn't solve the > > problem. > > well yesterday we were told that was explicitly not the goal, but that was > not by you ... i just mention it to explain why we seem to be walking in > circles a bit. > > anyway the bounding set doesn't actually make sense so forget that. the > question then is just whether it makes sense to allow things to continue > at all in this situation. would you mind indulging me by giving one or two > concrete examples in the previous known cves of what capabilities you would > have dropped tto allow the rest to continue to be safely used? > Of course. Let's take an example of the CVE that I have mentioned in my cover-letter - CVE-2017-7308(https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308). It's well documented and even has a exploit(https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308) c-program that can demonstrate how it can be used against non-patched kernel. There is very nice blog post(https://googleprojectzero.blogspot.kr/2017/05/exploiting-linux-kernel-via-packet.html) about this vulnerability by Andrey Konovalov. This is about the AF_PACKET socket interface that is protected behind NET_RAW capability. This capability is not available to unprivileged user. However, any unprivileged user can get NET_RAW capability (as demonstrated in the cover-letter code that I have attached in this patch series) so this NET_RAW capability is available to any unprivileged user on the host if the kernel has user-namespaces available. With this patch-set applied, all that is needed is to flip a bit with the sysctl (kernel.controlled_userns_caps_whitelist) as demonstrated below - root@lphh6:~# uname -a Linux lphh6 4.14.0-smp-DEV #97 SMP @1510203579 x86_64 GNU/Linux root@lphh6:~# sysctl -q kernel.controlled_userns_caps_whitelist kernel.controlled_userns_caps_whitelist = 1f,ffffffff Now when I run the program (demo from the cover-letter) as a normal unprivileged user I can't create a RAW socket in init-ns but I can in the child-ns. dumbo@lphh6:~$ /tmp/acquire_raw Attempting to open RAW socket before unshare()... socket() SOCK_RAW failed: : Operation not permitted Attempting to open RAW socket after unshare()... Successfully opened RAW-Sock after unshare(). dumbo@lphh6:~$ Now as a root user. Take off CAP_NET_RAW root@lphh6:~# sysctl -w kernel.controlled_userns_caps_whitelist=1f,ffffdfff kernel.controlled_userns_caps_whitelist = 1f,ffffdfff root@lphh6:~# Now run the same program as an unprivileged user - dumbo@lphh6:~$ /tmp/acquire_raw Attempting to open RAW socket before unshare()... socket() SOCK_RAW failed: : Operation not permitted Attempting to open RAW socket after unshare()... socket() SOCK_RAW failed: : Operation not permitted dumbo@lphh6:~$ Notice that it has failed to create a raw socket in init and in child namespace. It's not blocking creation of user-namespaces but allowing admin turn individual capability bits on and off. This is very simplistic example of just demonstrating how capability bits turn-on/off works. So let's assume a sandboxed environment where we don't know what a binary that we are about run in an environment which is identified as susceptible. By turning off the NET_RAW bit, the admin gets an assurance that system is safe and if binary fails because it's not getting this capability then that bad but a sad consequence (without compromising the host integrity) but if it doesn't use the NET_RAW capability but any other combination of remaining 36 capabilities, it would get whatever is necessary. This means we can safely allow processes to create user-namespaces by taking off certain capabilities in question for temporary/extended period until proper fix is applied without compromising the system integrity. The impact will vary based on which capability is taken off and admin would / should be ware of for the environment that he/she is dealing with. thanks, --mahesh.. > thanks, > serge
Quoting Mahesh Bandewar (महेश बंडेवार) (maheshb@google.com): > Of course. Let's take an example of the CVE that I have mentioned in > my cover-letter - > CVE-2017-7308(https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308). > It's well documented and even has a > exploit(https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308) > c-program that can demonstrate how it can be used against non-patched > kernel. There is very nice blog > post(https://googleprojectzero.blogspot.kr/2017/05/exploiting-linux-kernel-via-packet.html) > about this vulnerability by Andrey Konovalov. Ok, thanks. It's a good example because the fix for this CVE actually came by itself (http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/changelog). Normally multiple CVEs come at the same time, which would make a workaround for one now helpful. This is a good counter-example. I'm going to maintain that I really don't like this. But it looks useful, so ack on the concept, I'll just have to look again at the code now. Thanks for indulging me. -serge
Quoting Mahesh Bandewar (mahesh@bandewar.net): > From: Mahesh Bandewar <maheshb@google.com> > > With this new notion of "controlled" user-namespaces, the controlled > user-namespaces are marked at the time of their creation while the > capabilities of processes that belong to them are controlled using the > global mask. > > Init-user-ns is always uncontrolled and a process that has SYS_ADMIN > that belongs to uncontrolled user-ns can create another (child) user- > namespace that is uncontrolled. Any other process (that either does > not have SYS_ADMIN or belongs to a controlled user-ns) can only > create a user-ns that is controlled. > > global-capability-whitelist (controlled_userns_caps_whitelist) is used > at the capability check-time and keeps the semantics for the processes > that belong to uncontrolled user-ns as it is. Processes that belong to > controlled user-ns however are subjected to different checks- > > (a) if the capability in question is controlled and process belongs > to controlled user-ns, then it's always denied. > (b) if the capability in question is NOT controlled then fall back > to the traditional check. > > Signed-off-by: Mahesh Bandewar <maheshb@google.com> > --- > include/linux/capability.h | 1 + > include/linux/user_namespace.h | 20 ++++++++++++++++++++ > kernel/capability.c | 5 +++++ > kernel/user_namespace.c | 3 +++ > security/commoncap.c | 8 ++++++++ > 5 files changed, 37 insertions(+) > > diff --git a/include/linux/capability.h b/include/linux/capability.h > index 6c0b9677c03f..b8c6cac18658 100644 > --- a/include/linux/capability.h > +++ b/include/linux/capability.h > @@ -250,6 +250,7 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns); > extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps); > int proc_douserns_caps_whitelist(struct ctl_table *table, int write, > void __user *buff, size_t *lenp, loff_t *ppos); > +bool is_capability_controlled(int cap); > > extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size); > > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h > index c18e01252346..e890fe81b47e 100644 > --- a/include/linux/user_namespace.h > +++ b/include/linux/user_namespace.h > @@ -22,6 +22,7 @@ struct uid_gid_map { /* 64 bytes -- 1 cache line */ > }; > > #define USERNS_SETGROUPS_ALLOWED 1UL > +#define USERNS_CONTROLLED 2UL > > #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED > > @@ -102,6 +103,16 @@ static inline void put_user_ns(struct user_namespace *ns) > __put_user_ns(ns); > } > > +static inline bool is_user_ns_controlled(const struct user_namespace *ns) > +{ > + return ns->flags & USERNS_CONTROLLED; > +} > + > +static inline void mark_user_ns_controlled(struct user_namespace *ns) > +{ > + ns->flags |= USERNS_CONTROLLED; > +} > + > struct seq_operations; > extern const struct seq_operations proc_uid_seq_operations; > extern const struct seq_operations proc_gid_seq_operations; > @@ -160,6 +171,15 @@ static inline struct ns_common *ns_get_owner(struct ns_common *ns) > { > return ERR_PTR(-EPERM); > } > + > +static inline bool is_user_ns_controlled(const struct user_namespace *ns) > +{ > + return false; > +} > + > +static inline void mark_user_ns_controlled(struct user_namespace *ns) > +{ > +} > #endif > > #endif /* _LINUX_USER_H */ > diff --git a/kernel/capability.c b/kernel/capability.c > index 62dbe3350c1b..40a38cc4ff43 100644 > --- a/kernel/capability.c > +++ b/kernel/capability.c > @@ -510,6 +510,11 @@ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns) > } > > /* Controlled-userns capabilities routines */ > +bool is_capability_controlled(int cap) > +{ > + return !cap_raised(controlled_userns_caps_whitelist, cap); > +} > + > #ifdef CONFIG_SYSCTL > int proc_douserns_caps_whitelist(struct ctl_table *table, int write, > void __user *buff, size_t *lenp, loff_t *ppos) > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index c490f1e4313b..f393ea5108f0 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -53,6 +53,9 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > cred->cap_effective = CAP_FULL_SET; > cred->cap_ambient = CAP_EMPTY_SET; > cred->cap_bset = CAP_FULL_SET; > + if (!ns_capable(user_ns->parent, CAP_SYS_ADMIN) || > + is_user_ns_controlled(user_ns->parent)) > + mark_user_ns_controlled(user_ns); Hm, why do this here, rather than at create_user_ns()? It shouldn't be recalculated when someone does setns() should it?
On 11/06/2017 10:23 PM, Serge E. Hallyn wrote: > I think I definately prefer what I mentioned in the email to Boris. > Basically a "permanent capability bounding set". The normal bounding > set gets reset to a full set on every new user_ns creation. In this > proposal, it would instead be set to the calling task's permanent > capability set, which starts (at boot) full, and which privileged > tasks can pull capabilities out of. Actually, this may solve a similar problem I've been looking at. The idea was basically at strategic points in the kernel (possibly LSM hook sites, still evaluating, and probably syscall entry) validate that a task has not "magically" acquired capabilities that it or parent specifically said it cannot have and then take some action like say killing it immediately. Using your terms, basically make the "permanent capability set" a write-once privilege escalation defense. To handle the 0-day threat, perhaps make it writable but only with more "restrictive" values. -chrish
Quoting chris hyser (chris.hyser@oracle.com): > On 11/06/2017 10:23 PM, Serge E. Hallyn wrote: > >I think I definately prefer what I mentioned in the email to Boris. > >Basically a "permanent capability bounding set". The normal bounding > >set gets reset to a full set on every new user_ns creation. In this > >proposal, it would instead be set to the calling task's permanent > >capability set, which starts (at boot) full, and which privileged > >tasks can pull capabilities out of. > > Actually, this may solve a similar problem I've been looking at. The > idea was basically at strategic points in the kernel (possibly LSM > hook sites, still evaluating, and probably syscall entry) validate > that a task has not "magically" acquired capabilities that it or > parent specifically said it cannot have and then take some action > like say killing it immediately. Using your terms, basically make > the "permanent capability set" a write-once privilege escalation > defense. To handle the 0-day threat, perhaps make it writable but > only with more "restrictive" values. Would the existing capability bounding set not suffice for that? The 'permanent' bounding set turns out to not be a good fit for the problem being discussed in this thread, but please feel free to start a new thread if you want to discuss your use case.
On 11/09/2017 01:05 PM, Serge E. Hallyn wrote: > Would the existing capability bounding set not suffice for that? > > The 'permanent' bounding set turns out to not be a good fit for > the problem being discussed in this thread, but please feel free > to start a new thread if you want to discuss your use case. Sure. I will formulate something for a new thread. What seems to be asked for here is a way to globally patch the capability sets of a entire process subtree. -chrish
"Mahesh Bandewar (महेश बंडेवार)" <maheshb@google.com> writes: > [resend response as earlier one failed because of formatting issues] > > On Thu, Nov 9, 2017 at 12:21 PM, Serge E. Hallyn <serge@hallyn.com> wrote: >> >> On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) wrote: >> > On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner >> > <christian.brauner@canonical.com> wrote: >> > > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote: >> > >> Sorry folks I was traveling and seems like lot happened on this thread. :p >> > >> >> > >> I will try to response few of these comments selectively - >> > >> >> > >> > The thing that makes me hesitate with this set is that it is a >> > >> > permanent new feature to address what (I hope) is a temporary >> > >> > problem. >> > >> I agree this is permanent new feature but it's not solving a temporary >> > >> problem. It's impossible to assess what and when new vulnerability >> > >> that could show up. I think Daniel summed it up appropriately in his >> > >> response >> > >> >> > >> > Seems like there are two naive ways to do it, the first being to just >> > >> > look at all code under ns_capable() plus code called from there. It >> > >> > seems like looking at the result of that could be fruitful. >> > >> This is really hard. The main issue that there were features designed >> > >> and developed before user-ns days with an assumption that unprivileged >> > >> users will never get certain capabilities which only root user gets. >> > >> Now that is not true anymore with user-ns creation with mapping root >> > >> for any process. Also at the same time blocking user-ns creation for >> > >> eveyone is a big-hammer which is not needed too. So it's not that easy >> > >> to just perform a code-walk-though and correct those decisions now. >> > >> >> > >> > It seems to me that the existing control in >> > >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape >> > >> > in that case. >> > >> This solution is essentially blocking unprivileged users from using >> > >> the user-namespaces entirely. This is not really a solution that can >> > >> work. The solution that this patch-set adds allows unprivileged users >> > >> to create user-namespaces. Actually the proposed solution is more >> > >> fine-grained approach than the unprivileged_userns_clone solution >> > >> since you can selectively block capabilities rather than completely >> > >> blocking the functionality. >> > > >> > > I've been talking to Stéphane today about this and we should also keep in mind >> > > that we have: >> > > >> > > chb@conventiont|~ >> > >> ls -al /proc/sys/user/ >> > > total 0 >> > > dr-xr-xr-x 1 root root 0 Nov 6 23:32 . >> > > dr-xr-xr-x 1 root root 0 Nov 2 22:13 .. >> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_cgroup_namespaces >> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_instances >> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_watches >> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_ipc_namespaces >> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_mnt_namespaces >> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_net_namespaces >> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_pid_namespaces >> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_user_namespaces >> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_uts_namespaces >> > > >> > > These files allow you to limit the number of namespaces that can be created >> > > *per namespace* type. So let's say your system runs a bunch of user namespaces >> > > you can do: >> > > >> > > chb@conventiont|~ >> > >> echo 0 > /proc/sys/user/max_user_namespaces >> > > >> > > So that the next time you try to create a user namespaces you'd see: >> > > >> > > chb@conventiont|~ >> > >> unshare -U >> > > unshare: unshare failed: No space left on device >> > > >> > > So there's not even a need to upstream a new sysctl since we have ways of >> > > blocking this. >> > > >> > I'm not sure how it's solving the problem that my patch-set is addressing? >> > I agree though that the need for unprivileged_userns_clone sysctl goes >> > away as this is equivalent to setting that sysctl to 0 as you have >> > described above. >> >> oh right that was the reasoning iirc for not needing the other sysctl. >> >> > However as I mentioned earlier, blocking processes from creating >> > user-namespaces is not the solution. Processes should be able to >> > create namespaces as they are designed but at the same time we need to >> > have controls to 'contain' them if a need arise. Setting max_no to 0 >> > is not the solution that I'm looking for since it doesn't solve the >> > problem. >> >> well yesterday we were told that was explicitly not the goal, but that was >> not by you ... i just mention it to explain why we seem to be walking in >> circles a bit. >> >> anyway the bounding set doesn't actually make sense so forget that. the >> question then is just whether it makes sense to allow things to continue >> at all in this situation. would you mind indulging me by giving one or two >> concrete examples in the previous known cves of what capabilities you would >> have dropped tto allow the rest to continue to be safely used? >> > Of course. Let's take an example of the CVE that I have mentioned in > my cover-letter - > CVE-2017-7308(https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308). > It's well documented and even has a > exploit(https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308) > c-program that can demonstrate how it can be used against non-patched > kernel. There is very nice blog > post(https://googleprojectzero.blogspot.kr/2017/05/exploiting-linux-kernel-via-packet.html) > about this vulnerability by Andrey Konovalov. > > This is about the AF_PACKET socket interface that is protected behind > NET_RAW capability. This capability is not available to unprivileged > user. However, any unprivileged user can get NET_RAW capability (as > demonstrated in the cover-letter code that I have attached in this > patch series) so this NET_RAW capability is available to any > unprivileged user on the host if the kernel has user-namespaces > available. > > With this patch-set applied, all that is needed is to flip a bit with > the sysctl (kernel.controlled_userns_caps_whitelist) as demonstrated > below - > > root@lphh6:~# uname -a > Linux lphh6 4.14.0-smp-DEV #97 SMP @1510203579 x86_64 GNU/Linux > root@lphh6:~# sysctl -q kernel.controlled_userns_caps_whitelist > kernel.controlled_userns_caps_whitelist = 1f,ffffffff > > Now when I run the program (demo from the cover-letter) as a normal > unprivileged user I can't create a RAW socket in init-ns but I can in > the child-ns. > > dumbo@lphh6:~$ /tmp/acquire_raw > Attempting to open RAW socket before unshare()... > socket() SOCK_RAW failed: : Operation not permitted > Attempting to open RAW socket after unshare()... > Successfully opened RAW-Sock after unshare(). > dumbo@lphh6:~$ > > Now as a root user. Take off CAP_NET_RAW > > root@lphh6:~# sysctl -w kernel.controlled_userns_caps_whitelist=1f,ffffdfff > kernel.controlled_userns_caps_whitelist = 1f,ffffdfff > root@lphh6:~# > > Now run the same program as an unprivileged user - > > dumbo@lphh6:~$ /tmp/acquire_raw > Attempting to open RAW socket before unshare()... > socket() SOCK_RAW failed: : Operation not permitted > Attempting to open RAW socket after unshare()... > socket() SOCK_RAW failed: : Operation not permitted > dumbo@lphh6:~$ > > Notice that it has failed to create a raw socket in init and in child > namespace. It's not blocking creation of user-namespaces but allowing > admin turn individual capability bits on and off. > > This is very simplistic example of just demonstrating how capability > bits turn-on/off works. So let's assume a sandboxed environment where > we don't know what a binary that we are about run in an environment > which is identified as susceptible. By turning off the NET_RAW bit, > the admin gets an assurance that system is safe and if binary fails > because it's not getting this capability then that bad but a sad > consequence (without compromising the host integrity) but if it > doesn't use the NET_RAW capability but any other combination of > remaining 36 capabilities, it would get whatever is necessary. This > means we can safely allow processes to create user-namespaces by > taking off certain capabilities in question for temporary/extended > period until proper fix is applied without compromising the system > integrity. The impact will vary based on which capability is taken off > and admin would / should be ware of for the environment that he/she is > dealing with. My challenge with this reasoning is that I don't know that it meanifully generalizes to any other capability. I can in the sandbox today create a user namespace and then set max_net_namespaces to 0, and drop CAP_NET_RAW and that blocks the attack. (Possibly with a little spice to prevent a suid root program from reacquiring CAP_NET_RAW). So while your solution doesn't look horrible especially if it can be done at a user namespace level so the restrictions can be limited to a single sandbox. I am not at all certain that the capabilities is the proper place to limit code reachability. I would very much like to see which capabilities that are available with ns_capable, are more meaningful to limit than just dropping the capability during sandbox creation and denying the creation of the corresponding namespace. CAP_NET_RAW is one. Are there any other capabilities that are meanginful to limit? Eric
On Fri, Nov 10, 2017 at 2:25 AM, Serge E. Hallyn <serge@hallyn.com> wrote: > Quoting Mahesh Bandewar (mahesh@bandewar.net): >> From: Mahesh Bandewar <maheshb@google.com> >> >> With this new notion of "controlled" user-namespaces, the controlled >> user-namespaces are marked at the time of their creation while the >> capabilities of processes that belong to them are controlled using the >> global mask. >> >> Init-user-ns is always uncontrolled and a process that has SYS_ADMIN >> that belongs to uncontrolled user-ns can create another (child) user- >> namespace that is uncontrolled. Any other process (that either does >> not have SYS_ADMIN or belongs to a controlled user-ns) can only >> create a user-ns that is controlled. >> >> global-capability-whitelist (controlled_userns_caps_whitelist) is used >> at the capability check-time and keeps the semantics for the processes >> that belong to uncontrolled user-ns as it is. Processes that belong to >> controlled user-ns however are subjected to different checks- >> >> (a) if the capability in question is controlled and process belongs >> to controlled user-ns, then it's always denied. >> (b) if the capability in question is NOT controlled then fall back >> to the traditional check. >> >> Signed-off-by: Mahesh Bandewar <maheshb@google.com> >> --- >> include/linux/capability.h | 1 + >> include/linux/user_namespace.h | 20 ++++++++++++++++++++ >> kernel/capability.c | 5 +++++ >> kernel/user_namespace.c | 3 +++ >> security/commoncap.c | 8 ++++++++ >> 5 files changed, 37 insertions(+) >> >> diff --git a/include/linux/capability.h b/include/linux/capability.h >> index 6c0b9677c03f..b8c6cac18658 100644 >> --- a/include/linux/capability.h >> +++ b/include/linux/capability.h >> @@ -250,6 +250,7 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns); >> extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps); >> int proc_douserns_caps_whitelist(struct ctl_table *table, int write, >> void __user *buff, size_t *lenp, loff_t *ppos); >> +bool is_capability_controlled(int cap); >> >> extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size); >> >> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h >> index c18e01252346..e890fe81b47e 100644 >> --- a/include/linux/user_namespace.h >> +++ b/include/linux/user_namespace.h >> @@ -22,6 +22,7 @@ struct uid_gid_map { /* 64 bytes -- 1 cache line */ >> }; >> >> #define USERNS_SETGROUPS_ALLOWED 1UL >> +#define USERNS_CONTROLLED 2UL >> >> #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED >> >> @@ -102,6 +103,16 @@ static inline void put_user_ns(struct user_namespace *ns) >> __put_user_ns(ns); >> } >> >> +static inline bool is_user_ns_controlled(const struct user_namespace *ns) >> +{ >> + return ns->flags & USERNS_CONTROLLED; >> +} >> + >> +static inline void mark_user_ns_controlled(struct user_namespace *ns) >> +{ >> + ns->flags |= USERNS_CONTROLLED; >> +} >> + >> struct seq_operations; >> extern const struct seq_operations proc_uid_seq_operations; >> extern const struct seq_operations proc_gid_seq_operations; >> @@ -160,6 +171,15 @@ static inline struct ns_common *ns_get_owner(struct ns_common *ns) >> { >> return ERR_PTR(-EPERM); >> } >> + >> +static inline bool is_user_ns_controlled(const struct user_namespace *ns) >> +{ >> + return false; >> +} >> + >> +static inline void mark_user_ns_controlled(struct user_namespace *ns) >> +{ >> +} >> #endif >> >> #endif /* _LINUX_USER_H */ >> diff --git a/kernel/capability.c b/kernel/capability.c >> index 62dbe3350c1b..40a38cc4ff43 100644 >> --- a/kernel/capability.c >> +++ b/kernel/capability.c >> @@ -510,6 +510,11 @@ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns) >> } >> >> /* Controlled-userns capabilities routines */ >> +bool is_capability_controlled(int cap) >> +{ >> + return !cap_raised(controlled_userns_caps_whitelist, cap); >> +} >> + >> #ifdef CONFIG_SYSCTL >> int proc_douserns_caps_whitelist(struct ctl_table *table, int write, >> void __user *buff, size_t *lenp, loff_t *ppos) >> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c >> index c490f1e4313b..f393ea5108f0 100644 >> --- a/kernel/user_namespace.c >> +++ b/kernel/user_namespace.c >> @@ -53,6 +53,9 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) >> cred->cap_effective = CAP_FULL_SET; >> cred->cap_ambient = CAP_EMPTY_SET; >> cred->cap_bset = CAP_FULL_SET; >> + if (!ns_capable(user_ns->parent, CAP_SYS_ADMIN) || >> + is_user_ns_controlled(user_ns->parent)) >> + mark_user_ns_controlled(user_ns); > > Hm, why do this here, rather than at create_user_ns()? It > shouldn't be recalculated when someone does setns() should it? > You are absolutely right! It doesn't make sense to recalculate for every setns() call. It's a side effect of couple of iterations / approaches that I tried before finalizing this one. I'll move this block to create_user_ns() after the set_cred_user_ns() call so that this wont be triggered in setns() path. Thanks, --mahesh..
On Fri, Nov 10, 2017 at 6:58 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > "Mahesh Bandewar (महेश बंडेवार)" <maheshb@google.com> writes: > >> [resend response as earlier one failed because of formatting issues] >> >> On Thu, Nov 9, 2017 at 12:21 PM, Serge E. Hallyn <serge@hallyn.com> wrote: >>> >>> On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) wrote: >>> > On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner >>> > <christian.brauner@canonical.com> wrote: >>> > > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote: >>> > >> Sorry folks I was traveling and seems like lot happened on this thread. :p >>> > >> >>> > >> I will try to response few of these comments selectively - >>> > >> >>> > >> > The thing that makes me hesitate with this set is that it is a >>> > >> > permanent new feature to address what (I hope) is a temporary >>> > >> > problem. >>> > >> I agree this is permanent new feature but it's not solving a temporary >>> > >> problem. It's impossible to assess what and when new vulnerability >>> > >> that could show up. I think Daniel summed it up appropriately in his >>> > >> response >>> > >> >>> > >> > Seems like there are two naive ways to do it, the first being to just >>> > >> > look at all code under ns_capable() plus code called from there. It >>> > >> > seems like looking at the result of that could be fruitful. >>> > >> This is really hard. The main issue that there were features designed >>> > >> and developed before user-ns days with an assumption that unprivileged >>> > >> users will never get certain capabilities which only root user gets. >>> > >> Now that is not true anymore with user-ns creation with mapping root >>> > >> for any process. Also at the same time blocking user-ns creation for >>> > >> eveyone is a big-hammer which is not needed too. So it's not that easy >>> > >> to just perform a code-walk-though and correct those decisions now. >>> > >> >>> > >> > It seems to me that the existing control in >>> > >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape >>> > >> > in that case. >>> > >> This solution is essentially blocking unprivileged users from using >>> > >> the user-namespaces entirely. This is not really a solution that can >>> > >> work. The solution that this patch-set adds allows unprivileged users >>> > >> to create user-namespaces. Actually the proposed solution is more >>> > >> fine-grained approach than the unprivileged_userns_clone solution >>> > >> since you can selectively block capabilities rather than completely >>> > >> blocking the functionality. >>> > > >>> > > I've been talking to Stéphane today about this and we should also keep in mind >>> > > that we have: >>> > > >>> > > chb@conventiont|~ >>> > >> ls -al /proc/sys/user/ >>> > > total 0 >>> > > dr-xr-xr-x 1 root root 0 Nov 6 23:32 . >>> > > dr-xr-xr-x 1 root root 0 Nov 2 22:13 .. >>> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_cgroup_namespaces >>> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_instances >>> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_watches >>> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_ipc_namespaces >>> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_mnt_namespaces >>> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_net_namespaces >>> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_pid_namespaces >>> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_user_namespaces >>> > > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_uts_namespaces >>> > > >>> > > These files allow you to limit the number of namespaces that can be created >>> > > *per namespace* type. So let's say your system runs a bunch of user namespaces >>> > > you can do: >>> > > >>> > > chb@conventiont|~ >>> > >> echo 0 > /proc/sys/user/max_user_namespaces >>> > > >>> > > So that the next time you try to create a user namespaces you'd see: >>> > > >>> > > chb@conventiont|~ >>> > >> unshare -U >>> > > unshare: unshare failed: No space left on device >>> > > >>> > > So there's not even a need to upstream a new sysctl since we have ways of >>> > > blocking this. >>> > > >>> > I'm not sure how it's solving the problem that my patch-set is addressing? >>> > I agree though that the need for unprivileged_userns_clone sysctl goes >>> > away as this is equivalent to setting that sysctl to 0 as you have >>> > described above. >>> >>> oh right that was the reasoning iirc for not needing the other sysctl. >>> >>> > However as I mentioned earlier, blocking processes from creating >>> > user-namespaces is not the solution. Processes should be able to >>> > create namespaces as they are designed but at the same time we need to >>> > have controls to 'contain' them if a need arise. Setting max_no to 0 >>> > is not the solution that I'm looking for since it doesn't solve the >>> > problem. >>> >>> well yesterday we were told that was explicitly not the goal, but that was >>> not by you ... i just mention it to explain why we seem to be walking in >>> circles a bit. >>> >>> anyway the bounding set doesn't actually make sense so forget that. the >>> question then is just whether it makes sense to allow things to continue >>> at all in this situation. would you mind indulging me by giving one or two >>> concrete examples in the previous known cves of what capabilities you would >>> have dropped tto allow the rest to continue to be safely used? >>> >> Of course. Let's take an example of the CVE that I have mentioned in >> my cover-letter - >> CVE-2017-7308(https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308). >> It's well documented and even has a >> exploit(https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308) >> c-program that can demonstrate how it can be used against non-patched >> kernel. There is very nice blog >> post(https://googleprojectzero.blogspot.kr/2017/05/exploiting-linux-kernel-via-packet.html) >> about this vulnerability by Andrey Konovalov. >> >> This is about the AF_PACKET socket interface that is protected behind >> NET_RAW capability. This capability is not available to unprivileged >> user. However, any unprivileged user can get NET_RAW capability (as >> demonstrated in the cover-letter code that I have attached in this >> patch series) so this NET_RAW capability is available to any >> unprivileged user on the host if the kernel has user-namespaces >> available. >> >> With this patch-set applied, all that is needed is to flip a bit with >> the sysctl (kernel.controlled_userns_caps_whitelist) as demonstrated >> below - >> >> root@lphh6:~# uname -a >> Linux lphh6 4.14.0-smp-DEV #97 SMP @1510203579 x86_64 GNU/Linux >> root@lphh6:~# sysctl -q kernel.controlled_userns_caps_whitelist >> kernel.controlled_userns_caps_whitelist = 1f,ffffffff >> >> Now when I run the program (demo from the cover-letter) as a normal >> unprivileged user I can't create a RAW socket in init-ns but I can in >> the child-ns. >> >> dumbo@lphh6:~$ /tmp/acquire_raw >> Attempting to open RAW socket before unshare()... >> socket() SOCK_RAW failed: : Operation not permitted >> Attempting to open RAW socket after unshare()... >> Successfully opened RAW-Sock after unshare(). >> dumbo@lphh6:~$ >> >> Now as a root user. Take off CAP_NET_RAW >> >> root@lphh6:~# sysctl -w kernel.controlled_userns_caps_whitelist=1f,ffffdfff >> kernel.controlled_userns_caps_whitelist = 1f,ffffdfff >> root@lphh6:~# >> >> Now run the same program as an unprivileged user - >> >> dumbo@lphh6:~$ /tmp/acquire_raw >> Attempting to open RAW socket before unshare()... >> socket() SOCK_RAW failed: : Operation not permitted >> Attempting to open RAW socket after unshare()... >> socket() SOCK_RAW failed: : Operation not permitted >> dumbo@lphh6:~$ >> >> Notice that it has failed to create a raw socket in init and in child >> namespace. It's not blocking creation of user-namespaces but allowing >> admin turn individual capability bits on and off. >> >> This is very simplistic example of just demonstrating how capability >> bits turn-on/off works. So let's assume a sandboxed environment where >> we don't know what a binary that we are about run in an environment >> which is identified as susceptible. By turning off the NET_RAW bit, >> the admin gets an assurance that system is safe and if binary fails >> because it's not getting this capability then that bad but a sad >> consequence (without compromising the host integrity) but if it >> doesn't use the NET_RAW capability but any other combination of >> remaining 36 capabilities, it would get whatever is necessary. This >> means we can safely allow processes to create user-namespaces by >> taking off certain capabilities in question for temporary/extended >> period until proper fix is applied without compromising the system >> integrity. The impact will vary based on which capability is taken off >> and admin would / should be ware of for the environment that he/she is >> dealing with. > > My challenge with this reasoning is that I don't know that it meanifully > generalizes to any other capability. > > I can in the sandbox today create a user namespace and then set > max_net_namespaces to 0, and drop CAP_NET_RAW and that blocks > the attack. (Possibly with a little spice to prevent a suid root > program from reacquiring CAP_NET_RAW). > This is problematic since you are expecting the user-namespace creator to perform this operation and then block the child process from creating the user-namespace. This is similar to making user-namespace creation a privileged operation discussed previously. > So while your solution doesn't look horrible especially if it can be > done at a user namespace level so the restrictions can be limited to a > single sandbox. I am not at all certain that the capabilities is the > proper place to limit code reachability. > > I would very much like to see which capabilities that are available with > ns_capable, are more meaningful to limit than just dropping the > capability during sandbox creation and denying the creation of the > corresponding namespace. > The primary assumption in this approach is that we can drop capabilities before running the workload and then not allowing workload to create the user-namespace. This does not work for cases where workload needs to create user-namespaces. > CAP_NET_RAW is one. Are there any other capabilities that are > meanginful to limit? > There are currently 37 capabilities and I see many of those are currently namespace aware (with ns_capable() call). Also there seems to be disproportionate amount of capable() to ns_capable() calls. This could be a result of not every feature available kernel-wide being namespace aware/capable etc. and this will evolve and mature i.e. ns_capable() will continue to grow where this would be applicable. Also Probably I'm the wrong person to ask this question to since I understand networking more than anything else. However, the main point is that we cannot predict which vulnerability is going to get published tomorrow networking or non-networking, so having a tool that gives controls to admin while allowing user-namespace creation is super useful. thanks, --mahesh.. > Eric
Quoting Eric W. Biederman (ebiederm@xmission.com): > single sandbox. I am not at all certain that the capabilities is the > proper place to limit code reachability. Right, I keep having this gut feeling that there is another way we should be doing that. Maybe based on ksplice or perf, or maybe more based on subsystems. And I hope someone pursues that. But I can't put my finger on it, and meanwhile the capability checks obviously *are* in fact gates... -serge
On Fri, Nov 10, 2017 at 1:46 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > Quoting Eric W. Biederman (ebiederm@xmission.com): >> single sandbox. I am not at all certain that the capabilities is the >> proper place to limit code reachability. > > Right, I keep having this gut feeling that there is another way we > should be doing that. Maybe based on ksplice or perf, or maybe more > based on subsystems. And I hope someone pursues that. But I can't put > my finger on it, and meanwhile the capability checks obviously *are* in > fact gates... > Well, I don't mind if there is a better solution available. The proposed solution is not adding too much or complex code and using a bit and a sysctl and will be sitting dormant. When we have complete solution, this addition should not be a burden to maintain because of it's non-invasive footprint. I will push the next version of the patch-set that implements Serge's finding. Thanks, --mahesh.. [PS: I'll be soon traveling again and moving to an area where connectivity will be scarce / unreliable. So please expect lot more delays in my responses.] > -serge
diff --git a/include/linux/capability.h b/include/linux/capability.h index 6c0b9677c03f..b8c6cac18658 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -250,6 +250,7 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns); extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps); int proc_douserns_caps_whitelist(struct ctl_table *table, int write, void __user *buff, size_t *lenp, loff_t *ppos); +bool is_capability_controlled(int cap); extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size); diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index c18e01252346..e890fe81b47e 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -22,6 +22,7 @@ struct uid_gid_map { /* 64 bytes -- 1 cache line */ }; #define USERNS_SETGROUPS_ALLOWED 1UL +#define USERNS_CONTROLLED 2UL #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED @@ -102,6 +103,16 @@ static inline void put_user_ns(struct user_namespace *ns) __put_user_ns(ns); } +static inline bool is_user_ns_controlled(const struct user_namespace *ns) +{ + return ns->flags & USERNS_CONTROLLED; +} + +static inline void mark_user_ns_controlled(struct user_namespace *ns) +{ + ns->flags |= USERNS_CONTROLLED; +} + struct seq_operations; extern const struct seq_operations proc_uid_seq_operations; extern const struct seq_operations proc_gid_seq_operations; @@ -160,6 +171,15 @@ static inline struct ns_common *ns_get_owner(struct ns_common *ns) { return ERR_PTR(-EPERM); } + +static inline bool is_user_ns_controlled(const struct user_namespace *ns) +{ + return false; +} + +static inline void mark_user_ns_controlled(struct user_namespace *ns) +{ +} #endif #endif /* _LINUX_USER_H */ diff --git a/kernel/capability.c b/kernel/capability.c index 62dbe3350c1b..40a38cc4ff43 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -510,6 +510,11 @@ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns) } /* Controlled-userns capabilities routines */ +bool is_capability_controlled(int cap) +{ + return !cap_raised(controlled_userns_caps_whitelist, cap); +} + #ifdef CONFIG_SYSCTL int proc_douserns_caps_whitelist(struct ctl_table *table, int write, void __user *buff, size_t *lenp, loff_t *ppos) diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index c490f1e4313b..f393ea5108f0 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -53,6 +53,9 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) cred->cap_effective = CAP_FULL_SET; cred->cap_ambient = CAP_EMPTY_SET; cred->cap_bset = CAP_FULL_SET; + if (!ns_capable(user_ns->parent, CAP_SYS_ADMIN) || + is_user_ns_controlled(user_ns->parent)) + mark_user_ns_controlled(user_ns); #ifdef CONFIG_KEYS key_put(cred->request_key_auth); cred->request_key_auth = NULL; diff --git a/security/commoncap.c b/security/commoncap.c index fc46f5b85251..89103f16ac37 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -73,6 +73,14 @@ int cap_capable(const struct cred *cred, struct user_namespace *targ_ns, { struct user_namespace *ns = targ_ns; + /* If the capability is controlled and user-ns that process + * belongs-to is 'controlled' then return EPERM and no need + * to check the user-ns hierarchy. + */ + if (is_user_ns_controlled(cred->user_ns) && + is_capability_controlled(cap)) + return -EPERM; + /* See if cred has the capability in the target user namespace * by examining the target user namespace and all of the target * user namespace's parents.