Message ID | 20190506165439.9155-6-cyphar@cyphar.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | namei: resolveat(2) path resolution restriction API | expand |
On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > The need to be able to scope path resolution of interpreters became > clear with one of the possible vectors used in CVE-2019-5736 (which > most major container runtimes were vulnerable to). > > Naively, it might seem that openat(2) -- which supports path scoping -- > can be combined with execveat(AT_EMPTY_PATH) to trivially scope the > binary being executed. Unfortunately, a "bad binary" (usually a symlink) > could be written as a #!-style script with the symlink target as the > interpreter -- which would be completely missed by just scoping the > openat(2). An example of this being exploitable is CVE-2019-5736. > > In order to get around this, we need to pass down to each binfmt_* > implementation the scoping flags requested in execveat(2). In order to > maintain backwards-compatibility we only pass the scoping AT_* flags. > > To avoid breaking userspace (in the exceptionally rare cases where you > have #!-scripts with a relative path being execveat(2)-ed with dfd != > AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are > set in execveat(2). This seems extremely dangerous. I like the overall series, but not this patch. > @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename, > > sched_exec(); > > + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS | > + AT_THIS_ROOT); [...] > +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */ So now what happens if there is a setuid root ELF binary with program interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that going to let the unprivileged user decide which interpreter the setuid-root process should use? From a high-level perspective, opening the interpreter should be controlled by the program that is being loaded, not by the program that invoked it. In my opinion, CVE-2019-5736 points out two different problems: The big problem: The __ptrace_may_access() logic has a special-case short-circuit for "introspection" that you can't opt out of; this makes it possible to open things in procfs that are related to the current process even if the credentials of the process wouldn't permit accessing another process like it. I think the proper fix to deal with this would be to add a prctl() flag for "set whether introspection is allowed for this process", and if userspace has manually un-set that flag, any introspection special-case logic would be skipped. An additional problem: /proc/*/exe can be used to open a file for writing; I think it may have been Andy Lutomirski who pointed out some time ago that it would be nice if you couldn't use /proc/*/fd/* to re-open files with more privileges, which is sort of the same thing.
On 2019-05-06, Jann Horn <jannh@google.com> wrote: > On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > The need to be able to scope path resolution of interpreters became > > clear with one of the possible vectors used in CVE-2019-5736 (which > > most major container runtimes were vulnerable to). > > > > Naively, it might seem that openat(2) -- which supports path scoping -- > > can be combined with execveat(AT_EMPTY_PATH) to trivially scope the > > binary being executed. Unfortunately, a "bad binary" (usually a symlink) > > could be written as a #!-style script with the symlink target as the > > interpreter -- which would be completely missed by just scoping the > > openat(2). An example of this being exploitable is CVE-2019-5736. > > > > In order to get around this, we need to pass down to each binfmt_* > > implementation the scoping flags requested in execveat(2). In order to > > maintain backwards-compatibility we only pass the scoping AT_* flags. > > > > To avoid breaking userspace (in the exceptionally rare cases where you > > have #!-scripts with a relative path being execveat(2)-ed with dfd != > > AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are > > set in execveat(2). > > This seems extremely dangerous. I like the overall series, but not this patch. > > > @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename, > > > > sched_exec(); > > > > + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS | > > + AT_THIS_ROOT); > [...] > > +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */ > > So now what happens if there is a setuid root ELF binary with program > interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an > unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that > going to let the unprivileged user decide which interpreter the > setuid-root process should use? From a high-level perspective, opening > the interpreter should be controlled by the program that is being > loaded, not by the program that invoked it. I went a bit nuts with openat_exec(), and I did end up adding it to the ELF interpreter lookup (and you're completely right that this is a bad idea -- I will drop it from this patch if it's included in the next series). The proposed solutions you give below are much nicer than this patch so I can drop it and work on fixing those issues separately. > In my opinion, CVE-2019-5736 points out two different problems: > > The big problem: The __ptrace_may_access() logic has a special-case > short-circuit for "introspection" that you can't opt out of; this > makes it possible to open things in procfs that are related to the > current process even if the credentials of the process wouldn't permit > accessing another process like it. I think the proper fix to deal with > this would be to add a prctl() flag for "set whether introspection is > allowed for this process", and if userspace has manually un-set that > flag, any introspection special-case logic would be skipped. We could do PR_SET_DUMPABLE=3 for this, I guess? > An additional problem: /proc/*/exe can be used to open a file for > writing; I think it may have been Andy Lutomirski who pointed out some > time ago that it would be nice if you couldn't use /proc/*/fd/* to > re-open files with more privileges, which is sort of the same thing. This is something I'm currently working on a series for, which would boil down to some restrictions on how re-opening of file descriptors works through procfs. However, execveat() of a procfs magiclink is a bit hard to block -- there is no way for userspace to to represent a file being "open for execute" so they are all "open for execute" by default and blocking it outright seems a bit extreme (though I actually hope to eventually add the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence why I've reserved some bits). (Thinking more about it, there is an argument that I should include the above patch into this series so that we can block re-opening of fds opened through resolveat(2) without explicit flags from the outset.)
> On May 6, 2019, at 12:17 PM, Aleksa Sarai <cyphar@cyphar.com> wrote: > >> On 2019-05-06, Jann Horn <jannh@google.com> wrote: >>> On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote: >>> The need to be able to scope path resolution of interpreters became >>> clear with one of the possible vectors used in CVE-2019-5736 (which >>> most major container runtimes were vulnerable to). >>> >>> Naively, it might seem that openat(2) -- which supports path scoping -- >>> can be combined with execveat(AT_EMPTY_PATH) to trivially scope the >>> binary being executed. Unfortunately, a "bad binary" (usually a symlink) >>> could be written as a #!-style script with the symlink target as the >>> interpreter -- which would be completely missed by just scoping the >>> openat(2). An example of this being exploitable is CVE-2019-5736. >>> >>> In order to get around this, we need to pass down to each binfmt_* >>> implementation the scoping flags requested in execveat(2). In order to >>> maintain backwards-compatibility we only pass the scoping AT_* flags. >>> >>> To avoid breaking userspace (in the exceptionally rare cases where you >>> have #!-scripts with a relative path being execveat(2)-ed with dfd != >>> AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are >>> set in execveat(2). >> >> This seems extremely dangerous. I like the overall series, but not this patch. >> >>> @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename, >>> >>> sched_exec(); >>> >>> + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS | >>> + AT_THIS_ROOT); >> [...] >>> +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */ >> >> So now what happens if there is a setuid root ELF binary with program >> interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an >> unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that >> going to let the unprivileged user decide which interpreter the >> setuid-root process should use? From a high-level perspective, opening >> the interpreter should be controlled by the program that is being >> loaded, not by the program that invoked it. > > I went a bit nuts with openat_exec(), and I did end up adding it to the > ELF interpreter lookup (and you're completely right that this is a bad > idea -- I will drop it from this patch if it's included in the next > series). > > The proposed solutions you give below are much nicer than this patch so > I can drop it and work on fixing those issues separately. > >> In my opinion, CVE-2019-5736 points out two different problems: >> >> The big problem: The __ptrace_may_access() logic has a special-case >> short-circuit for "introspection" that you can't opt out of; this >> makes it possible to open things in procfs that are related to the >> current process even if the credentials of the process wouldn't permit >> accessing another process like it. I think the proper fix to deal with >> this would be to add a prctl() flag for "set whether introspection is >> allowed for this process", and if userspace has manually un-set that >> flag, any introspection special-case logic would be skipped. > > We could do PR_SET_DUMPABLE=3 for this, I guess? > >> An additional problem: /proc/*/exe can be used to open a file for >> writing; I think it may have been Andy Lutomirski who pointed out some >> time ago that it would be nice if you couldn't use /proc/*/fd/* to >> re-open files with more privileges, which is sort of the same thing. > > This is something I'm currently working on a series for, which would > boil down to some restrictions on how re-opening of file descriptors > works through procfs. > > However, execveat() of a procfs magiclink is a bit hard to block -- > there is no way for userspace to to represent a file being "open for > execute" so they are all "open for execute" by default and blocking it > outright seems a bit extreme (though I actually hope to eventually add > the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence > why I've reserved some bits). There’s an O_MAYEXEC series floating around. > > (Thinking more about it, there is an argument that I should include the > above patch into this series so that we can block re-opening of fds > opened through resolveat(2) without explicit flags from the outset.) > > -- > Aleksa Sarai > Senior Software Engineer (Containers) > SUSE Linux GmbH > <https://www.cyphar.com/>
Jann Horn <jannh@google.com> writes: > > In my opinion, CVE-2019-5736 points out two different problems: > > The big problem: The __ptrace_may_access() logic has a special-case > short-circuit for "introspection" that you can't opt out of; Once upon a time in a galaxy far far away I fixed a bug where we missing ptrace_may_access checks on various proc files and systems using selinux stopped working. At the time selinux did not allow ptrace like access to yourself. The "introspection" special case was the quick and simple work-around. There is nothing fundamental in having the "introspection" special case except that various lsms have probably grown to depend upon it being there. I expect without difficulty we could move the check down into the various lsms. Which would get that check out of the core kernel code. Then the special case would the lsms challenge to keep or remove. Eric
On 2019-05-07, Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2019-05-06, Jann Horn <jannh@google.com> wrote: > > On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > The need to be able to scope path resolution of interpreters became > > > clear with one of the possible vectors used in CVE-2019-5736 (which > > > most major container runtimes were vulnerable to). > > > > > > Naively, it might seem that openat(2) -- which supports path scoping -- > > > can be combined with execveat(AT_EMPTY_PATH) to trivially scope the > > > binary being executed. Unfortunately, a "bad binary" (usually a symlink) > > > could be written as a #!-style script with the symlink target as the > > > interpreter -- which would be completely missed by just scoping the > > > openat(2). An example of this being exploitable is CVE-2019-5736. > > > > > > In order to get around this, we need to pass down to each binfmt_* > > > implementation the scoping flags requested in execveat(2). In order to > > > maintain backwards-compatibility we only pass the scoping AT_* flags. > > > > > > To avoid breaking userspace (in the exceptionally rare cases where you > > > have #!-scripts with a relative path being execveat(2)-ed with dfd != > > > AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are > > > set in execveat(2). > > > > This seems extremely dangerous. I like the overall series, but not this patch. > > > > > @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename, > > > > > > sched_exec(); > > > > > > + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS | > > > + AT_THIS_ROOT); > > [...] > > > +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */ > > > > So now what happens if there is a setuid root ELF binary with program > > interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an > > unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that > > going to let the unprivileged user decide which interpreter the > > setuid-root process should use? From a high-level perspective, opening > > the interpreter should be controlled by the program that is being > > loaded, not by the program that invoked it. > > I went a bit nuts with openat_exec(), and I did end up adding it to the > ELF interpreter lookup (and you're completely right that this is a bad > idea -- I will drop it from this patch if it's included in the next > series). > > The proposed solutions you give below are much nicer than this patch so > I can drop it and work on fixing those issues separately. Another possible solution would be to only allow (for instance) AT_NO_MAGICLINKS for execveat(2). That way you cannot scope the resolution but you can block the most concerning cases -- those involving /proc-related access. I've posted a v7 with this patch dropped (because we can always add AT_* flags later in time), but I think having at least NO_MAGICLINKS would be useful.
On Tue, May 07, 2019 at 07:38:58PM -0500, Eric W. Biederman wrote: > Jann Horn <jannh@google.com> writes: > > In my opinion, CVE-2019-5736 points out two different problems: > > > > The big problem: The __ptrace_may_access() logic has a special-case > > short-circuit for "introspection" that you can't opt out of; > > Once upon a time in a galaxy far far away I fixed a bug where we missing > ptrace_may_access checks on various proc files and systems using selinux > stopped working. At the time selinux did not allow ptrace like access > to yourself. The "introspection" special case was the quick and simple > work-around. > > There is nothing fundamental in having the "introspection" special case > except that various lsms have probably grown to depend upon it being > there. I expect without difficulty we could move the check down > into the various lsms. Which would get that check out of the core > kernel code. Oh, if that's an option, that would be great, I think. But this means, for example, that a non-root, non-dumpable process can't open /proc/self/maps anymore, or open /proc/self/fd/*, and things like that, without making itself dumpable. I would be surprised if there is no code out there that relies on that. From what I can tell, without the introspection special case, introspection would fail in the following cases (assuming that the process is not capable and isn't using sys_setfs[ug]id()): - ruid/euid/suid are not all the same - rgid/egid/sgid are not all the same - process is not dumpable I think that there probably should be some way for a non-dumpable process to look at its own procfs entries? If we could start from a clean slate, I'd propose an opt-in flag to openat() for that, but since we don't have a clean slate, I'd be afraid of breaking things with that. But maybe I'm just being overly careful here?
On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote: > On 2019-05-06, Jann Horn <jannh@google.com> wrote: > > In my opinion, CVE-2019-5736 points out two different problems: > > > > The big problem: The __ptrace_may_access() logic has a special-case > > short-circuit for "introspection" that you can't opt out of; this > > makes it possible to open things in procfs that are related to the > > current process even if the credentials of the process wouldn't permit > > accessing another process like it. I think the proper fix to deal with > > this would be to add a prctl() flag for "set whether introspection is > > allowed for this process", and if userspace has manually un-set that > > flag, any introspection special-case logic would be skipped. > > We could do PR_SET_DUMPABLE=3 for this, I guess? Hmm... I'd make it a new prctl() command, since introspection is somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I think the introspection flag should be per-thread. > > An additional problem: /proc/*/exe can be used to open a file for > > writing; I think it may have been Andy Lutomirski who pointed out some > > time ago that it would be nice if you couldn't use /proc/*/fd/* to > > re-open files with more privileges, which is sort of the same thing. > > This is something I'm currently working on a series for, which would > boil down to some restrictions on how re-opening of file descriptors > works through procfs. Ah, nice! > However, execveat() of a procfs magiclink is a bit hard to block -- > there is no way for userspace to to represent a file being "open for > execute" so they are all "open for execute" by default and blocking it > outright seems a bit extreme (though I actually hope to eventually add > the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence > why I've reserved some bits). (For what it's worth, I'm mostly concerned about read vs write, not really about execute, since execute really is just another form of reading in my opinion.) > (Thinking more about it, there is an argument that I should include the > above patch into this series so that we can block re-opening of fds > opened through resolveat(2) without explicit flags from the outset.)
On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote: > > On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote: > > On 2019-05-06, Jann Horn <jannh@google.com> wrote: > > > In my opinion, CVE-2019-5736 points out two different problems: > > > > > > The big problem: The __ptrace_may_access() logic has a special-case > > > short-circuit for "introspection" that you can't opt out of; this > > > makes it possible to open things in procfs that are related to the > > > current process even if the credentials of the process wouldn't permit > > > accessing another process like it. I think the proper fix to deal with > > > this would be to add a prctl() flag for "set whether introspection is > > > allowed for this process", and if userspace has manually un-set that > > > flag, any introspection special-case logic would be skipped. > > > > We could do PR_SET_DUMPABLE=3 for this, I guess? > > Hmm... I'd make it a new prctl() command, since introspection is > somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I > think the introspection flag should be per-thread. I've lost track of the context here, but it seems to me that mitigating attacks involving accidental following of /proc links shouldn't depend on dumpability. What's the actual problem this is trying to solve again?
On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote: > On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote: > > > > On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote: > > > On 2019-05-06, Jann Horn <jannh@google.com> wrote: > > > > In my opinion, CVE-2019-5736 points out two different problems: > > > > > > > > The big problem: The __ptrace_may_access() logic has a special-case > > > > short-circuit for "introspection" that you can't opt out of; this > > > > makes it possible to open things in procfs that are related to the > > > > current process even if the credentials of the process wouldn't permit > > > > accessing another process like it. I think the proper fix to deal with > > > > this would be to add a prctl() flag for "set whether introspection is > > > > allowed for this process", and if userspace has manually un-set that > > > > flag, any introspection special-case logic would be skipped. > > > > > > We could do PR_SET_DUMPABLE=3 for this, I guess? > > > > Hmm... I'd make it a new prctl() command, since introspection is > > somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I > > think the introspection flag should be per-thread. > > I've lost track of the context here, but it seems to me that > mitigating attacks involving accidental following of /proc links > shouldn't depend on dumpability. What's the actual problem this is > trying to solve again? The one actual security problem that I've seen related to this is CVE-2019-5736. There is a write-up of it at <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html> under "Successful approach", but it goes more or less as follows: A container is running that doesn't use user namespaces (because for some reason I don't understand, apparently some people still do that). An evil process is running inside the container with UID 0 (as in, GLOBAL_ROOT_UID); so if the evil process inside the container was able to reach root-owned files on the host filesystem, it could write into them. The container engine wants to spawn a new process inside the container. It forks off a child that joins the container's namespaces (including PID and mount namespaces), and then the child calls execve() on some path in the container. The attacker replaces the executable in the container with a symlink to /proc/self/exe and replaces a library inside the container with a malicious one. When the container engine calls execve(), intending to run an executable inside the container, it instead goes through ptrace_may_access() using the introspection short-circuit and re-executes its own executable through the jumped symlink /proc/self/exe (which is normally unreachable for the container). After the execve(), the process loads an evil library from inside the container and is under the control of the container. Now the container controls a process whose /proc/self/exe is a jumped symlink to a host executable, and the container can write into it. Some container engines are now using an extremely ugly hack to work around this - whenever they want to enter a container, they copy the host binary into a new memfd and execute that to avoid exposing the original host binary to containers: <https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b> In my opinion, the problems here are: - Apparently some people run untrusted containers without user namespaces. It would be really nice if people could not do that. (Probably the biggest problem here.) - ptrace_may_access() has a short-circuit that permits a process to unintentionally look at itself even if it has dropped privileges - here, it permits the execve("/proc/self/exe", ...) that would normally be blocked by the check for CAP_SYS_PTRACE if the process is nondumpable. - You can use /proc/*/exe to get a writable fd.
On Sat, May 11, 2019 at 12:55 AM Jann Horn <jannh@google.com> wrote: > > On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote: > > On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote: > > > > > > On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote: > > > > On 2019-05-06, Jann Horn <jannh@google.com> wrote: > > > > > In my opinion, CVE-2019-5736 points out two different problems: > > > > > > > > > > The big problem: The __ptrace_may_access() logic has a special-case > > > > > short-circuit for "introspection" that you can't opt out of; this > > > > > makes it possible to open things in procfs that are related to the > > > > > current process even if the credentials of the process wouldn't permit > > > > > accessing another process like it. I think the proper fix to deal with > > > > > this would be to add a prctl() flag for "set whether introspection is > > > > > allowed for this process", and if userspace has manually un-set that > > > > > flag, any introspection special-case logic would be skipped. > > > > > > > > We could do PR_SET_DUMPABLE=3 for this, I guess? > > > > > > Hmm... I'd make it a new prctl() command, since introspection is > > > somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I > > > think the introspection flag should be per-thread. > > > > I've lost track of the context here, but it seems to me that > > mitigating attacks involving accidental following of /proc links > > shouldn't depend on dumpability. What's the actual problem this is > > trying to solve again? > > The one actual security problem that I've seen related to this is > CVE-2019-5736. There is a write-up of it at > <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html> > under "Successful approach", but it goes more or less as follows: > > A container is running that doesn't use user namespaces (because for > some reason I don't understand, apparently some people still do that). > An evil process is running inside the container with UID 0 (as in, > GLOBAL_ROOT_UID); so if the evil process inside the container was able > to reach root-owned files on the host filesystem, it could write into > them. > > The container engine wants to spawn a new process inside the container. > It forks off a child that joins the container's namespaces (including > PID and mount namespaces), and then the child calls execve() on some > path in the container. > The attacker replaces the executable in the container with a symlink > to /proc/self/exe and replaces a library inside the container with a > malicious one. > When the container engine calls execve(), intending to run an executable > inside the container, it instead goes through ptrace_may_access() using > the introspection short-circuit and re-executes its own executable > through the jumped symlink /proc/self/exe (which is normally unreachable > for the container). After the execve(), the process loads an evil > library from inside the container and is under the control of the > container. > Now the container controls a process whose /proc/self/exe is a jumped > symlink to a host executable, and the container can write into it. > > Some container engines are now using an extremely ugly hack to work > around this - whenever they want to enter a container, they copy the > host binary into a new memfd and execute that to avoid exposing the > original host binary to containers: > <https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b> > > > In my opinion, the problems here are: > > - Apparently some people run untrusted containers without user > namespaces. It would be really nice if people could not do that. > (Probably the biggest problem here.) I know I sound like a broken record since I've been going on about this forever together with a lot of other people but honestly, the fact that people are running untrusted workloads in privileged containers is the real issue here. Aleksa is a good friend of mine and we have discussed this a lot so I hope he doesn't hate me for saying this again: it is crazy that there are container runtimes out there that promise (or at least do not state the opposite) containers without user namespaces or containers with user namespaces that allow to map the host root id to anything can be safe. They cannot. Even if this /proc/*/exe thing is somehow blocked there are other ways of escaping from a privileged container. We (i.e. LXC) literally do not accept CVEs for privileged containers because we do not consider them safe by design. It seems to me to be heading in the wrong direction to keep up the illusion that with enough effort we can make this all nice and safe. Yes, the userspace memfd hack we came up with is as ugly as a security patch can be but if you make promises you can't keep you better be prepared to pay the price when things start to fall apart. So if this part of the patch is just needed to handle this do we really want to do all that tricky work or is there more to gain from this that makes it worth it. Christian
On 2019-05-11, Christian Brauner <christian@brauner.io> wrote: > > In my opinion, the problems here are: > > > > - Apparently some people run untrusted containers without user > > namespaces. It would be really nice if people could not do that. > > (Probably the biggest problem here.) > > I know I sound like a broken record since I've been going on about this > forever together with a lot of other people but honestly, > the fact that people are running untrusted workloads in privileged containers > is the real issue here. I completely agree. It's a shit-show, and it's caused by bad defaults in Docker and (now) podman. To be fair, they both now support rootless containers but the default is still privileged containers. They do support user namespaces (though it should be noted that LXD's support is much nicer from a security standpoint) but unless it's the default the support is almost pointless. In the case of Docker it can lead to some usability issues when you enable it (which I believe is the main justification for it not being the default). > Aleksa is a good friend of mine and we have discussed this a lot so I hope > he doesn't hate me for saying this again: it is crazy that there are container > runtimes out there that promise (or at least do not state the opposite) > containers without user namespaces or containers with user namespaces > that allow to map the host root id to anything can be safe. They cannot. Yeah, the fact that we (runc) don't scream from the rooftops that this setup is insecure is definitely a problem. I have mentioned this whenever I've had a chance, but the fact that the most popular runtimes (which use runc) don't use user namespaces compounds the issue. I'm willing to bet that >90% of users of runc-based runtimes don't use user namespaces at all, and this is all down to bad defaults. There are also some other misfeatures we have in runc that we're basically forced to support because some users use them, and we can't really break entire projects (even though it's the projects' fault they have an insecure setup). > It seems to me to be heading in the wrong direction to keep up the > illusion that with enough effort we can make this all nice and safe. > Yes, the userspace memfd hack we came up with is as ugly as a security > patch can be but if you make promises you can't keep you better be > prepared to pay the price when things start to fall apart. > So if this part of the patch is just needed to handle this do we really > want to do all that tricky work or is there more to gain from this that > makes it worth it. I dropped this patch in v7, I don't think it's required for the overarching feature. Looking back on it, it doesn't make much sense given the context that privileged containers are unsafe in the first place. I do think that being able to block introspection might be a useful hardening feature though. During attachment it would be nice to be sure that nothing will be able to touch the attaching process's /proc/$pid -- even itself.
> On May 10, 2019, at 3:55 PM, Jann Horn <jannh@google.com> wrote: > >> On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote: >>> On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote: >>> >>>> On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote: >>>>> On 2019-05-06, Jann Horn <jannh@google.com> wrote: >>>>> In my opinion, CVE-2019-5736 points out two different problems: >>>>> >>>>> The big problem: The __ptrace_may_access() logic has a special-case >>>>> short-circuit for "introspection" that you can't opt out of; this >>>>> makes it possible to open things in procfs that are related to the >>>>> current process even if the credentials of the process wouldn't permit >>>>> accessing another process like it. I think the proper fix to deal with >>>>> this would be to add a prctl() flag for "set whether introspection is >>>>> allowed for this process", and if userspace has manually un-set that >>>>> flag, any introspection special-case logic would be skipped. >>>> >>>> We could do PR_SET_DUMPABLE=3 for this, I guess? >>> >>> Hmm... I'd make it a new prctl() command, since introspection is >>> somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I >>> think the introspection flag should be per-thread. >> >> I've lost track of the context here, but it seems to me that >> mitigating attacks involving accidental following of /proc links >> shouldn't depend on dumpability. What's the actual problem this is >> trying to solve again? > > The one actual security problem that I've seen related to this is > CVE-2019-5736. There is a write-up of it at > <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html> > under "Successful approach", but it goes more or less as follows: > > A container is running that doesn't use user namespaces (because for > some reason I don't understand, apparently some people still do that). > An evil process is running inside the container with UID 0 (as in, > GLOBAL_ROOT_UID); so if the evil process inside the container was able > to reach root-owned files on the host filesystem, it could write into > them. > > The container engine wants to spawn a new process inside the container. > It forks off a child that joins the container's namespaces (including > PID and mount namespaces), and then the child calls execve() on some > path in the container. I think that, at this point, the task should be considered owned by the container. Maybe we should have a better API than execve() to execute a program in a safer way, but fiddling with dumpability seems like a band-aid. In fact, the process is arguably pwned even *before* execve. A better “spawn” API should fix this. In the mean time, I think it should be assumed that, if you join a container’s namespaces, you are at its mercy. > The attacker replaces the executable in the container with a symlink > to /proc/self/exe and replaces a library inside the container with a > malicious one. Cute. > When the container engine calls execve(), intending to run an executable > inside the container, it instead goes through ptrace_may_access() using > the introspection short-circuit and re-executes its own executable > through the jumped symlink /proc/self/exe (which is normally unreachable > for the container). After the execve(), the process loads an evil > library from inside the container and is under the control of the > container. > Now the container controls a process whose /proc/self/exe is a jumped > symlink to a host executable, and the container can write into it. > > Some container engines are now using an extremely ugly hack to work > around this - whenever they want to enter a container, they copy the > host binary into a new memfd and execute that to avoid exposing the > original host binary to containers: > <https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b> > > > In my opinion, the problems here are: > > - Apparently some people run untrusted containers without user > namespaces. It would be really nice if people could not do that. > (Probably the biggest problem here.) > - ptrace_may_access() has a short-circuit that permits a process to > unintentionally look at itself even if it has dropped privileges - > here, it permits the execve("/proc/self/exe", ...) that would > normally be blocked by the check for CAP_SYS_PTRACE if the process > is nondumpable. I don’t see this as a problem. Dumpable is about protecting a task from others, not about protecting a task against itself. > - You can use /proc/*/exe to get a writable fd. This is IMO the real bug.
On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski <luto@amacapital.net> wrote: > > A better “spawn” API should fix this. Andy, stop with the "spawn would be better". Spawn is garbage. It's garbage because it's fundamentally too inflexible, and it's garbage because it is quite complex to try to work around the inflexibility by having those complex "action pointer arrays" to make up for its failings. And spawn() would fundamentally have all the same permission issues that you now point to execve() as having, so it doesn't even *solve* anything. You've said this whole "spawn would fix things" thing before, and it's wrong. Spawn isn't better. Really. If fixes absolutely zero things, and the only reason for spawn existing is because VMS and NT had that broken and inflexible model. There's at least one paper from some MS people about how "spawn()" is wonderful, and maybe you bought into the garbage from that. But that paper is about how they hate fork(), not because of execve(). And if you hate fork, use "vfork()" instead (preferably with an immediate call to a non-returning function in the child to avoid the stack re-use issue that makes it so simple to screw up vfork() in hard to debug ways). execve() is a _fine_ model. That's not the problem in this whole issue at all - never was, and never will be. The problem in this discussion is (a) having privileges you shouldn't have and (b) having other interfaces that make it easyish to change the filesystem layout to confuse those entities with privileges. So the reason the open flags can be problematic is exactly because they effectively change filesystem layout. And no, it's not just AT_THIS_ROOT, although that's the obvious one. Things like "you can't follow symlinks" can also effectively change the layout: imagine if you have a PATH-like lookup model, and you end up having symlinks as part of the standard filesystem layout.. Now a "don't follow symlinks" can turn the *standard* executable into something that isn't found, and then you might end up executing something else instead (think root having '.' as the last entry in path, which some people used to suggest as the fix for the completely bad "first entry" case).. Notice? None of the real problems are about execve or would be solved by any spawn API. You just think that because you've apparently been talking to too many MS people that think fork (and thus indirectly execve()) is bad process management. Linus
On 2019-05-11, Andy Lutomirski <luto@amacapital.net> wrote: > >> I've lost track of the context here, but it seems to me that > >> mitigating attacks involving accidental following of /proc links > >> shouldn't depend on dumpability. What's the actual problem this is > >> trying to solve again? > > > > The one actual security problem that I've seen related to this is > > CVE-2019-5736. There is a write-up of it at > > <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html> > > under "Successful approach", but it goes more or less as follows: > > > > A container is running that doesn't use user namespaces (because for > > some reason I don't understand, apparently some people still do that). > > An evil process is running inside the container with UID 0 (as in, > > GLOBAL_ROOT_UID); so if the evil process inside the container was able > > to reach root-owned files on the host filesystem, it could write into > > them. > > > > The container engine wants to spawn a new process inside the container. > > It forks off a child that joins the container's namespaces (including > > PID and mount namespaces), and then the child calls execve() on some > > path in the container. > > I think that, at this point, the task should be considered owned by > the container. Maybe we should have a better API than execve() to > execute a program in a safer way, but fiddling with dumpability seems > like a band-aid. In fact, the process is arguably pwned even *before* > execve. Yeah, execve is just the vector (though in this case it's done in order to clear mm->dumpable). An earlier CVE (CVE-2016-9962) was very similar but was attacking a dirfd that runc had open into the container (LXC had a very similar bug too) -- setting !mm->dumpable was one of the workarounds we had for this. > A better “spawn” API should fix this. In the mean time, I think it > should be assumed that, if you join a container’s namespaces, you are > at its mercy. This is generally how we treat containers as runtime authors, but it's not a trivial thing to get right. In many cases the kernel APIs are working against you -- Christian and myself have written a fair few patches to fix holes in the kernel APIs so we can avoid these kinds of assumptions. But yes, one of the most risky parts of a container runtime is when you're attaching to a running container because all of the helpful introspection APIs in /proc/ suddenly become a security nightmare. A better "spawn a process in these namespaces" API might help improve the situation (or at least, I hope it would). > > - You can use /proc/*/exe to get a writable fd. > > This is IMO the real bug. I will try to send an RFC of the patchset I have for this next week or so. Funnily enough, currently /proc/*/exe has the write bit set in its "mode" (my series fixes this).
On Sat, May 11, 2019 at 1:21 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Notice? None of the real problems are about execve or would be solved > by any spawn API. You just think that because you've apparently been > talking to too many MS people that think fork (and thus indirectly > execve()) is bad process management. Side note: a good policy has been (and remains) to make suid binaries not be dynamically linked. And in the absence of that, the dynamic linker at least resets the library path when it notices itself being dynamic, and it certainly doesn't inherit any open flags from the non-trusted environment. And by the same logic, a suid interpreter must *definitely* should not inherit any execve() flags from the non-trusted environment. So I think Aleksa's patch to use the passed-in open flags is *exactly* the wrong thing to do for security reasons. It doesn't close holes, it opens them. Linus
On 2019-05-11, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, May 11, 2019 at 1:21 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Notice? None of the real problems are about execve or would be solved > > by any spawn API. You just think that because you've apparently been > > talking to too many MS people that think fork (and thus indirectly > > execve()) is bad process management. > > Side note: a good policy has been (and remains) to make suid binaries > not be dynamically linked. And in the absence of that, the dynamic > linker at least resets the library path when it notices itself being > dynamic, and it certainly doesn't inherit any open flags from the > non-trusted environment. > > And by the same logic, a suid interpreter must *definitely* should not > inherit any execve() flags from the non-trusted environment. So I > think Aleksa's patch to use the passed-in open flags is *exactly* the > wrong thing to do for security reasons. It doesn't close holes, it > opens them. Yup, I've dropped the patch for the next version. (To be honest, I'm not sure why I included any of the other flags -- the only one that would've been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > Yup, I've dropped the patch for the next version. (To be honest, I'm not > sure why I included any of the other flags -- the only one that would've > been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.) I do wonder if we could try to just set AT_NO_MAGICLINKS unconditionally for execve() (and certainly for the suid case). I'd rather try to do these things across the board, than have "suid binaries are treated specially" if at all possible. The main use case for having /proc/<pid>/exe thing is for finding open file descriptors, and for 'ps' kind of use, or to find the startup directory when people don't populate the execve() environment fully (ie "readlink(/proc/self/exe)" is afaik pretty common. Sadly, googling for execve /proc/self/exe does actually find hits, including one that implies that chrome does exactly that. So it might not be possible. Somewhat odd, but it does just confirm the whole "users will at some point do everything in their power to use every odd special case, intended or not". Linus
On May 11, 2019 7:43:44 PM GMT+02:00, Linus Torvalds <torvalds@linux-foundation.org> wrote: >On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <cyphar@cyphar.com> wrote: >> >> Yup, I've dropped the patch for the next version. (To be honest, I'm >not >> sure why I included any of the other flags -- the only one that >would've >> been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.) > >I do wonder if we could try to just set AT_NO_MAGICLINKS >unconditionally for execve() (and certainly for the suid case). > >I'd rather try to do these things across the board, than have "suid >binaries are treated specially" if at all possible. > >The main use case for having /proc/<pid>/exe thing is for finding open >file descriptors, and for 'ps' kind of use, or to find the startup >directory when people don't populate the execve() environment fully >(ie "readlink(/proc/self/exe)" is afaik pretty common. > >Sadly, googling for > > execve /proc/self/exe > >does actually find hits, including one that implies that chrome does >exactly that. So it might not be possible. > >Somewhat odd, but it does just confirm the whole "users will at some >point do everything in their power to use every odd special case, >intended or not". > > Linus Sadly I have to admit that we are using this. Also, execveat on glibc is implemented via /proc/self/fd/<nr> on kernels that do not have a proper execveat. See fexecve... Christian
On 2019-05-11, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > Yup, I've dropped the patch for the next version. (To be honest, I'm not > > sure why I included any of the other flags -- the only one that would've > > been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.) > > I do wonder if we could try to just set AT_NO_MAGICLINKS > unconditionally for execve() (and certainly for the suid case). > > I'd rather try to do these things across the board, than have "suid > binaries are treated specially" if at all possible. > > The main use case for having /proc/<pid>/exe thing is for finding open > file descriptors, and for 'ps' kind of use, or to find the startup > directory when people don't populate the execve() environment fully > (ie "readlink(/proc/self/exe)" is afaik pretty common. > > Sadly, googling for > > execve /proc/self/exe > > does actually find hits, including one that implies that chrome does > exactly that. So it might not be possible. > > Somewhat odd, but it does just confirm the whole "users will at some > point do everything in their power to use every odd special case, > intended or not". *sheepishly* Actually we use this in runc very liberally. It's done because we need to run namespace-related code but runc is written in Go so (long story short) we re-exec ourselves in order to run some __attribute__((constructor)) code which sets up the namespaces and then lets the Go runtime boot. I suspect just writing everything in C would've been orders of magnitude simpler, but I wasn't around when that decision was made. :P Also as Christian mentioned, fexecve(3) in glibc is implemented using /proc/self/fd on old kernels (then again, if we change the behaviour on new kernels it won't matter because glibc uses execveat(AT_EMPTY_PATH) if it's available).
> On May 11, 2019, at 10:21 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski <luto@amacapital.net> wrote: >> >> A better “spawn” API should fix this. > > Andy, stop with the "spawn would be better". It doesn’t have to be spawn per se. But the current situation sucks. > > Notice? None of the real problems are about execve or would be solved > by any spawn API. You just think that because you've apparently been > talking to too many MS people that think fork (and thus indirectly > execve()) is bad process management. > > I’ve literally never spoken to an MS person about it. What container managers and init systems *want* is a way to drop privileges, change namespaces, etc and then run something in a controlled way so that the intermediate states aren’t dangerous. An API for this could be spawn-like or exec-like — that particular distinction is beside the point. Having personally written code that mucks with namepsaces, I've wanted two particular abilities that are both quite awkward: a) Change all my UIDs and GIDs to match a container, enter that container's namespaces, and run some binary in the container's filesystem, all atomically enough that I don't need to worry about accidentally leaking privileges into the container. A super-duper-non-dumpable mode would kind of allow this, but I'd worry that there's some other hole besides ptrace() and /proc/self. b) Change all my UIDs and GIDs to match a container, enter that container's namespaces, and run some binary that is *not* in the container's filesystem. This happens, for example, if the container's mount namespace has no exec mounts at all. We don't have a fantastic way to do this at all right now due to /proc/self/exe. Regardless, the actual CVE at hand would have been nicely avoided if writing to /proc/self/exe didn’t work, and I see no reason we can’t make that happen. I suppose we could also consider a change to disable /proc/self/exe if it's not reachable from /proc/self/root. By "disable", I mean that readlink() should maybe still work, but actually trying to open it could probably fail safely.
On Sat, May 11, 2019 at 07:08:49PM -0400, Linus Torvalds wrote: > [ on mobile again, power is off and the wifi doesn't work, so I'm reading > email on my phone and apologizing once more for horrible html email.. ] > > On Sat, May 11, 2019, 18:40 Andy Lutomirski <luto@kernel.org> wrote: > > > > > a) Change all my UIDs and GIDs to match a container, enter that > > container's namespaces, and run some binary in the container's For the namespace part, an idea I had and presented at LPC a while ago was to make setns() interpret the nstype argument as a flag argument and to introduce an fd that can refer to multiple namespaces at the same time. This way you could do: setns(namespaces_fd, CLONE_NEWNS | CLONE_NEWUSER | CLONE_NEWPID) that could still be done. But I since have come to think that there's a better way now that we have CLONE_PIDFD. We should just make setns() take a pidfd as argument and then be able to call: setns(pidfd, 0); which would cause the calling thread to join all of the namespaces of the process referred to by pidfd at the same time. That really shouldn't be hard to do. I could probably get this going rather quickly and it would really help out container managers a lot. > > filesystem, all atomically enough that I don't need to worry about > > accidentally leaking privileges into the container. A > > super-duper-non-dumpable mode would kind of allow this, but I'd worry > > that there's some other hole besides ptrace() and /proc/self. > > > > So I would expect that you'd want to do this even *without* doing an execve > at all, which is why I still don't think this is actually about > spawn/execve at all. > > But I agree that what you that what you want sounds reasonable. But I think I have pitched an api like that a while ago (see [1]) - when I first started this fd for processes thing - that would allow you to do things like that. The gist was: 1. int pidfd_create aka CLONE_PIDFD now will return an fd that creates a process context. The fd returned by does not refer to any existing process and has not actually been started yet. So non-configuration operations on it or trying to interact with it would fail with e.g. ESRCH/EINVAL. We essentially have this now with CLONE_PIDFD. The bit that is missing is an "unscheduled" process. 2. int pidfd_config takes a CLONE_PIDFD and can be used to configure a process context before it is alive. Some things that I would like to be able to do with this syscall are: - configure signals - set clone flags - write idmappings if the process runs in a new user namespace - configure what happens when all procfds referring to the process are gone - ... 3. int pidfd_info 4. int pidfd_manage Parts of that are captured in pidfd_send_signal(). Just to get a very rough feel for this without detailing parameters right now: /* process should have own mountns */ pidfd_config(fd, PROC_SET_NAMESPACE, CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWUSR, <potentially additional arguments>) /* process should get SIGKILL when all procfds are closed */ pidfd_config(fd, PROC_SET_CLOSE, SIGKILL, <potentially additional arguments>) After the caller is done configuring the process there would be a final step: pidfd_config(fd, PROC_CREATE, 0, <potentially additional arguments>) which would create the process and (either as return value or through a parameter) return the pid of the newly created process. I had more thoughts on this and had started prototyping some of it but then there wasn't much interest it seemed. Maybe because it's crazy. [1]: https://lore.kernel.org/lkml/20181118174148.nvkc4ox2uorfatbm@brauner.io/ > the "dumpable" flag has always been a complete hack, and very unreliable > with random meanings (and random ways to then get around it). > > We have actually had lots of people wanting to run "lists of system calls" > kinds of things. Sometimes for performance reasons, sometimes for other > random reasons Maybe that "atomicity" angle would be another one, although > we would have to make the setuid etc things be something that happens at > the end. > > So rather than "spawn", is much rather see people think about ways to just > batch operations in general, rather than think it is about batching things > just before a process change. > > b) Change all my UIDs and GIDs to match a container, enter that > > container's namespaces, and run some binary that is *not* in the > > container's filesystem. > > > > Hey, you could probably do something very close to that by opening the > executable you want to run first, making it O_CLOEXEC, then doing all the > other things (which now makes the executable inaccessible), and finally > doing execveat() on the file descriptor.. I think that's somewhat similar to what I've suggested above. > > I say "something very close" because even though it's O_CLOEXEC, only the > file would be closed, and /proc/self/exe would still exist. > > But we really could make that execveat of a O_CLOEXEC file thing also > disable access to /proc/*/exe, and it sounds like a sane and simple > extension in general to do.. > > Linus > > >
On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <luto@amacapital.net> wrote: > > I bet this will break something that already exists. An execveat() flag to turn off /proc/self/exe would do the trick, though. Thinking more about it, I suspect it is (once again) wrong to let the thing that does the execve() control that bit. Generally, the less we allow people to affect the lifetime and environment of a suid executable, the better off we are. But maybe we could limit /proc/*/exe to at least not honor suid'ness of the target? Or does chrome/runc depend on that too? Linus
On 2019-05-12, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <luto@amacapital.net> wrote: > > I bet this will break something that already exists. An execveat() > > flag to turn off /proc/self/exe would do the trick, though. > > Thinking more about it, I suspect it is (once again) wrong to let the > thing that does the execve() control that bit. > > Generally, the less we allow people to affect the lifetime and > environment of a suid executable, the better off we are. > > But maybe we could limit /proc/*/exe to at least not honor suid'ness > of the target? Or does chrome/runc depend on that too? Speaking on the runc side, we don't depend on this. It's possible someone depends on this for fexecve(3) -- but as mentioned before in newer kernels glibc uses execve(AT_EMPTY_PATH). I would like to point out though that I'm a little bit cautious about /proc/self/exe-specific restrictions -- because a trivial way to get around them would be to just open it with O_PATH (and you end up with a /proc/self/fd/ which is equivalent). Unfortunately blocking setuid exec on all O_PATH descriptors would break even execve(AT_EMPTY_PATH) of setuid descriptors. The patches I mentioned (which Andy and I discussed off-list) would effectively make the magiclink modes in /proc/ affect how you can operate on the path (no write bit in the mode, cannot re-open it write). One aspect of this is how to handle O_PATH and in particular how do we handle an O_PATH re-open of an already-restricted magiclink. Maybe we could make it so that setuid is disallowed if you are dealing with an O_PATH fd which was a magiclink. Effectively, on O_PATH open you get an fmode_t saying FMODE_SETUID_EXEC_ALLOWED *but* if the path is a magiclink this fmode gets dropped and when the fd is given to execveat(AT_EMPTY_PATH) the fmode is checked and setuid-exec is not allowed. [I assume in this discussion "setuid" means "setuid + setcap", right?]
On 2019-05-12, Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2019-05-12, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <luto@amacapital.net> wrote: > > > I bet this will break something that already exists. An execveat() > > > flag to turn off /proc/self/exe would do the trick, though. > > > > Thinking more about it, I suspect it is (once again) wrong to let the > > thing that does the execve() control that bit. > > > > Generally, the less we allow people to affect the lifetime and > > environment of a suid executable, the better off we are. > > > > But maybe we could limit /proc/*/exe to at least not honor suid'ness > > of the target? Or does chrome/runc depend on that too? > > Speaking on the runc side, we don't depend on this. It's possible > someone depends on this for fexecve(3) -- but as mentioned before in > newer kernels glibc uses execve(AT_EMPTY_PATH). > > I would like to point out though that I'm a little bit cautious about > /proc/self/exe-specific restrictions -- because a trivial way to get > around them would be to just open it with O_PATH (and you end up with a > /proc/self/fd/ which is equivalent). Unfortunately blocking setuid exec > on all O_PATH descriptors would break even execve(AT_EMPTY_PATH) of > setuid descriptors. > > The patches I mentioned (which Andy and I discussed off-list) would > effectively make the magiclink modes in /proc/ affect how you can > operate on the path (no write bit in the mode, cannot re-open it write). > One aspect of this is how to handle O_PATH and in particular how do we > handle an O_PATH re-open of an already-restricted magiclink. > > Maybe we could make it so that setuid is disallowed if you are dealing > with an O_PATH fd which was a magiclink. Effectively, on O_PATH open you > get an fmode_t saying FMODE_SETUID_EXEC_ALLOWED *but* if the path is a > magiclink this fmode gets dropped and when the fd is given to > execveat(AT_EMPTY_PATH) the fmode is checked and setuid-exec is not > allowed. ... and obviously /proc/self/exe would have an fmode ~FMODE_SETUID_EXEC_ALLOWED from the outset. The reason for this slightly odd semantic would be to continue to allow O_PATH setuid-exec as long as the O_PATH was opened from an actual path rather than a magiclink.
> On May 12, 2019, at 6:35 AM, Aleksa Sarai <cyphar@cyphar.com> wrote: > >> On 2019-05-12, Linus Torvalds <torvalds@linux-foundation.org> wrote: >>> On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <luto@amacapital.net> wrote: >>> I bet this will break something that already exists. An execveat() >>> flag to turn off /proc/self/exe would do the trick, though. >> >> Thinking more about it, I suspect it is (once again) wrong to let the >> thing that does the execve() control that bit. >> >> Generally, the less we allow people to affect the lifetime and >> environment of a suid executable, the better off we are. >> >> But maybe we could limit /proc/*/exe to at least not honor suid'ness >> of the target? Or does chrome/runc depend on that too? > > Speaking on the runc side, we don't depend on this. It's possible > someone depends on this for fexecve(3) -- but as mentioned before in > newer kernels glibc uses execve(AT_EMPTY_PATH). Why are we concerned about suid? Don’t we already block suid if the path being execed doesn’t come from the current mountns? That should mostly cover the things we care about, no? I suppose we could also block suid for deleted files, so that deleting a known-buggy suid binary becomes more reliable. But every sensible package tool should already be chmoding the suid away before unlinking.
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 7d09d125f148..473420eed477 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -772,7 +772,7 @@ static int load_elf_binary(struct linux_binprm *bprm) if (elf_interpreter[elf_ppnt->p_filesz - 1] != '\0') goto out_free_interp; - interpreter = open_exec(elf_interpreter); + interpreter = openat_exec(bprm->dfd, elf_interpreter, bprm->flags); retval = PTR_ERR(interpreter); if (IS_ERR(interpreter)) goto out_free_interp; diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c index b53bb3729ac1..c463c6428f77 100644 --- a/fs/binfmt_elf_fdpic.c +++ b/fs/binfmt_elf_fdpic.c @@ -263,7 +263,7 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm) kdebug("Using ELF interpreter %s", interpreter_name); /* replace the program with the interpreter */ - interpreter = open_exec(interpreter_name); + interpreter = openat_exec(bprm->dfd, interpreter_name, bprm->flags); retval = PTR_ERR(interpreter); if (IS_ERR(interpreter)) { interpreter = NULL; diff --git a/fs/binfmt_em86.c b/fs/binfmt_em86.c index dd2d3f0cd55d..3ee46b0dc0d4 100644 --- a/fs/binfmt_em86.c +++ b/fs/binfmt_em86.c @@ -81,10 +81,10 @@ static int load_em86(struct linux_binprm *bprm) /* * OK, now restart the process with the interpreter's inode. - * Note that we use open_exec() as the name is now in kernel + * Note that we use openat_exec() as the name is now in kernel * space, and we don't need to copy it. */ - file = open_exec(interp); + file = openat_exec(binprm->dfd, interp, binprm->flags); if (IS_ERR(file)) return PTR_ERR(file); diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index aa4a7a23ff99..573ef06ff5a1 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -209,7 +209,7 @@ static int load_misc_binary(struct linux_binprm *bprm) if (!IS_ERR(interp_file)) deny_write_access(interp_file); } else { - interp_file = open_exec(fmt->interpreter); + interp_file = openat_exec(bprm->dfd, fmt->interpreter, bprm->flags); } retval = PTR_ERR(interp_file); if (IS_ERR(interp_file)) diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c index e996174cbfc0..aaff12d12bfd 100644 --- a/fs/binfmt_script.c +++ b/fs/binfmt_script.c @@ -137,7 +137,7 @@ static int load_script(struct linux_binprm *bprm) /* * OK, now restart the process with the interpreter's dentry. */ - file = open_exec(i_name); + file = openat_exec(bprm->dfd, i_name, bprm->flags); if (IS_ERR(file)) return PTR_ERR(file); diff --git a/fs/exec.c b/fs/exec.c index 2e0033348d8e..d0f244d9a541 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -846,12 +846,24 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) .lookup_flags = LOOKUP_FOLLOW, }; - if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0) + if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH | AT_BENEATH | + AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS | + AT_THIS_ROOT)) != 0) return ERR_PTR(-EINVAL); if (flags & AT_SYMLINK_NOFOLLOW) open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW; if (flags & AT_EMPTY_PATH) open_exec_flags.lookup_flags |= LOOKUP_EMPTY; + if (flags & AT_BENEATH) + open_exec_flags.lookup_flags |= LOOKUP_BENEATH; + if (flags & AT_XDEV) + open_exec_flags.lookup_flags |= LOOKUP_XDEV; + if (flags & AT_NO_MAGICLINKS) + open_exec_flags.lookup_flags |= LOOKUP_NO_MAGICLINKS; + if (flags & AT_NO_SYMLINKS) + open_exec_flags.lookup_flags |= LOOKUP_NO_SYMLINKS; + if (flags & AT_THIS_ROOT) + open_exec_flags.lookup_flags |= LOOKUP_IN_ROOT; file = do_filp_open(fd, name, &open_exec_flags); if (IS_ERR(file)) @@ -879,18 +891,18 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) return ERR_PTR(err); } -struct file *open_exec(const char *name) +struct file *openat_exec(int dfd, const char *name, int flags) { struct filename *filename = getname_kernel(name); struct file *f = ERR_CAST(filename); if (!IS_ERR(filename)) { - f = do_open_execat(AT_FDCWD, filename, 0); + f = do_open_execat(dfd, filename, flags); putname(filename); } return f; } -EXPORT_SYMBOL(open_exec); +EXPORT_SYMBOL(openat_exec); int kernel_read_file(struct file *file, void **buf, loff_t *size, loff_t max_size, enum kernel_read_file_id id) @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename, sched_exec(); + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS | + AT_THIS_ROOT); + bprm->dfd = AT_FDCWD; + if (bprm->flags) + bprm->dfd = fd; + bprm->file = file; if (!filename) { bprm->filename = "none"; diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index 688ab0de7810..e4da2d36e97f 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -50,6 +50,7 @@ struct linux_binprm { unsigned int taso:1; #endif unsigned int recursion_depth; /* only for search_binary_handler() */ + int dfd, flags; /* passed down to execat_open() */ struct file * file; struct cred *cred; /* new credentials */ int unsafe; /* how unsafe this exec is (mask of LSM_UNSAFE_*) */ diff --git a/include/linux/fs.h b/include/linux/fs.h index fe94c48481a6..cdcd2e021101 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2945,8 +2945,13 @@ extern int kernel_read_file_from_fd(int, void **, loff_t *, loff_t, extern ssize_t kernel_read(struct file *, void *, size_t, loff_t *); extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *); extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *); -extern struct file * open_exec(const char *); - + +extern struct file *openat_exec(int, const char *, int); +static inline struct file *open_exec(const char *name) +{ + return openat_exec(AT_FDCWD, name, 0); +} + /* fs/dcache.c -- generic fs support functions */ extern bool is_subdir(struct dentry *, struct dentry *); extern bool path_is_under(const struct path *, const struct path *); diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 1d338357df8a..bfca7f87cd2a 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -93,5 +93,11 @@ #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ +#define AT_RESOLUTION_TYPE 0x1F0000 /* Type of path-resolution scoping we are applying. */ +#define AT_BENEATH 0x010000 /* - Block "lexical" trickery like "..", symlinks, absolute paths, etc. */ +#define AT_XDEV 0x020000 /* - Block mount-point crossings (includes bind-mounts). */ +#define AT_NO_MAGICLINKS 0x040000 /* - Block procfs-style "magic" symlinks. */ +#define AT_NO_SYMLINKS 0x080000 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */ +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */ #endif /* _UAPI_LINUX_FCNTL_H */
The need to be able to scope path resolution of interpreters became clear with one of the possible vectors used in CVE-2019-5736 (which most major container runtimes were vulnerable to). Naively, it might seem that openat(2) -- which supports path scoping -- can be combined with execveat(AT_EMPTY_PATH) to trivially scope the binary being executed. Unfortunately, a "bad binary" (usually a symlink) could be written as a #!-style script with the symlink target as the interpreter -- which would be completely missed by just scoping the openat(2). An example of this being exploitable is CVE-2019-5736. In order to get around this, we need to pass down to each binfmt_* implementation the scoping flags requested in execveat(2). In order to maintain backwards-compatibility we only pass the scoping AT_* flags. To avoid breaking userspace (in the exceptionally rare cases where you have #!-scripts with a relative path being execveat(2)-ed with dfd != AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are set in execveat(2). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- fs/binfmt_elf.c | 2 +- fs/binfmt_elf_fdpic.c | 2 +- fs/binfmt_em86.c | 4 ++-- fs/binfmt_misc.c | 2 +- fs/binfmt_script.c | 2 +- fs/exec.c | 26 ++++++++++++++++++++++---- include/linux/binfmts.h | 1 + include/linux/fs.h | 9 +++++++-- include/uapi/linux/fcntl.h | 6 ++++++ 9 files changed, 42 insertions(+), 12 deletions(-)