Message ID | 20200904203147.2908430-7-krisman@collabora.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Syscall User Dispatch | expand |
On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote: > Syscall User Dispatch (SUD) must take precedence over seccomp, since the > use case is emulation (it can be invoked with a different ABI) such that > seccomp filtering by syscall number doesn't make sense in the first > place. In addition, either the syscall is dispatched back to userspace, > in which case there is no resource for seccomp to protect, or the Tbh, I'm torn here. I'm not a super clever attacker but it feels to me that this is still at least a clever way to circumvent a seccomp sandbox. If I'd be confined by a seccomp profile that would cause me to be SIGKILLed when I try do open() I could prctl() myself to do user dispatch to prevent that from happening, no? > syscall will be executed, and seccomp will execute next. > > Regarding ptrace, I experimented with before and after, and while the > same ABI argument applies, I felt it was easier to debug if I let ptrace > happen for syscalls that are dispatched back to userspace. In addition, > doing it after ptrace makes the code in syscall_exit_work slightly > simpler, since it doesn't require special handling for this feature. > > Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> > --- > kernel/entry/common.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/kernel/entry/common.c b/kernel/entry/common.c > index 44fd089d59da..fdb0c543539d 100644 > --- a/kernel/entry/common.c > +++ b/kernel/entry/common.c > @@ -6,6 +6,8 @@ > #include <linux/audit.h> > #include <linux/syscall_intercept.h> > > +#include "common.h" > + > #define CREATE_TRACE_POINTS > #include <trace/events/syscalls.h> > > @@ -47,6 +49,12 @@ static inline long do_syscall_intercept(struct pt_regs *regs) > int sysint_work = READ_ONCE(current->syscall_intercept); > int ret; > > + if (sysint_work & SYSINT_USER_DISPATCH) { > + ret = do_syscall_user_dispatch(regs); > + if (ret == -1L) > + return ret; > + } > + > if (sysint_work & SYSINT_SECCOMP) { > ret = __secure_computing(NULL); > if (ret == -1L) > -- > 2.28.0
> On Sep 7, 2020, at 3:15 AM, Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote: >> Syscall User Dispatch (SUD) must take precedence over seccomp, since the >> use case is emulation (it can be invoked with a different ABI) such that >> seccomp filtering by syscall number doesn't make sense in the first >> place. In addition, either the syscall is dispatched back to userspace, >> in which case there is no resource for seccomp to protect, or the > > Tbh, I'm torn here. I'm not a super clever attacker but it feels to me > that this is still at least a clever way to circumvent a seccomp > sandbox. > If I'd be confined by a seccomp profile that would cause me to be > SIGKILLed when I try do open() I could prctl() myself to do user > dispatch to prevent that from happening, no? > Not really, I think. The idea is that you didn’t actually do open(). You did a SYSCALL instruction which meant something else, and the syscall dispatch correctly prevented the kernel from misinterpreting it as open().
On Mon, Sep 07, 2020 at 07:15:52AM -0700, Andy Lutomirski wrote: > > > > On Sep 7, 2020, at 3:15 AM, Christian Brauner <christian.brauner@ubuntu.com> wrote: > > > > On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote: > >> Syscall User Dispatch (SUD) must take precedence over seccomp, since the > >> use case is emulation (it can be invoked with a different ABI) such that > >> seccomp filtering by syscall number doesn't make sense in the first > >> place. In addition, either the syscall is dispatched back to userspace, > >> in which case there is no resource for seccomp to protect, or the > > > > Tbh, I'm torn here. I'm not a super clever attacker but it feels to me > > that this is still at least a clever way to circumvent a seccomp > > sandbox. > > If I'd be confined by a seccomp profile that would cause me to be > > SIGKILLed when I try do open() I could prctl() myself to do user > > dispatch to prevent that from happening, no? > > > > Not really, I think. The idea is that you didn’t actually do open(). > You did a SYSCALL instruction which meant something else, and the > syscall dispatch correctly prevented the kernel from misinterpreting > it as open(). Right, for the case where you're e.g. emulating windows syscalls that's true. I was thinking when you're running natively on Linux: couldn't I first load a seccomp profile "kill me if someone does an open()", then I exec() the target binary and that binary is setup to do prctl(USER_DISPATCH) first thing. I guess, it's ok because as far as I had time to read it this is a nothing or all mechanism, i.e. _all_ system calls are re-routed in contrast to e.g. seccomp where I could do this per-syscall. So for user-dispatch it wouldn't make sense to use it on Linux per se. Still makes me a little uneasy. :) Christian
On Mon, Sep 7, 2020 at 7:25 AM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Mon, Sep 07, 2020 at 07:15:52AM -0700, Andy Lutomirski wrote: > > > > > > > On Sep 7, 2020, at 3:15 AM, Christian Brauner <christian.brauner@ubuntu.com> wrote: > > > > > > On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote: > > >> Syscall User Dispatch (SUD) must take precedence over seccomp, since the > > >> use case is emulation (it can be invoked with a different ABI) such that > > >> seccomp filtering by syscall number doesn't make sense in the first > > >> place. In addition, either the syscall is dispatched back to userspace, > > >> in which case there is no resource for seccomp to protect, or the > > > > > > Tbh, I'm torn here. I'm not a super clever attacker but it feels to me > > > that this is still at least a clever way to circumvent a seccomp > > > sandbox. > > > If I'd be confined by a seccomp profile that would cause me to be > > > SIGKILLed when I try do open() I could prctl() myself to do user > > > dispatch to prevent that from happening, no? > > > > > > > Not really, I think. The idea is that you didn’t actually do open(). > > You did a SYSCALL instruction which meant something else, and the > > syscall dispatch correctly prevented the kernel from misinterpreting > > it as open(). > > Right, for the case where you're e.g. emulating windows syscalls that's > true. I was thinking when you're running natively on Linux: couldn't I > first load a seccomp profile "kill me if someone does an open()", then > I exec() the target binary and that binary is setup to do > prctl(USER_DISPATCH) first thing. I guess, it's ok because as far as I > had time to read it this is a nothing or all mechanism, i.e. _all_ > system calls are re-routed in contrast to e.g. seccomp where I could do > this per-syscall. So for user-dispatch it wouldn't make sense to use it > on Linux per se. Still makes me a little uneasy. :) There's an escape hatch, so processes using this can still make syscalls. Maybe think about it another way: a process using user dispatch should definitely *not* trigger seccomp user notifiers, errno returns, or ptrace events, since they'll all do the wrong thing. IMO RET_KILL is the same. Barring some very severe defect, there's no way a program can use user dispatch to escape seccomp -- a program could use user dispatch to allow them to do: mov $__NR_open, %rax syscall without dying despite the presence of a filter that would kill the process if it tried to do open(), but this doesn't bypass the filter at all. The process could just as easily have done: mov $__NR_open jmp magic_stub(%rip) without tripping the filter, since no system call actually happens here. --Andy
On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote: > Syscall User Dispatch (SUD) must take precedence over seccomp, since the > use case is emulation (it can be invoked with a different ABI) such that > seccomp filtering by syscall number doesn't make sense in the first > place. In addition, either the syscall is dispatched back to userspace, > in which case there is no resource for seccomp to protect, or the > syscall will be executed, and seccomp will execute next. > > Regarding ptrace, I experimented with before and after, and while the > same ABI argument applies, I felt it was easier to debug if I let ptrace > happen for syscalls that are dispatched back to userspace. In addition, > doing it after ptrace makes the code in syscall_exit_work slightly > simpler, since it doesn't require special handling for this feature. I think I'm with Andy that this should be before ptrace(). ptrace() users will attempt to interpret things like they're regular syscalls, and that's definitely not the case.
diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 44fd089d59da..fdb0c543539d 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -6,6 +6,8 @@ #include <linux/audit.h> #include <linux/syscall_intercept.h> +#include "common.h" + #define CREATE_TRACE_POINTS #include <trace/events/syscalls.h> @@ -47,6 +49,12 @@ static inline long do_syscall_intercept(struct pt_regs *regs) int sysint_work = READ_ONCE(current->syscall_intercept); int ret; + if (sysint_work & SYSINT_USER_DISPATCH) { + ret = do_syscall_user_dispatch(regs); + if (ret == -1L) + return ret; + } + if (sysint_work & SYSINT_SECCOMP) { ret = __secure_computing(NULL); if (ret == -1L)
Syscall User Dispatch (SUD) must take precedence over seccomp, since the use case is emulation (it can be invoked with a different ABI) such that seccomp filtering by syscall number doesn't make sense in the first place. In addition, either the syscall is dispatched back to userspace, in which case there is no resource for seccomp to protect, or the syscall will be executed, and seccomp will execute next. Regarding ptrace, I experimented with before and after, and while the same ABI argument applies, I felt it was easier to debug if I let ptrace happen for syscalls that are dispatched back to userspace. In addition, doing it after ptrace makes the code in syscall_exit_work slightly simpler, since it doesn't require special handling for this feature. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> --- kernel/entry/common.c | 8 ++++++++ 1 file changed, 8 insertions(+)