diff mbox series

[v6,6/9] kernel: entry: Support Syscall User Dispatch for common syscall entry

Message ID 20200904203147.2908430-7-krisman@collabora.com (mailing list archive)
State New
Headers show
Series Syscall User Dispatch | expand

Commit Message

Gabriel Krisman Bertazi Sept. 4, 2020, 8:31 p.m. UTC
Syscall User Dispatch (SUD) must take precedence over seccomp, since the
use case is emulation (it can be invoked with a different ABI) such that
seccomp filtering by syscall number doesn't make sense in the first
place.  In addition, either the syscall is dispatched back to userspace,
in which case there is no resource for seccomp to protect, or the
syscall will be executed, and seccomp will execute next.

Regarding ptrace, I experimented with before and after, and while the
same ABI argument applies, I felt it was easier to debug if I let ptrace
happen for syscalls that are dispatched back to userspace.  In addition,
doing it after ptrace makes the code in syscall_exit_work slightly
simpler, since it doesn't require special handling for this feature.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 kernel/entry/common.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Christian Brauner Sept. 7, 2020, 10:15 a.m. UTC | #1
On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote:
> Syscall User Dispatch (SUD) must take precedence over seccomp, since the
> use case is emulation (it can be invoked with a different ABI) such that
> seccomp filtering by syscall number doesn't make sense in the first
> place.  In addition, either the syscall is dispatched back to userspace,
> in which case there is no resource for seccomp to protect, or the

Tbh, I'm torn here. I'm not a super clever attacker but it feels to me
that this is still at least a clever way to circumvent a seccomp
sandbox.
If I'd be confined by a seccomp profile that would cause me to be
SIGKILLed when I try do open() I could prctl() myself to do user
dispatch to prevent that from happening, no?

> syscall will be executed, and seccomp will execute next.
> 
> Regarding ptrace, I experimented with before and after, and while the
> same ABI argument applies, I felt it was easier to debug if I let ptrace
> happen for syscalls that are dispatched back to userspace.  In addition,
> doing it after ptrace makes the code in syscall_exit_work slightly
> simpler, since it doesn't require special handling for this feature.
> 
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
> ---
>  kernel/entry/common.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 44fd089d59da..fdb0c543539d 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -6,6 +6,8 @@
>  #include <linux/audit.h>
>  #include <linux/syscall_intercept.h>
>  
> +#include "common.h"
> +
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/syscalls.h>
>  
> @@ -47,6 +49,12 @@ static inline long do_syscall_intercept(struct pt_regs *regs)
>  	int sysint_work = READ_ONCE(current->syscall_intercept);
>  	int ret;
>  
> +	if (sysint_work & SYSINT_USER_DISPATCH) {
> +		ret = do_syscall_user_dispatch(regs);
> +		if (ret == -1L)
> +			return ret;
> +	}
> +
>  	if (sysint_work & SYSINT_SECCOMP) {
>  		ret = __secure_computing(NULL);
>  		if (ret == -1L)
> -- 
> 2.28.0
Andy Lutomirski Sept. 7, 2020, 2:15 p.m. UTC | #2
> On Sep 7, 2020, at 3:15 AM, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> 
> On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote:
>> Syscall User Dispatch (SUD) must take precedence over seccomp, since the
>> use case is emulation (it can be invoked with a different ABI) such that
>> seccomp filtering by syscall number doesn't make sense in the first
>> place.  In addition, either the syscall is dispatched back to userspace,
>> in which case there is no resource for seccomp to protect, or the
> 
> Tbh, I'm torn here. I'm not a super clever attacker but it feels to me
> that this is still at least a clever way to circumvent a seccomp
> sandbox.
> If I'd be confined by a seccomp profile that would cause me to be
> SIGKILLed when I try do open() I could prctl() myself to do user
> dispatch to prevent that from happening, no?
> 

Not really, I think. The idea is that you didn’t actually do open(). You did a SYSCALL instruction which meant something else, and the syscall dispatch correctly prevented the kernel from misinterpreting it as open().
Christian Brauner Sept. 7, 2020, 2:25 p.m. UTC | #3
On Mon, Sep 07, 2020 at 07:15:52AM -0700, Andy Lutomirski wrote:
> 
> 
> > On Sep 7, 2020, at 3:15 AM, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> > 
> > On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote:
> >> Syscall User Dispatch (SUD) must take precedence over seccomp, since the
> >> use case is emulation (it can be invoked with a different ABI) such that
> >> seccomp filtering by syscall number doesn't make sense in the first
> >> place.  In addition, either the syscall is dispatched back to userspace,
> >> in which case there is no resource for seccomp to protect, or the
> > 
> > Tbh, I'm torn here. I'm not a super clever attacker but it feels to me
> > that this is still at least a clever way to circumvent a seccomp
> > sandbox.
> > If I'd be confined by a seccomp profile that would cause me to be
> > SIGKILLed when I try do open() I could prctl() myself to do user
> > dispatch to prevent that from happening, no?
> > 
> 
> Not really, I think. The idea is that you didn’t actually do open().
> You did a SYSCALL instruction which meant something else, and the
> syscall dispatch correctly prevented the kernel from misinterpreting
> it as open().

Right, for the case where you're e.g. emulating windows syscalls that's
true. I was thinking when you're running natively on Linux: couldn't I
first load a seccomp profile "kill me if someone does an open()", then
I exec() the target binary and that binary is setup to do
prctl(USER_DISPATCH) first thing. I guess, it's ok because as far as I
had time to read it this is a nothing or all mechanism, i.e. _all_
system calls are re-routed in contrast to e.g. seccomp where I could do
this per-syscall. So for user-dispatch it wouldn't make sense to use it
on Linux per se. Still makes me a little uneasy. :)

Christian
Andy Lutomirski Sept. 7, 2020, 8:20 p.m. UTC | #4
On Mon, Sep 7, 2020 at 7:25 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Mon, Sep 07, 2020 at 07:15:52AM -0700, Andy Lutomirski wrote:
> >
> >
> > > On Sep 7, 2020, at 3:15 AM, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> > >
> > > On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote:
> > >> Syscall User Dispatch (SUD) must take precedence over seccomp, since the
> > >> use case is emulation (it can be invoked with a different ABI) such that
> > >> seccomp filtering by syscall number doesn't make sense in the first
> > >> place.  In addition, either the syscall is dispatched back to userspace,
> > >> in which case there is no resource for seccomp to protect, or the
> > >
> > > Tbh, I'm torn here. I'm not a super clever attacker but it feels to me
> > > that this is still at least a clever way to circumvent a seccomp
> > > sandbox.
> > > If I'd be confined by a seccomp profile that would cause me to be
> > > SIGKILLed when I try do open() I could prctl() myself to do user
> > > dispatch to prevent that from happening, no?
> > >
> >
> > Not really, I think. The idea is that you didn’t actually do open().
> > You did a SYSCALL instruction which meant something else, and the
> > syscall dispatch correctly prevented the kernel from misinterpreting
> > it as open().
>
> Right, for the case where you're e.g. emulating windows syscalls that's
> true. I was thinking when you're running natively on Linux: couldn't I
> first load a seccomp profile "kill me if someone does an open()", then
> I exec() the target binary and that binary is setup to do
> prctl(USER_DISPATCH) first thing. I guess, it's ok because as far as I
> had time to read it this is a nothing or all mechanism, i.e. _all_
> system calls are re-routed in contrast to e.g. seccomp where I could do
> this per-syscall. So for user-dispatch it wouldn't make sense to use it
> on Linux per se. Still makes me a little uneasy. :)

There's an escape hatch, so processes using this can still make syscalls.

Maybe think about it another way: a process using user dispatch should
definitely *not* trigger seccomp user notifiers, errno returns, or
ptrace events, since they'll all do the wrong thing.  IMO RET_KILL is
the same.

Barring some very severe defect, there's no way a program can use user
dispatch to escape seccomp -- a program could use user dispatch to
allow them to do:

mov $__NR_open, %rax
syscall

without dying despite the presence of a filter that would kill the
process if it tried to do open(), but this doesn't bypass the filter
at all.  The process could just as easily have done:

mov $__NR_open
jmp magic_stub(%rip)

without tripping the filter, since no system call actually happens here.

--Andy
Peter Zijlstra Sept. 11, 2020, 9:46 a.m. UTC | #5
On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote:
> Syscall User Dispatch (SUD) must take precedence over seccomp, since the
> use case is emulation (it can be invoked with a different ABI) such that
> seccomp filtering by syscall number doesn't make sense in the first
> place.  In addition, either the syscall is dispatched back to userspace,
> in which case there is no resource for seccomp to protect, or the
> syscall will be executed, and seccomp will execute next.
> 
> Regarding ptrace, I experimented with before and after, and while the
> same ABI argument applies, I felt it was easier to debug if I let ptrace
> happen for syscalls that are dispatched back to userspace.  In addition,
> doing it after ptrace makes the code in syscall_exit_work slightly
> simpler, since it doesn't require special handling for this feature.

I think I'm with Andy that this should be before ptrace(). ptrace()
users will attempt to interpret things like they're regular syscalls,
and that's definitely not the case.
diff mbox series

Patch

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 44fd089d59da..fdb0c543539d 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -6,6 +6,8 @@ 
 #include <linux/audit.h>
 #include <linux/syscall_intercept.h>
 
+#include "common.h"
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
@@ -47,6 +49,12 @@  static inline long do_syscall_intercept(struct pt_regs *regs)
 	int sysint_work = READ_ONCE(current->syscall_intercept);
 	int ret;
 
+	if (sysint_work & SYSINT_USER_DISPATCH) {
+		ret = do_syscall_user_dispatch(regs);
+		if (ret == -1L)
+			return ret;
+	}
+
 	if (sysint_work & SYSINT_SECCOMP) {
 		ret = __secure_computing(NULL);
 		if (ret == -1L)