seccomp: passthrough uretprobe systemcall without filtering

Message ID	20250117005539.325887-1-eyal.birger@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E3A3F4964F; Fri, 17 Jan 2025 00:56:05 +0000 (UTC) From: Eyal Birger <eyal.birger@gmail.com> To: kees@kernel.org, luto@amacapital.net, wad@chromium.org, oleg@redhat.com, mhiramat@kernel.org, andrii@kernel.org, jolsa@kernel.org Cc: alexei.starovoitov@gmail.com, olsajiri@gmail.com, cyphar@cyphar.com, songliubraving@fb.com, yhs@fb.com, john.fastabend@gmail.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, daniel@iogearbox.net, ast@kernel.org, andrii.nakryiko@gmail.com, rostedt@goodmis.org, rafi@rbk.io, shmulik.ladkani@gmail.com, bpf@vger.kernel.org, linux-api@vger.kernel.org, linux-trace-kernel@vger.kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org, Eyal Birger <eyal.birger@gmail.com>, stable@vger.kernel.org Subject: [PATCH] seccomp: passthrough uretprobe systemcall without filtering Date: Thu, 16 Jan 2025 16:55:39 -0800 Message-ID: <20250117005539.325887-1-eyal.birger@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	seccomp: passthrough uretprobe systemcall without filtering \| expand seccomp: passthrough uretprobe systemcall without filtering

Context	Check	Description
netdev/tree_selection	success	Not a local patch

Eyal Birger Jan. 17, 2025, 12:55 a.m. UTC

When attaching uretprobes to processes running inside docker, the attached
process is segfaulted when encountering the retprobe.

The reason is that now that uretprobe is a system call the default seccomp
filters in docker block it as they only allow a specific set of known
syscalls. This is true for other userspace applications which use seccomp
to control their syscall surface.

Since uretprobe is a "kernel implementation detail" system call which is
not used by userspace application code directly, it is impractical and
there's very little point in forcing all userspace applications to
explicitly allow it in order to avoid crashing tracked processes.

Pass this systemcall through seccomp without depending on configuration.

Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
Reported-by: Rafael Buchbinder <rafi@rbk.io>
Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
Cc: stable@vger.kernel.org
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
---

The following reproduction script synthetically demonstrates the problem:

cat > /tmp/x.c << EOF

char *syscalls[] = {
	"write",
	"exit_group",
	"fstat",
};

__attribute__((noinline)) int probed(void)
{
	printf("Probed\n");
	return 1;
}

void apply_seccomp_filter(char **syscalls, int num_syscalls)
{
	scmp_filter_ctx ctx;

	ctx = seccomp_init(SCMP_ACT_KILL);
	for (int i = 0; i < num_syscalls; i++) {
		seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
				 seccomp_syscall_resolve_name(syscalls[i]), 0);
	}
	seccomp_load(ctx);
	seccomp_release(ctx);
}

int main(int argc, char *argv[])
{
	int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);

	apply_seccomp_filter(syscalls, num_syscalls);

	probed();

	return 0;
}
EOF

cat > /tmp/trace.bt << EOF
uretprobe:/tmp/x:probed
{
    printf("ret=%d\n", retval);
}
EOF

gcc -o /tmp/x /tmp/x.c -lseccomp

/usr/bin/bpftrace /tmp/trace.bt &

sleep 5 # wait for uretprobe attach
/tmp/x

pkill bpftrace

rm /tmp/x /tmp/x.c /tmp/trace.bt
---
 kernel/seccomp.c | 5 +++++
 1 file changed, 5 insertions(+)

Oleg Nesterov Jan. 17, 2025, 1:39 a.m. UTC | #1

On 01/16, Eyal Birger wrote:
>
> Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> Reported-by: Rafael Buchbinder <rafi@rbk.io>
> Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> Cc: stable@vger.kernel.org
...
> @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
>  	this_syscall = sd ? sd->nr :
>  		syscall_get_nr(current, current_pt_regs());
>
> +#ifdef CONFIG_X86_64
> +	if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> +		return 0;
> +#endif

Acked-by: Oleg Nesterov <oleg@redhat.com>

A note for the seccomp maintainers...

I don't know what do you think, but I agree in advance that the very fact this
patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.

The problem is that we need a simple patch for -stable which fixes the real
problem. We can cleanup this logic later, I think.

Oleg.

Masami Hiramatsu (Google) Jan. 17, 2025, 8:02 a.m. UTC | #2

On Fri, 17 Jan 2025 02:39:28 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> On 01/16, Eyal Birger wrote:
> >
> > Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > Reported-by: Rafael Buchbinder <rafi@rbk.io>
> > Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> > Cc: stable@vger.kernel.org
> ...
> > @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
> >  	this_syscall = sd ? sd->nr :
> >  		syscall_get_nr(current, current_pt_regs());
> >
> > +#ifdef CONFIG_X86_64
> > +	if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> > +		return 0;
> > +#endif
> 
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> 
> 
> A note for the seccomp maintainers...
> 
> I don't know what do you think, but I agree in advance that the very fact this
> patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
> 

Indeed. in_ia32_syscall() depends arch/x86 too.
We can add an inline function like;

``` uprobes.h
static inline bool is_uprobe_syscall(int syscall)
{
	// arch_is_uprobe_syscall check can be replaced by Kconfig,
	// something like CONFIG_ARCH_URETPROBE_SYSCALL.
#ifdef arch_is_uprobe_syscall
	return arch_is_uprobe_syscall(syscall)
#else
	return false;
#endif
}
```
and 
``` arch/x86/include/asm/uprobes.h
#define arch_is_uprobe_syscall(syscall) \
	(IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())
```

> The problem is that we need a simple patch for -stable which fixes the real
> problem. We can cleanup this logic later, I think.

Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that
do not pollute the seccomp subsystem with #ifdef.

Thank you,



> 
> Oleg.
>

Eyal Birger Jan. 17, 2025, 1:36 p.m. UTC | #3

On Fri, Jan 17, 2025 at 12:02 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Fri, 17 Jan 2025 02:39:28 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 01/16, Eyal Birger wrote:
> > >
> > > Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > > Reported-by: Rafael Buchbinder <rafi@rbk.io>
> > > Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> > > Cc: stable@vger.kernel.org
> > ...
> > > @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
> > >     this_syscall = sd ? sd->nr :
> > >             syscall_get_nr(current, current_pt_regs());
> > >
> > > +#ifdef CONFIG_X86_64
> > > +   if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> > > +           return 0;
> > > +#endif
> >
> > Acked-by: Oleg Nesterov <oleg@redhat.com>
> >
> >
> > A note for the seccomp maintainers...
> >
> > I don't know what do you think, but I agree in advance that the very fact this
> > patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
> >
>
> Indeed. in_ia32_syscall() depends arch/x86 too.
> We can add an inline function like;
>
> ``` uprobes.h
> static inline bool is_uprobe_syscall(int syscall)
> {
>         // arch_is_uprobe_syscall check can be replaced by Kconfig,
>         // something like CONFIG_ARCH_URETPROBE_SYSCALL.
> #ifdef arch_is_uprobe_syscall
>         return arch_is_uprobe_syscall(syscall)
> #else
>         return false;
> #endif
> }
> ```
> and
> ``` arch/x86/include/asm/uprobes.h
> #define arch_is_uprobe_syscall(syscall) \
>         (IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())
> ```

Notice it'll need to be enclosed in ifdef CONFIG_X86_64 as __NR_uretprobe
isn't defined otherwise so the IS_ENABLED isn't needed.

>
> > The problem is that we need a simple patch for -stable which fixes the real
> > problem. We can cleanup this logic later, I think.
>
> Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that
> do not pollute the seccomp subsystem with #ifdef.

I like this approach.

Notice it does add a couple of includes that weren't there before:
- arch/x86/include/asm/uprobes.h would include asm/unistd.h
- seccomp.c would include linux/uprobes.h

So it's a less "trivial" patch... If that's ok I can repost with these
changes.

Eyal.

Oleg Nesterov Jan. 17, 2025, 2:09 p.m. UTC | #4

On 01/17, Masami Hiramatsu wrote:
>
> On Fri, 17 Jan 2025 02:39:28 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > A note for the seccomp maintainers...
> >
> > I don't know what do you think, but I agree in advance that the very fact this
> > patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
> >
>
> Indeed. in_ia32_syscall() depends arch/x86 too.
> We can add an inline function like;
>
> ``` uprobes.h
> static inline bool is_uprobe_syscall(int syscall)
> {

We can, and this is what I tried to suggest from the very beginning.
But I agree with Eyal who decided to send the most trivial fix for
-stable, we can add the helper later.

I don't think it should live in uprobes.h and I'd prefer something
like arch_seccomp_ignored(int) but I won't insist.

> 	// arch_is_uprobe_syscall check can be replaced by Kconfig,
> 	// something like CONFIG_ARCH_URETPROBE_SYSCALL.

Or sysctl or both. This is another issue.

> ``` arch/x86/include/asm/uprobes.h
> #define arch_is_uprobe_syscall(syscall) \
> 	(IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())
> ```

This won't compile if IS_ENABLED(CONFIG_X86_64) == false, __NR_uretprobe
will be undefined.

> > The problem is that we need a simple patch for -stable which fixes the real
> > problem. We can cleanup this logic later, I think.
>
> Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that
> do not pollute the seccomp subsystem with #ifdef.

See above. But I won't insist.

Oleg.

Andrii Nakryiko Jan. 17, 2025, 5:51 p.m. UTC | #5

On Fri, Jan 17, 2025 at 6:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 01/17, Masami Hiramatsu wrote:
> >
> > On Fri, 17 Jan 2025 02:39:28 +0100
> > Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > A note for the seccomp maintainers...
> > >
> > > I don't know what do you think, but I agree in advance that the very fact this
> > > patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
> > >
> >
> > Indeed. in_ia32_syscall() depends arch/x86 too.
> > We can add an inline function like;
> >
> > ``` uprobes.h
> > static inline bool is_uprobe_syscall(int syscall)
> > {
>
> We can, and this is what I tried to suggest from the very beginning.
> But I agree with Eyal who decided to send the most trivial fix for
> -stable, we can add the helper later.
>
> I don't think it should live in uprobes.h and I'd prefer something
> like arch_seccomp_ignored(int) but I won't insist.

yep, I think this is the way, keeping it as a general category. Should
we also put rt_sigreturn there explicitly as well? Also, wouldn't it
be better to have it as a non-arch-specific function for something
like rt_sigreturn where defining it per each arch is cumbersome, and
have the default implementation also call into an arch-specific
function?

>
> >       // arch_is_uprobe_syscall check can be replaced by Kconfig,
> >       // something like CONFIG_ARCH_URETPROBE_SYSCALL.
>
> Or sysctl or both. This is another issue.
>
> > ``` arch/x86/include/asm/uprobes.h
> > #define arch_is_uprobe_syscall(syscall) \
> >       (IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())
> > ```
>
> This won't compile if IS_ENABLED(CONFIG_X86_64) == false, __NR_uretprobe
> will be undefined.
>
> > > The problem is that we need a simple patch for -stable which fixes the real
> > > problem. We can cleanup this logic later, I think.
> >
> > Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that
> > do not pollute the seccomp subsystem with #ifdef.
>
> See above. But I won't insist.
>
> Oleg.
>

Dmitry V. Levin Jan. 17, 2025, 6:34 p.m. UTC | #6

On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:
> When attaching uretprobes to processes running inside docker, the attached
> process is segfaulted when encountering the retprobe.
> 
> The reason is that now that uretprobe is a system call the default seccomp
> filters in docker block it as they only allow a specific set of known
> syscalls. This is true for other userspace applications which use seccomp
> to control their syscall surface.
> 
> Since uretprobe is a "kernel implementation detail" system call which is
> not used by userspace application code directly, it is impractical and
> there's very little point in forcing all userspace applications to
> explicitly allow it in order to avoid crashing tracked processes.
> 
> Pass this systemcall through seccomp without depending on configuration.
> 
> Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> Reported-by: Rafael Buchbinder <rafi@rbk.io>
> Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> Cc: stable@vger.kernel.org
> Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
> ---
> 
> The following reproduction script synthetically demonstrates the problem:
> 
> cat > /tmp/x.c << EOF
> 
> char *syscalls[] = {
> 	"write",
> 	"exit_group",
> 	"fstat",
> };
> 
> __attribute__((noinline)) int probed(void)
> {
> 	printf("Probed\n");
> 	return 1;
> }
> 
> void apply_seccomp_filter(char **syscalls, int num_syscalls)
> {
> 	scmp_filter_ctx ctx;
> 
> 	ctx = seccomp_init(SCMP_ACT_KILL);
> 	for (int i = 0; i < num_syscalls; i++) {
> 		seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
> 				 seccomp_syscall_resolve_name(syscalls[i]), 0);
> 	}
> 	seccomp_load(ctx);
> 	seccomp_release(ctx);
> }
> 
> int main(int argc, char *argv[])
> {
> 	int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);
> 
> 	apply_seccomp_filter(syscalls, num_syscalls);
> 
> 	probed();
> 
> 	return 0;
> }
> EOF
> 
> cat > /tmp/trace.bt << EOF
> uretprobe:/tmp/x:probed
> {
>     printf("ret=%d\n", retval);
> }
> EOF
> 
> gcc -o /tmp/x /tmp/x.c -lseccomp
> 
> /usr/bin/bpftrace /tmp/trace.bt &
> 
> sleep 5 # wait for uretprobe attach
> /tmp/x
> 
> pkill bpftrace
> 
> rm /tmp/x /tmp/x.c /tmp/trace.bt
> ---
>  kernel/seccomp.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 385d48293a5f..10a55c9b5c18 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
>  	this_syscall = sd ? sd->nr :
>  		syscall_get_nr(current, current_pt_regs());
>  
> +#ifdef CONFIG_X86_64
> +	if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> +		return 0;
> +#endif
> +
>  	switch (mode) {
>  	case SECCOMP_MODE_STRICT:
>  		__secure_computing_strict(this_syscall);  /* may call do_exit */

This seems to be a hot fix to bypass some SECCOMP_RET_ERRNO filters.
However, this way it bypasses seccomp completely, including
SECCOMP_RET_TRACE, making it invisible to strace --seccomp,
and I wonder why do you want that.

Eyal Birger Jan. 17, 2025, 6:52 p.m. UTC | #7

On Fri, Jan 17, 2025 at 10:34 AM Dmitry V. Levin <ldv@strace.io> wrote:
>
> On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:
> > When attaching uretprobes to processes running inside docker, the attached
> > process is segfaulted when encountering the retprobe.
> >
> > The reason is that now that uretprobe is a system call the default seccomp
> > filters in docker block it as they only allow a specific set of known
> > syscalls. This is true for other userspace applications which use seccomp
> > to control their syscall surface.
> >
> > Since uretprobe is a "kernel implementation detail" system call which is
> > not used by userspace application code directly, it is impractical and
> > there's very little point in forcing all userspace applications to
> > explicitly allow it in order to avoid crashing tracked processes.
> >
> > Pass this systemcall through seccomp without depending on configuration.
> >
> > Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > Reported-by: Rafael Buchbinder <rafi@rbk.io>
> > Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
> > ---
> >
> > The following reproduction script synthetically demonstrates the problem:
> >
> > cat > /tmp/x.c << EOF
> >
> > char *syscalls[] = {
> >       "write",
> >       "exit_group",
> >       "fstat",
> > };
> >
> > __attribute__((noinline)) int probed(void)
> > {
> >       printf("Probed\n");
> >       return 1;
> > }
> >
> > void apply_seccomp_filter(char **syscalls, int num_syscalls)
> > {
> >       scmp_filter_ctx ctx;
> >
> >       ctx = seccomp_init(SCMP_ACT_KILL);
> >       for (int i = 0; i < num_syscalls; i++) {
> >               seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
> >                                seccomp_syscall_resolve_name(syscalls[i]), 0);
> >       }
> >       seccomp_load(ctx);
> >       seccomp_release(ctx);
> > }
> >
> > int main(int argc, char *argv[])
> > {
> >       int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);
> >
> >       apply_seccomp_filter(syscalls, num_syscalls);
> >
> >       probed();
> >
> >       return 0;
> > }
> > EOF
> >
> > cat > /tmp/trace.bt << EOF
> > uretprobe:/tmp/x:probed
> > {
> >     printf("ret=%d\n", retval);
> > }
> > EOF
> >
> > gcc -o /tmp/x /tmp/x.c -lseccomp
> >
> > /usr/bin/bpftrace /tmp/trace.bt &
> >
> > sleep 5 # wait for uretprobe attach
> > /tmp/x
> >
> > pkill bpftrace
> >
> > rm /tmp/x /tmp/x.c /tmp/trace.bt
> > ---
> >  kernel/seccomp.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 385d48293a5f..10a55c9b5c18 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
> >       this_syscall = sd ? sd->nr :
> >               syscall_get_nr(current, current_pt_regs());
> >
> > +#ifdef CONFIG_X86_64
> > +     if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> > +             return 0;
> > +#endif
> > +
> >       switch (mode) {
> >       case SECCOMP_MODE_STRICT:
> >               __secure_computing_strict(this_syscall);  /* may call do_exit */
>
> This seems to be a hot fix to bypass some SECCOMP_RET_ERRNO filters.

It's a little broader than just SECCOMP_RET_ERRNO, but yes, this is a
hotfix to avoid filtering this system call in seccomp.

The rationale is that this is not a userspace created system call - the
kernel uses it to instrument the function - and the fact that it's a
system call is just an implementation detail. Ideally, userspace wouldn't
need to know or care about it.

> However, this way it bypasses seccomp completely, including
> SECCOMP_RET_TRACE, making it invisible to strace --seccomp,
> and I wonder why do you want that.

It's a good question. I could move this check to both "strict" seccomp and
after the BPF verdict is received, but before it's applied, but I fear this
would make the fix more error prone, and way harder to backmerge. So I'm
wondering whether supporting strace --seccomp-bpf for this particular
syscall is a priority.

Eyal.

seccomp: passthrough uretprobe systemcall without filtering

Checks

Commit Message

Comments

Patch