diff mbox series

seccomp: passthrough uretprobe systemcall without filtering

Message ID 20250117005539.325887-1-eyal.birger@gmail.com (mailing list archive)
State New
Headers show
Series seccomp: passthrough uretprobe systemcall without filtering | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Eyal Birger Jan. 17, 2025, 12:55 a.m. UTC
When attaching uretprobes to processes running inside docker, the attached
process is segfaulted when encountering the retprobe.

The reason is that now that uretprobe is a system call the default seccomp
filters in docker block it as they only allow a specific set of known
syscalls. This is true for other userspace applications which use seccomp
to control their syscall surface.

Since uretprobe is a "kernel implementation detail" system call which is
not used by userspace application code directly, it is impractical and
there's very little point in forcing all userspace applications to
explicitly allow it in order to avoid crashing tracked processes.

Pass this systemcall through seccomp without depending on configuration.

Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
Reported-by: Rafael Buchbinder <rafi@rbk.io>
Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
Cc: stable@vger.kernel.org
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
---

The following reproduction script synthetically demonstrates the problem:

cat > /tmp/x.c << EOF

char *syscalls[] = {
	"write",
	"exit_group",
	"fstat",
};

__attribute__((noinline)) int probed(void)
{
	printf("Probed\n");
	return 1;
}

void apply_seccomp_filter(char **syscalls, int num_syscalls)
{
	scmp_filter_ctx ctx;

	ctx = seccomp_init(SCMP_ACT_KILL);
	for (int i = 0; i < num_syscalls; i++) {
		seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
				 seccomp_syscall_resolve_name(syscalls[i]), 0);
	}
	seccomp_load(ctx);
	seccomp_release(ctx);
}

int main(int argc, char *argv[])
{
	int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);

	apply_seccomp_filter(syscalls, num_syscalls);

	probed();

	return 0;
}
EOF

cat > /tmp/trace.bt << EOF
uretprobe:/tmp/x:probed
{
    printf("ret=%d\n", retval);
}
EOF

gcc -o /tmp/x /tmp/x.c -lseccomp

/usr/bin/bpftrace /tmp/trace.bt &

sleep 5 # wait for uretprobe attach
/tmp/x

pkill bpftrace

rm /tmp/x /tmp/x.c /tmp/trace.bt
---
 kernel/seccomp.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Oleg Nesterov Jan. 17, 2025, 1:39 a.m. UTC | #1
On 01/16, Eyal Birger wrote:
>
> Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> Reported-by: Rafael Buchbinder <rafi@rbk.io>
> Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> Cc: stable@vger.kernel.org
...
> @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
>  	this_syscall = sd ? sd->nr :
>  		syscall_get_nr(current, current_pt_regs());
>
> +#ifdef CONFIG_X86_64
> +	if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> +		return 0;
> +#endif

Acked-by: Oleg Nesterov <oleg@redhat.com>


A note for the seccomp maintainers...

I don't know what do you think, but I agree in advance that the very fact this
patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.

The problem is that we need a simple patch for -stable which fixes the real
problem. We can cleanup this logic later, I think.

Oleg.
Masami Hiramatsu (Google) Jan. 17, 2025, 8:02 a.m. UTC | #2
On Fri, 17 Jan 2025 02:39:28 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> On 01/16, Eyal Birger wrote:
> >
> > Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > Reported-by: Rafael Buchbinder <rafi@rbk.io>
> > Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> > Cc: stable@vger.kernel.org
> ...
> > @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
> >  	this_syscall = sd ? sd->nr :
> >  		syscall_get_nr(current, current_pt_regs());
> >
> > +#ifdef CONFIG_X86_64
> > +	if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> > +		return 0;
> > +#endif
> 
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> 
> 
> A note for the seccomp maintainers...
> 
> I don't know what do you think, but I agree in advance that the very fact this
> patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
> 

Indeed. in_ia32_syscall() depends arch/x86 too.
We can add an inline function like;

``` uprobes.h
static inline bool is_uprobe_syscall(int syscall)
{
	// arch_is_uprobe_syscall check can be replaced by Kconfig,
	// something like CONFIG_ARCH_URETPROBE_SYSCALL.
#ifdef arch_is_uprobe_syscall
	return arch_is_uprobe_syscall(syscall)
#else
	return false;
#endif
}
```
and 
``` arch/x86/include/asm/uprobes.h
#define arch_is_uprobe_syscall(syscall) \
	(IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())
```

> The problem is that we need a simple patch for -stable which fixes the real
> problem. We can cleanup this logic later, I think.

Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that
do not pollute the seccomp subsystem with #ifdef.

Thank you,



> 
> Oleg.
>
Eyal Birger Jan. 17, 2025, 1:36 p.m. UTC | #3
On Fri, Jan 17, 2025 at 12:02 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Fri, 17 Jan 2025 02:39:28 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 01/16, Eyal Birger wrote:
> > >
> > > Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > > Reported-by: Rafael Buchbinder <rafi@rbk.io>
> > > Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> > > Cc: stable@vger.kernel.org
> > ...
> > > @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
> > >     this_syscall = sd ? sd->nr :
> > >             syscall_get_nr(current, current_pt_regs());
> > >
> > > +#ifdef CONFIG_X86_64
> > > +   if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> > > +           return 0;
> > > +#endif
> >
> > Acked-by: Oleg Nesterov <oleg@redhat.com>
> >
> >
> > A note for the seccomp maintainers...
> >
> > I don't know what do you think, but I agree in advance that the very fact this
> > patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
> >
>
> Indeed. in_ia32_syscall() depends arch/x86 too.
> We can add an inline function like;
>
> ``` uprobes.h
> static inline bool is_uprobe_syscall(int syscall)
> {
>         // arch_is_uprobe_syscall check can be replaced by Kconfig,
>         // something like CONFIG_ARCH_URETPROBE_SYSCALL.
> #ifdef arch_is_uprobe_syscall
>         return arch_is_uprobe_syscall(syscall)
> #else
>         return false;
> #endif
> }
> ```
> and
> ``` arch/x86/include/asm/uprobes.h
> #define arch_is_uprobe_syscall(syscall) \
>         (IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())
> ```

Notice it'll need to be enclosed in ifdef CONFIG_X86_64 as __NR_uretprobe
isn't defined otherwise so the IS_ENABLED isn't needed.

>
> > The problem is that we need a simple patch for -stable which fixes the real
> > problem. We can cleanup this logic later, I think.
>
> Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that
> do not pollute the seccomp subsystem with #ifdef.

I like this approach.

Notice it does add a couple of includes that weren't there before:
- arch/x86/include/asm/uprobes.h would include asm/unistd.h
- seccomp.c would include linux/uprobes.h

So it's a less "trivial" patch... If that's ok I can repost with these
changes.

Eyal.
diff mbox series

Patch

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 385d48293a5f..10a55c9b5c18 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1359,6 +1359,11 @@  int __secure_computing(const struct seccomp_data *sd)
 	this_syscall = sd ? sd->nr :
 		syscall_get_nr(current, current_pt_regs());
 
+#ifdef CONFIG_X86_64
+	if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
+		return 0;
+#endif
+
 	switch (mode) {
 	case SECCOMP_MODE_STRICT:
 		__secure_computing_strict(this_syscall);  /* may call do_exit */