答复: 答复: new seccomp mode aims to improve performance
diff mbox series

Message ID 07ce4c1273054955a350e67f2dc35812@huawei.com
State New
Headers show
Series
  • 答复: 答复: new seccomp mode aims to improve performance
Related show

Commit Message

zhujianwei (C) June 2, 2020, 11:34 a.m. UTC
> > > This is the test result on linux 5.7.0-rc7 for aarch64.
> > > And retpoline disabled default.
> > > #cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
> > > Not affected
> > >
> > > bpf_jit_enable 1
> > > bpf_jit_harden 0
> > >
> > > We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one).
> > > The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m):
> > >
> > > original:
> > >         seccomp_off:                    10684939
> > >         seccomp_on_1_filters:   8513805         overhead:19.8%
> > >         seccomp_on_4_filters:   7105592         overhead:33.0%
> > >         seccomp_on_32_filters:  2308677         overhead:78.3%
> > >
> > > after replacing bpf_prog_run_pin_on_cpu:
> > >         seccomp_off:                    10685244
> > >         seccomp_on_1_filters:   9146483         overhead:14.1%
> > >         seccomp_on_4_filters:   8969886         overhead:16.0%
> > >         seccomp_on_32_filters:  6454372         overhead:39.6%
> > >
> > > N-filter bpf overhead:
> > >         1_filters:              5.7%
> > >         4_filters:              17.0%
> > >         32_filters:     38.7%
> > >
> > > // kernel code modification place
> > > static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct 
> > > bpf_prog *prog, const void *ctx) {
> > >         return SECCOMP_RET_ALLOW;
> > > }
> > 
> > >This is apples to oranges.
> > >As explained earlier:
> > >https://lore.kernel.org/netdev/20200531171915.wsxvdjeetmhpsdv2@ast-mb
> > >p.dhcp.thefacebook.com/T/#u Please use __weak instead of static and 
> > >redo the numbers.
> > 
> > 
> > we have replaced ‘static’ with ‘__weak’, tested with the same way, and got almostly the same result, in our test environment(aarch64).
> > 
> > -static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct 
> > bpf_prog *prog, const void *ctx)
> > +__weak noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct 
> > +bpf_prog *prog, const void *ctx)
> > 
> > original:
> > 	seccomp_off:			10684939
> > 	seccomp_on_1_filters:	8513805		overhead:19.8%
> > 	seccomp_on_4_filters:	7105592		overhead:33.0%
> > 	seccomp_on_32_filters:	2308677		overhead:78.3%
> > 	
> > after replacing bpf_prog_run_pin_on_cpu:
> > 	seccomp_off:			10667195
> > 	seccomp_on_1_filters:	9147454		overhead:14.2%
> > 	seccomp_on_4_filters:	8927605		overhead:16.1%
> > 	seccomp_on_32_filters:	6355476		overhead:40.6%

> > are you saying that by replacing 'static' with '__weak' it got slower?!
> > Something doesn't add up. Please check generated assembly.
> > By having such 'static noinline bpf_prog_run_pin_on_cpu' you're telling compiler to remove most of seccomp_run_filters() code which now will return only two possible values. Which further means that large 'switch'
> > statement in __seccomp_filter() is also optimized. populate_seccomp_data() is removed. Etc, etc. That explains 14% vs 19% difference.
> > May be you have some debug on? Like cant_migrate() is not a nop?
> > Or static_branch is not supported?
> > The sure way is to check assembly.

> No, we say that by replacing 'static' with '__weak' it got the same result, in our testcase which filters 20 allowed syscall num (for details, see the previous post). 
>
> static case:
>	N-filter bpf overhead:
>	1_filters:		5.7%
>	4_filters:		17.0%
>	32_filters:	38.7%
>
> __weak case:
>	N-filter bpf overhead:
>	1_filters:		5.6%
>	4_filters:		16.9%
>	32_filters:	37.7%

And in many scenarios, the requirement for syscall filter is usually simple, and does not need complex filter rules, for example, just configure a syscall black or white list. However, we have noticed that seccomp will have a performance overhead that cannot be ignored in this simple scenario. For example, referring to Kees's t est data, this cost is almost 41/636 = 6.5%, and Alex's data is 17/226 = 7.5%, based on single rule of filtering (getpid); Our data for this overhead is 19.8% (refer to the previous 'orignal' test results), filtering based on our 20 rules (unixbench syscall).

To improve the filtering performance in these scenarios, as a PoC, we replaced calling seccomp_computing() in syscall_trace_enter(), with a light_syscall_filter() which uses a bitmap for syscall filter, and only filter syscall-num-only case without the syscall-args case. The light_syscall_filter() implemented as a sub-branch in seccomp_computing().

To measure the performance, we use the same unixbench syscall testcase with 20 allowed syscall-num rules. The result shows that the light_syscall_filter only imposed 1.2% overhead. 
Can we take a deep discussion to add this light filter mode?

This is the details:

light_syscall_filter:	//run unixbench syscall testcase 10 times and get the average.
	seccomp_off:			10684018
	seccomp_on_1_filters:	10553233
	overhead:				1.2%

// kernel modification
+	}
 
 	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
 		trace_sys_enter(regs, regs->syscallno);

Comments

Kees Cook June 2, 2020, 6:32 p.m. UTC | #1
On Tue, Jun 02, 2020 at 11:34:04AM +0000, zhujianwei (C) wrote:
> And in many scenarios, the requirement for syscall filter is usually
> simple, and does not need complex filter rules, for example, just
> configure a syscall black or white list. However, we have noticed that
> seccomp will have a performance overhead that cannot be ignored in this
> simple scenario. For example, referring to Kees's t est data, this cost
> is almost 41/636 = 6.5%, and Alex's data is 17/226 = 7.5%, based on
> single rule of filtering (getpid); Our data for this overhead is 19.8%
> (refer to the previous 'orignal' test results), filtering based on our
> 20 rules (unixbench syscall).

I wonder if aarch64 has higher overhead for calling into the TIF_WORK
trace stuff? (Or if aarch64's BPF JIT is not as efficient as x86?)

> // kernel modification
> --- linux-5.7-rc7_1/arch/arm64/kernel/ptrace.c	2020-05-25 06:32:54.000000000 +0800
> +++ linux-5.7-rc7/arch/arm64/kernel/ptrace.c	2020-06-02 12:35:04.412000000 +0800
> @@ -1827,6 +1827,46 @@
>  	regs->regs[regno] = saved_reg;
>  }
>  
> +#define PID_MAX    1000000
> +#define SYSNUM_MAX 0x220

You can use NR_syscalls here, I think.

> +
> +/* all zero*/
> +bool g_light_filter_switch[PID_MAX] = {0};
> +bool g_light_filter_bitmap[PID_MAX][SYSNUM_MAX] = {0};

These can be static, and I would double-check your allocation size -- I
suspect this is allocating a byte for each bool. I would recommend
DECLARE_BITMAP() and friends.

> +static int __light_syscall_filter(void) {
> +   int pid;
> +	int this_syscall;
> +
> +   pid = current->pid;
> +	this_syscall = syscall_get_nr(current, task_pt_regs(current));
> +
> +   if(g_light_filter_bitmap[pid][this_syscall] == true) {
> +       printk(KERN_ERR "light syscall filter: syscall num %d denied.\n", this_syscall);
> +		goto skip;
> +   }
> +
> +	return 0;
> +skip:	
> +	return -1;
> +}
> +
> +static inline int light_syscall_filter(void) {
> +	if (unlikely(test_thread_flag(TIF_SECCOMP))) {
> +                 return __light_syscall_filter();
> +        }
> +
> +	return 0;
> +}
> +
>  int syscall_trace_enter(struct pt_regs *regs)
>  {
>  	unsigned long flags = READ_ONCE(current_thread_info()->flags);
> @@ -1837,9 +1877,10 @@
>  			return -1;
>  	}
>  
> -	/* Do the secure computing after ptrace; failures should be fast. */
> -	if (secure_computing() == -1)
> +	/* light check for syscall-num-only rule. */
> +	if (light_syscall_filter() == -1) {
>  		return -1;
> +	}
>  
>  	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
>  		trace_sys_enter(regs, regs->syscallno);

Given that you're still doing this in syscall_trace_enter(), I imagine
it could live in secure_computing().

Anyway, the functionality here is similar to what I've been working
on for bitmaps (having a global preallocated bitmap isn't going to be
upstreamable, but it's good for PoC). The complications are with handling
differing architecture (for compat systems), tracking/choosing between
the various basic SECCOMP_RET_* behaviors, etc.

-Kees
zhujianwei (C) June 3, 2020, 4:51 a.m. UTC | #2
> Given that you're still doing this in syscall_trace_enter(), I imagine
> it could live in secure_computing().

Indeed, We agree with that adding light_syscall_filter in seccomp_computing(). 

> I wonder if aarch64 has higher overhead for calling into the TIF_WORK
> trace stuff? (Or if aarch64's BPF JIT is not as efficient as x86?)

We also guess that there are many possible reasons.
And we think that placing the bitmap filter the further forward the better. Our test results show that placing the bitmap filter forward can solve single filter total overhead. What is your opinion about that?

> Anyway, the functionality here is similar to what I've been working
> on for bitmaps (having a global preallocated bitmap isn't going to be
> upstreamable, but it's good for PoC). The complications are with handling
> differing architecture (for compat systems), tracking/choosing between
> the various basic SECCOMP_RET_* behaviors, etc.

Firstly, thank you for correction in code, yes, it is just a PoC for performance test.
Indeed, your bitmap idea is basicly same with us. And, we are trying to find a solution to improve the seccomp performance for product use. 
So What is your plan to have it done? 
Could we do something to help you proceed it?

Patch
diff mbox series

--- linux-5.7-rc7_1/arch/arm64/kernel/ptrace.c	2020-05-25 06:32:54.000000000 +0800
+++ linux-5.7-rc7/arch/arm64/kernel/ptrace.c	2020-06-02 12:35:04.412000000 +0800
@@ -1827,6 +1827,46 @@ 
 	regs->regs[regno] = saved_reg;
 }
 
+#define PID_MAX    1000000
+#define SYSNUM_MAX 0x220
+
+/* all zero*/
+bool g_light_filter_switch[PID_MAX] = {0};
+bool g_light_filter_bitmap[PID_MAX][SYSNUM_MAX] = {0};
+
+
+static int __light_syscall_filter(void) {
+   int pid;
+	int this_syscall;
+
+   pid = current->pid;
+	this_syscall = syscall_get_nr(current, task_pt_regs(current));
+
+   if(g_light_filter_bitmap[pid][this_syscall] == true) {
+       printk(KERN_ERR "light syscall filter: syscall num %d denied.\n", this_syscall);
+		goto skip;
+   }
+
+	return 0;
+skip:	
+	return -1;
+}
+
+static inline int light_syscall_filter(void) {
+	if (unlikely(test_thread_flag(TIF_SECCOMP))) {
+                 return __light_syscall_filter();
+        }
+
+	return 0;
+}
+
 int syscall_trace_enter(struct pt_regs *regs)
 {
 	unsigned long flags = READ_ONCE(current_thread_info()->flags);
@@ -1837,9 +1877,10 @@ 
 			return -1;
 	}
 
-	/* Do the secure computing after ptrace; failures should be fast. */
-	if (secure_computing() == -1)
+	/* light check for syscall-num-only rule. */
+	if (light_syscall_filter() == -1) {
 		return -1;