Message ID | 5c7333ea4bec2fad1b47a8fa2db7c31e4ffc4f14.1663334978.git.josh@joshtriplett.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm | expand |
On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote: > Currently, execve allocates an mm and parses argv and envp before > checking if the path exists. However, the common case of a $PATH search > may have several failed calls to exec before a single success. Do a > filename lookup for the purposes of returning ENOENT before doing more > expensive operations. At first I didn't understand how you were seeing this, since I'm so used to watching shell scripts under tracing, which correctly use stat(): $ strace bash -c foo stat("/home/keescook/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) stat("/usr/local/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) stat("/usr/local/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) stat("/usr/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) stat("/usr/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) stat("/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) stat("/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) But I see, yes, glibc tries to actually call execve(), which, as you say, is extremely heavy: $ strace ./execvpe ... execve("/home/kees/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) execve("/usr/local/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) execve("/usr/local/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) execve("/usr/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) execve("/usr/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) execve("/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) execve("/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) This really seems much more like a glibc bug. The shell does it correctly... -Kees
On Fri, Sep 16, 2022 at 07:38:37AM -0700, Kees Cook wrote: > On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote: > > Currently, execve allocates an mm and parses argv and envp before > > checking if the path exists. However, the common case of a $PATH search > > may have several failed calls to exec before a single success. Do a > > filename lookup for the purposes of returning ENOENT before doing more > > expensive operations. > > At first I didn't understand how you were seeing this, since I'm so used > to watching shell scripts under tracing, which correctly use stat(): > > $ strace bash -c foo > stat("/home/keescook/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) > stat("/usr/local/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) > stat("/usr/local/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) > stat("/usr/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) > stat("/usr/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) > stat("/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) > stat("/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory) > > But I see, yes, glibc tries to actually call execve(), which, as you > say, is extremely heavy: > > $ strace ./execvpe > ... > execve("/home/kees/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) > execve("/usr/local/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) > execve("/usr/local/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) > execve("/usr/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) > execve("/usr/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) > execve("/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) > execve("/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory) > > This really seems much more like a glibc bug. The shell does it correctly... musl does the same thing, as do python and perl (likely via execvp or posix_spawnp). As does gcc when it executes `as`. And I've seen more than a few programs hand-implement a PATH search the same way. Seems worth optimizing for. And with io_uring_spawn, it'll be the substantially easier approach, since it'll just require one pass (series of execs) rather than two (stats then exec). - Josh Triplett
[Hi Peter, apologies for dumping you into the middle of this thread. I've got a question about sched_exec() below...] On Fri, Sep 16, 2022 at 09:13:44PM +0100, Josh Triplett wrote: > musl does the same thing, as do python and perl (likely via execvp or > posix_spawnp). As does gcc when it executes `as`. And I've seen more > than a few programs hand-implement a PATH search the same way. Seems > worth optimizing for. Yeah, it does seem like a simple way to eliminate needless work, though I'd really like to see some kind of perf count of "in a given kernel build, how many execve() system calls fail due to path search vs succeed", just to get a better sense of the scale of the problem. I don't like the idea of penalizing the _succeeding_ case, though, which happens if we do the path walk twice. So, I went and refactoring the setup order, moving the do_open_execat() up into alloc_bprm() instead of where it was in bprm_exec(). The result makes it so it is, as you observed, before the mm creation and generally expensive argument copying. The difference to your patch seems to only be the allocation of the file table entry, but avoids the double lookup, so I'm hoping the result is actually even faster. This cleanup is actually quite satisfying organizationally too -- the fd and filename were passed around rather oddly. The interaction with sched_exec() should be no worse (the file is opened before it in either case), but in reading that function, it talks about taking the opportunity to move the process to another CPU (IIUC) since, paraphrasing, "it is at its lowest memory/cache size." But I wonder if there is an existing accidental pessimistic result in that the process stack has already been allocated. I am only passingly familiar with how tasks get moved around under NUMA -- is the scheduler going to move this process onto a different NUMA node and now it will be forced to have the userspace process stack on one node and the program text and heap on another? Or is that totally lost in the noise? More specifically, I was wondering if processes would benefit from having sched_exec() moved before the mm creation? Regardless, here's a very lightly tested patch. Can you take this for a spin and check your benchmark? Thanks! -Kees diff --git a/fs/exec.c b/fs/exec.c index 9a5ca7b82bfc..5534301d67ca 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -898,6 +898,10 @@ EXPORT_SYMBOL(transfer_args_to_stack); #endif /* CONFIG_MMU */ +/* + * On success, callers must call do_close_execat() on the returned + * struct file. + */ static struct file *do_open_execat(int fd, struct filename *name, int flags) { struct file *file; @@ -945,6 +949,16 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) return ERR_PTR(err); } +/** + * open_exec - Open a path name for execution + * + * @name: path name to open with the intent of executing it. + * + * Returns ERR_PTR on failure or allocated struct file on success. + * + * As this is a wrapper for the internal do_open_execat(), callers + * must call allow_write_access() before fput() on release. + */ struct file *open_exec(const char *name) { struct filename *filename = getname_kernel(name); @@ -1485,6 +1499,15 @@ static int prepare_bprm_creds(struct linux_binprm *bprm) return -ENOMEM; } +/* Matches do_open_execat() */ +static void do_close_execat(struct file *file) +{ + if (!file) + return; + allow_write_access(file); + fput(file); +} + static void free_bprm(struct linux_binprm *bprm) { if (bprm->mm) { @@ -1496,10 +1519,7 @@ static void free_bprm(struct linux_binprm *bprm) mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } - if (bprm->file) { - allow_write_access(bprm->file); - fput(bprm->file); - } + do_close_execat(bprm->file); if (bprm->executable) fput(bprm->executable); /* If a binfmt changed the interp, free it. */ @@ -1509,12 +1529,26 @@ static void free_bprm(struct linux_binprm *bprm) kfree(bprm); } -static struct linux_binprm *alloc_bprm(int fd, struct filename *filename) +static struct linux_binprm *alloc_bprm(int fd, struct filename *filename, + int flags) { - struct linux_binprm *bprm = kzalloc(sizeof(*bprm), GFP_KERNEL); - int retval = -ENOMEM; - if (!bprm) + struct linux_binprm *bprm; + struct file *file; + int retval; + + file = do_open_execat(fd, filename, flags); + if (IS_ERR(file)) { + retval = PTR_ERR(file); goto out; + } + + retval = -ENOMEM; + bprm = kzalloc(sizeof(*bprm), GFP_KERNEL); + if (!bprm) { + do_close_execat(file); + goto out; + } + bprm->file = file; if (fd == AT_FDCWD || filename->name[0] == '/') { bprm->filename = filename->name; @@ -1531,6 +1565,18 @@ static struct linux_binprm *alloc_bprm(int fd, struct filename *filename) } bprm->interp = bprm->filename; + /* + * Record that a name derived from an O_CLOEXEC fd will be + * inaccessible after exec. This allows the code in exec to + * choose to fail when the executable is not mmaped into the + * interpreter and an open file descriptor is not passed to + * the interpreter. This makes for a better user experience + * than having the interpreter start and then immediately fail + * when it finds the executable is inaccessible. + */ + if (bprm->fdpath && get_close_on_exec(fd)) + bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE; + retval = bprm_mm_init(bprm); if (retval) goto out_free; @@ -1803,10 +1849,8 @@ static int exec_binprm(struct linux_binprm *bprm) /* * sys_execve() executes a new program. */ -static int bprm_execve(struct linux_binprm *bprm, - int fd, struct filename *filename, int flags) +static int bprm_execve(struct linux_binprm *bprm) { - struct file *file; int retval; retval = prepare_bprm_creds(bprm); @@ -1816,26 +1860,8 @@ static int bprm_execve(struct linux_binprm *bprm, check_unsafe_exec(bprm); current->in_execve = 1; - file = do_open_execat(fd, filename, flags); - retval = PTR_ERR(file); - if (IS_ERR(file)) - goto out_unmark; - sched_exec(); - bprm->file = file; - /* - * Record that a name derived from an O_CLOEXEC fd will be - * inaccessible after exec. This allows the code in exec to - * choose to fail when the executable is not mmaped into the - * interpreter and an open file descriptor is not passed to - * the interpreter. This makes for a better user experience - * than having the interpreter start and then immediately fail - * when it finds the executable is inaccessible. - */ - if (bprm->fdpath && get_close_on_exec(fd)) - bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE; - /* Set the unchanging part of bprm->cred */ retval = security_bprm_creds_for_exec(bprm); if (retval) @@ -1863,7 +1889,6 @@ static int bprm_execve(struct linux_binprm *bprm, if (bprm->point_of_no_return && !fatal_signal_pending(current)) force_fatal_sig(SIGSEGV); -out_unmark: current->fs->in_exec = 0; current->in_execve = 0; @@ -1897,7 +1922,7 @@ static int do_execveat_common(int fd, struct filename *filename, * further execve() calls fail. */ current->flags &= ~PF_NPROC_EXCEEDED; - bprm = alloc_bprm(fd, filename); + bprm = alloc_bprm(fd, filename, flags); if (IS_ERR(bprm)) { retval = PTR_ERR(bprm); goto out_ret; @@ -1946,7 +1971,7 @@ static int do_execveat_common(int fd, struct filename *filename, bprm->argc = 1; } - retval = bprm_execve(bprm, fd, filename, flags); + retval = bprm_execve(bprm); out_free: free_bprm(bprm); @@ -1971,7 +1996,7 @@ int kernel_execve(const char *kernel_filename, if (IS_ERR(filename)) return PTR_ERR(filename); - bprm = alloc_bprm(fd, filename); + bprm = alloc_bprm(fd, filename, 0); if (IS_ERR(bprm)) { retval = PTR_ERR(bprm); goto out_ret; @@ -2006,7 +2031,7 @@ int kernel_execve(const char *kernel_filename, if (retval < 0) goto out_free; - retval = bprm_execve(bprm, fd, filename, 0); + retval = bprm_execve(bprm); out_free: free_bprm(bprm); out_ret:
On Fri, Sep 16, 2022 at 05:11:18PM -0700, Kees Cook wrote: > I don't like the idea of penalizing the _succeeding_ case, though, which > happens if we do the path walk twice. So, I went and refactoring the setup > order, moving the do_open_execat() up into alloc_bprm() instead of where > it was in bprm_exec(). The result makes it so it is, as you observed, > before the mm creation and generally expensive argument copying. The > difference to your patch seems to only be the allocation of the file > table entry, but avoids the double lookup, so I'm hoping the result is > actually even faster. Thanks for giving this a try; I'd wondered how feasible it would be to just do one lookup. However, on the same test system with the same test setup, with your refactor it seems to go slower: fork/execvpe: 38087ns fork/execve: 33758ns For comparison, the previous numbers (which I re-confirmed): Without fast-path: fork/execvpe: 49876ns fork/execve: 32773ns With my original separate-lookup fast-path: fork/execvpe: 36890ns fork/execve: 31551ns I tried several runs of each, and I seem to get reasonably consistent results. My test program just creates a pipe once, then loops on clock_gettime/fork/execvpe/read, with the spawned child process doing clock_gettime/write/exit (in asm to minimize overhead). The test PATH is PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. with the test program in the current directory. - Josh Triplett
On Fri, Sep 16, 2022 at 05:11:18PM -0700, Kees Cook wrote: > The interaction with sched_exec() should be no worse (the file is opened > before it in either case), but in reading that function, it talks about > taking the opportunity to move the process to another CPU (IIUC) since, > paraphrasing, "it is at its lowest memory/cache size." But I wonder if > there is an existing accidental pessimistic result in that the process > stack has already been allocated. I am only passingly familiar with how > tasks get moved around under NUMA -- is the scheduler going to move > this process onto a different NUMA node and now it will be forced to > have the userspace process stack on one node and the program text and > heap on another? Or is that totally lost in the noise? Probably noise; text is going to be a crap-shoot anyway due to DSOs only having a single copy in the page-cache. And the stack will be relatively small at this point and also, numa-balance can migrate those pages around if they matter. > More specifically, I was wondering if processes would benefit from having > sched_exec() moved before the mm creation? Can't hurt I think.
On Sat, Sep 17, 2022 at 01:50:24AM +0100, Josh Triplett wrote: > On Fri, Sep 16, 2022 at 05:11:18PM -0700, Kees Cook wrote: > > I don't like the idea of penalizing the _succeeding_ case, though, which > > happens if we do the path walk twice. So, I went and refactoring the setup > > order, moving the do_open_execat() up into alloc_bprm() instead of where > > it was in bprm_exec(). The result makes it so it is, as you observed, > > before the mm creation and generally expensive argument copying. The > > difference to your patch seems to only be the allocation of the file > > table entry, but avoids the double lookup, so I'm hoping the result is > > actually even faster. > > Thanks for giving this a try; I'd wondered how feasible it would be to > just do one lookup. > > However, on the same test system with the same test setup, with your > refactor it seems to go slower: > fork/execvpe: 38087ns > fork/execve: 33758ns > > For comparison, the previous numbers (which I re-confirmed): > > Without fast-path: > fork/execvpe: 49876ns > fork/execve: 32773ns > > With my original separate-lookup fast-path: > fork/execvpe: 36890ns > fork/execve: 31551ns Hmm, this shows as slower in the *normal* case, which I find rather surprising -- it's the same work, just reordered. Can you post a URL to your tests? I'd like to reproduce this and maybe throw perf at it as well. > I tried several runs of each, and I seem to get reasonably consistent > results. > > My test program just creates a pipe once, then loops on > clock_gettime/fork/execvpe/read, with the spawned child process doing > clock_gettime/write/exit (in asm to minimize overhead). The test PATH is > PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. with > the test program in the current directory. I'm also curious about less synthetic testing. It'd be nice to see real workloads with these changes, etc.
Greeting, FYI, we noticed the following commit (built with gcc-11): commit: 0a276ae2d224d48c9ddbeb9fa14bc2fd0893a7b9 ("[PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm") url: https://github.com/intel-lab-lkp/linux/commits/Josh-Triplett/fs-exec-c-Add-fast-path-for-ENOENT-on-PATH-search-before-allocating-mm/20220916-214429 base: https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git for-next/execve patch link: https://lore.kernel.org/linux-mm/5c7333ea4bec2fad1b47a8fa2db7c31e4ffc4f14.1663334978.git.josh@joshtriplett.org in testcase: trinity version: trinity-i386-4d2343bd-1_20200320 with following parameters: runtime: 300s group: group-01 test-description: Trinity is a linux system call fuzz tester. test-url: http://codemonkey.org.uk/projects/trinity/ on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): [ 129.613450][ C0] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 43s! [ 129.616497][ C0] Showing busy workqueues and worker pools: [ 129.620580][ C0] workqueue events: flags=0x0 [ 129.623078][ C0] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256 refcnt=5 [ 129.623118][ C0] in-flight: 52:do_free_init do_free_init, 29:key_garbage_collector key_garbage_collector [ 129.623151][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=3 [ 129.623161][ C0] pending: vmstat_shepherd, drm_fb_helper_damage_work [drm_kms_helper] [ 129.637767][ C0] workqueue events_highpri: flags=0x10 [ 129.640231][ C0] pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2 [ 129.640252][ C0] pending: mix_interrupt_randomness [ 129.640271][ C0] workqueue events_freezable_power_: flags=0x84 [ 129.650217][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 129.650238][ C0] pending: disk_events_workfn [ 129.650254][ C0] workqueue rcu_gp: flags=0x8 [ 129.659366][ C0] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 129.659387][ C0] in-flight: 185:wait_rcu_exp_gp [ 129.659405][ C0] workqueue mm_percpu_wq: flags=0x8 [ 129.668545][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=4 [ 129.668565][ C0] pending: vmstat_update, lru_add_drain_per_cpu BAR(35) [ 129.668595][ C0] workqueue kblockd: flags=0x18 [ 129.678480][ C0] pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2 [ 129.678499][ C0] pending: blk_mq_timeout_work [ 129.678560][ C0] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=5 idle: 3764 22 [ 175.181511][ T185] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-.... } 88583 jiffies s: 41 root: 0x1/. [ 175.185095][ T185] rcu: blocking rcu_node structures (internal RCU debug): [ 175.188233][ T185] Task dump for CPU 0: [ 175.191019][ T185] task:trinity-subchil state:R running task stack: 0 pid: 3731 ppid: 3652 flags:0x2000000c [ 175.198313][ T185] Call Trace: [ 175.200246][ T185] <TASK> [ 175.202827][ T185] ? do_fast_syscall_32 (arch/x86/entry/common.c:203) [ 175.205560][ T185] ? entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:122) [ 175.208517][ T185] </TASK> [ 186.313437][ C0] rcu: INFO: rcu_sched self-detected stall on CPU [ 186.315637][ C0] rcu: 0-....: (99854 ticks this GP) idle=64dc/1/0x4000000000000000 softirq=25122/25122 fqs=30028 [ 186.317670][ C0] (t=100000 jiffies g=32689 q=29567 ncpus=2) [ 186.319650][ C0] NMI backtrace for cpu 0 [ 186.321784][ C0] CPU: 0 PID: 3731 Comm: trinity-subchil Not tainted 6.0.0-rc2-00003-g0a276ae2d224 #1 [ 186.324188][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 186.326774][ C0] Call Trace: [ 186.328778][ C0] <IRQ> [ 186.331085][ C0] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1)) [ 186.332793][ C0] nmi_cpu_backtrace.cold (lib/nmi_backtrace.c:111) [ 186.334549][ C0] ? lapic_can_unplug_cpu (arch/x86/kernel/apic/hw_nmi.c:33) [ 186.336150][ C0] nmi_trigger_cpumask_backtrace (lib/nmi_backtrace.c:62) [ 186.337777][ C0] trigger_single_cpu_backtrace (include/linux/nmi.h:168) [ 186.339411][ C0] rcu_dump_cpu_stacks (kernel/rcu/tree_stall.h:371) [ 186.341082][ C0] print_cpu_stall.cold (kernel/rcu/tree_stall.h:662) [ 186.342889][ C0] check_cpu_stall (kernel/rcu/tree_stall.h:745) [ 186.344423][ C0] rcu_sched_clock_irq (kernel/rcu/tree_nocb.h:913 kernel/rcu/tree.c:3663 kernel/rcu/tree.c:2342) [ 186.346071][ C0] update_process_times (arch/x86/include/asm/preempt.h:27 kernel/time/timer.c:1841) [ 186.347626][ C0] tick_sched_handle (kernel/time/tick-sched.c:244) [ 186.349235][ C0] tick_sched_timer (kernel/time/tick-sched.c:1480) [ 186.350780][ C0] ? tick_sched_do_timer (kernel/time/tick-sched.c:1467) [ 186.352495][ C0] __hrtimer_run_queues (kernel/time/hrtimer.c:1685 kernel/time/hrtimer.c:1749) [ 186.354484][ C0] ? enqueue_hrtimer (kernel/time/hrtimer.c:1719) [ 186.356550][ C0] ? kvm_clock_get_cycles (arch/x86/include/asm/preempt.h:85 arch/x86/kernel/kvmclock.c:80 arch/x86/kernel/kvmclock.c:86) [ 186.358499][ C0] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:292 (discriminator 3) kernel/time/timekeeping.c:388 (discriminator 3) kernel/time/timekeeping.c:2320 (discriminator 3)) [ 186.360609][ C0] hrtimer_interrupt (kernel/time/hrtimer.c:1814) [ 186.362418][ C0] __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1113) [ 186.364295][ C0] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1106 (discriminator 14)) [ 186.366201][ C0] </IRQ> [ 186.367490][ C0] <TASK> [ 186.369287][ C0] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:649) [ 186.371191][ C0] RIP: 0010:native_queued_spin_lock_slowpath (arch/x86/include/asm/vdso/processor.h:19 arch/x86/include/asm/qspinlock.h:101 kernel/locking/qspinlock.c:327) [ 186.373248][ C0] Code: 00 00 00 65 48 2b 04 25 28 00 00 00 0f 85 2f 08 00 00 48 81 c4 88 00 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc f3 90 <e9> 71 ff ff ff 44 8b 74 24 48 41 81 fe 00 01 00 00 0f 84 e3 00 00 All code ======== 0: 00 00 add %al,(%rax) 2: 00 65 48 add %ah,0x48(%rbp) 5: 2b 04 25 28 00 00 00 sub 0x28,%eax c: 0f 85 2f 08 00 00 jne 0x841 12: 48 81 c4 88 00 00 00 add $0x88,%rsp 19: 5b pop %rbx 1a: 5d pop %rbp 1b: 41 5c pop %r12 1d: 41 5d pop %r13 1f: 41 5e pop %r14 21: 41 5f pop %r15 23: c3 retq 24: cc int3 25: cc int3 26: cc int3 27: cc int3 28: f3 90 pause 2a:* e9 71 ff ff ff jmpq 0xffffffffffffffa0 <-- trapping instruction 2f: 44 8b 74 24 48 mov 0x48(%rsp),%r14d 34: 41 81 fe 00 01 00 00 cmp $0x100,%r14d 3b: 0f .byte 0xf 3c: 84 e3 test %ah,%bl ... Code starting with the faulting instruction =========================================== 0: e9 71 ff ff ff jmpq 0xffffffffffffff76 5: 44 8b 74 24 48 mov 0x48(%rsp),%r14d a: 41 81 fe 00 01 00 00 cmp $0x100,%r14d 11: 0f .byte 0xf 12: 84 e3 test %ah,%bl ... [ 186.377327][ C0] RSP: 0018:ffffc90000e17cb8 EFLAGS: 00000202 [ 186.379142][ C0] RAX: 0000000000000000 RBX: 00000000587993b0 RCX: ffffffff812fc8ec [ 186.380912][ C0] RDX: fffff520001c2ff7 RSI: 0000000000000004 RDI: ffffc90000e17fb0 [ 186.383091][ C0] RBP: ffffc90000e17fb0 R08: 0000000000000000 R09: ffffc90000e17fb3 [ 186.384750][ C0] R10: fffff520001c2ff6 R11: 0000000000000001 R12: 0000000000000003 [ 186.386354][ C0] R13: fffff520001c2ff6 R14: 0000000000000001 R15: 1ffff920001c2f98 [ 186.388126][ C0] ? native_queued_spin_lock_slowpath (arch/x86/include/asm/atomic.h:29 include/linux/atomic/atomic-instrumented.h:28 arch/x86/include/asm/qspinlock.h:100 kernel/locking/qspinlock.c:327) [ 186.389979][ C0] ? native_queued_spin_lock_slowpath (arch/x86/include/asm/atomic.h:29 include/linux/atomic/atomic-instrumented.h:28 arch/x86/include/asm/qspinlock.h:100 kernel/locking/qspinlock.c:327) [ 186.391917][ C0] ? pv_hash (kernel/locking/qspinlock.c:317) [ 186.393367][ C0] ? lockref_put_return (arch/x86/include/asm/atomic.h:29 (discriminator 5) include/linux/atomic/atomic-instrumented.h:28 (discriminator 5) include/asm-generic/qspinlock.h:73 (discriminator 5) lib/lockref.c:122 (discriminator 5)) [ 186.394784][ C0] _raw_spin_lock (arch/x86/include/asm/paravirt.h:591 arch/x86/include/asm/qspinlock.h:51 include/asm-generic/qspinlock.h:114 include/linux/spinlock.h:185 include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154) [ 186.396052][ C0] ? _raw_write_lock_irq (kernel/locking/spinlock.c:153) [ 186.397502][ C0] ? allocate_slab (mm/slub.c:378 mm/slub.c:1929 mm/slub.c:1995) [ 186.399281][ C0] dput (fs/dcache.c:790 fs/dcache.c:900) [ 186.400885][ C0] path_put (fs/namei.c:559) [ 186.402250][ C0] do_execveat_common+0x5ec/0x7c0 [ 186.403558][ C0] ? __check_object_size (mm/memremap.c:37) [ 186.404848][ C0] ? alloc_bprm (fs/exec.c:1866) [ 186.406039][ C0] __ia32_compat_sys_execve (fs/exec.c:2117) [ 186.407302][ C0] __do_fast_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:178) [ 186.408512][ C0] do_fast_syscall_32 (arch/x86/entry/common.c:203) [ 186.409678][ C0] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:122) [ 186.410947][ C0] RIP: 0023:0xf7f3c549 [ 186.412085][ C0] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 All code ======== 0: 03 74 c0 01 add 0x1(%rax,%rax,8),%esi 4: 10 05 03 74 b8 01 adc %al,0x1b87403(%rip) # 0x1b8740d a: 10 06 adc %al,(%rsi) c: 03 74 b4 01 add 0x1(%rsp,%rsi,4),%esi 10: 10 07 adc %al,(%rdi) 12: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi 16: 10 08 adc %cl,(%rax) 18: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi 1c: 00 00 add %al,(%rax) 1e: 00 00 add %al,(%rax) 20: 00 51 52 add %dl,0x52(%rcx) 23: 55 push %rbp 24: 89 e5 mov %esp,%ebp 26: 0f 34 sysenter 28: cd 80 int $0x80 2a:* 5d pop %rbp <-- trapping instruction 2b: 5a pop %rdx 2c: 59 pop %rcx 2d: c3 retq 2e: 90 nop 2f: 90 nop 30: 90 nop 31: 90 nop 32: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi 39: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi Code starting with the faulting instruction =========================================== 0: 5d pop %rbp 1: 5a pop %rdx 2: 59 pop %rcx 3: c3 retq 4: 90 nop 5: 90 nop 6: 90 nop 7: 90 nop 8: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi f: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi If you fix the issue, kindly add following tag | Reported-by: kernel test robot <yujie.liu@intel.com> | Link: https://lore.kernel.org/r/202209221401.90061e56-yujie.liu@intel.com To reproduce: # build kernel cd linux cp config-6.0.0-rc2-00003-g0a276ae2d224 .config make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install cd <mod-install-dir> find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
On Mon, Sep 19, 2022 at 01:02:51PM -0700, Kees Cook wrote: > On Sat, Sep 17, 2022 at 01:50:24AM +0100, Josh Triplett wrote: > > On Fri, Sep 16, 2022 at 05:11:18PM -0700, Kees Cook wrote: > > > I don't like the idea of penalizing the _succeeding_ case, though, which > > > happens if we do the path walk twice. So, I went and refactoring the setup > > > order, moving the do_open_execat() up into alloc_bprm() instead of where > > > it was in bprm_exec(). The result makes it so it is, as you observed, > > > before the mm creation and generally expensive argument copying. The > > > difference to your patch seems to only be the allocation of the file > > > table entry, but avoids the double lookup, so I'm hoping the result is > > > actually even faster. > > > > Thanks for giving this a try; I'd wondered how feasible it would be to > > just do one lookup. > > > > However, on the same test system with the same test setup, with your > > refactor it seems to go slower: > > fork/execvpe: 38087ns > > fork/execve: 33758ns > > > > For comparison, the previous numbers (which I re-confirmed): > > > > Without fast-path: > > fork/execvpe: 49876ns > > fork/execve: 32773ns > > > > With my original separate-lookup fast-path: > > fork/execvpe: 36890ns > > fork/execve: 31551ns > > Hmm, this shows as slower in the *normal* case, which I find rather > surprising -- it's the same work, just reordered. > > Can you post a URL to your tests? I'd like to reproduce this and maybe > throw perf at it as well. Sure. Sorry for the delay, needed to integrate some fixes (such as aarch64 support) and factor out the bits that won't build if you don't have a patched liburing. https://github.com/joshtriplett/spawnbench - Josh Triplett
On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote: > Currently, execve allocates an mm and parses argv and envp before > checking if the path exists. However, the common case of a $PATH search > may have several failed calls to exec before a single success. Do a > filename lookup for the purposes of returning ENOENT before doing more > expensive operations. > > This does not create a TOCTTOU race, because this can only happen if the > file didn't exist at some point during the exec call, and that point is > permitted to be when we did our lookup. > > To measure performance, I ran 2000 fork and execvpe calls with a > seven-element PATH in which the file was found in the seventh directory > (representative of the common case as /usr/bin is the seventh directory > on my $PATH), as well as 2000 fork and execve calls with an absolute > path to an existing binary. I recorded the minimum time for each, to > eliminate noise from context switches and similar. > > Without fast-path: > fork/execvpe: 49876ns > fork/execve: 32773ns > > With fast-path: > fork/execvpe: 36890ns > fork/execve: 32069ns > > The cost of the additional lookup seems to be in the noise for a > successful exec, but it provides a 26% improvement for the path search > case by speeding up the six failed execs. > > Signed-off-by: Josh Triplett <josh@joshtriplett.org> *thread necromancy* I'll snag this patch after -rc1 is out. Based on the research we both did in the rest of this thread, this original patch is a clear win. Let's get it into linux-next and see if anything else falls out of it. I did, however, scratch my head over the 0-day report: https://lore.kernel.org/lkml/202209221401.90061e56-yujie.liu@intel.com/ But I can't see why this patch could trigger those problems... Thanks! -Kees
On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote: > Currently, execve allocates an mm and parses argv and envp before > checking if the path exists. However, the common case of a $PATH search > may have several failed calls to exec before a single success. Do a > filename lookup for the purposes of returning ENOENT before doing more > expensive operations. > > This does not create a TOCTTOU race, because this can only happen if the > file didn't exist at some point during the exec call, and that point is > permitted to be when we did our lookup. > > To measure performance, I ran 2000 fork and execvpe calls with a > seven-element PATH in which the file was found in the seventh directory > (representative of the common case as /usr/bin is the seventh directory > on my $PATH), as well as 2000 fork and execve calls with an absolute > path to an existing binary. I recorded the minimum time for each, to > eliminate noise from context switches and similar. > > Without fast-path: > fork/execvpe: 49876ns > fork/execve: 32773ns > > With fast-path: > fork/execvpe: 36890ns > fork/execve: 32069ns > > The cost of the additional lookup seems to be in the noise for a > successful exec, but it provides a 26% improvement for the path search > case by speeding up the six failed execs. > > Signed-off-by: Josh Triplett <josh@joshtriplett.org> > --- > > Discussed this at Plumbers with Kees Cook; turned out to be even more of > a win than anticipated. > > fs/exec.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/fs/exec.c b/fs/exec.c > index 9a5ca7b82bfc..fe786aeb2f1b 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1881,6 +1881,16 @@ static int do_execveat_common(int fd, struct filename *filename, > if (IS_ERR(filename)) > return PTR_ERR(filename); > > + /* Fast-path ENOENT for $PATH search failures, before we alloc an mm or > + * parse arguments. */ > + if (fd == AT_FDCWD && flags == 0 && filename->name[0] == '/') { > + struct path path; > + retval = filename_lookup(AT_FDCWD, filename, 0, &path, NULL); > + if (retval == -ENOENT) Oh, actually, I see the 0-day problem. This should be: if (retval < 0) > + goto out_ret; > + path_put(&path); Otherwise this put will happen for an non-successful lookup that wans't ENOENT. I've fixed this in my tree. -Kees > + } > + > /* > * We move the actual failure in case of RLIMIT_NPROC excess from > * set*uid() to execve() because too many poorly written programs > -- > 2.37.2 >
On Tue, Nov 07, 2023 at 12:30:37PM -0800, Kees Cook wrote: > On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote: > > Currently, execve allocates an mm and parses argv and envp before > > checking if the path exists. However, the common case of a $PATH search > > may have several failed calls to exec before a single success. Do a > > filename lookup for the purposes of returning ENOENT before doing more > > expensive operations. > > > > This does not create a TOCTTOU race, because this can only happen if the > > file didn't exist at some point during the exec call, and that point is > > permitted to be when we did our lookup. > > > > To measure performance, I ran 2000 fork and execvpe calls with a > > seven-element PATH in which the file was found in the seventh directory > > (representative of the common case as /usr/bin is the seventh directory > > on my $PATH), as well as 2000 fork and execve calls with an absolute > > path to an existing binary. I recorded the minimum time for each, to > > eliminate noise from context switches and similar. > > > > Without fast-path: > > fork/execvpe: 49876ns > > fork/execve: 32773ns > > > > With fast-path: > > fork/execvpe: 36890ns > > fork/execve: 32069ns > > > > The cost of the additional lookup seems to be in the noise for a > > successful exec, but it provides a 26% improvement for the path search > > case by speeding up the six failed execs. > > > > Signed-off-by: Josh Triplett <josh@joshtriplett.org> > > *thread necromancy* > > I'll snag this patch after -rc1 is out. Based on the research we both > did in the rest of this thread, this original patch is a clear win. > Let's get it into linux-next and see if anything else falls out of it. > So the obvious question is why not store lookup result within bprm, instead of doing the lookup again later. Turns out you had very same question and even wrote a patch to sort it out: https://lore.kernel.org/all/202209161637.9EDAF6B18@keescook/ Why do you intend to go with this patch instead?
On 11/7/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > On Tue, Nov 07, 2023 at 12:30:37PM -0800, Kees Cook wrote: >> On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote: >> > Currently, execve allocates an mm and parses argv and envp before >> > checking if the path exists. However, the common case of a $PATH search >> > may have several failed calls to exec before a single success. Do a >> > filename lookup for the purposes of returning ENOENT before doing more >> > expensive operations. >> > >> > This does not create a TOCTTOU race, because this can only happen if >> > the >> > file didn't exist at some point during the exec call, and that point is >> > permitted to be when we did our lookup. >> > >> > To measure performance, I ran 2000 fork and execvpe calls with a >> > seven-element PATH in which the file was found in the seventh directory >> > (representative of the common case as /usr/bin is the seventh directory >> > on my $PATH), as well as 2000 fork and execve calls with an absolute >> > path to an existing binary. I recorded the minimum time for each, to >> > eliminate noise from context switches and similar. >> > >> > Without fast-path: >> > fork/execvpe: 49876ns >> > fork/execve: 32773ns >> > >> > With fast-path: >> > fork/execvpe: 36890ns >> > fork/execve: 32069ns >> > >> > The cost of the additional lookup seems to be in the noise for a >> > successful exec, but it provides a 26% improvement for the path search >> > case by speeding up the six failed execs. >> > >> > Signed-off-by: Josh Triplett <josh@joshtriplett.org> >> >> *thread necromancy* >> >> I'll snag this patch after -rc1 is out. Based on the research we both >> did in the rest of this thread, this original patch is a clear win. >> Let's get it into linux-next and see if anything else falls out of it. >> > > So the obvious question is why not store lookup result within bprm, > instead of doing the lookup again later. > > Turns out you had very same question and even wrote a patch to sort it > out: https://lore.kernel.org/all/202209161637.9EDAF6B18@keescook/ > > Why do you intend to go with this patch instead? > Welp, should have read the remaining follow up discussion. Even so, the patch which is only doing the lookup once cannot be legitimately slower. Note there is often perf variance between different boots of the same kernel and I suspect this is what justifies it. I would suggest adding a runtime knob to control whether 1. lookups are done late (stock kernel) 2. there is one upfront lookup like in the original patch and finally 3. stuffing file into bprm Then a bunch of runs with the knob cycling between them could serve as a justification. If the patch which dodges second lookup still somehow appears slower a flamegraph or other profile would be nice. I can volunteer to take a look at what's going on provided above measurements will be done and show funkyness.
On Tue, Nov 07, 2023 at 10:23:16PM +0100, Mateusz Guzik wrote: > If the patch which dodges second lookup still somehow appears slower a > flamegraph or other profile would be nice. I can volunteer to take a > look at what's going on provided above measurements will be done and > show funkyness. When I looked at this last, it seemed like all the work done in do_filp_open() (my patch, which moved the lookup earlier) was heavier than the duplicate filename_lookup(). What I didn't test was moving the sched_exec() before the mm creation, which Peter confirmed shouldn't be a problem, but I think that might be only a tiny benefit, if at all. If you can do some comparisons, that would be great; it always takes me a fair bit of time to get set up for flame graph generation, etc. :) -Kees
On 11/7/23, Kees Cook <keescook@chromium.org> wrote: > On Tue, Nov 07, 2023 at 10:23:16PM +0100, Mateusz Guzik wrote: >> If the patch which dodges second lookup still somehow appears slower a >> flamegraph or other profile would be nice. I can volunteer to take a >> look at what's going on provided above measurements will be done and >> show funkyness. > > When I looked at this last, it seemed like all the work done in > do_filp_open() (my patch, which moved the lookup earlier) was heavier > than the duplicate filename_lookup(). > > What I didn't test was moving the sched_exec() before the mm creation, > which Peter confirmed shouldn't be a problem, but I think that might be > only a tiny benefit, if at all. > > If you can do some comparisons, that would be great; it always takes me > a fair bit of time to get set up for flame graph generation, etc. :) > So I spawned *one* process executing one statocally linked binary in a loop, test case from http://apollo.backplane.com/DFlyMisc/doexec.c . The profile is definitely not what I expected: 5.85% [kernel] [k] asm_exc_page_fault 5.84% [kernel] [k] __pv_queued_spin_lock_slowpath [snip] I'm going to have to recompile with lock profiling, meanwhile according to bpftrace (bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }') top hits would be: @[ __pv_queued_spin_lock_slowpath+1 _raw_spin_lock+37 __schedule+192 schedule_idle+38 do_idle+366 cpu_startup_entry+38 start_secondary+282 secondary_startup_64_no_verify+381 ]: 181 @[ __pv_queued_spin_lock_slowpath+1 _raw_spin_lock_irq+43 wait_for_completion+141 stop_one_cpu+127 sched_exec+165 bprm_execve+328 do_execveat_common.isra.0+429 __x64_sys_execve+50 do_syscall_64+46 entry_SYSCALL_64_after_hwframe+110 ]: 206 I did not see this coming for sure. I'll poke around maybe this weekend.
On November 7, 2023 3:08:47 PM PST, Mateusz Guzik <mjguzik@gmail.com> wrote: >On 11/7/23, Kees Cook <keescook@chromium.org> wrote: >> On Tue, Nov 07, 2023 at 10:23:16PM +0100, Mateusz Guzik wrote: >>> If the patch which dodges second lookup still somehow appears slower a >>> flamegraph or other profile would be nice. I can volunteer to take a >>> look at what's going on provided above measurements will be done and >>> show funkyness. >> >> When I looked at this last, it seemed like all the work done in >> do_filp_open() (my patch, which moved the lookup earlier) was heavier >> than the duplicate filename_lookup(). >> >> What I didn't test was moving the sched_exec() before the mm creation, >> which Peter confirmed shouldn't be a problem, but I think that might be >> only a tiny benefit, if at all. >> >> If you can do some comparisons, that would be great; it always takes me >> a fair bit of time to get set up for flame graph generation, etc. :) >> > >So I spawned *one* process executing one statocally linked binary in a >loop, test case from http://apollo.backplane.com/DFlyMisc/doexec.c . > >The profile is definitely not what I expected: > 5.85% [kernel] [k] asm_exc_page_fault > 5.84% [kernel] [k] __pv_queued_spin_lock_slowpath >[snip] > >I'm going to have to recompile with lock profiling, meanwhile >according to bpftrace >(bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }') >top hits would be: > >@[ > __pv_queued_spin_lock_slowpath+1 > _raw_spin_lock+37 > __schedule+192 > schedule_idle+38 > do_idle+366 > cpu_startup_entry+38 > start_secondary+282 > secondary_startup_64_no_verify+381 >]: 181 >@[ > __pv_queued_spin_lock_slowpath+1 > _raw_spin_lock_irq+43 > wait_for_completion+141 > stop_one_cpu+127 > sched_exec+165 There's the suspicious sched_exec() I was talking about! :) I think it needs to be moved, and perhaps _later_ instead of earlier? Hmm... -Kees > bprm_execve+328 > do_execveat_common.isra.0+429 > __x64_sys_execve+50 > do_syscall_64+46 > entry_SYSCALL_64_after_hwframe+110 >]: 206 > >I did not see this coming for sure. I'll poke around maybe this weekend. >
On 11/8/23, Kees Cook <kees@kernel.org> wrote: > > > On November 7, 2023 3:08:47 PM PST, Mateusz Guzik <mjguzik@gmail.com> > wrote: >>On 11/7/23, Kees Cook <keescook@chromium.org> wrote: >>> On Tue, Nov 07, 2023 at 10:23:16PM +0100, Mateusz Guzik wrote: >>>> If the patch which dodges second lookup still somehow appears slower a >>>> flamegraph or other profile would be nice. I can volunteer to take a >>>> look at what's going on provided above measurements will be done and >>>> show funkyness. >>> >>> When I looked at this last, it seemed like all the work done in >>> do_filp_open() (my patch, which moved the lookup earlier) was heavier >>> than the duplicate filename_lookup(). >>> >>> What I didn't test was moving the sched_exec() before the mm creation, >>> which Peter confirmed shouldn't be a problem, but I think that might be >>> only a tiny benefit, if at all. >>> >>> If you can do some comparisons, that would be great; it always takes me >>> a fair bit of time to get set up for flame graph generation, etc. :) >>> >> >>So I spawned *one* process executing one statocally linked binary in a >>loop, test case from http://apollo.backplane.com/DFlyMisc/doexec.c . >> >>The profile is definitely not what I expected: >> 5.85% [kernel] [k] asm_exc_page_fault >> 5.84% [kernel] [k] __pv_queued_spin_lock_slowpath >>[snip] >> >>I'm going to have to recompile with lock profiling, meanwhile >>according to bpftrace >>(bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = >> count(); }') >>top hits would be: >> >>@[ >> __pv_queued_spin_lock_slowpath+1 >> _raw_spin_lock+37 >> __schedule+192 >> schedule_idle+38 >> do_idle+366 >> cpu_startup_entry+38 >> start_secondary+282 >> secondary_startup_64_no_verify+381 >>]: 181 >>@[ >> __pv_queued_spin_lock_slowpath+1 >> _raw_spin_lock_irq+43 >> wait_for_completion+141 >> stop_one_cpu+127 >> sched_exec+165 > > There's the suspicious sched_exec() I was talking about! :) > > I think it needs to be moved, and perhaps _later_ instead of earlier? > Hmm... > I'm getting around 3.4k execs/s. However, if I "taskset -c 3 ./static-doexec 1" the number goes up to about 9.5k and lock contention disappears from the profile. So off hand looks like the task is walking around the box when it perhaps could be avoided -- it is idle apart from running the test. Again this is going to require a serious look instead of ad hoc pokes. Side note I actually read your patch this time around instead of skimming through it and assuming it did what I thought. do_filp_open is of course very expensive and kmalloc + kfree are slow. On top of it deallocating a file object even after a failed open was very expensive due to delegation to task_work (recently fixed). What I claim should be clear-cut faster is that lookup as in the original patch and only messing with file allocation et al if it succeeds.
On Wed, Nov 08, 2023 at 01:03:33AM +0100, Mateusz Guzik wrote: > [...] > >>@[ > >> __pv_queued_spin_lock_slowpath+1 > >> _raw_spin_lock_irq+43 > >> wait_for_completion+141 > >> stop_one_cpu+127 > >> sched_exec+165 > > > > There's the suspicious sched_exec() I was talking about! :) > > > > I think it needs to be moved, and perhaps _later_ instead of earlier? > > Hmm... > > > > I'm getting around 3.4k execs/s. However, if I "taskset -c 3 > ./static-doexec 1" the number goes up to about 9.5k and lock > contention disappears from the profile. So off hand looks like the > task is walking around the box when it perhaps could be avoided -- it > is idle apart from running the test. Again this is going to require a > serious look instead of ad hoc pokes. Hm, that is pretty interesting. I'll see if I can go find the original rationale for adding sched_exec() in there... > Side note I actually read your patch this time around instead of > skimming through it and assuming it did what I thought. > > do_filp_open is of course very expensive and kmalloc + kfree are slow. > On top of it deallocating a file object even after a failed open was > very expensive due to delegation to task_work (recently fixed). > > What I claim should be clear-cut faster is that lookup as in the > original patch and only messing with file allocation et al if it > succeeds. I'm less familiar with the VFS guts here -- I'm open to alternatives! :)
On Wed, Nov 08, 2023 at 01:03:33AM +0100, Mateusz Guzik wrote: > On 11/8/23, Kees Cook <kees@kernel.org> wrote: > > > > > > On November 7, 2023 3:08:47 PM PST, Mateusz Guzik <mjguzik@gmail.com> > > wrote: > >>On 11/7/23, Kees Cook <keescook@chromium.org> wrote: > >>> On Tue, Nov 07, 2023 at 10:23:16PM +0100, Mateusz Guzik wrote: > >>>> If the patch which dodges second lookup still somehow appears slower a > >>>> flamegraph or other profile would be nice. I can volunteer to take a > >>>> look at what's going on provided above measurements will be done and > >>>> show funkyness. > >>> > >>> When I looked at this last, it seemed like all the work done in > >>> do_filp_open() (my patch, which moved the lookup earlier) was heavier > >>> than the duplicate filename_lookup(). > >>> > >>> What I didn't test was moving the sched_exec() before the mm creation, > >>> which Peter confirmed shouldn't be a problem, but I think that might be > >>> only a tiny benefit, if at all. > >>> > >>> If you can do some comparisons, that would be great; it always takes me > >>> a fair bit of time to get set up for flame graph generation, etc. :) > >>> > >> > >>So I spawned *one* process executing one statocally linked binary in a > >>loop, test case from http://apollo.backplane.com/DFlyMisc/doexec.c . > >> > >>The profile is definitely not what I expected: > >> 5.85% [kernel] [k] asm_exc_page_fault > >> 5.84% [kernel] [k] __pv_queued_spin_lock_slowpath > >>[snip] > >> > >>I'm going to have to recompile with lock profiling, meanwhile > >>according to bpftrace > >>(bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = > >> count(); }') > >>top hits would be: > >> > >>@[ > >> __pv_queued_spin_lock_slowpath+1 > >> _raw_spin_lock+37 > >> __schedule+192 > >> schedule_idle+38 > >> do_idle+366 > >> cpu_startup_entry+38 > >> start_secondary+282 > >> secondary_startup_64_no_verify+381 > >>]: 181 > >>@[ > >> __pv_queued_spin_lock_slowpath+1 > >> _raw_spin_lock_irq+43 > >> wait_for_completion+141 > >> stop_one_cpu+127 > >> sched_exec+165 > > > > There's the suspicious sched_exec() I was talking about! :) > > > > I think it needs to be moved, and perhaps _later_ instead of earlier? > > Hmm... > > > > I'm getting around 3.4k execs/s. However, if I "taskset -c 3 > ./static-doexec 1" the number goes up to about 9.5k and lock > contention disappears from the profile. So off hand looks like the > task is walking around the box when it perhaps could be avoided -- it > is idle apart from running the test. Again this is going to require a > serious look instead of ad hoc pokes. Peter, is this something you can speak to? It seems like execve() forces a change in running CPU. Is this really something we want to be doing? Or is there some better way to keep it on the same CPU unless there is contention? -Kees
On 11/8/23, Kees Cook <keescook@chromium.org> wrote: > On Wed, Nov 08, 2023 at 01:03:33AM +0100, Mateusz Guzik wrote: >> On 11/8/23, Kees Cook <kees@kernel.org> wrote: >> > >> > >> > On November 7, 2023 3:08:47 PM PST, Mateusz Guzik <mjguzik@gmail.com> >> > wrote: >> >>On 11/7/23, Kees Cook <keescook@chromium.org> wrote: >> >>> On Tue, Nov 07, 2023 at 10:23:16PM +0100, Mateusz Guzik wrote: >> >>>> If the patch which dodges second lookup still somehow appears slower >> >>>> a >> >>>> flamegraph or other profile would be nice. I can volunteer to take a >> >>>> look at what's going on provided above measurements will be done and >> >>>> show funkyness. >> >>> >> >>> When I looked at this last, it seemed like all the work done in >> >>> do_filp_open() (my patch, which moved the lookup earlier) was heavier >> >>> than the duplicate filename_lookup(). >> >>> >> >>> What I didn't test was moving the sched_exec() before the mm >> >>> creation, >> >>> which Peter confirmed shouldn't be a problem, but I think that might >> >>> be >> >>> only a tiny benefit, if at all. >> >>> >> >>> If you can do some comparisons, that would be great; it always takes >> >>> me >> >>> a fair bit of time to get set up for flame graph generation, etc. :) >> >>> >> >> >> >>So I spawned *one* process executing one statocally linked binary in a >> >>loop, test case from http://apollo.backplane.com/DFlyMisc/doexec.c . >> >> >> >>The profile is definitely not what I expected: >> >> 5.85% [kernel] [k] asm_exc_page_fault >> >> 5.84% [kernel] [k] __pv_queued_spin_lock_slowpath >> >>[snip] >> >> >> >>I'm going to have to recompile with lock profiling, meanwhile >> >>according to bpftrace >> >>(bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = >> >> count(); }') >> >>top hits would be: >> >> >> >>@[ >> >> __pv_queued_spin_lock_slowpath+1 >> >> _raw_spin_lock+37 >> >> __schedule+192 >> >> schedule_idle+38 >> >> do_idle+366 >> >> cpu_startup_entry+38 >> >> start_secondary+282 >> >> secondary_startup_64_no_verify+381 >> >>]: 181 >> >>@[ >> >> __pv_queued_spin_lock_slowpath+1 >> >> _raw_spin_lock_irq+43 >> >> wait_for_completion+141 >> >> stop_one_cpu+127 >> >> sched_exec+165 >> > >> > There's the suspicious sched_exec() I was talking about! :) >> > >> > I think it needs to be moved, and perhaps _later_ instead of earlier? >> > Hmm... >> > >> >> I'm getting around 3.4k execs/s. However, if I "taskset -c 3 >> ./static-doexec 1" the number goes up to about 9.5k and lock >> contention disappears from the profile. So off hand looks like the >> task is walking around the box when it perhaps could be avoided -- it >> is idle apart from running the test. Again this is going to require a >> serious look instead of ad hoc pokes. > > Peter, is this something you can speak to? It seems like execve() forces > a change in running CPU. Is this really something we want to be doing? > Or is there some better way to keep it on the same CPU unless there is > contention? > sched_exec causes migration only for only few % of execs in the bench, but when it does happen there is tons of overhead elsewhere. I expect real programs which get past execve will be prone to migrating anyway, regardless of what sched_exec is doing. That is to say, while sched_exec buggering off here would be nice, I think for real-world wins the thing to investigate is the overhead which comes from migration to begin with.
Mateusz Guzik <mjguzik@gmail.com> writes: > On 11/8/23, Kees Cook <keescook@chromium.org> wrote: >> On Wed, Nov 08, 2023 at 01:03:33AM +0100, Mateusz Guzik wrote: >>> I'm getting around 3.4k execs/s. However, if I "taskset -c 3 >>> ./static-doexec 1" the number goes up to about 9.5k and lock >>> contention disappears from the profile. So off hand looks like the >>> task is walking around the box when it perhaps could be avoided -- it >>> is idle apart from running the test. Again this is going to require a >>> serious look instead of ad hoc pokes. >> >> Peter, is this something you can speak to? It seems like execve() forces >> a change in running CPU. Is this really something we want to be doing? >> Or is there some better way to keep it on the same CPU unless there is >> contention? >> > > sched_exec causes migration only for only few % of execs in the bench, > but when it does happen there is tons of overhead elsewhere. > > I expect real programs which get past execve will be prone to > migrating anyway, regardless of what sched_exec is doing. > > That is to say, while sched_exec buggering off here would be nice, I > think for real-world wins the thing to investigate is the overhead > which comes from migration to begin with. I have a vague memory that the idea is that there is a point during exec when it should be much less expensive than normal to allow migration between cpus because all of the old state has gone away. Assuming that is the rationale, if we are getting lock contention then either there is a global lock in there, or there is the potential to pick a less expensive location within exec. Just to confirm my memory I dug a little deeper and I found the original commit that added sched_exec (in tglx's git tree of the bit keeper history). commit f01419fd6d4e5b32fef19d206bc3550cc04567a9 Author: Martin J. Bligh <mbligh@aracnet.com> Date: Wed Jan 15 19:46:10 2003 -0800 [PATCH] (2/3) Initial load balancing Patch from Michael Hohnbaum This adds a hook, sched_balance_exec(), to the exec code, to make it place the exec'ed task on the least loaded queue. We have less state to move at exec time than fork time, so this is the cheapest point to cross-node migrate. Experience in Dynix/PTX and testing on Linux has confirmed that this is the cheapest time to move tasks between nodes. It also macro-wraps changes to nr_running, to allow us to keep track of per-node nr_running as well. Again, no impact on non-NUMA machines. Eric
On 11/9/23, Eric W. Biederman <ebiederm@xmission.com> wrote: > Mateusz Guzik <mjguzik@gmail.com> writes: >> sched_exec causes migration only for only few % of execs in the bench, >> but when it does happen there is tons of overhead elsewhere. >> >> I expect real programs which get past execve will be prone to >> migrating anyway, regardless of what sched_exec is doing. >> >> That is to say, while sched_exec buggering off here would be nice, I >> think for real-world wins the thing to investigate is the overhead >> which comes from migration to begin with. > > I have a vague memory that the idea is that there is a point during exec > when it should be much less expensive than normal to allow migration > between cpus because all of the old state has gone away. > > Assuming that is the rationale, if we are getting lock contention > then either there is a global lock in there, or there is the potential > to pick a less expensive location within exec. > Given the commit below I think the term "migration cost" is overloaded here. By migration cost in my previous mail I meant the immediate cost (stop_one_cpu and so on), but also the aftermath -- for example tlb flushes on another CPU when tearing down your now-defunct mm after you switched. For testing purposes I verified commenting out sched_exec and not using taskset still gives me about 9.5k ops/s. I 100% agree should the task be moved between NUMA domains, it makes sense to do it when it has the smallest footprint. I don't know what the original patch did, the current code just picks a CPU and migrates to it, regardless of NUMA considerations. I will note that the goal would still be achieved by comparing domains and doing nothing if they match. I think this would be nice to fix, but it is definitely not a big deal. I guess the question is to Peter Zijlstra if this sounds reasonable. > Just to confirm my memory I dug a little deeper and I found the original > commit that added sched_exec (in tglx's git tree of the bit keeper > history). > > commit f01419fd6d4e5b32fef19d206bc3550cc04567a9 > Author: Martin J. Bligh <mbligh@aracnet.com> > Date: Wed Jan 15 19:46:10 2003 -0800 > > [PATCH] (2/3) Initial load balancing > > Patch from Michael Hohnbaum > > This adds a hook, sched_balance_exec(), to the exec code, to make it > place the exec'ed task on the least loaded queue. We have less state > to move at exec time than fork time, so this is the cheapest point > to cross-node migrate. Experience in Dynix/PTX and testing on Linux > has confirmed that this is the cheapest time to move tasks between > nodes. > > It also macro-wraps changes to nr_running, to allow us to keep track of > per-node nr_running as well. Again, no impact on non-NUMA machines. > > > Eric > >
Mateusz Guzik <mjguzik@gmail.com> writes: > On 11/9/23, Eric W. Biederman <ebiederm@xmission.com> wrote: >> Mateusz Guzik <mjguzik@gmail.com> writes: >>> sched_exec causes migration only for only few % of execs in the bench, >>> but when it does happen there is tons of overhead elsewhere. >>> >>> I expect real programs which get past execve will be prone to >>> migrating anyway, regardless of what sched_exec is doing. >>> >>> That is to say, while sched_exec buggering off here would be nice, I >>> think for real-world wins the thing to investigate is the overhead >>> which comes from migration to begin with. >> >> I have a vague memory that the idea is that there is a point during exec >> when it should be much less expensive than normal to allow migration >> between cpus because all of the old state has gone away. >> >> Assuming that is the rationale, if we are getting lock contention >> then either there is a global lock in there, or there is the potential >> to pick a less expensive location within exec. >> > > Given the commit below I think the term "migration cost" is overloaded here. > > By migration cost in my previous mail I meant the immediate cost > (stop_one_cpu and so on), but also the aftermath -- for example tlb > flushes on another CPU when tearing down your now-defunct mm after you > switched. > > For testing purposes I verified commenting out sched_exec and not > using taskset still gives me about 9.5k ops/s. > > I 100% agree should the task be moved between NUMA domains, it makes > sense to do it when it has the smallest footprint. I don't know what > the original patch did, the current code just picks a CPU and migrates > to it, regardless of NUMA considerations. I will note that the goal > would still be achieved by comparing domains and doing nothing if they > match. > > I think this would be nice to fix, but it is definitely not a big > deal. I guess the question is to Peter Zijlstra if this sounds > reasonable. Perhaps I misread the trace. My point was simply that the sched_exec seemed to be causing lock contention because what was on one cpu is now on another cpu, and we are now getting cross cpu lock ping-pongs. If the sched_exec is causing exec to cause cross cpu lock ping-pongs, then we can move sched_exec to a better place within exec. It has already happened once, shortly after it was introduced. Ultimately we want the sched_exec to be in the cheapest place within exec that we can find. Eric
diff --git a/fs/exec.c b/fs/exec.c index 9a5ca7b82bfc..fe786aeb2f1b 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1881,6 +1881,16 @@ static int do_execveat_common(int fd, struct filename *filename, if (IS_ERR(filename)) return PTR_ERR(filename); + /* Fast-path ENOENT for $PATH search failures, before we alloc an mm or + * parse arguments. */ + if (fd == AT_FDCWD && flags == 0 && filename->name[0] == '/') { + struct path path; + retval = filename_lookup(AT_FDCWD, filename, 0, &path, NULL); + if (retval == -ENOENT) + goto out_ret; + path_put(&path); + } + /* * We move the actual failure in case of RLIMIT_NPROC excess from * set*uid() to execve() because too many poorly written programs
Currently, execve allocates an mm and parses argv and envp before checking if the path exists. However, the common case of a $PATH search may have several failed calls to exec before a single success. Do a filename lookup for the purposes of returning ENOENT before doing more expensive operations. This does not create a TOCTTOU race, because this can only happen if the file didn't exist at some point during the exec call, and that point is permitted to be when we did our lookup. To measure performance, I ran 2000 fork and execvpe calls with a seven-element PATH in which the file was found in the seventh directory (representative of the common case as /usr/bin is the seventh directory on my $PATH), as well as 2000 fork and execve calls with an absolute path to an existing binary. I recorded the minimum time for each, to eliminate noise from context switches and similar. Without fast-path: fork/execvpe: 49876ns fork/execve: 32773ns With fast-path: fork/execvpe: 36890ns fork/execve: 32069ns The cost of the additional lookup seems to be in the noise for a successful exec, but it provides a 26% improvement for the path search case by speeding up the six failed execs. Signed-off-by: Josh Triplett <josh@joshtriplett.org> --- Discussed this at Plumbers with Kees Cook; turned out to be even more of a win than anticipated. fs/exec.c | 10 ++++++++++ 1 file changed, 10 insertions(+)