diff mbox series

fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm

Message ID 5c7333ea4bec2fad1b47a8fa2db7c31e4ffc4f14.1663334978.git.josh@joshtriplett.org (mailing list archive)
State New, archived
Headers show
Series fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm | expand

Commit Message

Josh Triplett Sept. 16, 2022, 1:41 p.m. UTC
Currently, execve allocates an mm and parses argv and envp before
checking if the path exists. However, the common case of a $PATH search
may have several failed calls to exec before a single success. Do a
filename lookup for the purposes of returning ENOENT before doing more
expensive operations.

This does not create a TOCTTOU race, because this can only happen if the
file didn't exist at some point during the exec call, and that point is
permitted to be when we did our lookup.

To measure performance, I ran 2000 fork and execvpe calls with a
seven-element PATH in which the file was found in the seventh directory
(representative of the common case as /usr/bin is the seventh directory
on my $PATH), as well as 2000 fork and execve calls with an absolute
path to an existing binary. I recorded the minimum time for each, to
eliminate noise from context switches and similar.

Without fast-path:
fork/execvpe: 49876ns
fork/execve:  32773ns

With fast-path:
fork/execvpe: 36890ns
fork/execve:  32069ns

The cost of the additional lookup seems to be in the noise for a
successful exec, but it provides a 26% improvement for the path search
case by speeding up the six failed execs.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
---

Discussed this at Plumbers with Kees Cook; turned out to be even more of
a win than anticipated.

 fs/exec.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Kees Cook Sept. 16, 2022, 2:38 p.m. UTC | #1
On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote:
> Currently, execve allocates an mm and parses argv and envp before
> checking if the path exists. However, the common case of a $PATH search
> may have several failed calls to exec before a single success. Do a
> filename lookup for the purposes of returning ENOENT before doing more
> expensive operations.

At first I didn't understand how you were seeing this, since I'm so used
to watching shell scripts under tracing, which correctly use stat():

$ strace bash -c foo
stat("/home/keescook/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
stat("/usr/local/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/foo", 0x7ffe1f9ddea0)   = -1 ENOENT (No such file or directory)
stat("/usr/bin/foo", 0x7ffe1f9ddea0)    = -1 ENOENT (No such file or directory)
stat("/sbin/foo", 0x7ffe1f9ddea0)       = -1 ENOENT (No such file or directory)
stat("/bin/foo", 0x7ffe1f9ddea0)        = -1 ENOENT (No such file or directory)

But I see, yes, glibc tries to actually call execve(), which, as you
say, is extremely heavy:

$ strace ./execvpe
...
execve("/home/kees/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/local/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/local/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
execve("/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
execve("/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)

This really seems much more like a glibc bug. The shell does it correctly...

-Kees
Josh Triplett Sept. 16, 2022, 8:13 p.m. UTC | #2
On Fri, Sep 16, 2022 at 07:38:37AM -0700, Kees Cook wrote:
> On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote:
> > Currently, execve allocates an mm and parses argv and envp before
> > checking if the path exists. However, the common case of a $PATH search
> > may have several failed calls to exec before a single success. Do a
> > filename lookup for the purposes of returning ENOENT before doing more
> > expensive operations.
> 
> At first I didn't understand how you were seeing this, since I'm so used
> to watching shell scripts under tracing, which correctly use stat():
> 
> $ strace bash -c foo
> stat("/home/keescook/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
> stat("/usr/local/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
> stat("/usr/local/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
> stat("/usr/sbin/foo", 0x7ffe1f9ddea0)   = -1 ENOENT (No such file or directory)
> stat("/usr/bin/foo", 0x7ffe1f9ddea0)    = -1 ENOENT (No such file or directory)
> stat("/sbin/foo", 0x7ffe1f9ddea0)       = -1 ENOENT (No such file or directory)
> stat("/bin/foo", 0x7ffe1f9ddea0)        = -1 ENOENT (No such file or directory)
> 
> But I see, yes, glibc tries to actually call execve(), which, as you
> say, is extremely heavy:
> 
> $ strace ./execvpe
> ...
> execve("/home/kees/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> execve("/usr/local/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> execve("/usr/local/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> execve("/usr/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> execve("/usr/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> execve("/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> execve("/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> 
> This really seems much more like a glibc bug. The shell does it correctly...

musl does the same thing, as do python and perl (likely via execvp or
posix_spawnp). As does gcc when it executes `as`. And I've seen more
than a few programs hand-implement a PATH search the same way. Seems
worth optimizing for.

And with io_uring_spawn, it'll be the substantially easier approach,
since it'll just require one pass (series of execs) rather than two
(stats then exec).

- Josh Triplett
Kees Cook Sept. 17, 2022, 12:11 a.m. UTC | #3
[Hi Peter, apologies for dumping you into the middle of this thread.
I've got a question about sched_exec() below...]

On Fri, Sep 16, 2022 at 09:13:44PM +0100, Josh Triplett wrote:
> musl does the same thing, as do python and perl (likely via execvp or
> posix_spawnp). As does gcc when it executes `as`. And I've seen more
> than a few programs hand-implement a PATH search the same way. Seems
> worth optimizing for.

Yeah, it does seem like a simple way to eliminate needless work, though
I'd really like to see some kind of perf count of "in a given kernel
build, how many execve() system calls fail due to path search vs succeed",
just to get a better sense of the scale of the problem.

I don't like the idea of penalizing the _succeeding_ case, though, which
happens if we do the path walk twice. So, I went and refactoring the setup
order, moving the do_open_execat() up into alloc_bprm() instead of where
it was in bprm_exec(). The result makes it so it is, as you observed,
before the mm creation and generally expensive argument copying. The
difference to your patch seems to only be the allocation of the file
table entry, but avoids the double lookup, so I'm hoping the result is
actually even faster.

This cleanup is actually quite satisfying organizationally too -- the
fd and filename were passed around rather oddly.

The interaction with sched_exec() should be no worse (the file is opened
before it in either case), but in reading that function, it talks about
taking the opportunity to move the process to another CPU (IIUC) since,
paraphrasing, "it is at its lowest memory/cache size." But I wonder if
there is an existing accidental pessimistic result in that the process
stack has already been allocated. I am only passingly familiar with how
tasks get moved around under NUMA -- is the scheduler going to move
this process onto a different NUMA node and now it will be forced to
have the userspace process stack on one node and the program text and
heap on another? Or is that totally lost in the noise?

More specifically, I was wondering if processes would benefit from having
sched_exec() moved before the mm creation?

Regardless, here's a very lightly tested patch. Can you take this for a
spin and check your benchmark? Thanks!

-Kees

diff --git a/fs/exec.c b/fs/exec.c
index 9a5ca7b82bfc..5534301d67ca 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -898,6 +898,10 @@ EXPORT_SYMBOL(transfer_args_to_stack);
 
 #endif /* CONFIG_MMU */
 
+/*
+ * On success, callers must call do_close_execat() on the returned
+ * struct file.
+ */
 static struct file *do_open_execat(int fd, struct filename *name, int flags)
 {
 	struct file *file;
@@ -945,6 +949,16 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
 	return ERR_PTR(err);
 }
 
+/**
+ * open_exec - Open a path name for execution
+ *
+ * @name: path name to open with the intent of executing it.
+ *
+ * Returns ERR_PTR on failure or allocated struct file on success.
+ *
+ * As this is a wrapper for the internal do_open_execat(), callers
+ * must call allow_write_access() before fput() on release.
+ */
 struct file *open_exec(const char *name)
 {
 	struct filename *filename = getname_kernel(name);
@@ -1485,6 +1499,15 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
 	return -ENOMEM;
 }
 
+/* Matches do_open_execat() */
+static void do_close_execat(struct file *file)
+{
+	if (!file)
+		return;
+	allow_write_access(file);
+	fput(file);
+}
+
 static void free_bprm(struct linux_binprm *bprm)
 {
 	if (bprm->mm) {
@@ -1496,10 +1519,7 @@ static void free_bprm(struct linux_binprm *bprm)
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
-	if (bprm->file) {
-		allow_write_access(bprm->file);
-		fput(bprm->file);
-	}
+	do_close_execat(bprm->file);
 	if (bprm->executable)
 		fput(bprm->executable);
 	/* If a binfmt changed the interp, free it. */
@@ -1509,12 +1529,26 @@ static void free_bprm(struct linux_binprm *bprm)
 	kfree(bprm);
 }
 
-static struct linux_binprm *alloc_bprm(int fd, struct filename *filename)
+static struct linux_binprm *alloc_bprm(int fd, struct filename *filename,
+				       int flags)
 {
-	struct linux_binprm *bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
-	int retval = -ENOMEM;
-	if (!bprm)
+	struct linux_binprm *bprm;
+	struct file *file;
+	int retval;
+
+	file = do_open_execat(fd, filename, flags);
+	if (IS_ERR(file)) {
+		retval = PTR_ERR(file);
 		goto out;
+	}
+
+	retval = -ENOMEM;
+	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
+	if (!bprm) {
+		do_close_execat(file);
+		goto out;
+	}
+	bprm->file = file;
 
 	if (fd == AT_FDCWD || filename->name[0] == '/') {
 		bprm->filename = filename->name;
@@ -1531,6 +1565,18 @@ static struct linux_binprm *alloc_bprm(int fd, struct filename *filename)
 	}
 	bprm->interp = bprm->filename;
 
+	/*
+	 * Record that a name derived from an O_CLOEXEC fd will be
+	 * inaccessible after exec.  This allows the code in exec to
+	 * choose to fail when the executable is not mmaped into the
+	 * interpreter and an open file descriptor is not passed to
+	 * the interpreter.  This makes for a better user experience
+	 * than having the interpreter start and then immediately fail
+	 * when it finds the executable is inaccessible.
+	 */
+	if (bprm->fdpath && get_close_on_exec(fd))
+		bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
+
 	retval = bprm_mm_init(bprm);
 	if (retval)
 		goto out_free;
@@ -1803,10 +1849,8 @@ static int exec_binprm(struct linux_binprm *bprm)
 /*
  * sys_execve() executes a new program.
  */
-static int bprm_execve(struct linux_binprm *bprm,
-		       int fd, struct filename *filename, int flags)
+static int bprm_execve(struct linux_binprm *bprm)
 {
-	struct file *file;
 	int retval;
 
 	retval = prepare_bprm_creds(bprm);
@@ -1816,26 +1860,8 @@ static int bprm_execve(struct linux_binprm *bprm,
 	check_unsafe_exec(bprm);
 	current->in_execve = 1;
 
-	file = do_open_execat(fd, filename, flags);
-	retval = PTR_ERR(file);
-	if (IS_ERR(file))
-		goto out_unmark;
-
 	sched_exec();
 
-	bprm->file = file;
-	/*
-	 * Record that a name derived from an O_CLOEXEC fd will be
-	 * inaccessible after exec.  This allows the code in exec to
-	 * choose to fail when the executable is not mmaped into the
-	 * interpreter and an open file descriptor is not passed to
-	 * the interpreter.  This makes for a better user experience
-	 * than having the interpreter start and then immediately fail
-	 * when it finds the executable is inaccessible.
-	 */
-	if (bprm->fdpath && get_close_on_exec(fd))
-		bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
-
 	/* Set the unchanging part of bprm->cred */
 	retval = security_bprm_creds_for_exec(bprm);
 	if (retval)
@@ -1863,7 +1889,6 @@ static int bprm_execve(struct linux_binprm *bprm,
 	if (bprm->point_of_no_return && !fatal_signal_pending(current))
 		force_fatal_sig(SIGSEGV);
 
-out_unmark:
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 
@@ -1897,7 +1922,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	 * further execve() calls fail. */
 	current->flags &= ~PF_NPROC_EXCEEDED;
 
-	bprm = alloc_bprm(fd, filename);
+	bprm = alloc_bprm(fd, filename, flags);
 	if (IS_ERR(bprm)) {
 		retval = PTR_ERR(bprm);
 		goto out_ret;
@@ -1946,7 +1971,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 		bprm->argc = 1;
 	}
 
-	retval = bprm_execve(bprm, fd, filename, flags);
+	retval = bprm_execve(bprm);
 out_free:
 	free_bprm(bprm);
 
@@ -1971,7 +1996,7 @@ int kernel_execve(const char *kernel_filename,
 	if (IS_ERR(filename))
 		return PTR_ERR(filename);
 
-	bprm = alloc_bprm(fd, filename);
+	bprm = alloc_bprm(fd, filename, 0);
 	if (IS_ERR(bprm)) {
 		retval = PTR_ERR(bprm);
 		goto out_ret;
@@ -2006,7 +2031,7 @@ int kernel_execve(const char *kernel_filename,
 	if (retval < 0)
 		goto out_free;
 
-	retval = bprm_execve(bprm, fd, filename, 0);
+	retval = bprm_execve(bprm);
 out_free:
 	free_bprm(bprm);
 out_ret:
Josh Triplett Sept. 17, 2022, 12:50 a.m. UTC | #4
On Fri, Sep 16, 2022 at 05:11:18PM -0700, Kees Cook wrote:
> I don't like the idea of penalizing the _succeeding_ case, though, which
> happens if we do the path walk twice. So, I went and refactoring the setup
> order, moving the do_open_execat() up into alloc_bprm() instead of where
> it was in bprm_exec(). The result makes it so it is, as you observed,
> before the mm creation and generally expensive argument copying. The
> difference to your patch seems to only be the allocation of the file
> table entry, but avoids the double lookup, so I'm hoping the result is
> actually even faster.

Thanks for giving this a try; I'd wondered how feasible it would be to
just do one lookup.

However, on the same test system with the same test setup, with your
refactor it seems to go slower:
fork/execvpe: 38087ns
fork/execve:  33758ns

For comparison, the previous numbers (which I re-confirmed):

Without fast-path:
fork/execvpe: 49876ns
fork/execve:  32773ns

With my original separate-lookup fast-path:
fork/execvpe: 36890ns
fork/execve:  31551ns


I tried several runs of each, and I seem to get reasonably consistent
results.

My test program just creates a pipe once, then loops on
clock_gettime/fork/execvpe/read, with the spawned child process doing
clock_gettime/write/exit (in asm to minimize overhead). The test PATH is
PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. with
the test program in the current directory.

- Josh Triplett
Peter Zijlstra Sept. 19, 2022, 2:34 p.m. UTC | #5
On Fri, Sep 16, 2022 at 05:11:18PM -0700, Kees Cook wrote:

> The interaction with sched_exec() should be no worse (the file is opened
> before it in either case), but in reading that function, it talks about
> taking the opportunity to move the process to another CPU (IIUC) since,
> paraphrasing, "it is at its lowest memory/cache size." But I wonder if
> there is an existing accidental pessimistic result in that the process
> stack has already been allocated. I am only passingly familiar with how
> tasks get moved around under NUMA -- is the scheduler going to move
> this process onto a different NUMA node and now it will be forced to
> have the userspace process stack on one node and the program text and
> heap on another? Or is that totally lost in the noise?

Probably noise; text is going to be a crap-shoot anyway due to DSOs only
having a single copy in the page-cache. And the stack will be relatively
small at this point and also, numa-balance can migrate those pages
around if they matter.

> More specifically, I was wondering if processes would benefit from having
> sched_exec() moved before the mm creation?

Can't hurt I think.
Kees Cook Sept. 19, 2022, 8:02 p.m. UTC | #6
On Sat, Sep 17, 2022 at 01:50:24AM +0100, Josh Triplett wrote:
> On Fri, Sep 16, 2022 at 05:11:18PM -0700, Kees Cook wrote:
> > I don't like the idea of penalizing the _succeeding_ case, though, which
> > happens if we do the path walk twice. So, I went and refactoring the setup
> > order, moving the do_open_execat() up into alloc_bprm() instead of where
> > it was in bprm_exec(). The result makes it so it is, as you observed,
> > before the mm creation and generally expensive argument copying. The
> > difference to your patch seems to only be the allocation of the file
> > table entry, but avoids the double lookup, so I'm hoping the result is
> > actually even faster.
> 
> Thanks for giving this a try; I'd wondered how feasible it would be to
> just do one lookup.
> 
> However, on the same test system with the same test setup, with your
> refactor it seems to go slower:
> fork/execvpe: 38087ns
> fork/execve:  33758ns
> 
> For comparison, the previous numbers (which I re-confirmed):
> 
> Without fast-path:
> fork/execvpe: 49876ns
> fork/execve:  32773ns
> 
> With my original separate-lookup fast-path:
> fork/execvpe: 36890ns
> fork/execve:  31551ns

Hmm, this shows as slower in the *normal* case, which I find rather
surprising -- it's the same work, just reordered.

Can you post a URL to your tests? I'd like to reproduce this and maybe
throw perf at it as well.

> I tried several runs of each, and I seem to get reasonably consistent
> results.
> 
> My test program just creates a pipe once, then loops on
> clock_gettime/fork/execvpe/read, with the spawned child process doing
> clock_gettime/write/exit (in asm to minimize overhead). The test PATH is
> PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. with
> the test program in the current directory.

I'm also curious about less synthetic testing. It'd be nice to see real
workloads with these changes, etc.
kernel test robot Sept. 22, 2022, 7:27 a.m. UTC | #7
Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 0a276ae2d224d48c9ddbeb9fa14bc2fd0893a7b9 ("[PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm")
url: https://github.com/intel-lab-lkp/linux/commits/Josh-Triplett/fs-exec-c-Add-fast-path-for-ENOENT-on-PATH-search-before-allocating-mm/20220916-214429
base: https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git for-next/execve
patch link: https://lore.kernel.org/linux-mm/5c7333ea4bec2fad1b47a8fa2db7c31e4ffc4f14.1663334978.git.josh@joshtriplett.org

in testcase: trinity
version: trinity-i386-4d2343bd-1_20200320
with following parameters:

	runtime: 300s
	group: group-01

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


[  129.613450][    C0] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 43s!
[  129.616497][    C0] Showing busy workqueues and worker pools:
[  129.620580][    C0] workqueue events: flags=0x0
[  129.623078][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256 refcnt=5
[  129.623118][    C0]     in-flight: 52:do_free_init do_free_init, 29:key_garbage_collector key_garbage_collector
[  129.623151][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
[  129.623161][    C0]     pending: vmstat_shepherd, drm_fb_helper_damage_work [drm_kms_helper]
[  129.637767][    C0] workqueue events_highpri: flags=0x10
[  129.640231][    C0]   pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
[  129.640252][    C0]     pending: mix_interrupt_randomness
[  129.640271][    C0] workqueue events_freezable_power_: flags=0x84
[  129.650217][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  129.650238][    C0]     pending: disk_events_workfn
[  129.650254][    C0] workqueue rcu_gp: flags=0x8
[  129.659366][    C0]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  129.659387][    C0]     in-flight: 185:wait_rcu_exp_gp
[  129.659405][    C0] workqueue mm_percpu_wq: flags=0x8
[  129.668545][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=4
[  129.668565][    C0]     pending: vmstat_update, lru_add_drain_per_cpu BAR(35)
[  129.668595][    C0] workqueue kblockd: flags=0x18
[  129.678480][    C0]   pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
[  129.678499][    C0]     pending: blk_mq_timeout_work
[  129.678560][    C0] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=5 idle: 3764 22
[  175.181511][  T185] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-.... } 88583 jiffies s: 41 root: 0x1/.
[  175.185095][  T185] rcu: blocking rcu_node structures (internal RCU debug):
[  175.188233][  T185] Task dump for CPU 0:
[  175.191019][  T185] task:trinity-subchil state:R  running task     stack:    0 pid: 3731 ppid:  3652 flags:0x2000000c
[  175.198313][  T185] Call Trace:
[  175.200246][  T185]  <TASK>
[ 175.202827][ T185] ? do_fast_syscall_32 (arch/x86/entry/common.c:203) 
[ 175.205560][ T185] ? entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:122) 
[  175.208517][  T185]  </TASK>
[  186.313437][    C0] rcu: INFO: rcu_sched self-detected stall on CPU
[  186.315637][    C0] rcu: 	0-....: (99854 ticks this GP) idle=64dc/1/0x4000000000000000 softirq=25122/25122 fqs=30028
[  186.317670][    C0] 	(t=100000 jiffies g=32689 q=29567 ncpus=2)
[  186.319650][    C0] NMI backtrace for cpu 0
[  186.321784][    C0] CPU: 0 PID: 3731 Comm: trinity-subchil Not tainted 6.0.0-rc2-00003-g0a276ae2d224 #1
[  186.324188][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[  186.326774][    C0] Call Trace:
[  186.328778][    C0]  <IRQ>
[ 186.331085][ C0] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1)) 
[ 186.332793][ C0] nmi_cpu_backtrace.cold (lib/nmi_backtrace.c:111) 
[ 186.334549][ C0] ? lapic_can_unplug_cpu (arch/x86/kernel/apic/hw_nmi.c:33) 
[ 186.336150][ C0] nmi_trigger_cpumask_backtrace (lib/nmi_backtrace.c:62) 
[ 186.337777][ C0] trigger_single_cpu_backtrace (include/linux/nmi.h:168) 
[ 186.339411][ C0] rcu_dump_cpu_stacks (kernel/rcu/tree_stall.h:371) 
[ 186.341082][ C0] print_cpu_stall.cold (kernel/rcu/tree_stall.h:662) 
[ 186.342889][ C0] check_cpu_stall (kernel/rcu/tree_stall.h:745) 
[ 186.344423][ C0] rcu_sched_clock_irq (kernel/rcu/tree_nocb.h:913 kernel/rcu/tree.c:3663 kernel/rcu/tree.c:2342) 
[ 186.346071][ C0] update_process_times (arch/x86/include/asm/preempt.h:27 kernel/time/timer.c:1841) 
[ 186.347626][ C0] tick_sched_handle (kernel/time/tick-sched.c:244) 
[ 186.349235][ C0] tick_sched_timer (kernel/time/tick-sched.c:1480) 
[ 186.350780][ C0] ? tick_sched_do_timer (kernel/time/tick-sched.c:1467) 
[ 186.352495][ C0] __hrtimer_run_queues (kernel/time/hrtimer.c:1685 kernel/time/hrtimer.c:1749) 
[ 186.354484][ C0] ? enqueue_hrtimer (kernel/time/hrtimer.c:1719) 
[ 186.356550][ C0] ? kvm_clock_get_cycles (arch/x86/include/asm/preempt.h:85 arch/x86/kernel/kvmclock.c:80 arch/x86/kernel/kvmclock.c:86) 
[ 186.358499][ C0] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:292 (discriminator 3) kernel/time/timekeeping.c:388 (discriminator 3) kernel/time/timekeeping.c:2320 (discriminator 3)) 
[ 186.360609][ C0] hrtimer_interrupt (kernel/time/hrtimer.c:1814) 
[ 186.362418][ C0] __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1113) 
[ 186.364295][ C0] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1106 (discriminator 14)) 
[  186.366201][    C0]  </IRQ>
[  186.367490][    C0]  <TASK>
[ 186.369287][ C0] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:649) 
[ 186.371191][ C0] RIP: 0010:native_queued_spin_lock_slowpath (arch/x86/include/asm/vdso/processor.h:19 arch/x86/include/asm/qspinlock.h:101 kernel/locking/qspinlock.c:327) 
[ 186.373248][ C0] Code: 00 00 00 65 48 2b 04 25 28 00 00 00 0f 85 2f 08 00 00 48 81 c4 88 00 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc f3 90 <e9> 71 ff ff ff 44 8b 74 24 48 41 81 fe 00 01 00 00 0f 84 e3 00 00
All code
========
   0:	00 00                	add    %al,(%rax)
   2:	00 65 48             	add    %ah,0x48(%rbp)
   5:	2b 04 25 28 00 00 00 	sub    0x28,%eax
   c:	0f 85 2f 08 00 00    	jne    0x841
  12:	48 81 c4 88 00 00 00 	add    $0x88,%rsp
  19:	5b                   	pop    %rbx
  1a:	5d                   	pop    %rbp
  1b:	41 5c                	pop    %r12
  1d:	41 5d                	pop    %r13
  1f:	41 5e                	pop    %r14
  21:	41 5f                	pop    %r15
  23:	c3                   	retq   
  24:	cc                   	int3   
  25:	cc                   	int3   
  26:	cc                   	int3   
  27:	cc                   	int3   
  28:	f3 90                	pause  
  2a:*	e9 71 ff ff ff       	jmpq   0xffffffffffffffa0		<-- trapping instruction
  2f:	44 8b 74 24 48       	mov    0x48(%rsp),%r14d
  34:	41 81 fe 00 01 00 00 	cmp    $0x100,%r14d
  3b:	0f                   	.byte 0xf
  3c:	84 e3                	test   %ah,%bl
	...

Code starting with the faulting instruction
===========================================
   0:	e9 71 ff ff ff       	jmpq   0xffffffffffffff76
   5:	44 8b 74 24 48       	mov    0x48(%rsp),%r14d
   a:	41 81 fe 00 01 00 00 	cmp    $0x100,%r14d
  11:	0f                   	.byte 0xf
  12:	84 e3                	test   %ah,%bl
	...
[  186.377327][    C0] RSP: 0018:ffffc90000e17cb8 EFLAGS: 00000202
[  186.379142][    C0] RAX: 0000000000000000 RBX: 00000000587993b0 RCX: ffffffff812fc8ec
[  186.380912][    C0] RDX: fffff520001c2ff7 RSI: 0000000000000004 RDI: ffffc90000e17fb0
[  186.383091][    C0] RBP: ffffc90000e17fb0 R08: 0000000000000000 R09: ffffc90000e17fb3
[  186.384750][    C0] R10: fffff520001c2ff6 R11: 0000000000000001 R12: 0000000000000003
[  186.386354][    C0] R13: fffff520001c2ff6 R14: 0000000000000001 R15: 1ffff920001c2f98
[ 186.388126][ C0] ? native_queued_spin_lock_slowpath (arch/x86/include/asm/atomic.h:29 include/linux/atomic/atomic-instrumented.h:28 arch/x86/include/asm/qspinlock.h:100 kernel/locking/qspinlock.c:327) 
[ 186.389979][ C0] ? native_queued_spin_lock_slowpath (arch/x86/include/asm/atomic.h:29 include/linux/atomic/atomic-instrumented.h:28 arch/x86/include/asm/qspinlock.h:100 kernel/locking/qspinlock.c:327) 
[ 186.391917][ C0] ? pv_hash (kernel/locking/qspinlock.c:317) 
[ 186.393367][ C0] ? lockref_put_return (arch/x86/include/asm/atomic.h:29 (discriminator 5) include/linux/atomic/atomic-instrumented.h:28 (discriminator 5) include/asm-generic/qspinlock.h:73 (discriminator 5) lib/lockref.c:122 (discriminator 5)) 
[ 186.394784][ C0] _raw_spin_lock (arch/x86/include/asm/paravirt.h:591 arch/x86/include/asm/qspinlock.h:51 include/asm-generic/qspinlock.h:114 include/linux/spinlock.h:185 include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154) 
[ 186.396052][ C0] ? _raw_write_lock_irq (kernel/locking/spinlock.c:153) 
[ 186.397502][ C0] ? allocate_slab (mm/slub.c:378 mm/slub.c:1929 mm/slub.c:1995) 
[ 186.399281][ C0] dput (fs/dcache.c:790 fs/dcache.c:900) 
[ 186.400885][ C0] path_put (fs/namei.c:559) 
[ 186.402250][ C0] do_execveat_common+0x5ec/0x7c0 
[ 186.403558][ C0] ? __check_object_size (mm/memremap.c:37) 
[ 186.404848][ C0] ? alloc_bprm (fs/exec.c:1866) 
[ 186.406039][ C0] __ia32_compat_sys_execve (fs/exec.c:2117) 
[ 186.407302][ C0] __do_fast_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:178) 
[ 186.408512][ C0] do_fast_syscall_32 (arch/x86/entry/common.c:203) 
[ 186.409678][ C0] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:122) 
[  186.410947][    C0] RIP: 0023:0xf7f3c549
[ 186.412085][ C0] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
All code
========
   0:	03 74 c0 01          	add    0x1(%rax,%rax,8),%esi
   4:	10 05 03 74 b8 01    	adc    %al,0x1b87403(%rip)        # 0x1b8740d
   a:	10 06                	adc    %al,(%rsi)
   c:	03 74 b4 01          	add    0x1(%rsp,%rsi,4),%esi
  10:	10 07                	adc    %al,(%rdi)
  12:	03 74 b0 01          	add    0x1(%rax,%rsi,4),%esi
  16:	10 08                	adc    %cl,(%rax)
  18:	03 74 d8 01          	add    0x1(%rax,%rbx,8),%esi
  1c:	00 00                	add    %al,(%rax)
  1e:	00 00                	add    %al,(%rax)
  20:	00 51 52             	add    %dl,0x52(%rcx)
  23:	55                   	push   %rbp
  24:	89 e5                	mov    %esp,%ebp
  26:	0f 34                	sysenter 
  28:	cd 80                	int    $0x80
  2a:*	5d                   	pop    %rbp		<-- trapping instruction
  2b:	5a                   	pop    %rdx
  2c:	59                   	pop    %rcx
  2d:	c3                   	retq   
  2e:	90                   	nop
  2f:	90                   	nop
  30:	90                   	nop
  31:	90                   	nop
  32:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi
  39:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi

Code starting with the faulting instruction
===========================================
   0:	5d                   	pop    %rbp
   1:	5a                   	pop    %rdx
   2:	59                   	pop    %rcx
   3:	c3                   	retq   
   4:	90                   	nop
   5:	90                   	nop
   6:	90                   	nop
   7:	90                   	nop
   8:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi
   f:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <yujie.liu@intel.com>
| Link: https://lore.kernel.org/r/202209221401.90061e56-yujie.liu@intel.com


To reproduce:

        # build kernel
	cd linux
	cp config-6.0.0-rc2-00003-g0a276ae2d224 .config
	make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
	make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
	cd <mod-install-dir>
	find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
Josh Triplett Oct. 1, 2022, 4:01 p.m. UTC | #8
On Mon, Sep 19, 2022 at 01:02:51PM -0700, Kees Cook wrote:
> On Sat, Sep 17, 2022 at 01:50:24AM +0100, Josh Triplett wrote:
> > On Fri, Sep 16, 2022 at 05:11:18PM -0700, Kees Cook wrote:
> > > I don't like the idea of penalizing the _succeeding_ case, though, which
> > > happens if we do the path walk twice. So, I went and refactoring the setup
> > > order, moving the do_open_execat() up into alloc_bprm() instead of where
> > > it was in bprm_exec(). The result makes it so it is, as you observed,
> > > before the mm creation and generally expensive argument copying. The
> > > difference to your patch seems to only be the allocation of the file
> > > table entry, but avoids the double lookup, so I'm hoping the result is
> > > actually even faster.
> > 
> > Thanks for giving this a try; I'd wondered how feasible it would be to
> > just do one lookup.
> > 
> > However, on the same test system with the same test setup, with your
> > refactor it seems to go slower:
> > fork/execvpe: 38087ns
> > fork/execve:  33758ns
> > 
> > For comparison, the previous numbers (which I re-confirmed):
> > 
> > Without fast-path:
> > fork/execvpe: 49876ns
> > fork/execve:  32773ns
> > 
> > With my original separate-lookup fast-path:
> > fork/execvpe: 36890ns
> > fork/execve:  31551ns
> 
> Hmm, this shows as slower in the *normal* case, which I find rather
> surprising -- it's the same work, just reordered.
> 
> Can you post a URL to your tests? I'd like to reproduce this and maybe
> throw perf at it as well.

Sure. Sorry for the delay, needed to integrate some fixes (such as
aarch64 support) and factor out the bits that won't build if you don't
have a patched liburing.

https://github.com/joshtriplett/spawnbench

- Josh Triplett
Kees Cook Nov. 7, 2023, 8:30 p.m. UTC | #9
On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote:
> Currently, execve allocates an mm and parses argv and envp before
> checking if the path exists. However, the common case of a $PATH search
> may have several failed calls to exec before a single success. Do a
> filename lookup for the purposes of returning ENOENT before doing more
> expensive operations.
> 
> This does not create a TOCTTOU race, because this can only happen if the
> file didn't exist at some point during the exec call, and that point is
> permitted to be when we did our lookup.
> 
> To measure performance, I ran 2000 fork and execvpe calls with a
> seven-element PATH in which the file was found in the seventh directory
> (representative of the common case as /usr/bin is the seventh directory
> on my $PATH), as well as 2000 fork and execve calls with an absolute
> path to an existing binary. I recorded the minimum time for each, to
> eliminate noise from context switches and similar.
> 
> Without fast-path:
> fork/execvpe: 49876ns
> fork/execve:  32773ns
> 
> With fast-path:
> fork/execvpe: 36890ns
> fork/execve:  32069ns
> 
> The cost of the additional lookup seems to be in the noise for a
> successful exec, but it provides a 26% improvement for the path search
> case by speeding up the six failed execs.
> 
> Signed-off-by: Josh Triplett <josh@joshtriplett.org>

*thread necromancy*

I'll snag this patch after -rc1 is out. Based on the research we both
did in the rest of this thread, this original patch is a clear win.
Let's get it into linux-next and see if anything else falls out of it.

I did, however, scratch my head over the 0-day report:
https://lore.kernel.org/lkml/202209221401.90061e56-yujie.liu@intel.com/

But I can't see why this patch could trigger those problems...

Thanks!

-Kees
Kees Cook Nov. 7, 2023, 8:37 p.m. UTC | #10
On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote:
> Currently, execve allocates an mm and parses argv and envp before
> checking if the path exists. However, the common case of a $PATH search
> may have several failed calls to exec before a single success. Do a
> filename lookup for the purposes of returning ENOENT before doing more
> expensive operations.
> 
> This does not create a TOCTTOU race, because this can only happen if the
> file didn't exist at some point during the exec call, and that point is
> permitted to be when we did our lookup.
> 
> To measure performance, I ran 2000 fork and execvpe calls with a
> seven-element PATH in which the file was found in the seventh directory
> (representative of the common case as /usr/bin is the seventh directory
> on my $PATH), as well as 2000 fork and execve calls with an absolute
> path to an existing binary. I recorded the minimum time for each, to
> eliminate noise from context switches and similar.
> 
> Without fast-path:
> fork/execvpe: 49876ns
> fork/execve:  32773ns
> 
> With fast-path:
> fork/execvpe: 36890ns
> fork/execve:  32069ns
> 
> The cost of the additional lookup seems to be in the noise for a
> successful exec, but it provides a 26% improvement for the path search
> case by speeding up the six failed execs.
> 
> Signed-off-by: Josh Triplett <josh@joshtriplett.org>
> ---
> 
> Discussed this at Plumbers with Kees Cook; turned out to be even more of
> a win than anticipated.
> 
>  fs/exec.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 9a5ca7b82bfc..fe786aeb2f1b 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1881,6 +1881,16 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	if (IS_ERR(filename))
>  		return PTR_ERR(filename);
>  
> +	/* Fast-path ENOENT for $PATH search failures, before we alloc an mm or
> +	 * parse arguments. */
> +	if (fd == AT_FDCWD && flags == 0 && filename->name[0] == '/') {
> +		struct path path;
> +		retval = filename_lookup(AT_FDCWD, filename, 0, &path, NULL);
> +		if (retval == -ENOENT)

Oh, actually, I see the 0-day problem. This should be:

		if (retval < 0)

> +			goto out_ret;
> +		path_put(&path);

Otherwise this put will happen for an non-successful lookup that wans't
ENOENT. I've fixed this in my tree.

-Kees

> +	}
> +
>  	/*
>  	 * We move the actual failure in case of RLIMIT_NPROC excess from
>  	 * set*uid() to execve() because too many poorly written programs
> -- 
> 2.37.2
>
Mateusz Guzik Nov. 7, 2023, 8:51 p.m. UTC | #11
On Tue, Nov 07, 2023 at 12:30:37PM -0800, Kees Cook wrote:
> On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote:
> > Currently, execve allocates an mm and parses argv and envp before
> > checking if the path exists. However, the common case of a $PATH search
> > may have several failed calls to exec before a single success. Do a
> > filename lookup for the purposes of returning ENOENT before doing more
> > expensive operations.
> > 
> > This does not create a TOCTTOU race, because this can only happen if the
> > file didn't exist at some point during the exec call, and that point is
> > permitted to be when we did our lookup.
> > 
> > To measure performance, I ran 2000 fork and execvpe calls with a
> > seven-element PATH in which the file was found in the seventh directory
> > (representative of the common case as /usr/bin is the seventh directory
> > on my $PATH), as well as 2000 fork and execve calls with an absolute
> > path to an existing binary. I recorded the minimum time for each, to
> > eliminate noise from context switches and similar.
> > 
> > Without fast-path:
> > fork/execvpe: 49876ns
> > fork/execve:  32773ns
> > 
> > With fast-path:
> > fork/execvpe: 36890ns
> > fork/execve:  32069ns
> > 
> > The cost of the additional lookup seems to be in the noise for a
> > successful exec, but it provides a 26% improvement for the path search
> > case by speeding up the six failed execs.
> > 
> > Signed-off-by: Josh Triplett <josh@joshtriplett.org>
> 
> *thread necromancy*
> 
> I'll snag this patch after -rc1 is out. Based on the research we both
> did in the rest of this thread, this original patch is a clear win.
> Let's get it into linux-next and see if anything else falls out of it.
> 

So the obvious question is why not store lookup result within bprm,
instead of doing the lookup again later.

Turns out you had very same question and even wrote a patch to sort it
out: https://lore.kernel.org/all/202209161637.9EDAF6B18@keescook/

Why do you intend to go with this patch instead?
Mateusz Guzik Nov. 7, 2023, 9:23 p.m. UTC | #12
On 11/7/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On Tue, Nov 07, 2023 at 12:30:37PM -0800, Kees Cook wrote:
>> On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote:
>> > Currently, execve allocates an mm and parses argv and envp before
>> > checking if the path exists. However, the common case of a $PATH search
>> > may have several failed calls to exec before a single success. Do a
>> > filename lookup for the purposes of returning ENOENT before doing more
>> > expensive operations.
>> >
>> > This does not create a TOCTTOU race, because this can only happen if
>> > the
>> > file didn't exist at some point during the exec call, and that point is
>> > permitted to be when we did our lookup.
>> >
>> > To measure performance, I ran 2000 fork and execvpe calls with a
>> > seven-element PATH in which the file was found in the seventh directory
>> > (representative of the common case as /usr/bin is the seventh directory
>> > on my $PATH), as well as 2000 fork and execve calls with an absolute
>> > path to an existing binary. I recorded the minimum time for each, to
>> > eliminate noise from context switches and similar.
>> >
>> > Without fast-path:
>> > fork/execvpe: 49876ns
>> > fork/execve:  32773ns
>> >
>> > With fast-path:
>> > fork/execvpe: 36890ns
>> > fork/execve:  32069ns
>> >
>> > The cost of the additional lookup seems to be in the noise for a
>> > successful exec, but it provides a 26% improvement for the path search
>> > case by speeding up the six failed execs.
>> >
>> > Signed-off-by: Josh Triplett <josh@joshtriplett.org>
>>
>> *thread necromancy*
>>
>> I'll snag this patch after -rc1 is out. Based on the research we both
>> did in the rest of this thread, this original patch is a clear win.
>> Let's get it into linux-next and see if anything else falls out of it.
>>
>
> So the obvious question is why not store lookup result within bprm,
> instead of doing the lookup again later.
>
> Turns out you had very same question and even wrote a patch to sort it
> out: https://lore.kernel.org/all/202209161637.9EDAF6B18@keescook/
>
> Why do you intend to go with this patch instead?
>

Welp, should have read the remaining follow up discussion.

Even so, the patch which is only doing the lookup once cannot be
legitimately slower.

Note there is often perf variance between different boots of the same
kernel and I suspect this is what justifies it.

I would suggest adding a runtime knob to control whether 1. lookups
are done late (stock kernel) 2. there is one upfront lookup like in
the original patch and finally 3. stuffing file into bprm

Then a bunch of runs with the knob cycling between them could serve as
a justification.

If the patch which dodges second lookup still somehow appears slower a
flamegraph or other profile would be nice. I can volunteer to take a
look at what's going on provided above measurements will be done and
show funkyness.
Kees Cook Nov. 7, 2023, 10:50 p.m. UTC | #13
On Tue, Nov 07, 2023 at 10:23:16PM +0100, Mateusz Guzik wrote:
> If the patch which dodges second lookup still somehow appears slower a
> flamegraph or other profile would be nice. I can volunteer to take a
> look at what's going on provided above measurements will be done and
> show funkyness.

When I looked at this last, it seemed like all the work done in
do_filp_open() (my patch, which moved the lookup earlier) was heavier
than the duplicate filename_lookup().

What I didn't test was moving the sched_exec() before the mm creation,
which Peter confirmed shouldn't be a problem, but I think that might be
only a tiny benefit, if at all.

If you can do some comparisons, that would be great; it always takes me
a fair bit of time to get set up for flame graph generation, etc. :)

-Kees
Mateusz Guzik Nov. 7, 2023, 11:08 p.m. UTC | #14
On 11/7/23, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Nov 07, 2023 at 10:23:16PM +0100, Mateusz Guzik wrote:
>> If the patch which dodges second lookup still somehow appears slower a
>> flamegraph or other profile would be nice. I can volunteer to take a
>> look at what's going on provided above measurements will be done and
>> show funkyness.
>
> When I looked at this last, it seemed like all the work done in
> do_filp_open() (my patch, which moved the lookup earlier) was heavier
> than the duplicate filename_lookup().
>
> What I didn't test was moving the sched_exec() before the mm creation,
> which Peter confirmed shouldn't be a problem, but I think that might be
> only a tiny benefit, if at all.
>
> If you can do some comparisons, that would be great; it always takes me
> a fair bit of time to get set up for flame graph generation, etc. :)
>

So I spawned *one* process executing one statocally linked binary in a
loop, test case from http://apollo.backplane.com/DFlyMisc/doexec.c .

The profile is definitely not what I expected:
   5.85%  [kernel]           [k] asm_exc_page_fault
   5.84%  [kernel]           [k] __pv_queued_spin_lock_slowpath
[snip]

I'm going to have to recompile with lock profiling, meanwhile
according to bpftrace
(bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }')
top hits would be:

@[
    __pv_queued_spin_lock_slowpath+1
    _raw_spin_lock+37
    __schedule+192
    schedule_idle+38
    do_idle+366
    cpu_startup_entry+38
    start_secondary+282
    secondary_startup_64_no_verify+381
]: 181
@[
    __pv_queued_spin_lock_slowpath+1
    _raw_spin_lock_irq+43
    wait_for_completion+141
    stop_one_cpu+127
    sched_exec+165
    bprm_execve+328
    do_execveat_common.isra.0+429
    __x64_sys_execve+50
    do_syscall_64+46
    entry_SYSCALL_64_after_hwframe+110
]: 206

I did not see this coming for sure. I'll poke around maybe this weekend.
Kees Cook Nov. 7, 2023, 11:39 p.m. UTC | #15
On November 7, 2023 3:08:47 PM PST, Mateusz Guzik <mjguzik@gmail.com> wrote:
>On 11/7/23, Kees Cook <keescook@chromium.org> wrote:
>> On Tue, Nov 07, 2023 at 10:23:16PM +0100, Mateusz Guzik wrote:
>>> If the patch which dodges second lookup still somehow appears slower a
>>> flamegraph or other profile would be nice. I can volunteer to take a
>>> look at what's going on provided above measurements will be done and
>>> show funkyness.
>>
>> When I looked at this last, it seemed like all the work done in
>> do_filp_open() (my patch, which moved the lookup earlier) was heavier
>> than the duplicate filename_lookup().
>>
>> What I didn't test was moving the sched_exec() before the mm creation,
>> which Peter confirmed shouldn't be a problem, but I think that might be
>> only a tiny benefit, if at all.
>>
>> If you can do some comparisons, that would be great; it always takes me
>> a fair bit of time to get set up for flame graph generation, etc. :)
>>
>
>So I spawned *one* process executing one statocally linked binary in a
>loop, test case from http://apollo.backplane.com/DFlyMisc/doexec.c .
>
>The profile is definitely not what I expected:
>   5.85%  [kernel]           [k] asm_exc_page_fault
>   5.84%  [kernel]           [k] __pv_queued_spin_lock_slowpath
>[snip]
>
>I'm going to have to recompile with lock profiling, meanwhile
>according to bpftrace
>(bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }')
>top hits would be:
>
>@[
>    __pv_queued_spin_lock_slowpath+1
>    _raw_spin_lock+37
>    __schedule+192
>    schedule_idle+38
>    do_idle+366
>    cpu_startup_entry+38
>    start_secondary+282
>    secondary_startup_64_no_verify+381
>]: 181
>@[
>    __pv_queued_spin_lock_slowpath+1
>    _raw_spin_lock_irq+43
>    wait_for_completion+141
>    stop_one_cpu+127
>    sched_exec+165

There's the suspicious sched_exec() I was talking about! :)

I think it needs to be moved, and perhaps _later_ instead of earlier? Hmm...

-Kees

>    bprm_execve+328
>    do_execveat_common.isra.0+429
>    __x64_sys_execve+50
>    do_syscall_64+46
>    entry_SYSCALL_64_after_hwframe+110
>]: 206
>
>I did not see this coming for sure. I'll poke around maybe this weekend.
>
Mateusz Guzik Nov. 8, 2023, 12:03 a.m. UTC | #16
On 11/8/23, Kees Cook <kees@kernel.org> wrote:
>
>
> On November 7, 2023 3:08:47 PM PST, Mateusz Guzik <mjguzik@gmail.com>
> wrote:
>>On 11/7/23, Kees Cook <keescook@chromium.org> wrote:
>>> On Tue, Nov 07, 2023 at 10:23:16PM +0100, Mateusz Guzik wrote:
>>>> If the patch which dodges second lookup still somehow appears slower a
>>>> flamegraph or other profile would be nice. I can volunteer to take a
>>>> look at what's going on provided above measurements will be done and
>>>> show funkyness.
>>>
>>> When I looked at this last, it seemed like all the work done in
>>> do_filp_open() (my patch, which moved the lookup earlier) was heavier
>>> than the duplicate filename_lookup().
>>>
>>> What I didn't test was moving the sched_exec() before the mm creation,
>>> which Peter confirmed shouldn't be a problem, but I think that might be
>>> only a tiny benefit, if at all.
>>>
>>> If you can do some comparisons, that would be great; it always takes me
>>> a fair bit of time to get set up for flame graph generation, etc. :)
>>>
>>
>>So I spawned *one* process executing one statocally linked binary in a
>>loop, test case from http://apollo.backplane.com/DFlyMisc/doexec.c .
>>
>>The profile is definitely not what I expected:
>>   5.85%  [kernel]           [k] asm_exc_page_fault
>>   5.84%  [kernel]           [k] __pv_queued_spin_lock_slowpath
>>[snip]
>>
>>I'm going to have to recompile with lock profiling, meanwhile
>>according to bpftrace
>>(bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] =
>> count(); }')
>>top hits would be:
>>
>>@[
>>    __pv_queued_spin_lock_slowpath+1
>>    _raw_spin_lock+37
>>    __schedule+192
>>    schedule_idle+38
>>    do_idle+366
>>    cpu_startup_entry+38
>>    start_secondary+282
>>    secondary_startup_64_no_verify+381
>>]: 181
>>@[
>>    __pv_queued_spin_lock_slowpath+1
>>    _raw_spin_lock_irq+43
>>    wait_for_completion+141
>>    stop_one_cpu+127
>>    sched_exec+165
>
> There's the suspicious sched_exec() I was talking about! :)
>
> I think it needs to be moved, and perhaps _later_ instead of earlier?
> Hmm...
>

I'm getting around 3.4k execs/s. However, if I "taskset -c 3
./static-doexec 1" the number goes up to about 9.5k and lock
contention disappears from the profile. So off hand looks like the
task is walking around the box when it perhaps could be avoided -- it
is idle apart from running the test. Again this is going to require a
serious look instead of ad hoc pokes.

Side note I actually read your patch this time around instead of
skimming through it and assuming it did what I thought.

do_filp_open is of course very expensive and kmalloc + kfree are slow.
On top of it deallocating a file object even after a failed open was
very expensive due to delegation to task_work (recently fixed).

What I claim should be clear-cut faster is that lookup as in the
original patch and only messing with file allocation et al if it
succeeds.
Kees Cook Nov. 8, 2023, 7:25 p.m. UTC | #17
On Wed, Nov 08, 2023 at 01:03:33AM +0100, Mateusz Guzik wrote:
> [...]
> >>@[
> >>    __pv_queued_spin_lock_slowpath+1
> >>    _raw_spin_lock_irq+43
> >>    wait_for_completion+141
> >>    stop_one_cpu+127
> >>    sched_exec+165
> >
> > There's the suspicious sched_exec() I was talking about! :)
> >
> > I think it needs to be moved, and perhaps _later_ instead of earlier?
> > Hmm...
> >
> 
> I'm getting around 3.4k execs/s. However, if I "taskset -c 3
> ./static-doexec 1" the number goes up to about 9.5k and lock
> contention disappears from the profile. So off hand looks like the
> task is walking around the box when it perhaps could be avoided -- it
> is idle apart from running the test. Again this is going to require a
> serious look instead of ad hoc pokes.

Hm, that is pretty interesting. I'll see if I can go find the original
rationale for adding sched_exec() in there...

> Side note I actually read your patch this time around instead of
> skimming through it and assuming it did what I thought.
> 
> do_filp_open is of course very expensive and kmalloc + kfree are slow.
> On top of it deallocating a file object even after a failed open was
> very expensive due to delegation to task_work (recently fixed).
> 
> What I claim should be clear-cut faster is that lookup as in the
> original patch and only messing with file allocation et al if it
> succeeds.

I'm less familiar with the VFS guts here -- I'm open to alternatives! :)
Kees Cook Nov. 8, 2023, 7:31 p.m. UTC | #18
On Wed, Nov 08, 2023 at 01:03:33AM +0100, Mateusz Guzik wrote:
> On 11/8/23, Kees Cook <kees@kernel.org> wrote:
> >
> >
> > On November 7, 2023 3:08:47 PM PST, Mateusz Guzik <mjguzik@gmail.com>
> > wrote:
> >>On 11/7/23, Kees Cook <keescook@chromium.org> wrote:
> >>> On Tue, Nov 07, 2023 at 10:23:16PM +0100, Mateusz Guzik wrote:
> >>>> If the patch which dodges second lookup still somehow appears slower a
> >>>> flamegraph or other profile would be nice. I can volunteer to take a
> >>>> look at what's going on provided above measurements will be done and
> >>>> show funkyness.
> >>>
> >>> When I looked at this last, it seemed like all the work done in
> >>> do_filp_open() (my patch, which moved the lookup earlier) was heavier
> >>> than the duplicate filename_lookup().
> >>>
> >>> What I didn't test was moving the sched_exec() before the mm creation,
> >>> which Peter confirmed shouldn't be a problem, but I think that might be
> >>> only a tiny benefit, if at all.
> >>>
> >>> If you can do some comparisons, that would be great; it always takes me
> >>> a fair bit of time to get set up for flame graph generation, etc. :)
> >>>
> >>
> >>So I spawned *one* process executing one statocally linked binary in a
> >>loop, test case from http://apollo.backplane.com/DFlyMisc/doexec.c .
> >>
> >>The profile is definitely not what I expected:
> >>   5.85%  [kernel]           [k] asm_exc_page_fault
> >>   5.84%  [kernel]           [k] __pv_queued_spin_lock_slowpath
> >>[snip]
> >>
> >>I'm going to have to recompile with lock profiling, meanwhile
> >>according to bpftrace
> >>(bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] =
> >> count(); }')
> >>top hits would be:
> >>
> >>@[
> >>    __pv_queued_spin_lock_slowpath+1
> >>    _raw_spin_lock+37
> >>    __schedule+192
> >>    schedule_idle+38
> >>    do_idle+366
> >>    cpu_startup_entry+38
> >>    start_secondary+282
> >>    secondary_startup_64_no_verify+381
> >>]: 181
> >>@[
> >>    __pv_queued_spin_lock_slowpath+1
> >>    _raw_spin_lock_irq+43
> >>    wait_for_completion+141
> >>    stop_one_cpu+127
> >>    sched_exec+165
> >
> > There's the suspicious sched_exec() I was talking about! :)
> >
> > I think it needs to be moved, and perhaps _later_ instead of earlier?
> > Hmm...
> >
> 
> I'm getting around 3.4k execs/s. However, if I "taskset -c 3
> ./static-doexec 1" the number goes up to about 9.5k and lock
> contention disappears from the profile. So off hand looks like the
> task is walking around the box when it perhaps could be avoided -- it
> is idle apart from running the test. Again this is going to require a
> serious look instead of ad hoc pokes.

Peter, is this something you can speak to? It seems like execve() forces
a change in running CPU. Is this really something we want to be doing?
Or is there some better way to keep it on the same CPU unless there is
contention?

-Kees
Mateusz Guzik Nov. 8, 2023, 7:35 p.m. UTC | #19
On 11/8/23, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Nov 08, 2023 at 01:03:33AM +0100, Mateusz Guzik wrote:
>> On 11/8/23, Kees Cook <kees@kernel.org> wrote:
>> >
>> >
>> > On November 7, 2023 3:08:47 PM PST, Mateusz Guzik <mjguzik@gmail.com>
>> > wrote:
>> >>On 11/7/23, Kees Cook <keescook@chromium.org> wrote:
>> >>> On Tue, Nov 07, 2023 at 10:23:16PM +0100, Mateusz Guzik wrote:
>> >>>> If the patch which dodges second lookup still somehow appears slower
>> >>>> a
>> >>>> flamegraph or other profile would be nice. I can volunteer to take a
>> >>>> look at what's going on provided above measurements will be done and
>> >>>> show funkyness.
>> >>>
>> >>> When I looked at this last, it seemed like all the work done in
>> >>> do_filp_open() (my patch, which moved the lookup earlier) was heavier
>> >>> than the duplicate filename_lookup().
>> >>>
>> >>> What I didn't test was moving the sched_exec() before the mm
>> >>> creation,
>> >>> which Peter confirmed shouldn't be a problem, but I think that might
>> >>> be
>> >>> only a tiny benefit, if at all.
>> >>>
>> >>> If you can do some comparisons, that would be great; it always takes
>> >>> me
>> >>> a fair bit of time to get set up for flame graph generation, etc. :)
>> >>>
>> >>
>> >>So I spawned *one* process executing one statocally linked binary in a
>> >>loop, test case from http://apollo.backplane.com/DFlyMisc/doexec.c .
>> >>
>> >>The profile is definitely not what I expected:
>> >>   5.85%  [kernel]           [k] asm_exc_page_fault
>> >>   5.84%  [kernel]           [k] __pv_queued_spin_lock_slowpath
>> >>[snip]
>> >>
>> >>I'm going to have to recompile with lock profiling, meanwhile
>> >>according to bpftrace
>> >>(bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] =
>> >> count(); }')
>> >>top hits would be:
>> >>
>> >>@[
>> >>    __pv_queued_spin_lock_slowpath+1
>> >>    _raw_spin_lock+37
>> >>    __schedule+192
>> >>    schedule_idle+38
>> >>    do_idle+366
>> >>    cpu_startup_entry+38
>> >>    start_secondary+282
>> >>    secondary_startup_64_no_verify+381
>> >>]: 181
>> >>@[
>> >>    __pv_queued_spin_lock_slowpath+1
>> >>    _raw_spin_lock_irq+43
>> >>    wait_for_completion+141
>> >>    stop_one_cpu+127
>> >>    sched_exec+165
>> >
>> > There's the suspicious sched_exec() I was talking about! :)
>> >
>> > I think it needs to be moved, and perhaps _later_ instead of earlier?
>> > Hmm...
>> >
>>
>> I'm getting around 3.4k execs/s. However, if I "taskset -c 3
>> ./static-doexec 1" the number goes up to about 9.5k and lock
>> contention disappears from the profile. So off hand looks like the
>> task is walking around the box when it perhaps could be avoided -- it
>> is idle apart from running the test. Again this is going to require a
>> serious look instead of ad hoc pokes.
>
> Peter, is this something you can speak to? It seems like execve() forces
> a change in running CPU. Is this really something we want to be doing?
> Or is there some better way to keep it on the same CPU unless there is
> contention?
>

sched_exec causes migration only for only few % of execs in the bench,
but when it does happen there is tons of overhead elsewhere.

I expect real programs which get past execve will be prone to
migrating anyway, regardless of what sched_exec is doing.

That is to say, while sched_exec buggering off here would be nice, I
think for real-world wins the thing to investigate is the overhead
which comes from migration to begin with.
Eric W. Biederman Nov. 9, 2023, 12:17 a.m. UTC | #20
Mateusz Guzik <mjguzik@gmail.com> writes:

> On 11/8/23, Kees Cook <keescook@chromium.org> wrote:
>> On Wed, Nov 08, 2023 at 01:03:33AM +0100, Mateusz Guzik wrote:
>>> I'm getting around 3.4k execs/s. However, if I "taskset -c 3
>>> ./static-doexec 1" the number goes up to about 9.5k and lock
>>> contention disappears from the profile. So off hand looks like the
>>> task is walking around the box when it perhaps could be avoided -- it
>>> is idle apart from running the test. Again this is going to require a
>>> serious look instead of ad hoc pokes.
>>
>> Peter, is this something you can speak to? It seems like execve() forces
>> a change in running CPU. Is this really something we want to be doing?
>> Or is there some better way to keep it on the same CPU unless there is
>> contention?
>>
>
> sched_exec causes migration only for only few % of execs in the bench,
> but when it does happen there is tons of overhead elsewhere.
>
> I expect real programs which get past execve will be prone to
> migrating anyway, regardless of what sched_exec is doing.
>
> That is to say, while sched_exec buggering off here would be nice, I
> think for real-world wins the thing to investigate is the overhead
> which comes from migration to begin with.

I have a vague memory that the idea is that there is a point during exec
when it should be much less expensive than normal to allow migration
between cpus because all of the old state has gone away.

Assuming that is the rationale, if we are getting lock contention
then either there is a global lock in there, or there is the potential
to pick a less expensive location within exec.


Just to confirm my memory I dug a little deeper and I found the original
commit that added sched_exec (in tglx's git tree of the bit keeper
history).

commit f01419fd6d4e5b32fef19d206bc3550cc04567a9
Author: Martin J. Bligh <mbligh@aracnet.com>
Date:   Wed Jan 15 19:46:10 2003 -0800

    [PATCH] (2/3) Initial load balancing
    
    Patch from Michael Hohnbaum
    
    This adds a hook, sched_balance_exec(), to the exec code, to make it
    place the exec'ed task on the least loaded queue. We have less state
    to move at exec time than fork time, so this is the cheapest point
    to cross-node migrate. Experience in Dynix/PTX and testing on Linux
    has confirmed that this is the cheapest time to move tasks between nodes.
    
    It also macro-wraps changes to nr_running, to allow us to keep track of
    per-node nr_running as well. Again, no impact on non-NUMA machines.


Eric
Mateusz Guzik Nov. 9, 2023, 12:21 p.m. UTC | #21
On 11/9/23, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Mateusz Guzik <mjguzik@gmail.com> writes:
>> sched_exec causes migration only for only few % of execs in the bench,
>> but when it does happen there is tons of overhead elsewhere.
>>
>> I expect real programs which get past execve will be prone to
>> migrating anyway, regardless of what sched_exec is doing.
>>
>> That is to say, while sched_exec buggering off here would be nice, I
>> think for real-world wins the thing to investigate is the overhead
>> which comes from migration to begin with.
>
> I have a vague memory that the idea is that there is a point during exec
> when it should be much less expensive than normal to allow migration
> between cpus because all of the old state has gone away.
>
> Assuming that is the rationale, if we are getting lock contention
> then either there is a global lock in there, or there is the potential
> to pick a less expensive location within exec.
>

Given the commit below I think the term "migration cost" is overloaded here.

By migration cost in my previous mail I meant the immediate cost
(stop_one_cpu and so on), but also the aftermath -- for example tlb
flushes on another CPU when tearing down your now-defunct mm after you
switched.

For testing purposes I verified commenting out sched_exec and not
using taskset still gives me about 9.5k ops/s.

I 100% agree should the task be moved between NUMA domains, it makes
sense to do it when it has the smallest footprint. I don't know what
the original patch did, the current code just picks a CPU and migrates
to it, regardless of NUMA considerations. I will note that the goal
would still be achieved by comparing domains and doing nothing if they
match.

I think this would be nice to fix, but it is definitely not a big
deal. I guess the question is to Peter Zijlstra if this sounds
reasonable.

> Just to confirm my memory I dug a little deeper and I found the original
> commit that added sched_exec (in tglx's git tree of the bit keeper
> history).
>
> commit f01419fd6d4e5b32fef19d206bc3550cc04567a9
> Author: Martin J. Bligh <mbligh@aracnet.com>
> Date:   Wed Jan 15 19:46:10 2003 -0800
>
>     [PATCH] (2/3) Initial load balancing
>
>     Patch from Michael Hohnbaum
>
>     This adds a hook, sched_balance_exec(), to the exec code, to make it
>     place the exec'ed task on the least loaded queue. We have less state
>     to move at exec time than fork time, so this is the cheapest point
>     to cross-node migrate. Experience in Dynix/PTX and testing on Linux
>     has confirmed that this is the cheapest time to move tasks between
> nodes.
>
>     It also macro-wraps changes to nr_running, to allow us to keep track of
>     per-node nr_running as well. Again, no impact on non-NUMA machines.
>
>
> Eric
>
>
Eric W. Biederman Nov. 10, 2023, 5:26 a.m. UTC | #22
Mateusz Guzik <mjguzik@gmail.com> writes:

> On 11/9/23, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Mateusz Guzik <mjguzik@gmail.com> writes:
>>> sched_exec causes migration only for only few % of execs in the bench,
>>> but when it does happen there is tons of overhead elsewhere.
>>>
>>> I expect real programs which get past execve will be prone to
>>> migrating anyway, regardless of what sched_exec is doing.
>>>
>>> That is to say, while sched_exec buggering off here would be nice, I
>>> think for real-world wins the thing to investigate is the overhead
>>> which comes from migration to begin with.
>>
>> I have a vague memory that the idea is that there is a point during exec
>> when it should be much less expensive than normal to allow migration
>> between cpus because all of the old state has gone away.
>>
>> Assuming that is the rationale, if we are getting lock contention
>> then either there is a global lock in there, or there is the potential
>> to pick a less expensive location within exec.
>>
>
> Given the commit below I think the term "migration cost" is overloaded here.
>
> By migration cost in my previous mail I meant the immediate cost
> (stop_one_cpu and so on), but also the aftermath -- for example tlb
> flushes on another CPU when tearing down your now-defunct mm after you
> switched.
>
> For testing purposes I verified commenting out sched_exec and not
> using taskset still gives me about 9.5k ops/s.
>
> I 100% agree should the task be moved between NUMA domains, it makes
> sense to do it when it has the smallest footprint. I don't know what
> the original patch did, the current code just picks a CPU and migrates
> to it, regardless of NUMA considerations. I will note that the goal
> would still be achieved by comparing domains and doing nothing if they
> match.
>
> I think this would be nice to fix, but it is definitely not a big
> deal. I guess the question is to Peter Zijlstra if this sounds
> reasonable.

Perhaps I misread the trace. My point was simply that the sched_exec
seemed to be causing lock contention because what was on one cpu is
now on another cpu, and we are now getting cross cpu lock ping-pongs.

If the sched_exec is causing exec to cause cross cpu lock ping-pongs,
then we can move sched_exec to a better place within exec.  It has
already happened once, shortly after it was introduced.

Ultimately we want the sched_exec to be in the cheapest place within
exec that we can find.

Eric
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index 9a5ca7b82bfc..fe786aeb2f1b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1881,6 +1881,16 @@  static int do_execveat_common(int fd, struct filename *filename,
 	if (IS_ERR(filename))
 		return PTR_ERR(filename);
 
+	/* Fast-path ENOENT for $PATH search failures, before we alloc an mm or
+	 * parse arguments. */
+	if (fd == AT_FDCWD && flags == 0 && filename->name[0] == '/') {
+		struct path path;
+		retval = filename_lookup(AT_FDCWD, filename, 0, &path, NULL);
+		if (retval == -ENOENT)
+			goto out_ret;
+		path_put(&path);
+	}
+
 	/*
 	 * We move the actual failure in case of RLIMIT_NPROC excess from
 	 * set*uid() to execve() because too many poorly written programs