[v2] fs: try an opportunistic lookup for O_CREAT opens too

Message ID	20240806-openfast-v2-1-42da45981811@kernel.org (mailing list archive)
State	New
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E52F518D659; Tue, 6 Aug 2024 14:32:19 +0000 (UTC) From: Jeff Layton <jlayton@kernel.org> Date: Tue, 06 Aug 2024 10:32:17 -0400 Subject: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20240806-openfast-v2-1-42da45981811@kernel.org> To: Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>, Andrew Morton <akpm@linux-foundation.org> Cc: Mateusz Guzik <mjguzik@gmail.com>, Josef Bacik <josef@toxicpanda.com>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Jeff Layton <jlayton@kernel.org>
Series	[v2] fs: try an opportunistic lookup for O_CREAT opens too \| expand [v2] fs: try an opportunistic lookup for O_CREAT opens too

Jeff Layton Aug. 6, 2024, 2:32 p.m. UTC

Today, when opening a file we'll typically do a fast lookup, but if
O_CREAT is set, the kernel always takes the exclusive inode lock. I
assume this was done with the expectation that O_CREAT means that we
always expect to do the create, but that's often not the case. Many
programs set O_CREAT even in scenarios where the file already exists.

This patch rearranges the pathwalk-for-open code to also attempt a
fast_lookup in certain O_CREAT cases. If a positive dentry is found, the
inode_lock can be avoided altogether, and if auditing isn't enabled, it
can stay in rcuwalk mode for the last step_into.

One notable exception that is hopefully temporary: if we're doing an
rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the
dentry in that case is more expensive than taking the i_rwsem for now.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Here's a revised patch that does a fast_lookup in the O_CREAT codepath
too. The main difference here is that if a positive dentry is found and
audit_dummy_context is true, then we keep the walk lazy for the last
component, which avoids having to take any locks on the parent (just
like with non-O_CREAT opens).

The testcase below runs in about 18s on v6.10 (on an 80 CPU machine).
With this patch, it runs in about 1s:

 #define _GNU_SOURCE 1
 #include <stdio.h>
 #include <unistd.h>
 #include <errno.h>
 #include <fcntl.h>
 #include <stdlib.h>
 #include <sys/wait.h>

 #define PROCS           70
 #define LOOPS           500000

static int openloop(int tnum)
{
	char *file;
	int i, ret;

       	ret = asprintf(&file, "./testfile%d", tnum);
	if (ret < 0) {
		printf("asprintf failed for proc %d", tnum);
		return 1;
	}

	for (i = 0; i < LOOPS; ++i) {
		int fd = open(file, O_RDWR|O_CREAT, 0644);

		if (fd < 0) {
			perror("open");
			return 1;
		}
		close(fd);
	}
	unlink(file);
	free(file);
	return 0;
}

int main(int argc, char **argv) {
	pid_t kids[PROCS];
	int i, ret = 0;

	for (i = 0; i < PROCS; ++i) {
		kids[i] = fork();
		if (kids[i] > 0)
			return openloop(i);
		if (kids[i] < 0)
			perror("fork");
	}

	for (i = 0; i < PROCS; ++i) {
		int ret2;

		if (kids[i] > 0) {
			wait(&ret2);
			if (ret2 != 0)
				ret = ret2;
		}
	}
	return ret;
}
---
Changes in v2:
- drop the lockref patch since Mateusz is working on a better approach
- add trailing_slashes helper function
- add a lookup_fast_for_open helper function
- make lookup_fast_for_open skip the lookup if auditing is enabled
- if we find a positive dentry and auditing is disabled, don't unlazy
- Link to v1: https://lore.kernel.org/r/20240802-openfast-v1-0-a1cff2a33063@kernel.org
---
 fs/namei.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 53 insertions(+), 9 deletions(-)


---
base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd
change-id: 20240723-openfast-ac49a7b6ade2

Best regards,

Mateusz Guzik Aug. 6, 2024, 3:25 p.m. UTC | #1

On Tue, Aug 6, 2024 at 4:32 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> Today, when opening a file we'll typically do a fast lookup, but if
> O_CREAT is set, the kernel always takes the exclusive inode lock. I
> assume this was done with the expectation that O_CREAT means that we
> always expect to do the create, but that's often not the case. Many
> programs set O_CREAT even in scenarios where the file already exists.
>
> This patch rearranges the pathwalk-for-open code to also attempt a
> fast_lookup in certain O_CREAT cases. If a positive dentry is found, the
> inode_lock can be avoided altogether, and if auditing isn't enabled, it
> can stay in rcuwalk mode for the last step_into.
>
> One notable exception that is hopefully temporary: if we're doing an
> rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the
> dentry in that case is more expensive than taking the i_rwsem for now.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
> Here's a revised patch that does a fast_lookup in the O_CREAT codepath
> too. The main difference here is that if a positive dentry is found and
> audit_dummy_context is true, then we keep the walk lazy for the last
> component, which avoids having to take any locks on the parent (just
> like with non-O_CREAT opens).
>
> The testcase below runs in about 18s on v6.10 (on an 80 CPU machine).
> With this patch, it runs in about 1s:
>

I don't have an opinion on the patch.

If your kernel does not use apparmor and the patch manages to dodge
refing the parent, then indeed this should be fully deserialized just
like non-creat opens.

Instead of the hand-rolled benchmark may I interest you in using
will-it-scale instead? Notably it reports the achieved rate once per
second, so you can check if there is funky business going on between
reruns, gives the cpu the time to kick off turbo boost if applicable
etc.

I would bench with that myself, but I temporarily don't have handy
access to bigger hw. Even so, the below is completely optional and
perhaps more of a suggestion for the future :)

I hacked up the test case based on tests/open1.c.

git clone https://github.com/antonblanchard/will-it-scale.git

For example plop into tests/opencreate1.c && gmake &&
./opencreate1_processes -t 70:

#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>
#include <string.h>

char *testcase_description = "Separate file open/close + O_CREAT";

#define template        "/tmp/willitscale.XXXXXX"
static char (*tmpfiles)[sizeof(template)];
static unsigned long local_nr_tasks;

void testcase_prepare(unsigned long nr_tasks)
{
        int i;
        tmpfiles = (char(*)[sizeof(template)])malloc(sizeof(template)
* nr_tasks);
        assert(tmpfiles);

        for (i = 0; i < nr_tasks; i++) {
                strcpy(tmpfiles[i], template);
                char *tmpfile = tmpfiles[i];
                int fd = mkstemp(tmpfile);

                assert(fd >= 0);
                close(fd);
        }

        local_nr_tasks = nr_tasks;
}

void testcase(unsigned long long *iterations, unsigned long nr)
{
        char *tmpfile = tmpfiles[nr];

        while (1) {
                int fd = open(tmpfile, O_RDWR | O_CREAT, 0600);
                assert(fd >= 0);
                close(fd);

                (*iterations)++;
        }
}

void testcase_cleanup(void)
{
        int i;
        for (i = 0; i < local_nr_tasks; i++) {
                unlink(tmpfiles[i]);
        }
        free(tmpfiles);
}



> ---
>  fs/namei.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 53 insertions(+), 9 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 1e05a0f3f04d..2d716fb114c9 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3518,6 +3518,47 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>         return ERR_PTR(error);
>  }
>
> +static inline bool trailing_slashes(struct nameidata *nd)
> +{
> +       return (bool)nd->last.name[nd->last.len];
> +}
> +
> +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> +{
> +       struct dentry *dentry;
> +
> +       if (open_flag & O_CREAT) {
> +               /* Don't bother on an O_EXCL create */
> +               if (open_flag & O_EXCL)
> +                       return NULL;
> +
> +               /*
> +                * FIXME: If auditing is enabled, then we'll have to unlazy to
> +                * use the dentry. For now, don't do this, since it shifts
> +                * contention from parent's i_rwsem to its d_lockref spinlock.
> +                * Reconsider this once dentry refcounting handles heavy
> +                * contention better.
> +                */
> +               if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> +                       return NULL;
> +       }
> +
> +       if (trailing_slashes(nd))
> +               nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> +
> +       dentry = lookup_fast(nd);
> +
> +       if (open_flag & O_CREAT) {
> +               /* Discard negative dentries. Need inode_lock to do the create */
> +               if (dentry && !dentry->d_inode) {
> +                       if (!(nd->flags & LOOKUP_RCU))
> +                               dput(dentry);
> +                       dentry = NULL;
> +               }
> +       }
> +       return dentry;
> +}
> +
>  static const char *open_last_lookups(struct nameidata *nd,
>                    struct file *file, const struct open_flags *op)
>  {
> @@ -3535,28 +3576,31 @@ static const char *open_last_lookups(struct nameidata *nd,
>                 return handle_dots(nd, nd->last_type);
>         }
>
> +       /* We _can_ be in RCU mode here */
> +       dentry = lookup_fast_for_open(nd, open_flag);
> +       if (IS_ERR(dentry))
> +               return ERR_CAST(dentry);
> +
>         if (!(open_flag & O_CREAT)) {
> -               if (nd->last.name[nd->last.len])
> -                       nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> -               /* we _can_ be in RCU mode here */
> -               dentry = lookup_fast(nd);
> -               if (IS_ERR(dentry))
> -                       return ERR_CAST(dentry);
>                 if (likely(dentry))
>                         goto finish_lookup;
>
>                 if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
>                         return ERR_PTR(-ECHILD);
>         } else {
> -               /* create side of things */
>                 if (nd->flags & LOOKUP_RCU) {
> +                       /* can stay in rcuwalk if not auditing */
> +                       if (dentry && audit_dummy_context())
> +                               goto check_slashes;
>                         if (!try_to_unlazy(nd))
>                                 return ERR_PTR(-ECHILD);
>                 }
>                 audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> -               /* trailing slashes? */
> -               if (unlikely(nd->last.name[nd->last.len]))
> +check_slashes:
> +               if (trailing_slashes(nd))
>                         return ERR_PTR(-EISDIR);
> +               if (dentry)
> +                       goto finish_lookup;
>         }
>
>         if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
>
> ---
> base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd
> change-id: 20240723-openfast-ac49a7b6ade2
>
> Best regards,
> --
> Jeff Layton <jlayton@kernel.org>
>

Jeff Layton Aug. 6, 2024, 4:17 p.m. UTC | #2

On Tue, 2024-08-06 at 17:25 +0200, Mateusz Guzik wrote:
> On Tue, Aug 6, 2024 at 4:32 PM Jeff Layton <jlayton@kernel.org>
> wrote:
> > 
> > Today, when opening a file we'll typically do a fast lookup, but if
> > O_CREAT is set, the kernel always takes the exclusive inode lock. I
> > assume this was done with the expectation that O_CREAT means that
> > we
> > always expect to do the create, but that's often not the case. Many
> > programs set O_CREAT even in scenarios where the file already
> > exists.
> > 
> > This patch rearranges the pathwalk-for-open code to also attempt a
> > fast_lookup in certain O_CREAT cases. If a positive dentry is
> > found, the
> > inode_lock can be avoided altogether, and if auditing isn't
> > enabled, it
> > can stay in rcuwalk mode for the last step_into.
> > 
> > One notable exception that is hopefully temporary: if we're doing
> > an
> > rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing
> > the
> > dentry in that case is more expensive than taking the i_rwsem for
> > now.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> > Here's a revised patch that does a fast_lookup in the O_CREAT
> > codepath
> > too. The main difference here is that if a positive dentry is found
> > and
> > audit_dummy_context is true, then we keep the walk lazy for the
> > last
> > component, which avoids having to take any locks on the parent
> > (just
> > like with non-O_CREAT opens).
> > 
> > The testcase below runs in about 18s on v6.10 (on an 80 CPU
> > machine).
> > With this patch, it runs in about 1s:
> > 
> 
> I don't have an opinion on the patch.
> 
> If your kernel does not use apparmor and the patch manages to dodge
> refing the parent, then indeed this should be fully deserialized just
> like non-creat opens.
> 

Yep. Pity that auditing will slow things down, but them's the breaks...

> Instead of the hand-rolled benchmark may I interest you in using
> will-it-scale instead? Notably it reports the achieved rate once per
> second, so you can check if there is funky business going on between
> reruns, gives the cpu the time to kick off turbo boost if applicable
> etc.
> 
> I would bench with that myself, but I temporarily don't have handy
> access to bigger hw. Even so, the below is completely optional and
> perhaps more of a suggestion for the future :)
> 
> I hacked up the test case based on tests/open1.c.
> 
> git clone https://github.com/antonblanchard/will-it-scale.git
> 
> For example plop into tests/opencreate1.c && gmake &&
> ./opencreate1_processes -t 70:
> 
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <assert.h>
> #include <string.h>
> 
> char *testcase_description = "Separate file open/close + O_CREAT";
> 
> #define template        "/tmp/willitscale.XXXXXX"
> static char (*tmpfiles)[sizeof(template)];
> static unsigned long local_nr_tasks;
> 
> void testcase_prepare(unsigned long nr_tasks)
> {
>         int i;
>         tmpfiles = (char(*)[sizeof(template)])malloc(sizeof(template)
> * nr_tasks);
>         assert(tmpfiles);
> 
>         for (i = 0; i < nr_tasks; i++) {
>                 strcpy(tmpfiles[i], template);
>                 char *tmpfile = tmpfiles[i];
>                 int fd = mkstemp(tmpfile);
> 
>                 assert(fd >= 0);
>                 close(fd);
>         }
> 
>         local_nr_tasks = nr_tasks;
> }
> 
> void testcase(unsigned long long *iterations, unsigned long nr)
> {
>         char *tmpfile = tmpfiles[nr];
> 
>         while (1) {
>                 int fd = open(tmpfile, O_RDWR | O_CREAT, 0600);
>                 assert(fd >= 0);
>                 close(fd);
> 
>                 (*iterations)++;
>         }
> }
> 
> void testcase_cleanup(void)
> {
>         int i;
>         for (i = 0; i < local_nr_tasks; i++) {
>                 unlink(tmpfiles[i]);
>         }
>         free(tmpfiles);
> }
> 
> 

Good suggestion and thanks for the testcase. With v6.10 kernel, I'm
seeing numbers like this at -t 70:

min:4873 max:11510 total:418915
min:4884 max:10598 total:408848
min:3704 max:12371 total:467658
min:2842 max:11792 total:418239
min:2966 max:11511 total:414144
min:4756 max:11381 total:413137
min:4557 max:10789 total:404628
min:4780 max:11125 total:413349
min:4757 max:11156 total:405963

...with the patched kernel, things are significantly faster:

min:265865 max:508909 total:21464553                                  
min:263252 max:500084 total:21242190                                  
min:263989 max:504929 total:21396968                                  
min:263343 max:505852 total:21346829                                  
min:263023 max:507303 total:21410217                                  
min:263420 max:506593 total:21426307                                  
min:259556 max:494529 total:20927169                                  
min:264451 max:508967 total:21433676                                  
min:263486 max:509460 total:21399874                                  
min:263906 max:507400 total:21393015                                  

I can get some fancier plots if anyone is interested, but the benefit
of this patch seems pretty clear.

Mateusz Guzik Aug. 6, 2024, 4:42 p.m. UTC | #3

On Tue, Aug 6, 2024 at 6:17 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Tue, 2024-08-06 at 17:25 +0200, Mateusz Guzik wrote:
> > On Tue, Aug 6, 2024 at 4:32 PM Jeff Layton <jlayton@kernel.org>
> > wrote:
> > >
> > > Today, when opening a file we'll typically do a fast lookup, but if
> > > O_CREAT is set, the kernel always takes the exclusive inode lock. I
> > > assume this was done with the expectation that O_CREAT means that
> > > we
> > > always expect to do the create, but that's often not the case. Many
> > > programs set O_CREAT even in scenarios where the file already
> > > exists.
> > >
> > > This patch rearranges the pathwalk-for-open code to also attempt a
> > > fast_lookup in certain O_CREAT cases. If a positive dentry is
> > > found, the
> > > inode_lock can be avoided altogether, and if auditing isn't
> > > enabled, it
> > > can stay in rcuwalk mode for the last step_into.
> > >
> > > One notable exception that is hopefully temporary: if we're doing
> > > an
> > > rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing
> > > the
> > > dentry in that case is more expensive than taking the i_rwsem for
> > > now.
> > >
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > > Here's a revised patch that does a fast_lookup in the O_CREAT
> > > codepath
> > > too. The main difference here is that if a positive dentry is found
> > > and
> > > audit_dummy_context is true, then we keep the walk lazy for the
> > > last
> > > component, which avoids having to take any locks on the parent
> > > (just
> > > like with non-O_CREAT opens).
> > >
> > > The testcase below runs in about 18s on v6.10 (on an 80 CPU
> > > machine).
> > > With this patch, it runs in about 1s:
> > >
> >
> > I don't have an opinion on the patch.
> >
> > If your kernel does not use apparmor and the patch manages to dodge
> > refing the parent, then indeed this should be fully deserialized just
> > like non-creat opens.
> >
>
> Yep. Pity that auditing will slow things down, but them's the breaks...
>
> > Instead of the hand-rolled benchmark may I interest you in using
> > will-it-scale instead? Notably it reports the achieved rate once per
> > second, so you can check if there is funky business going on between
> > reruns, gives the cpu the time to kick off turbo boost if applicable
> > etc.
> >
> > I would bench with that myself, but I temporarily don't have handy
> > access to bigger hw. Even so, the below is completely optional and
> > perhaps more of a suggestion for the future :)
> >
> > I hacked up the test case based on tests/open1.c.
> >
> > git clone https://github.com/antonblanchard/will-it-scale.git
> >
> > For example plop into tests/opencreate1.c && gmake &&
> > ./opencreate1_processes -t 70:
> >
> > #include <stdlib.h>
> > #include <unistd.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > #include <assert.h>
> > #include <string.h>
> >
> > char *testcase_description = "Separate file open/close + O_CREAT";
> >
> > #define template        "/tmp/willitscale.XXXXXX"
> > static char (*tmpfiles)[sizeof(template)];
> > static unsigned long local_nr_tasks;
> >
> > void testcase_prepare(unsigned long nr_tasks)
> > {
> >         int i;
> >         tmpfiles = (char(*)[sizeof(template)])malloc(sizeof(template)
> > * nr_tasks);
> >         assert(tmpfiles);
> >
> >         for (i = 0; i < nr_tasks; i++) {
> >                 strcpy(tmpfiles[i], template);
> >                 char *tmpfile = tmpfiles[i];
> >                 int fd = mkstemp(tmpfile);
> >
> >                 assert(fd >= 0);
> >                 close(fd);
> >         }
> >
> >         local_nr_tasks = nr_tasks;
> > }
> >
> > void testcase(unsigned long long *iterations, unsigned long nr)
> > {
> >         char *tmpfile = tmpfiles[nr];
> >
> >         while (1) {
> >                 int fd = open(tmpfile, O_RDWR | O_CREAT, 0600);
> >                 assert(fd >= 0);
> >                 close(fd);
> >
> >                 (*iterations)++;
> >         }
> > }
> >
> > void testcase_cleanup(void)
> > {
> >         int i;
> >         for (i = 0; i < local_nr_tasks; i++) {
> >                 unlink(tmpfiles[i]);
> >         }
> >         free(tmpfiles);
> > }
> >
> >
>
> Good suggestion and thanks for the testcase. With v6.10 kernel, I'm
> seeing numbers like this at -t 70:
>
> min:4873 max:11510 total:418915
> min:4884 max:10598 total:408848
> min:3704 max:12371 total:467658
> min:2842 max:11792 total:418239
> min:2966 max:11511 total:414144
> min:4756 max:11381 total:413137
> min:4557 max:10789 total:404628
> min:4780 max:11125 total:413349
> min:4757 max:11156 total:405963
>
> ...with the patched kernel, things are significantly faster:
>
> min:265865 max:508909 total:21464553
> min:263252 max:500084 total:21242190
> min:263989 max:504929 total:21396968
> min:263343 max:505852 total:21346829
> min:263023 max:507303 total:21410217
> min:263420 max:506593 total:21426307
> min:259556 max:494529 total:20927169
> min:264451 max:508967 total:21433676
> min:263486 max:509460 total:21399874
> min:263906 max:507400 total:21393015
>
> I can get some fancier plots if anyone is interested, but the benefit
> of this patch seems pretty clear.

In the commit message I would replace that handrolled bench thing with +/-:
<quote>
70-way open(.., O_RDWR | O_CREAT) calls on already existing files, one
per-worker:
before: 467658
after:  21464553 (+4589%)
</quote>

I guess it would make sense to check if unlink1_processes is doing
fine if it's not too much hassle.

overall pretty decent win :P

Andi Kleen Aug. 6, 2024, 7:11 p.m. UTC | #4

Mateusz Guzik <mjguzik@gmail.com> writes:
>
> I would bench with that myself, but I temporarily don't have handy
> access to bigger hw. Even so, the below is completely optional and
> perhaps more of a suggestion for the future :)
>
> I hacked up the test case based on tests/open1.c.

Don't you need two test cases? One where the file exists and one
where it doesn't. Because the "doesn't exist" will likely be slower
than before because it will do the lookups twice,
and it will likely even slow single threaded.

I assume the penalty will also depend on the number of entries
in the path.

That all seem to be an important considerations in judging the benefits
of the patch.

-Andi

Mateusz Guzik Aug. 6, 2024, 7:22 p.m. UTC | #5

On Tue, Aug 6, 2024 at 9:11 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> Mateusz Guzik <mjguzik@gmail.com> writes:
> >
> > I would bench with that myself, but I temporarily don't have handy
> > access to bigger hw. Even so, the below is completely optional and
> > perhaps more of a suggestion for the future :)
> >
> > I hacked up the test case based on tests/open1.c.
>
> Don't you need two test cases? One where the file exists and one
> where it doesn't. Because the "doesn't exist" will likely be slower
> than before because it will do the lookups twice,
> and it will likely even slow single threaded.
>
> I assume the penalty will also depend on the number of entries
> in the path.
>
> That all seem to be an important considerations in judging the benefits
> of the patch.
>

This is why I suggested separately running "unlink1" which is
guaranteed to create a file every time -- all iterations will fail the
proposed fast path.

Unless you meant a mixed variant where only some of the threads create
files. Perhaps worthwhile to add, not hard to do (one can switch the
mode based on passed worker number).

Jeff Layton Aug. 6, 2024, 7:26 p.m. UTC | #6

On Tue, 2024-08-06 at 12:11 -0700, Andi Kleen wrote:
> Mateusz Guzik <mjguzik@gmail.com> writes:
> > 
> > I would bench with that myself, but I temporarily don't have handy
> > access to bigger hw. Even so, the below is completely optional and
> > perhaps more of a suggestion for the future :)
> > 
> > I hacked up the test case based on tests/open1.c.
> 
> Don't you need two test cases? One where the file exists and one
> where it doesn't. Because the "doesn't exist" will likely be slower
> than before because it will do the lookups twice,
> and it will likely even slow single threaded.
> 
> I assume the penalty will also depend on the number of entries
> in the path.
> 
> That all seem to be an important considerations in judging the benefits
> of the patch.
> 

Definitely.

FWIW, I did test a single threaded (bespoke) test case that did a bunch
of O_CREAT opens, closes and then unlinks. I didn't measure any
discernable difference with this patch. My conclusion from that was
that the extra lockless lookup should be cheap.

That said, this could show a difference if you have rather long hash
chains that need to be walked completely, and you have to actually do
the create every time. In practice though, time spent under the
inode_lock and doing the create tends to dominate in that case, so I
*think* this should still be worthwhile.

I'll plan to add a test like that to will_it_scale unless Mateusz beats
me to it. A long soak in linux-next is probably also justified with
this patch.

Jeff Layton Aug. 6, 2024, 7:51 p.m. UTC | #7

On Tue, 2024-08-06 at 10:32 -0400, Jeff Layton wrote:
> Today, when opening a file we'll typically do a fast lookup, but if
> O_CREAT is set, the kernel always takes the exclusive inode lock. I
> assume this was done with the expectation that O_CREAT means that we
> always expect to do the create, but that's often not the case. Many
> programs set O_CREAT even in scenarios where the file already exists.
> 
> This patch rearranges the pathwalk-for-open code to also attempt a
> fast_lookup in certain O_CREAT cases. If a positive dentry is found, the
> inode_lock can be avoided altogether, and if auditing isn't enabled, it
> can stay in rcuwalk mode for the last step_into.
> 
> One notable exception that is hopefully temporary: if we're doing an
> rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the
> dentry in that case is more expensive than taking the i_rwsem for now.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
> Here's a revised patch that does a fast_lookup in the O_CREAT codepath
> too. The main difference here is that if a positive dentry is found and
> audit_dummy_context is true, then we keep the walk lazy for the last
> component, which avoids having to take any locks on the parent (just
> like with non-O_CREAT opens).
> 
> The testcase below runs in about 18s on v6.10 (on an 80 CPU machine).
> With this patch, it runs in about 1s:
> 
>  #define _GNU_SOURCE 1
>  #include <stdio.h>
>  #include <unistd.h>
>  #include <errno.h>
>  #include <fcntl.h>
>  #include <stdlib.h>
>  #include <sys/wait.h>
> 
>  #define PROCS           70
>  #define LOOPS           500000
> 
> static int openloop(int tnum)
> {
> 	char *file;
> 	int i, ret;
> 
>        	ret = asprintf(&file, "./testfile%d", tnum);
> 	if (ret < 0) {
> 		printf("asprintf failed for proc %d", tnum);
> 		return 1;
> 	}
> 
> 	for (i = 0; i < LOOPS; ++i) {
> 		int fd = open(file, O_RDWR|O_CREAT, 0644);
> 
> 		if (fd < 0) {
> 			perror("open");
> 			return 1;
> 		}
> 		close(fd);
> 	}
> 	unlink(file);
> 	free(file);
> 	return 0;
> }
> 
> int main(int argc, char **argv) {
> 	pid_t kids[PROCS];
> 	int i, ret = 0;
> 
> 	for (i = 0; i < PROCS; ++i) {
> 		kids[i] = fork();
> 		if (kids[i] > 0)
> 			return openloop(i);
> 		if (kids[i] < 0)
> 			perror("fork");
> 	}
> 
> 	for (i = 0; i < PROCS; ++i) {
> 		int ret2;
> 
> 		if (kids[i] > 0) {
> 			wait(&ret2);
> 			if (ret2 != 0)
> 				ret = ret2;
> 		}
> 	}
> 	return ret;
> }
> ---
> Changes in v2:
> - drop the lockref patch since Mateusz is working on a better approach
> - add trailing_slashes helper function
> - add a lookup_fast_for_open helper function
> - make lookup_fast_for_open skip the lookup if auditing is enabled
> - if we find a positive dentry and auditing is disabled, don't unlazy
> - Link to v1: https://lore.kernel.org/r/20240802-openfast-v1-0-a1cff2a33063@kernel.org
> ---
>  fs/namei.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 53 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 1e05a0f3f04d..2d716fb114c9 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3518,6 +3518,47 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  	return ERR_PTR(error);
>  }
>  
> +static inline bool trailing_slashes(struct nameidata *nd)
> +{
> +	return (bool)nd->last.name[nd->last.len];
> +}
> +
> +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> +{
> +	struct dentry *dentry;
> +
> +	if (open_flag & O_CREAT) {
> +		/* Don't bother on an O_EXCL create */
> +		if (open_flag & O_EXCL)
> +			return NULL;
> +
> +		/*
> +		 * FIXME: If auditing is enabled, then we'll have to unlazy to
> +		 * use the dentry. For now, don't do this, since it shifts
> +		 * contention from parent's i_rwsem to its d_lockref spinlock.
> +		 * Reconsider this once dentry refcounting handles heavy
> +		 * contention better.
> +		 */
> +		if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> +			return NULL;
> +	}
> +
> +	if (trailing_slashes(nd))
> +		nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> +
> +	dentry = lookup_fast(nd);

Self-NAK on this patch. We have to test for IS_ERR on the returned
dentry here. I'll send a v3 along after I've retested it.

> +
> +	if (open_flag & O_CREAT) {
> +		/* Discard negative dentries. Need inode_lock to do the create */
> +		if (dentry && !dentry->d_inode) {
> +			if (!(nd->flags & LOOKUP_RCU))
> +				dput(dentry);
> +			dentry = NULL;
> +		}
> +	}
> +	return dentry;
> +}
> +
>  static const char *open_last_lookups(struct nameidata *nd,
>  		   struct file *file, const struct open_flags *op)
>  {
> @@ -3535,28 +3576,31 @@ static const char *open_last_lookups(struct nameidata *nd,
>  		return handle_dots(nd, nd->last_type);
>  	}
>  
> +	/* We _can_ be in RCU mode here */
> +	dentry = lookup_fast_for_open(nd, open_flag);
> +	if (IS_ERR(dentry))
> +		return ERR_CAST(dentry);
> +
>  	if (!(open_flag & O_CREAT)) {
> -		if (nd->last.name[nd->last.len])
> -			nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> -		/* we _can_ be in RCU mode here */
> -		dentry = lookup_fast(nd);
> -		if (IS_ERR(dentry))
> -			return ERR_CAST(dentry);
>  		if (likely(dentry))
>  			goto finish_lookup;
>  
>  		if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
>  			return ERR_PTR(-ECHILD);
>  	} else {
> -		/* create side of things */
>  		if (nd->flags & LOOKUP_RCU) {
> +			/* can stay in rcuwalk if not auditing */
> +			if (dentry && audit_dummy_context())
> +				goto check_slashes;
>  			if (!try_to_unlazy(nd))
>  				return ERR_PTR(-ECHILD);
>  		}
>  		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> -		/* trailing slashes? */
> -		if (unlikely(nd->last.name[nd->last.len]))
> +check_slashes:
> +		if (trailing_slashes(nd))
>  			return ERR_PTR(-EISDIR);
> +		if (dentry)
> +			goto finish_lookup;
>  	}
>  
>  	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
> 
> ---
> base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd
> change-id: 20240723-openfast-ac49a7b6ade2
> 
> Best regards,

Mateusz Guzik Aug. 6, 2024, 8:03 p.m. UTC | #8

On Tue, Aug 6, 2024 at 9:26 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Tue, 2024-08-06 at 12:11 -0700, Andi Kleen wrote:
> > Mateusz Guzik <mjguzik@gmail.com> writes:
> > >
> > > I would bench with that myself, but I temporarily don't have handy
> > > access to bigger hw. Even so, the below is completely optional and
> > > perhaps more of a suggestion for the future :)
> > >
> > > I hacked up the test case based on tests/open1.c.
> >
> > Don't you need two test cases? One where the file exists and one
> > where it doesn't. Because the "doesn't exist" will likely be slower
> > than before because it will do the lookups twice,
> > and it will likely even slow single threaded.
> >
> > I assume the penalty will also depend on the number of entries
> > in the path.
> >
> > That all seem to be an important considerations in judging the benefits
> > of the patch.
> >
>
> Definitely.
>
> FWIW, I did test a single threaded (bespoke) test case that did a bunch
> of O_CREAT opens, closes and then unlinks. I didn't measure any
> discernable difference with this patch. My conclusion from that was
> that the extra lockless lookup should be cheap.
>
> That said, this could show a difference if you have rather long hash
> chains that need to be walked completely, and you have to actually do
> the create every time. In practice though, time spent under the
> inode_lock and doing the create tends to dominate in that case, so I
> *think* this should still be worthwhile.
>
> I'll plan to add a test like that to will_it_scale unless Mateusz beats
> me to it. A long soak in linux-next is probably also justified with
> this patch.

Well if we are really going there I have to point out a couple of
three things concerning single-threaded performance from entering the
kernel to getting back with a file descriptor.

The headline is that there is a lot of single-threaded perf left on
the table and modulo a significant hit in terms of cycles it is going
to be hard to measure a meaningful result.

Before I get to the vfs layer, there is a significant loss in the
memory allocator because of memcg -- it takes several irq off/on trips
for every alloc (needed to grab struct file *). I have a plan what to
do with it (handle stuff with local cmpxchg (note no lock prefix)),
which I'm trying to get around to. Apart from that you may note the
allocator fast path performs a 16-byte cmpxchg, which is again dog
slow and executes twice (once for the file obj, another time for the
namei buffer). Someone(tm) should patch it up and I have some vague
ideas, but 0 idea when I can take a serious stab.

As for vfs, just from atomics perspective:
- opening a results in a spurious ref/unref cycle on the dentry, i
posted a patch [1] to sort it out (maybe Al will write his own
instead). single-threaded it gives about +5% to open/close rate
- there is an avoidable smp_mb in do_dentry_open, posted a patch [2]
and it got acked. i did not bother benching it
- someone sneaked in atomic refcount management to "struct filename"
to appease io_uring vs audit. this results in another atomic added to
every single path lookup. I have an idea what to do there
- opening for write (which you do with O_CREAT) calls
mnt_get_write_access which contains another smp_mb. this probably can
be sorted out the same way percpu semaphores got taken care of
- __legitimize_mnt has yet another smp_mb to synchro against unmount.
this *probably* can be sorted out with synchronize_rcu, but some care
is needed to not induce massive delays on unmount
- credential management is also done with atomics (so that's get + put
for the open/close cycle), $elsewhere I implemented a scheme where the
local count is cached in the thread itself and only rolled up on
credential change. this whacks the problem and is something i'm
planning on proposing at some point

and more

That is to say I completely agree this does add extra work, but the
kernel is not in shape where true single-threaded impact can be
sensibly measured. I do find it prudent to validate nothing happened
to somehow floor scalability of the case which does need to create
stuff though.

[1] https://lore.kernel.org/linux-fsdevel/20240806163256.882140-1-mjguzik@gmail.com/
[2] https://lore.kernel.org/linux-fsdevel/20240806172846.886570-1-mjguzik@gmail.com/

Jeff Layton Aug. 6, 2024, 8:42 p.m. UTC | #9

On Tue, 2024-08-06 at 21:22 +0200, Mateusz Guzik wrote:
> On Tue, Aug 6, 2024 at 9:11 PM Andi Kleen <ak@linux.intel.com> wrote:
> > 
> > Mateusz Guzik <mjguzik@gmail.com> writes:
> > > 
> > > I would bench with that myself, but I temporarily don't have
> > > handy
> > > access to bigger hw. Even so, the below is completely optional
> > > and
> > > perhaps more of a suggestion for the future :)
> > > 
> > > I hacked up the test case based on tests/open1.c.
> > 
> > Don't you need two test cases? One where the file exists and one
> > where it doesn't. Because the "doesn't exist" will likely be slower
> > than before because it will do the lookups twice,
> > and it will likely even slow single threaded.
> > 
> > I assume the penalty will also depend on the number of entries
> > in the path.
> > 
> > That all seem to be an important considerations in judging the
> > benefits
> > of the patch.
> > 
> 
> This is why I suggested separately running "unlink1" which is
> guaranteed to create a file every time -- all iterations will fail
> the
> proposed fast path.
> 
> Unless you meant a mixed variant where only some of the threads
> create
> files. Perhaps worthwhile to add, not hard to do (one can switch the
> mode based on passed worker number).
> 

Well...

    # ./unlink1_processes -t 70 -s 100

    average:
    v6.10:		114455
    v6.10 + patch:	149513

I suspect what's happening here is that this patch relieves contention
for the inode_lock and that allows the unlinks to proceed faster.

Running it with a single process though:

    average:
    v6.10:		200106
    v6.10 + patch:	199188

So, ~.4% degradation there? That doesn't seem too bad given the gain in
the threaded test.

Andi Kleen Aug. 6, 2024, 8:47 p.m. UTC | #10

> Before I get to the vfs layer, there is a significant loss in the
> memory allocator because of memcg -- it takes several irq off/on trips
> for every alloc (needed to grab struct file *). I have a plan what to
> do with it (handle stuff with local cmpxchg (note no lock prefix)),
> which I'm trying to get around to. Apart from that you may note the
> allocator fast path performs a 16-byte cmpxchg, which is again dog
> slow and executes twice (once for the file obj, another time for the
> namei buffer). Someone(tm) should patch it up and I have some vague
> ideas, but 0 idea when I can take a serious stab.

I just LBR sampled it on my skylake and it doesn't look
particularly slow. You see the whole massive block including CMPXCHG16 
gets IPC 2.7, which is rather good. If you see lots of cycles on it it's likely
a missing cache line.

    kmem_cache_free:
        ffffffff9944ce20                        nop %edi, %edx
        ffffffff9944ce24                        nopl  %eax, (%rax,%rax,1)
        ffffffff9944ce29                        pushq  %rbp
        ffffffff9944ce2a                        mov %rdi, %rdx
        ffffffff9944ce2d                        mov %rsp, %rbp
        ffffffff9944ce30                        pushq  %r15
        ffffffff9944ce32                        pushq  %r14
        ffffffff9944ce34                        pushq  %r13
        ffffffff9944ce36                        pushq  %r12
        ffffffff9944ce38                        mov $0x80000000, %r12d
        ffffffff9944ce3e                        pushq  %rbx
        ffffffff9944ce3f                        mov %rsi, %rbx
        ffffffff9944ce42                        and $0xfffffffffffffff0, %rsp
        ffffffff9944ce46                        sub $0x10, %rsp
        ffffffff9944ce4a                        movq  %gs:0x28, %rax
        ffffffff9944ce53                        movq  %rax, 0x8(%rsp)
        ffffffff9944ce58                        xor %eax, %eax
        ffffffff9944ce5a                        add %rsi, %r12
        ffffffff9944ce5d                        jb 0xffffffff9944d1ea
        ffffffff9944ce63                        mov $0xffffffff80000000, %rax
        ffffffff9944ce6a                        xor %r13d, %r13d
        ffffffff9944ce6d                        subq  0x17b068c(%rip), %rax
        ffffffff9944ce74                        add %r12, %rax
        ffffffff9944ce77                        shr $0xc, %rax
        ffffffff9944ce7b                        shl $0x6, %rax
        ffffffff9944ce7f                        addq  0x17b066a(%rip), %rax
        ffffffff9944ce86                        movq  0x8(%rax), %rcx
        ffffffff9944ce8a                        test $0x1, %cl
        ffffffff9944ce8d                        jnz 0xffffffff9944d15c
        ffffffff9944ce93                        nopl  %eax, (%rax,%rax,1)
        ffffffff9944ce98                        movq  (%rax), %rcx
        ffffffff9944ce9b                        and $0x8, %ch
        ffffffff9944ce9e                        jz 0xffffffff9944cfea
        ffffffff9944cea4                        test %rax, %rax
        ffffffff9944cea7                        jz 0xffffffff9944cfea
        ffffffff9944cead                        movq  0x8(%rax), %r14
        ffffffff9944ceb1                        test %r14, %r14
        ffffffff9944ceb4                        jz 0xffffffff9944cfac
        ffffffff9944ceba                        cmp %r14, %rdx
        ffffffff9944cebd                        jnz 0xffffffff9944d165
        ffffffff9944cec3                        test %r14, %r14
        ffffffff9944cec6                        jz 0xffffffff9944cfac
        ffffffff9944cecc                        movq  0x8(%rbp), %r15
        ffffffff9944ced0                        nopl  %eax, (%rax,%rax,1)
        ffffffff9944ced5                        movq  0x1fe5134(%rip), %rax
        ffffffff9944cedc                        test %r13, %r13
        ffffffff9944cedf                        jnz 0xffffffff9944ceef
        ffffffff9944cee1                        mov $0xffffffff80000000, %rax
        ffffffff9944cee8                        subq  0x17b0611(%rip), %rax
        ffffffff9944ceef                        add %rax, %r12
        ffffffff9944cef2                        shr $0xc, %r12
        ffffffff9944cef6                        shl $0x6, %r12
        ffffffff9944cefa                        addq  0x17b05ef(%rip), %r12
        ffffffff9944cf01                        movq  0x8(%r12), %rax
        ffffffff9944cf06                        mov %r12, %r13
        ffffffff9944cf09                        test $0x1, %al
        ffffffff9944cf0b                        jnz 0xffffffff9944d1b1
        ffffffff9944cf11                        nopl  %eax, (%rax,%rax,1)
        ffffffff9944cf16                        movq  (%r13), %rax
        ffffffff9944cf1a                        movq  %rbx, (%rsp)
        ffffffff9944cf1e                        test $0x8, %ah
        ffffffff9944cf21                        mov $0x0, %eax
        ffffffff9944cf26                        cmovz %rax, %r13
        ffffffff9944cf2a                        data16 nop
        ffffffff9944cf2c                        movq  0x38(%r13), %r8
        ffffffff9944cf30                        cmp $0x3, %r8
        ffffffff9944cf34                        jnbe 0xffffffff9944d1ca
        ffffffff9944cf3a                        nopl  %eax, (%rax,%rax,1)
        ffffffff9944cf3f                        movq  0x23d6f72(%rip), %rax
        ffffffff9944cf46                        mov %rbx, %rdx
        ffffffff9944cf49                        sub %rax, %rdx
        ffffffff9944cf4c                        cmp $0x1fffff, %rdx
        ffffffff9944cf53                        jbe 0xffffffff9944d03a
        ffffffff9944cf59                        movq  (%r14), %rax
        ffffffff9944cf5c                        addq  %gs:0x66bccab4(%rip), %rax
        ffffffff9944cf64                        movq  0x8(%rax), %rdx
        ffffffff9944cf68                        cmpq  %r13, 0x10(%rax)
        ffffffff9944cf6c                        jnz 0xffffffff9944d192
        ffffffff9944cf72                        movl  0x28(%r14), %ecx
        ffffffff9944cf76                        movq  (%rax), %rax
        ffffffff9944cf79                        add %rbx, %rcx
        ffffffff9944cf7c                        cmp %rbx, %rax
        ffffffff9944cf7f                        jz 0xffffffff9944d1ba
        ffffffff9944cf85                        movq  0xb8(%r14), %rsi
        ffffffff9944cf8c                        mov %rcx, %rdi
        ffffffff9944cf8f                        bswap %rdi
        ffffffff9944cf92                        xor %rax, %rsi
        ffffffff9944cf95                        xor %rdi, %rsi
        ffffffff9944cf98                        movq  %rsi, (%rcx)
        ffffffff9944cf9b                        leaq  0x2000(%rdx), %rcx
        ffffffff9944cfa2                        movq  (%r14), %rsi
        ffffffff9944cfa5                        cmpxchg16bx  %gs:(%rsi)
        ffffffff9944cfaa                        jnz 0xffffffff9944cf59
        ffffffff9944cfac                        movq  0x8(%rsp), %rax
        ffffffff9944cfb1                        subq  %gs:0x28, %rax
        ffffffff9944cfba                        jnz 0xffffffff9944d1fc
        ffffffff9944cfc0                        leaq  -0x28(%rbp), %rsp
        ffffffff9944cfc4                        popq  %rbx
        ffffffff9944cfc5                        popq  %r12
        ffffffff9944cfc7                        popq  %r13
        ffffffff9944cfc9                        popq  %r14
        ffffffff9944cfcb                        popq  %r15
        ffffffff9944cfcd                        popq  %rbp
        ffffffff9944cfce                        retq                            # PRED 38 cycles [126] 2.74 IPC    <-------------

Christian Brauner Aug. 7, 2024, 2:26 p.m. UTC | #11

> +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> +{
> +	struct dentry *dentry;
> +
> +	if (open_flag & O_CREAT) {
> +		/* Don't bother on an O_EXCL create */
> +		if (open_flag & O_EXCL)
> +			return NULL;
> +
> +		/*
> +		 * FIXME: If auditing is enabled, then we'll have to unlazy to
> +		 * use the dentry. For now, don't do this, since it shifts
> +		 * contention from parent's i_rwsem to its d_lockref spinlock.
> +		 * Reconsider this once dentry refcounting handles heavy
> +		 * contention better.
> +		 */
> +		if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> +			return NULL;

Hm, the audit_inode() on the parent is done independent of whether the
file was actually created or not. But the audit_inode() on the file
itself is only done when it was actually created. Imho, there's no need
to do audit_inode() on the parent when we immediately find that file
already existed. If we accept that then this makes the change a lot
simpler.

The inconsistency would partially remain though. When the file doesn't
exist audit_inode() on the parent is called but by the time we've
grabbed the inode lock someone else might already have created the file
and then again we wouldn't audit_inode() on the file but we would have
on the parent.

I think that's fine. But if that's bothersome the more aggressive thing
to do would be to pull that audit_inode() on the parent further down
after we created the file. Imho, that should be fine?...

See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
for a completely untested draft of what I mean.

Jeff Layton Aug. 7, 2024, 2:36 p.m. UTC | #12

On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > +{
> > +	struct dentry *dentry;
> > +
> > +	if (open_flag & O_CREAT) {
> > +		/* Don't bother on an O_EXCL create */
> > +		if (open_flag & O_EXCL)
> > +			return NULL;
> > +
> > +		/*
> > +		 * FIXME: If auditing is enabled, then we'll have to unlazy to
> > +		 * use the dentry. For now, don't do this, since it shifts
> > +		 * contention from parent's i_rwsem to its d_lockref spinlock.
> > +		 * Reconsider this once dentry refcounting handles heavy
> > +		 * contention better.
> > +		 */
> > +		if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > +			return NULL;
> 
> Hm, the audit_inode() on the parent is done independent of whether the
> file was actually created or not. But the audit_inode() on the file
> itself is only done when it was actually created. Imho, there's no need
> to do audit_inode() on the parent when we immediately find that file
> already existed. If we accept that then this makes the change a lot
> simpler.
> 
> The inconsistency would partially remain though. When the file doesn't
> exist audit_inode() on the parent is called but by the time we've
> grabbed the inode lock someone else might already have created the file
> and then again we wouldn't audit_inode() on the file but we would have
> on the parent.
> 
> I think that's fine. But if that's bothersome the more aggressive thing
> to do would be to pull that audit_inode() on the parent further down
> after we created the file. Imho, that should be fine?...
> 
> See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> for a completely untested draft of what I mean.

Yeah, that's a lot simpler. That said, my experience when I've worked
with audit in the past is that people who are using it are _very_
sensitive to changes of when records get emitted or not. I don't like
this, because I think the rules here are ad-hoc and somewhat arbitrary,
but keeping everything working exactly the same has been my MO whenever
I have to work in there.

If a certain access pattern suddenly generates a different set of
records (or some are missing, as would be in this case), we might get
bug reports about this. I'm ok with simplifying this code in the way
you suggest, but we may want to do it in a patch on top of mine, to
make it simple to revert later if that becomes necessary.

Christian Brauner Aug. 8, 2024, 10:36 a.m. UTC | #13

On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > +{
> > > +	struct dentry *dentry;
> > > +
> > > +	if (open_flag & O_CREAT) {
> > > +		/* Don't bother on an O_EXCL create */
> > > +		if (open_flag & O_EXCL)
> > > +			return NULL;
> > > +
> > > +		/*
> > > +		 * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > +		 * use the dentry. For now, don't do this, since it shifts
> > > +		 * contention from parent's i_rwsem to its d_lockref spinlock.
> > > +		 * Reconsider this once dentry refcounting handles heavy
> > > +		 * contention better.
> > > +		 */
> > > +		if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > +			return NULL;
> > 
> > Hm, the audit_inode() on the parent is done independent of whether the
> > file was actually created or not. But the audit_inode() on the file
> > itself is only done when it was actually created. Imho, there's no need
> > to do audit_inode() on the parent when we immediately find that file
> > already existed. If we accept that then this makes the change a lot
> > simpler.
> > 
> > The inconsistency would partially remain though. When the file doesn't
> > exist audit_inode() on the parent is called but by the time we've
> > grabbed the inode lock someone else might already have created the file
> > and then again we wouldn't audit_inode() on the file but we would have
> > on the parent.
> > 
> > I think that's fine. But if that's bothersome the more aggressive thing
> > to do would be to pull that audit_inode() on the parent further down
> > after we created the file. Imho, that should be fine?...
> > 
> > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > for a completely untested draft of what I mean.
> 
> Yeah, that's a lot simpler. That said, my experience when I've worked
> with audit in the past is that people who are using it are _very_
> sensitive to changes of when records get emitted or not. I don't like
> this, because I think the rules here are ad-hoc and somewhat arbitrary,
> but keeping everything working exactly the same has been my MO whenever
> I have to work in there.
> 
> If a certain access pattern suddenly generates a different set of
> records (or some are missing, as would be in this case), we might get
> bug reports about this. I'm ok with simplifying this code in the way
> you suggest, but we may want to do it in a patch on top of mine, to
> make it simple to revert later if that becomes necessary.

Fwiw, even with the rearranged checks in v3 of the patch audit records
will be dropped because we may find a positive dentry but the path may
have trailing slashes. At that point we just return without audit
whereas before we always would've done that audit.

Honestly, we should move that audit event as right now it's just really
weird and see if that works. Otherwise the change is somewhat horrible
complicating the already convoluted logic even more.

So I'm appending the patches that I have on top of your patch in
vfs.misc. Can you (other as well ofc) take a look and tell me whether
that's not breaking anything completely other than later audit events?

Jeff Layton Aug. 8, 2024, 10:54 a.m. UTC | #14

On Thu, 2024-08-08 at 12:36 +0200, Christian Brauner wrote:
> On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > +{
> > > > +	struct dentry *dentry;
> > > > +
> > > > +	if (open_flag & O_CREAT) {
> > > > +		/* Don't bother on an O_EXCL create */
> > > > +		if (open_flag & O_EXCL)
> > > > +			return NULL;
> > > > +
> > > > +		/*
> > > > +		 * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > +		 * use the dentry. For now, don't do this, since it shifts
> > > > +		 * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > +		 * Reconsider this once dentry refcounting handles heavy
> > > > +		 * contention better.
> > > > +		 */
> > > > +		if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > +			return NULL;
> > > 
> > > Hm, the audit_inode() on the parent is done independent of whether the
> > > file was actually created or not. But the audit_inode() on the file
> > > itself is only done when it was actually created. Imho, there's no need
> > > to do audit_inode() on the parent when we immediately find that file
> > > already existed. If we accept that then this makes the change a lot
> > > simpler.
> > > 
> > > The inconsistency would partially remain though. When the file doesn't
> > > exist audit_inode() on the parent is called but by the time we've
> > > grabbed the inode lock someone else might already have created the file
> > > and then again we wouldn't audit_inode() on the file but we would have
> > > on the parent.
> > > 
> > > I think that's fine. But if that's bothersome the more aggressive thing
> > > to do would be to pull that audit_inode() on the parent further down
> > > after we created the file. Imho, that should be fine?...
> > > 
> > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > for a completely untested draft of what I mean.
> > 
> > Yeah, that's a lot simpler. That said, my experience when I've worked
> > with audit in the past is that people who are using it are _very_
> > sensitive to changes of when records get emitted or not. I don't like
> > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > but keeping everything working exactly the same has been my MO whenever
> > I have to work in there.
> > 
> > If a certain access pattern suddenly generates a different set of
> > records (or some are missing, as would be in this case), we might get
> > bug reports about this. I'm ok with simplifying this code in the way
> > you suggest, but we may want to do it in a patch on top of mine, to
> > make it simple to revert later if that becomes necessary.
> 
> Fwiw, even with the rearranged checks in v3 of the patch audit records
> will be dropped because we may find a positive dentry but the path may
> have trailing slashes. At that point we just return without audit
> whereas before we always would've done that audit.
> 

I don't think so. v3 has it drop out of rcuwalk and do the audit_inode
call in the case of trailing slashes. I took great pains here to ensure
that if we emitted a record before that we still do it after. Do let me
know if I missed a case though.

> Honestly, we should move that audit event as right now it's just really
> weird and see if that works. Otherwise the change is somewhat horrible
> complicating the already convoluted logic even more.
> 
> So I'm appending the patches that I have on top of your patch in
> vfs.misc. Can you (other as well ofc) take a look and tell me whether
> that's not breaking anything completely other than later audit events?

That all looks fine to me. I'm not a heavy user of audit, but the
change you've made seems sane to me. Hopefully no one will notice.

I'll plan to do some testing with it later today.

Thanks!

Christian Brauner Aug. 8, 2024, 11:18 a.m. UTC | #15

> I don't think so. v3 has it drop out of rcuwalk and do the audit_inode
> call in the case of trailing slashes. I took great pains here to ensure
> that if we emitted a record before that we still do it after. Do let me
> know if I missed a case though.

Ah, I missed that this was predicated on audit not being enabled. Sorry
about the noise...

Jan Kara Aug. 8, 2024, 5:11 p.m. UTC | #16

On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > +{
> > > > +	struct dentry *dentry;
> > > > +
> > > > +	if (open_flag & O_CREAT) {
> > > > +		/* Don't bother on an O_EXCL create */
> > > > +		if (open_flag & O_EXCL)
> > > > +			return NULL;
> > > > +
> > > > +		/*
> > > > +		 * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > +		 * use the dentry. For now, don't do this, since it shifts
> > > > +		 * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > +		 * Reconsider this once dentry refcounting handles heavy
> > > > +		 * contention better.
> > > > +		 */
> > > > +		if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > +			return NULL;
> > > 
> > > Hm, the audit_inode() on the parent is done independent of whether the
> > > file was actually created or not. But the audit_inode() on the file
> > > itself is only done when it was actually created. Imho, there's no need
> > > to do audit_inode() on the parent when we immediately find that file
> > > already existed. If we accept that then this makes the change a lot
> > > simpler.
> > > 
> > > The inconsistency would partially remain though. When the file doesn't
> > > exist audit_inode() on the parent is called but by the time we've
> > > grabbed the inode lock someone else might already have created the file
> > > and then again we wouldn't audit_inode() on the file but we would have
> > > on the parent.
> > > 
> > > I think that's fine. But if that's bothersome the more aggressive thing
> > > to do would be to pull that audit_inode() on the parent further down
> > > after we created the file. Imho, that should be fine?...
> > > 
> > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > for a completely untested draft of what I mean.
> > 
> > Yeah, that's a lot simpler. That said, my experience when I've worked
> > with audit in the past is that people who are using it are _very_
> > sensitive to changes of when records get emitted or not. I don't like
> > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > but keeping everything working exactly the same has been my MO whenever
> > I have to work in there.
> > 
> > If a certain access pattern suddenly generates a different set of
> > records (or some are missing, as would be in this case), we might get
> > bug reports about this. I'm ok with simplifying this code in the way
> > you suggest, but we may want to do it in a patch on top of mine, to
> > make it simple to revert later if that becomes necessary.
> 
> Fwiw, even with the rearranged checks in v3 of the patch audit records
> will be dropped because we may find a positive dentry but the path may
> have trailing slashes. At that point we just return without audit
> whereas before we always would've done that audit.
> 
> Honestly, we should move that audit event as right now it's just really
> weird and see if that works. Otherwise the change is somewhat horrible
> complicating the already convoluted logic even more.
> 
> So I'm appending the patches that I have on top of your patch in
> vfs.misc. Can you (other as well ofc) take a look and tell me whether
> that's not breaking anything completely other than later audit events?

The changes look good as far as I'm concerned but let me CC audit guys if
they have some thoughts regarding the change in generating audit event for
the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
event for the parent when we are failing open due to trailing slashes in
the pathname? Essentially we are speaking about moving:

	audit_inode(nd->name, dir, AUDIT_INODE_PARENT);

from open_last_lookups() into lookup_open().

								Honza

Paul Moore Aug. 8, 2024, 9:12 p.m. UTC | #17

On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
> On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > > +{
> > > > > +       struct dentry *dentry;
> > > > > +
> > > > > +       if (open_flag & O_CREAT) {
> > > > > +               /* Don't bother on an O_EXCL create */
> > > > > +               if (open_flag & O_EXCL)
> > > > > +                       return NULL;
> > > > > +
> > > > > +               /*
> > > > > +                * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > > +                * use the dentry. For now, don't do this, since it shifts
> > > > > +                * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > > +                * Reconsider this once dentry refcounting handles heavy
> > > > > +                * contention better.
> > > > > +                */
> > > > > +               if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > > +                       return NULL;
> > > >
> > > > Hm, the audit_inode() on the parent is done independent of whether the
> > > > file was actually created or not. But the audit_inode() on the file
> > > > itself is only done when it was actually created. Imho, there's no need
> > > > to do audit_inode() on the parent when we immediately find that file
> > > > already existed. If we accept that then this makes the change a lot
> > > > simpler.
> > > >
> > > > The inconsistency would partially remain though. When the file doesn't
> > > > exist audit_inode() on the parent is called but by the time we've
> > > > grabbed the inode lock someone else might already have created the file
> > > > and then again we wouldn't audit_inode() on the file but we would have
> > > > on the parent.
> > > >
> > > > I think that's fine. But if that's bothersome the more aggressive thing
> > > > to do would be to pull that audit_inode() on the parent further down
> > > > after we created the file. Imho, that should be fine?...
> > > >
> > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > for a completely untested draft of what I mean.
> > >
> > > Yeah, that's a lot simpler. That said, my experience when I've worked
> > > with audit in the past is that people who are using it are _very_
> > > sensitive to changes of when records get emitted or not. I don't like
> > > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > > but keeping everything working exactly the same has been my MO whenever
> > > I have to work in there.
> > >
> > > If a certain access pattern suddenly generates a different set of
> > > records (or some are missing, as would be in this case), we might get
> > > bug reports about this. I'm ok with simplifying this code in the way
> > > you suggest, but we may want to do it in a patch on top of mine, to
> > > make it simple to revert later if that becomes necessary.
> >
> > Fwiw, even with the rearranged checks in v3 of the patch audit records
> > will be dropped because we may find a positive dentry but the path may
> > have trailing slashes. At that point we just return without audit
> > whereas before we always would've done that audit.
> >
> > Honestly, we should move that audit event as right now it's just really
> > weird and see if that works. Otherwise the change is somewhat horrible
> > complicating the already convoluted logic even more.
> >
> > So I'm appending the patches that I have on top of your patch in
> > vfs.misc. Can you (other as well ofc) take a look and tell me whether
> > that's not breaking anything completely other than later audit events?
>
> The changes look good as far as I'm concerned but let me CC audit guys if
> they have some thoughts regarding the change in generating audit event for
> the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
> event for the parent when we are failing open due to trailing slashes in
> the pathname? Essentially we are speaking about moving:
>
>         audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
>
> from open_last_lookups() into lookup_open().

Thanks for adding the audit mailing list to the CC, Jan.  I would ask
for others to do the same when discussing changes that could impact
audit (similar requests for the LSM framework, SELinux, etc.).

The inode/path logging in audit is ... something.  I have a
longstanding todo item to go revisit the audit inode logging, both to
fix some known bugs, and see what we can improve (I'm guessing quite a
bit).  Unfortunately, there is always something else which is burning
a little bit hotter and I haven't been able to get to it yet.

The general idea with audit is that you want to record the information
both on success and failure.  It's easy to understand the success
case, as it is a record of what actually happened on the system, but
you also want to record the failure case as it can provide some
insight on what a process/user is attempting to do, and that can be
very important for certain classes of users.  I haven't dug into the
patches in Christian's tree, but in general I think Jeff's guidance
about not changing what is recorded in the audit log is probably good
advice (there will surely be exceptions to that, but it's still good
guidance).

Jeff Layton Aug. 8, 2024, 11:43 p.m. UTC | #18

On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote:
> On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
> > On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > > > +{
> > > > > > +       struct dentry *dentry;
> > > > > > +
> > > > > > +       if (open_flag & O_CREAT) {
> > > > > > +               /* Don't bother on an O_EXCL create */
> > > > > > +               if (open_flag & O_EXCL)
> > > > > > +                       return NULL;
> > > > > > +
> > > > > > +               /*
> > > > > > +                * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > > > +                * use the dentry. For now, don't do this, since it shifts
> > > > > > +                * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > > > +                * Reconsider this once dentry refcounting handles heavy
> > > > > > +                * contention better.
> > > > > > +                */
> > > > > > +               if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > > > +                       return NULL;
> > > > > 
> > > > > Hm, the audit_inode() on the parent is done independent of whether the
> > > > > file was actually created or not. But the audit_inode() on the file
> > > > > itself is only done when it was actually created. Imho, there's no need
> > > > > to do audit_inode() on the parent when we immediately find that file
> > > > > already existed. If we accept that then this makes the change a lot
> > > > > simpler.
> > > > > 
> > > > > The inconsistency would partially remain though. When the file doesn't
> > > > > exist audit_inode() on the parent is called but by the time we've
> > > > > grabbed the inode lock someone else might already have created the file
> > > > > and then again we wouldn't audit_inode() on the file but we would have
> > > > > on the parent.
> > > > > 
> > > > > I think that's fine. But if that's bothersome the more aggressive thing
> > > > > to do would be to pull that audit_inode() on the parent further down
> > > > > after we created the file. Imho, that should be fine?...
> > > > > 
> > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > > for a completely untested draft of what I mean.
> > > > 
> > > > Yeah, that's a lot simpler. That said, my experience when I've worked
> > > > with audit in the past is that people who are using it are _very_
> > > > sensitive to changes of when records get emitted or not. I don't like
> > > > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > > > but keeping everything working exactly the same has been my MO whenever
> > > > I have to work in there.
> > > > 
> > > > If a certain access pattern suddenly generates a different set of
> > > > records (or some are missing, as would be in this case), we might get
> > > > bug reports about this. I'm ok with simplifying this code in the way
> > > > you suggest, but we may want to do it in a patch on top of mine, to
> > > > make it simple to revert later if that becomes necessary.
> > > 
> > > Fwiw, even with the rearranged checks in v3 of the patch audit records
> > > will be dropped because we may find a positive dentry but the path may
> > > have trailing slashes. At that point we just return without audit
> > > whereas before we always would've done that audit.
> > > 
> > > Honestly, we should move that audit event as right now it's just really
> > > weird and see if that works. Otherwise the change is somewhat horrible
> > > complicating the already convoluted logic even more.
> > > 
> > > So I'm appending the patches that I have on top of your patch in
> > > vfs.misc. Can you (other as well ofc) take a look and tell me whether
> > > that's not breaking anything completely other than later audit events?
> > 
> > The changes look good as far as I'm concerned but let me CC audit guys if
> > they have some thoughts regarding the change in generating audit event for
> > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
> > event for the parent when we are failing open due to trailing slashes in
> > the pathname? Essentially we are speaking about moving:
> > 
> >         audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> > 
> > from open_last_lookups() into lookup_open().
> 
> Thanks for adding the audit mailing list to the CC, Jan.  I would ask
> for others to do the same when discussing changes that could impact
> audit (similar requests for the LSM framework, SELinux, etc.).
> 
> The inode/path logging in audit is ... something.  I have a
> longstanding todo item to go revisit the audit inode logging, both to
> fix some known bugs, and see what we can improve (I'm guessing quite a
> bit).  Unfortunately, there is always something else which is burning
> a little bit hotter and I haven't been able to get to it yet.
> 

It is "something" alright. The audit logging just happens at strange
and inconvenient times vs. what else we're trying to do wrt pathwalking
and such. In particular here, the fact __audit_inode can block is what
really sucks.

Since we're discussing it...

ISTM that the inode/path logging here is something like a tracepoint.
In particular, we're looking to record a specific set of information at
specific points in the code. One of the big differences between them
however is that tracepoints don't block.  The catch is that we can't
just drop messages if we run out of audit logging space, so that would
have to be handled reasonably.

I wonder if we could leverage the tracepoint infrastructure to help us
record the necessary info somehow? Copy the records into a specific
ring buffer, and then copy them out to the audit infrastructure in
task_work?

I don't have any concrete ideas here, but the path/inode audit code has
been a burden for a while now and it'd be good to think about how we
could do this better.

> The general idea with audit is that you want to record the information
> both on success and failure.  It's easy to understand the success
> case, as it is a record of what actually happened on the system, but
> you also want to record the failure case as it can provide some
> insight on what a process/user is attempting to do, and that can be
> very important for certain classes of users.  I haven't dug into the
> patches in Christian's tree, but in general I think Jeff's guidance
> about not changing what is recorded in the audit log is probably good
> advice (there will surely be exceptions to that, but it's still good
> guidance).
> 

In this particular case, the question is:

Do we need to emit a AUDIT_INODE_PARENT record when opening an existing
file, just because O_CREAT was set? We don't emit such a record when
opening without O_CREAT set.

Paul Moore Aug. 9, 2024, 12:28 a.m. UTC | #19

On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@kernel.org> wrote:
> On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote:
> > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
> > > On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > > > > +{
> > > > > > > +       struct dentry *dentry;
> > > > > > > +
> > > > > > > +       if (open_flag & O_CREAT) {
> > > > > > > +               /* Don't bother on an O_EXCL create */
> > > > > > > +               if (open_flag & O_EXCL)
> > > > > > > +                       return NULL;
> > > > > > > +
> > > > > > > +               /*
> > > > > > > +                * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > > > > +                * use the dentry. For now, don't do this, since it shifts
> > > > > > > +                * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > > > > +                * Reconsider this once dentry refcounting handles heavy
> > > > > > > +                * contention better.
> > > > > > > +                */
> > > > > > > +               if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > > > > +                       return NULL;
> > > > > >
> > > > > > Hm, the audit_inode() on the parent is done independent of whether the
> > > > > > file was actually created or not. But the audit_inode() on the file
> > > > > > itself is only done when it was actually created. Imho, there's no need
> > > > > > to do audit_inode() on the parent when we immediately find that file
> > > > > > already existed. If we accept that then this makes the change a lot
> > > > > > simpler.
> > > > > >
> > > > > > The inconsistency would partially remain though. When the file doesn't
> > > > > > exist audit_inode() on the parent is called but by the time we've
> > > > > > grabbed the inode lock someone else might already have created the file
> > > > > > and then again we wouldn't audit_inode() on the file but we would have
> > > > > > on the parent.
> > > > > >
> > > > > > I think that's fine. But if that's bothersome the more aggressive thing
> > > > > > to do would be to pull that audit_inode() on the parent further down
> > > > > > after we created the file. Imho, that should be fine?...
> > > > > >
> > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > > > for a completely untested draft of what I mean.
> > > > >
> > > > > Yeah, that's a lot simpler. That said, my experience when I've worked
> > > > > with audit in the past is that people who are using it are _very_
> > > > > sensitive to changes of when records get emitted or not. I don't like
> > > > > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > > > > but keeping everything working exactly the same has been my MO whenever
> > > > > I have to work in there.
> > > > >
> > > > > If a certain access pattern suddenly generates a different set of
> > > > > records (or some are missing, as would be in this case), we might get
> > > > > bug reports about this. I'm ok with simplifying this code in the way
> > > > > you suggest, but we may want to do it in a patch on top of mine, to
> > > > > make it simple to revert later if that becomes necessary.
> > > >
> > > > Fwiw, even with the rearranged checks in v3 of the patch audit records
> > > > will be dropped because we may find a positive dentry but the path may
> > > > have trailing slashes. At that point we just return without audit
> > > > whereas before we always would've done that audit.
> > > >
> > > > Honestly, we should move that audit event as right now it's just really
> > > > weird and see if that works. Otherwise the change is somewhat horrible
> > > > complicating the already convoluted logic even more.
> > > >
> > > > So I'm appending the patches that I have on top of your patch in
> > > > vfs.misc. Can you (other as well ofc) take a look and tell me whether
> > > > that's not breaking anything completely other than later audit events?
> > >
> > > The changes look good as far as I'm concerned but let me CC audit guys if
> > > they have some thoughts regarding the change in generating audit event for
> > > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
> > > event for the parent when we are failing open due to trailing slashes in
> > > the pathname? Essentially we are speaking about moving:
> > >
> > >         audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> > >
> > > from open_last_lookups() into lookup_open().
> >
> > Thanks for adding the audit mailing list to the CC, Jan.  I would ask
> > for others to do the same when discussing changes that could impact
> > audit (similar requests for the LSM framework, SELinux, etc.).
> >
> > The inode/path logging in audit is ... something.  I have a
> > longstanding todo item to go revisit the audit inode logging, both to
> > fix some known bugs, and see what we can improve (I'm guessing quite a
> > bit).  Unfortunately, there is always something else which is burning
> > a little bit hotter and I haven't been able to get to it yet.
> >
>
> It is "something" alright. The audit logging just happens at strange
> and inconvenient times vs. what else we're trying to do wrt pathwalking
> and such. In particular here, the fact __audit_inode can block is what
> really sucks.
>
> Since we're discussing it...
>
> ISTM that the inode/path logging here is something like a tracepoint.
> In particular, we're looking to record a specific set of information at
> specific points in the code. One of the big differences between them
> however is that tracepoints don't block.  The catch is that we can't
> just drop messages if we run out of audit logging space, so that would
> have to be handled reasonably.

Yes, the buffer allocation is the tricky bit.  Audit does preallocate
some structs for tracking names which ideally should handle the vast
majority of the cases, but yes, we need something to handle all of the
corner cases too without having to resort to audit_panic().

> I wonder if we could leverage the tracepoint infrastructure to help us
> record the necessary info somehow? Copy the records into a specific
> ring buffer, and then copy them out to the audit infrastructure in
> task_work?

I believe using task_work will cause a number of challenges for the
audit subsystem as we try to bring everything together into a single
audit event.  We've had a lot of problems with io_uring doing similar
things, some of which are still unresolved.

> I don't have any concrete ideas here, but the path/inode audit code has
> been a burden for a while now and it'd be good to think about how we
> could do this better.

I've got some grand ideas on how to cut down on a lot of our
allocations and string generation in the critical path, not just with
the inodes, but with audit records in general.  Sadly I just haven't
had the time to get to any of it.

> > The general idea with audit is that you want to record the information
> > both on success and failure.  It's easy to understand the success
> > case, as it is a record of what actually happened on the system, but
> > you also want to record the failure case as it can provide some
> > insight on what a process/user is attempting to do, and that can be
> > very important for certain classes of users.  I haven't dug into the
> > patches in Christian's tree, but in general I think Jeff's guidance
> > about not changing what is recorded in the audit log is probably good
> > advice (there will surely be exceptions to that, but it's still good
> > guidance).
> >
>
> In this particular case, the question is:
>
> Do we need to emit a AUDIT_INODE_PARENT record when opening an existing
> file, just because O_CREAT was set? We don't emit such a record when
> opening without O_CREAT set.

I'm not as current on the third-party security requirements as I used
to be, but I do know that oftentimes when a file is created the parent
directory is an important bit of information to have in the audit log.

Jeff Layton Aug. 9, 2024, 12:33 a.m. UTC | #20

On Thu, 2024-08-08 at 20:28 -0400, Paul Moore wrote:
> On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@kernel.org> wrote:
> > On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote:
> > > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
> > > > On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > > > > > +{
> > > > > > > > +       struct dentry *dentry;
> > > > > > > > +
> > > > > > > > +       if (open_flag & O_CREAT) {
> > > > > > > > +               /* Don't bother on an O_EXCL create */
> > > > > > > > +               if (open_flag & O_EXCL)
> > > > > > > > +                       return NULL;
> > > > > > > > +
> > > > > > > > +               /*
> > > > > > > > +                * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > > > > > +                * use the dentry. For now, don't do this, since it shifts
> > > > > > > > +                * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > > > > > +                * Reconsider this once dentry refcounting handles heavy
> > > > > > > > +                * contention better.
> > > > > > > > +                */
> > > > > > > > +               if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > > > > > +                       return NULL;
> > > > > > > 
> > > > > > > Hm, the audit_inode() on the parent is done independent of whether the
> > > > > > > file was actually created or not. But the audit_inode() on the file
> > > > > > > itself is only done when it was actually created. Imho, there's no need
> > > > > > > to do audit_inode() on the parent when we immediately find that file
> > > > > > > already existed. If we accept that then this makes the change a lot
> > > > > > > simpler.
> > > > > > > 
> > > > > > > The inconsistency would partially remain though. When the file doesn't
> > > > > > > exist audit_inode() on the parent is called but by the time we've
> > > > > > > grabbed the inode lock someone else might already have created the file
> > > > > > > and then again we wouldn't audit_inode() on the file but we would have
> > > > > > > on the parent.
> > > > > > > 
> > > > > > > I think that's fine. But if that's bothersome the more aggressive thing
> > > > > > > to do would be to pull that audit_inode() on the parent further down
> > > > > > > after we created the file. Imho, that should be fine?...
> > > > > > > 
> > > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > > > > for a completely untested draft of what I mean.
> > > > > > 
> > > > > > Yeah, that's a lot simpler. That said, my experience when I've worked
> > > > > > with audit in the past is that people who are using it are _very_
> > > > > > sensitive to changes of when records get emitted or not. I don't like
> > > > > > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > > > > > but keeping everything working exactly the same has been my MO whenever
> > > > > > I have to work in there.
> > > > > > 
> > > > > > If a certain access pattern suddenly generates a different set of
> > > > > > records (or some are missing, as would be in this case), we might get
> > > > > > bug reports about this. I'm ok with simplifying this code in the way
> > > > > > you suggest, but we may want to do it in a patch on top of mine, to
> > > > > > make it simple to revert later if that becomes necessary.
> > > > > 
> > > > > Fwiw, even with the rearranged checks in v3 of the patch audit records
> > > > > will be dropped because we may find a positive dentry but the path may
> > > > > have trailing slashes. At that point we just return without audit
> > > > > whereas before we always would've done that audit.
> > > > > 
> > > > > Honestly, we should move that audit event as right now it's just really
> > > > > weird and see if that works. Otherwise the change is somewhat horrible
> > > > > complicating the already convoluted logic even more.
> > > > > 
> > > > > So I'm appending the patches that I have on top of your patch in
> > > > > vfs.misc. Can you (other as well ofc) take a look and tell me whether
> > > > > that's not breaking anything completely other than later audit events?
> > > > 
> > > > The changes look good as far as I'm concerned but let me CC audit guys if
> > > > they have some thoughts regarding the change in generating audit event for
> > > > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
> > > > event for the parent when we are failing open due to trailing slashes in
> > > > the pathname? Essentially we are speaking about moving:
> > > > 
> > > >         audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> > > > 
> > > > from open_last_lookups() into lookup_open().
> > > 
> > > Thanks for adding the audit mailing list to the CC, Jan.  I would ask
> > > for others to do the same when discussing changes that could impact
> > > audit (similar requests for the LSM framework, SELinux, etc.).
> > > 
> > > The inode/path logging in audit is ... something.  I have a
> > > longstanding todo item to go revisit the audit inode logging, both to
> > > fix some known bugs, and see what we can improve (I'm guessing quite a
> > > bit).  Unfortunately, there is always something else which is burning
> > > a little bit hotter and I haven't been able to get to it yet.
> > > 
> > 
> > It is "something" alright. The audit logging just happens at strange
> > and inconvenient times vs. what else we're trying to do wrt pathwalking
> > and such. In particular here, the fact __audit_inode can block is what
> > really sucks.
> > 
> > Since we're discussing it...
> > 
> > ISTM that the inode/path logging here is something like a tracepoint.
> > In particular, we're looking to record a specific set of information at
> > specific points in the code. One of the big differences between them
> > however is that tracepoints don't block.  The catch is that we can't
> > just drop messages if we run out of audit logging space, so that would
> > have to be handled reasonably.
> 
> Yes, the buffer allocation is the tricky bit.  Audit does preallocate
> some structs for tracking names which ideally should handle the vast
> majority of the cases, but yes, we need something to handle all of the
> corner cases too without having to resort to audit_panic().
> 
> > I wonder if we could leverage the tracepoint infrastructure to help us
> > record the necessary info somehow? Copy the records into a specific
> > ring buffer, and then copy them out to the audit infrastructure in
> > task_work?
> 
> I believe using task_work will cause a number of challenges for the
> audit subsystem as we try to bring everything together into a single
> audit event.  We've had a lot of problems with io_uring doing similar
> things, some of which are still unresolved.
> 
> > I don't have any concrete ideas here, but the path/inode audit code has
> > been a burden for a while now and it'd be good to think about how we
> > could do this better.
> 
> I've got some grand ideas on how to cut down on a lot of our
> allocations and string generation in the critical path, not just with
> the inodes, but with audit records in general.  Sadly I just haven't
> had the time to get to any of it.
> 
> > > The general idea with audit is that you want to record the information
> > > both on success and failure.  It's easy to understand the success
> > > case, as it is a record of what actually happened on the system, but
> > > you also want to record the failure case as it can provide some
> > > insight on what a process/user is attempting to do, and that can be
> > > very important for certain classes of users.  I haven't dug into the
> > > patches in Christian's tree, but in general I think Jeff's guidance
> > > about not changing what is recorded in the audit log is probably good
> > > advice (there will surely be exceptions to that, but it's still good
> > > guidance).
> > > 
> > 
> > In this particular case, the question is:
> > 
> > Do we need to emit a AUDIT_INODE_PARENT record when opening an existing
> > file, just because O_CREAT was set? We don't emit such a record when
> > opening without O_CREAT set.
> 
> I'm not as current on the third-party security requirements as I used
> to be, but I do know that oftentimes when a file is created the parent
> directory is an important bit of information to have in the audit log.
> 

Right. We'd still have that here since we have to unlazy to actually
create the file.

The question here is about the case where O_CREAT is set, but the file
already exists. Nothing is being created in that case, so do we need to
emit an audit record for the parent?

Paul Moore Aug. 9, 2024, 1:22 a.m. UTC | #21

On Thu, Aug 8, 2024 at 8:33 PM Jeff Layton <jlayton@kernel.org> wrote:
> On Thu, 2024-08-08 at 20:28 -0400, Paul Moore wrote:
> > On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote:
> > > > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
> > > > > On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > > > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > > > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > > > > > > +{
> > > > > > > > > +       struct dentry *dentry;
> > > > > > > > > +
> > > > > > > > > +       if (open_flag & O_CREAT) {
> > > > > > > > > +               /* Don't bother on an O_EXCL create */
> > > > > > > > > +               if (open_flag & O_EXCL)
> > > > > > > > > +                       return NULL;
> > > > > > > > > +
> > > > > > > > > +               /*
> > > > > > > > > +                * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > > > > > > +                * use the dentry. For now, don't do this, since it shifts
> > > > > > > > > +                * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > > > > > > +                * Reconsider this once dentry refcounting handles heavy
> > > > > > > > > +                * contention better.
> > > > > > > > > +                */
> > > > > > > > > +               if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > > > > > > +                       return NULL;
> > > > > > > >
> > > > > > > > Hm, the audit_inode() on the parent is done independent of whether the
> > > > > > > > file was actually created or not. But the audit_inode() on the file
> > > > > > > > itself is only done when it was actually created. Imho, there's no need
> > > > > > > > to do audit_inode() on the parent when we immediately find that file
> > > > > > > > already existed. If we accept that then this makes the change a lot
> > > > > > > > simpler.
> > > > > > > >
> > > > > > > > The inconsistency would partially remain though. When the file doesn't
> > > > > > > > exist audit_inode() on the parent is called but by the time we've
> > > > > > > > grabbed the inode lock someone else might already have created the file
> > > > > > > > and then again we wouldn't audit_inode() on the file but we would have
> > > > > > > > on the parent.
> > > > > > > >
> > > > > > > > I think that's fine. But if that's bothersome the more aggressive thing
> > > > > > > > to do would be to pull that audit_inode() on the parent further down
> > > > > > > > after we created the file. Imho, that should be fine?...
> > > > > > > >
> > > > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > > > > > for a completely untested draft of what I mean.
> > > > > > >
> > > > > > > Yeah, that's a lot simpler. That said, my experience when I've worked
> > > > > > > with audit in the past is that people who are using it are _very_
> > > > > > > sensitive to changes of when records get emitted or not. I don't like
> > > > > > > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > > > > > > but keeping everything working exactly the same has been my MO whenever
> > > > > > > I have to work in there.
> > > > > > >
> > > > > > > If a certain access pattern suddenly generates a different set of
> > > > > > > records (or some are missing, as would be in this case), we might get
> > > > > > > bug reports about this. I'm ok with simplifying this code in the way
> > > > > > > you suggest, but we may want to do it in a patch on top of mine, to
> > > > > > > make it simple to revert later if that becomes necessary.
> > > > > >
> > > > > > Fwiw, even with the rearranged checks in v3 of the patch audit records
> > > > > > will be dropped because we may find a positive dentry but the path may
> > > > > > have trailing slashes. At that point we just return without audit
> > > > > > whereas before we always would've done that audit.
> > > > > >
> > > > > > Honestly, we should move that audit event as right now it's just really
> > > > > > weird and see if that works. Otherwise the change is somewhat horrible
> > > > > > complicating the already convoluted logic even more.
> > > > > >
> > > > > > So I'm appending the patches that I have on top of your patch in
> > > > > > vfs.misc. Can you (other as well ofc) take a look and tell me whether
> > > > > > that's not breaking anything completely other than later audit events?
> > > > >
> > > > > The changes look good as far as I'm concerned but let me CC audit guys if
> > > > > they have some thoughts regarding the change in generating audit event for
> > > > > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
> > > > > event for the parent when we are failing open due to trailing slashes in
> > > > > the pathname? Essentially we are speaking about moving:
> > > > >
> > > > >         audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> > > > >
> > > > > from open_last_lookups() into lookup_open().
> > > >
> > > > Thanks for adding the audit mailing list to the CC, Jan.  I would ask
> > > > for others to do the same when discussing changes that could impact
> > > > audit (similar requests for the LSM framework, SELinux, etc.).
> > > >
> > > > The inode/path logging in audit is ... something.  I have a
> > > > longstanding todo item to go revisit the audit inode logging, both to
> > > > fix some known bugs, and see what we can improve (I'm guessing quite a
> > > > bit).  Unfortunately, there is always something else which is burning
> > > > a little bit hotter and I haven't been able to get to it yet.
> > > >
> > >
> > > It is "something" alright. The audit logging just happens at strange
> > > and inconvenient times vs. what else we're trying to do wrt pathwalking
> > > and such. In particular here, the fact __audit_inode can block is what
> > > really sucks.
> > >
> > > Since we're discussing it...
> > >
> > > ISTM that the inode/path logging here is something like a tracepoint.
> > > In particular, we're looking to record a specific set of information at
> > > specific points in the code. One of the big differences between them
> > > however is that tracepoints don't block.  The catch is that we can't
> > > just drop messages if we run out of audit logging space, so that would
> > > have to be handled reasonably.
> >
> > Yes, the buffer allocation is the tricky bit.  Audit does preallocate
> > some structs for tracking names which ideally should handle the vast
> > majority of the cases, but yes, we need something to handle all of the
> > corner cases too without having to resort to audit_panic().
> >
> > > I wonder if we could leverage the tracepoint infrastructure to help us
> > > record the necessary info somehow? Copy the records into a specific
> > > ring buffer, and then copy them out to the audit infrastructure in
> > > task_work?
> >
> > I believe using task_work will cause a number of challenges for the
> > audit subsystem as we try to bring everything together into a single
> > audit event.  We've had a lot of problems with io_uring doing similar
> > things, some of which are still unresolved.
> >
> > > I don't have any concrete ideas here, but the path/inode audit code has
> > > been a burden for a while now and it'd be good to think about how we
> > > could do this better.
> >
> > I've got some grand ideas on how to cut down on a lot of our
> > allocations and string generation in the critical path, not just with
> > the inodes, but with audit records in general.  Sadly I just haven't
> > had the time to get to any of it.
> >
> > > > The general idea with audit is that you want to record the information
> > > > both on success and failure.  It's easy to understand the success
> > > > case, as it is a record of what actually happened on the system, but
> > > > you also want to record the failure case as it can provide some
> > > > insight on what a process/user is attempting to do, and that can be
> > > > very important for certain classes of users.  I haven't dug into the
> > > > patches in Christian's tree, but in general I think Jeff's guidance
> > > > about not changing what is recorded in the audit log is probably good
> > > > advice (there will surely be exceptions to that, but it's still good
> > > > guidance).
> > > >
> > >
> > > In this particular case, the question is:
> > >
> > > Do we need to emit a AUDIT_INODE_PARENT record when opening an existing
> > > file, just because O_CREAT was set? We don't emit such a record when
> > > opening without O_CREAT set.
> >
> > I'm not as current on the third-party security requirements as I used
> > to be, but I do know that oftentimes when a file is created the parent
> > directory is an important bit of information to have in the audit log.
> >
>
> Right. We'd still have that here since we have to unlazy to actually
> create the file.
>
> The question here is about the case where O_CREAT is set, but the file
> already exists. Nothing is being created in that case, so do we need to
> emit an audit record for the parent?

As long as the full path information is present in the existing file's
audit record it should be okay.

Jeff Layton Aug. 9, 2024, 2:21 p.m. UTC | #22

On Thu, 2024-08-08 at 21:22 -0400, Paul Moore wrote:
> On Thu, Aug 8, 2024 at 8:33 PM Jeff Layton <jlayton@kernel.org>
> wrote:
> > On Thu, 2024-08-08 at 20:28 -0400, Paul Moore wrote:
> > > On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@kernel.org>
> > > wrote:
> > > > On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote:
> > > > > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > > > > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton
> > > > > > > wrote:
> > > > > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner
> > > > > > > > wrote:
> > > > > > > > > > +static struct dentry *lookup_fast_for_open(struct
> > > > > > > > > > nameidata *nd, int open_flag)
> > > > > > > > > > +{
> > > > > > > > > > +       struct dentry *dentry;
> > > > > > > > > > +
> > > > > > > > > > +       if (open_flag & O_CREAT) {
> > > > > > > > > > +               /* Don't bother on an O_EXCL create
> > > > > > > > > > */
> > > > > > > > > > +               if (open_flag & O_EXCL)
> > > > > > > > > > +                       return NULL;
> > > > > > > > > > +
> > > > > > > > > > +               /*
> > > > > > > > > > +                * FIXME: If auditing is enabled,
> > > > > > > > > > then we'll have to unlazy to
> > > > > > > > > > +                * use the dentry. For now, don't
> > > > > > > > > > do this, since it shifts
> > > > > > > > > > +                * contention from parent's i_rwsem
> > > > > > > > > > to its d_lockref spinlock.
> > > > > > > > > > +                * Reconsider this once dentry
> > > > > > > > > > refcounting handles heavy
> > > > > > > > > > +                * contention better.
> > > > > > > > > > +                */
> > > > > > > > > > +               if ((nd->flags & LOOKUP_RCU) &&
> > > > > > > > > > !audit_dummy_context())
> > > > > > > > > > +                       return NULL;
> > > > > > > > > 
> > > > > > > > > Hm, the audit_inode() on the parent is done
> > > > > > > > > independent of whether the
> > > > > > > > > file was actually created or not. But the
> > > > > > > > > audit_inode() on the file
> > > > > > > > > itself is only done when it was actually created.
> > > > > > > > > Imho, there's no need
> > > > > > > > > to do audit_inode() on the parent when we immediately
> > > > > > > > > find that file
> > > > > > > > > already existed. If we accept that then this makes
> > > > > > > > > the change a lot
> > > > > > > > > simpler.
> > > > > > > > > 
> > > > > > > > > The inconsistency would partially remain though. When
> > > > > > > > > the file doesn't
> > > > > > > > > exist audit_inode() on the parent is called but by
> > > > > > > > > the time we've
> > > > > > > > > grabbed the inode lock someone else might already
> > > > > > > > > have created the file
> > > > > > > > > and then again we wouldn't audit_inode() on the file
> > > > > > > > > but we would have
> > > > > > > > > on the parent.
> > > > > > > > > 
> > > > > > > > > I think that's fine. But if that's bothersome the
> > > > > > > > > more aggressive thing
> > > > > > > > > to do would be to pull that audit_inode() on the
> > > > > > > > > parent further down
> > > > > > > > > after we created the file. Imho, that should be
> > > > > > > > > fine?...
> > > > > > > > > 
> > > > > > > > > See
> > > > > > > > > https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > > > > > > for a completely untested draft of what I mean.
> > > > > > > > 
> > > > > > > > Yeah, that's a lot simpler. That said, my experience
> > > > > > > > when I've worked
> > > > > > > > with audit in the past is that people who are using it
> > > > > > > > are _very_
> > > > > > > > sensitive to changes of when records get emitted or
> > > > > > > > not. I don't like
> > > > > > > > this, because I think the rules here are ad-hoc and
> > > > > > > > somewhat arbitrary,
> > > > > > > > but keeping everything working exactly the same has
> > > > > > > > been my MO whenever
> > > > > > > > I have to work in there.
> > > > > > > > 
> > > > > > > > If a certain access pattern suddenly generates a
> > > > > > > > different set of
> > > > > > > > records (or some are missing, as would be in this
> > > > > > > > case), we might get
> > > > > > > > bug reports about this. I'm ok with simplifying this
> > > > > > > > code in the way
> > > > > > > > you suggest, but we may want to do it in a patch on top
> > > > > > > > of mine, to
> > > > > > > > make it simple to revert later if that becomes
> > > > > > > > necessary.
> > > > > > > 
> > > > > > > Fwiw, even with the rearranged checks in v3 of the patch
> > > > > > > audit records
> > > > > > > will be dropped because we may find a positive dentry but
> > > > > > > the path may
> > > > > > > have trailing slashes. At that point we just return
> > > > > > > without audit
> > > > > > > whereas before we always would've done that audit.
> > > > > > > 
> > > > > > > Honestly, we should move that audit event as right now
> > > > > > > it's just really
> > > > > > > weird and see if that works. Otherwise the change is
> > > > > > > somewhat horrible
> > > > > > > complicating the already convoluted logic even more.
> > > > > > > 
> > > > > > > So I'm appending the patches that I have on top of your
> > > > > > > patch in
> > > > > > > vfs.misc. Can you (other as well ofc) take a look and
> > > > > > > tell me whether
> > > > > > > that's not breaking anything completely other than later
> > > > > > > audit events?
> > > > > > 
> > > > > > The changes look good as far as I'm concerned but let me CC
> > > > > > audit guys if
> > > > > > they have some thoughts regarding the change in generating
> > > > > > audit event for
> > > > > > the parent. Paul, does it matter if open(O_CREAT) doesn't
> > > > > > generate audit
> > > > > > event for the parent when we are failing open due to
> > > > > > trailing slashes in
> > > > > > the pathname? Essentially we are speaking about moving:
> > > > > > 
> > > > > >         audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> > > > > > 
> > > > > > from open_last_lookups() into lookup_open().
> > > > > 
> > > > > Thanks for adding the audit mailing list to the CC, Jan.  I
> > > > > would ask
> > > > > for others to do the same when discussing changes that could
> > > > > impact
> > > > > audit (similar requests for the LSM framework, SELinux,
> > > > > etc.).
> > > > > 
> > > > > The inode/path logging in audit is ... something.  I have a
> > > > > longstanding todo item to go revisit the audit inode logging,
> > > > > both to
> > > > > fix some known bugs, and see what we can improve (I'm
> > > > > guessing quite a
> > > > > bit).  Unfortunately, there is always something else which is
> > > > > burning
> > > > > a little bit hotter and I haven't been able to get to it yet.
> > > > > 
> > > > 
> > > > It is "something" alright. The audit logging just happens at
> > > > strange
> > > > and inconvenient times vs. what else we're trying to do wrt
> > > > pathwalking
> > > > and such. In particular here, the fact __audit_inode can block
> > > > is what
> > > > really sucks.
> > > > 
> > > > Since we're discussing it...
> > > > 
> > > > ISTM that the inode/path logging here is something like a
> > > > tracepoint.
> > > > In particular, we're looking to record a specific set of
> > > > information at
> > > > specific points in the code. One of the big differences between
> > > > them
> > > > however is that tracepoints don't block.  The catch is that we
> > > > can't
> > > > just drop messages if we run out of audit logging space, so
> > > > that would
> > > > have to be handled reasonably.
> > > 
> > > Yes, the buffer allocation is the tricky bit.  Audit does
> > > preallocate
> > > some structs for tracking names which ideally should handle the
> > > vast
> > > majority of the cases, but yes, we need something to handle all
> > > of the
> > > corner cases too without having to resort to audit_panic().
> > > 
> > > > I wonder if we could leverage the tracepoint infrastructure to
> > > > help us
> > > > record the necessary info somehow? Copy the records into a
> > > > specific
> > > > ring buffer, and then copy them out to the audit infrastructure
> > > > in
> > > > task_work?
> > > 
> > > I believe using task_work will cause a number of challenges for
> > > the
> > > audit subsystem as we try to bring everything together into a
> > > single
> > > audit event.  We've had a lot of problems with io_uring doing
> > > similar
> > > things, some of which are still unresolved.
> > > 
> > > > I don't have any concrete ideas here, but the path/inode audit
> > > > code has
> > > > been a burden for a while now and it'd be good to think about
> > > > how we
> > > > could do this better.
> > > 
> > > I've got some grand ideas on how to cut down on a lot of our
> > > allocations and string generation in the critical path, not just
> > > with
> > > the inodes, but with audit records in general.  Sadly I just
> > > haven't
> > > had the time to get to any of it.
> > > 
> > > > > The general idea with audit is that you want to record the
> > > > > information
> > > > > both on success and failure.  It's easy to understand the
> > > > > success
> > > > > case, as it is a record of what actually happened on the
> > > > > system, but
> > > > > you also want to record the failure case as it can provide
> > > > > some
> > > > > insight on what a process/user is attempting to do, and that
> > > > > can be
> > > > > very important for certain classes of users.  I haven't dug
> > > > > into the
> > > > > patches in Christian's tree, but in general I think Jeff's
> > > > > guidance
> > > > > about not changing what is recorded in the audit log is
> > > > > probably good
> > > > > advice (there will surely be exceptions to that, but it's
> > > > > still good
> > > > > guidance).
> > > > > 
> > > > 
> > > > In this particular case, the question is:
> > > > 
> > > > Do we need to emit a AUDIT_INODE_PARENT record when opening an
> > > > existing
> > > > file, just because O_CREAT was set? We don't emit such a record
> > > > when
> > > > opening without O_CREAT set.
> > > 
> > > I'm not as current on the third-party security requirements as I
> > > used
> > > to be, but I do know that oftentimes when a file is created the
> > > parent
> > > directory is an important bit of information to have in the audit
> > > log.
> > > 
> > 
> > Right. We'd still have that here since we have to unlazy to
> > actually
> > create the file.
> > 
> > The question here is about the case where O_CREAT is set, but the
> > file
> > already exists. Nothing is being created in that case, so do we
> > need to
> > emit an audit record for the parent?
> 
> As long as the full path information is present in the existing
> file's
> audit record it should be okay.
> 

O_CREAT is ignored when the dentry already exists, so doing the same
thing that we do when O_CREAT isn't set seems reasonable.

We do call this in do_open, which would apply in this case:

        if (!(file->f_mode & FMODE_CREATED))
                audit_inode(nd->name, nd->path.dentry, 0);

That should have the necessary path info. If that's the case, then I
think Christian's cleanup series on top of mine should be OK. I think
that the only thing that would be missing is the AUDIT_INODE_PARENT
record for the directory in the case where the dentry already exists,
which should be superfluous.

ISTR that Red Hat has a pretty extensive testsuite for audit. We might
want to get them to run their tests on Christian's changes to be sure
there are no surprises, if they are amenable.

Paul Moore Aug. 11, 2024, 9:52 p.m. UTC | #23

On Fri, Aug 9, 2024 at 10:21 AM Jeff Layton <jlayton@kernel.org> wrote:
> On Thu, 2024-08-08 at 21:22 -0400, Paul Moore wrote:
> > On Thu, Aug 8, 2024 at 8:33 PM Jeff Layton <jlayton@kernel.org>
> > wrote:

...

> > > The question here is about the case where O_CREAT is set, but the
> > > file
> > > already exists. Nothing is being created in that case, so do we
> > > need to
> > > emit an audit record for the parent?
> >
> > As long as the full path information is present in the existing
> > file's
> > audit record it should be okay.
>
> O_CREAT is ignored when the dentry already exists, so doing the same
> thing that we do when O_CREAT isn't set seems reasonable.
>
> We do call this in do_open, which would apply in this case:
>
>         if (!(file->f_mode & FMODE_CREATED))
>                 audit_inode(nd->name, nd->path.dentry, 0);
>
> That should have the necessary path info. If that's the case, then I
> think Christian's cleanup series on top of mine should be OK. I think
> that the only thing that would be missing is the AUDIT_INODE_PARENT
> record for the directory in the case where the dentry already exists,
> which should be superfluous.
>
> ISTR that Red Hat has a pretty extensive testsuite for audit. We might
> want to get them to run their tests on Christian's changes to be sure
> there are no surprises, if they are amenable.

I believe you are thinking of the audit-test project, which started as
a community effort to develop a test suite suitable for the popular
Common Criteria protection profiles of the time (of which auditing was
an important requirement).  Unfortunately, after a couple rounds of
certifications I couldn't get the various companies involved to
continue to maintain the public test suite anymore, so everyone went
their own way with private forks.  I have no idea if the community
project still works, but someone at IBM/RH should have a recent~ish
version that either runs on a modern system or is close to it.  FWIW,
the dead/dormant community project can be found at the link below
(yes, that is a sf.net link):

https://sourceforge.net/projects/audit-test

It's been a while since I've been at RH, so I'm not sure if my test/QA
contacts are still there, but if you don't have a contact at IBM/RH
Jeff let me know and I can try to reach out.

Al Viro Aug. 14, 2024, 2:18 a.m. UTC | #24

On Tue, Aug 06, 2024 at 03:51:35PM -0400, Jeff Layton wrote:

> > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > +{
> > +	struct dentry *dentry;
> > +
> > +	if (open_flag & O_CREAT) {
> > +		/* Don't bother on an O_EXCL create */
> > +		if (open_flag & O_EXCL)
> > +			return NULL;
> > +
> > +		/*
> > +		 * FIXME: If auditing is enabled, then we'll have to unlazy to
> > +		 * use the dentry. For now, don't do this, since it shifts
> > +		 * contention from parent's i_rwsem to its d_lockref spinlock.
> > +		 * Reconsider this once dentry refcounting handles heavy
> > +		 * contention better.
> > +		 */
> > +		if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > +			return NULL;
> > +	}
> > +
> > +	if (trailing_slashes(nd))
> > +		nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> > +
> > +	dentry = lookup_fast(nd);
> 
> Self-NAK on this patch. We have to test for IS_ERR on the returned
> dentry here. I'll send a v3 along after I've retested it.

That's not the only problem; your "is it negative" test is inherently
racy in RCU mode.  IOW, what is positive at the time you get here can
bloody well go negative immediately afterwards.  Hit that with
O_CREAT and you've got a bogus ENOENT...

Al Viro Aug. 14, 2024, 2:40 a.m. UTC | #25

On Wed, Aug 14, 2024 at 03:18:17AM +0100, Al Viro wrote:

> That's not the only problem; your "is it negative" test is inherently
> racy in RCU mode.  IOW, what is positive at the time you get here can
> bloody well go negative immediately afterwards.  Hit that with
> O_CREAT and you've got a bogus ENOENT...

Hmm...  OTOH, in that case you end up in step_into(), which will do the
right thing...

	How well does that series survive NFS client regression tests?
That's where I'd expect potentially subtle shite, what with short-circuited
->d_revalidate() on the final pathwalk step in open()...

Jeff Layton Aug. 14, 2024, 11:48 a.m. UTC | #26

On Wed, 2024-08-14 at 03:40 +0100, Al Viro wrote:
> On Wed, Aug 14, 2024 at 03:18:17AM +0100, Al Viro wrote:
> 
> > That's not the only problem; your "is it negative" test is inherently
> > racy in RCU mode.  IOW, what is positive at the time you get here can
> > bloody well go negative immediately afterwards.  Hit that with
> > O_CREAT and you've got a bogus ENOENT...
> 
> Hmm...  OTOH, in that case you end up in step_into(), which will do the
> right thing...
> 
> 	How well does that series survive NFS client regression tests?
> That's where I'd expect potentially subtle shite, what with short-circuited
> ->d_revalidate() on the final pathwalk step in open()...

Christian took in my v3 patch which is a bit different from this one.
It seems to be doing fine in testing with NFS and otherwise.

I don't think we short-circuit the d_revalidate though, do we? That
version calls lookup_fast on the last component which should
d_revalidate the last dentry before returning it.

Christian Brauner Aug. 14, 2024, 12:40 p.m. UTC | #27

> Christian took in my v3 patch which is a bit different from this one.
> It seems to be doing fine in testing with NFS and otherwise.

Every branch gets tested with nfs fstests (in addition to the usual
suspects):

Failures: generic/732
Failed 1 of 600 tests

And that just fails because it's missing your 4fd042e0465c
("generic/732: don't run it on NFS")

Al Viro Aug. 14, 2024, 3:42 p.m. UTC | #28

On Wed, Aug 14, 2024 at 07:48:17AM -0400, Jeff Layton wrote:
> On Wed, 2024-08-14 at 03:40 +0100, Al Viro wrote:
> > On Wed, Aug 14, 2024 at 03:18:17AM +0100, Al Viro wrote:
> > 
> > > That's not the only problem; your "is it negative" test is inherently
> > > racy in RCU mode.  IOW, what is positive at the time you get here can
> > > bloody well go negative immediately afterwards.  Hit that with
> > > O_CREAT and you've got a bogus ENOENT...
> > 
> > Hmm...  OTOH, in that case you end up in step_into(), which will do the
> > right thing...
> > 
> > 	How well does that series survive NFS client regression tests?
> > That's where I'd expect potentially subtle shite, what with short-circuited
> > ->d_revalidate() on the final pathwalk step in open()...
> 
> Christian took in my v3 patch which is a bit different from this one.
> It seems to be doing fine in testing with NFS and otherwise.
> 
> I don't think we short-circuit the d_revalidate though, do we? That
> version calls lookup_fast on the last component which should
> d_revalidate the last dentry before returning it.

It's not about a skipped call of ->d_revalidate(); it's about the NFS
(especially NFS4) dances inside ->d_revalidate(), where it tries to
cut down on roundtrips where possible.  The interplay with ->atomic_open()
and ->open() is subtle and I'm not sure that we do not depend upon the
details of ->i_rwsem locking by fs/namei.c in there - proof of correctness
used to be rather convoluted there, especially wrt the unhashing and
rehashing aliases.

I'm not saying that your changes break things in there, but that's one
area where I would look for trouble.  NFS has fairly extensive regression
tests, and it would be a good idea to beat that patchset with those.

Al Viro Aug. 14, 2024, 3:44 p.m. UTC | #29

On Wed, Aug 14, 2024 at 02:40:23PM +0200, Christian Brauner wrote:
> > Christian took in my v3 patch which is a bit different from this one.
> > It seems to be doing fine in testing with NFS and otherwise.
> 
> Every branch gets tested with nfs fstests (in addition to the usual
> suspects):
> 
> Failures: generic/732
> Failed 1 of 600 tests
> 
> And that just fails because it's missing your 4fd042e0465c
> ("generic/732: don't run it on NFS")

connectathon would be more interesting...

Jeff Layton Aug. 14, 2024, 4:46 p.m. UTC | #30

On Wed, 2024-08-14 at 16:42 +0100, Al Viro wrote:
> On Wed, Aug 14, 2024 at 07:48:17AM -0400, Jeff Layton wrote:
> > On Wed, 2024-08-14 at 03:40 +0100, Al Viro wrote:
> > > On Wed, Aug 14, 2024 at 03:18:17AM +0100, Al Viro wrote:
> > > 
> > > > That's not the only problem; your "is it negative" test is
> > > > inherently
> > > > racy in RCU mode.  IOW, what is positive at the time you get
> > > > here can
> > > > bloody well go negative immediately afterwards.  Hit that with
> > > > O_CREAT and you've got a bogus ENOENT...
> > > 
> > > Hmm...  OTOH, in that case you end up in step_into(), which will
> > > do the
> > > right thing...
> > > 
> > > 	How well does that series survive NFS client regression
> > > tests?
> > > That's where I'd expect potentially subtle shite, what with
> > > short-circuited
> > > ->d_revalidate() on the final pathwalk step in open()...
> > 
> > Christian took in my v3 patch which is a bit different from this
> > one.
> > It seems to be doing fine in testing with NFS and otherwise.
> > 
> > I don't think we short-circuit the d_revalidate though, do we? That
> > version calls lookup_fast on the last component which should
> > d_revalidate the last dentry before returning it.
> 
> It's not about a skipped call of ->d_revalidate(); it's about the NFS
> (especially NFS4) dances inside ->d_revalidate(), where it tries to
> cut down on roundtrips where possible.  The interplay with -
> >atomic_open()
> and ->open() is subtle and I'm not sure that we do not depend upon
> the
> details of ->i_rwsem locking by fs/namei.c in there - proof of
> correctness
> used to be rather convoluted there, especially wrt the unhashing and
> rehashing aliases.
> 
> I'm not saying that your changes break things in there, but that's
> one
> area where I would look for trouble.  NFS has fairly extensive
> regression
> tests, and it would be a good idea to beat that patchset with those.

I've already run a bunch of NFS tests on it and it seems to be OK so
far, but I'll keep testing it. My take:

Opening an extant file with O_CREAT set should behave the same as with
O_CREAT not set.

I did crawl through NFS's d_revalidate functions. There are a couple of
places that check for O_CREAT, but they didn't seem to depend on the
i_rwsem or any particular locking.

Please do let me know if you see anything I missed though.

Thanks,

Mateusz Guzik Aug. 15, 2024, 3:07 p.m. UTC | #31

On Tue, Aug 6, 2024 at 10:47 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> > Before I get to the vfs layer, there is a significant loss in the
> > memory allocator because of memcg -- it takes several irq off/on trips
> > for every alloc (needed to grab struct file *). I have a plan what to
> > do with it (handle stuff with local cmpxchg (note no lock prefix)),
> > which I'm trying to get around to. Apart from that you may note the
> > allocator fast path performs a 16-byte cmpxchg, which is again dog
> > slow and executes twice (once for the file obj, another time for the
> > namei buffer). Someone(tm) should patch it up and I have some vague
> > ideas, but 0 idea when I can take a serious stab.
>
> I just LBR sampled it on my skylake and it doesn't look
> particularly slow. You see the whole massive block including CMPXCHG16
> gets IPC 2.7, which is rather good. If you see lots of cycles on it it's likely
> a missing cache line.
>
>     kmem_cache_free:
>         ffffffff9944ce20                        nop %edi, %edx
>         ffffffff9944ce24                        nopl  %eax, (%rax,%rax,1)
>         ffffffff9944ce29                        pushq  %rbp
>         ffffffff9944ce2a                        mov %rdi, %rdx
>         ffffffff9944ce2d                        mov %rsp, %rbp
>         ffffffff9944ce30                        pushq  %r15
>         ffffffff9944ce32                        pushq  %r14
>         ffffffff9944ce34                        pushq  %r13
>         ffffffff9944ce36                        pushq  %r12
>         ffffffff9944ce38                        mov $0x80000000, %r12d
>         ffffffff9944ce3e                        pushq  %rbx
>         ffffffff9944ce3f                        mov %rsi, %rbx
>         ffffffff9944ce42                        and $0xfffffffffffffff0, %rsp
>         ffffffff9944ce46                        sub $0x10, %rsp
>         ffffffff9944ce4a                        movq  %gs:0x28, %rax
>         ffffffff9944ce53                        movq  %rax, 0x8(%rsp)
>         ffffffff9944ce58                        xor %eax, %eax
>         ffffffff9944ce5a                        add %rsi, %r12
>         ffffffff9944ce5d                        jb 0xffffffff9944d1ea
>         ffffffff9944ce63                        mov $0xffffffff80000000, %rax
>         ffffffff9944ce6a                        xor %r13d, %r13d
>         ffffffff9944ce6d                        subq  0x17b068c(%rip), %rax
>         ffffffff9944ce74                        add %r12, %rax
>         ffffffff9944ce77                        shr $0xc, %rax
>         ffffffff9944ce7b                        shl $0x6, %rax
>         ffffffff9944ce7f                        addq  0x17b066a(%rip), %rax
>         ffffffff9944ce86                        movq  0x8(%rax), %rcx
>         ffffffff9944ce8a                        test $0x1, %cl
>         ffffffff9944ce8d                        jnz 0xffffffff9944d15c
>         ffffffff9944ce93                        nopl  %eax, (%rax,%rax,1)
>         ffffffff9944ce98                        movq  (%rax), %rcx
>         ffffffff9944ce9b                        and $0x8, %ch
>         ffffffff9944ce9e                        jz 0xffffffff9944cfea
>         ffffffff9944cea4                        test %rax, %rax
>         ffffffff9944cea7                        jz 0xffffffff9944cfea
>         ffffffff9944cead                        movq  0x8(%rax), %r14
>         ffffffff9944ceb1                        test %r14, %r14
>         ffffffff9944ceb4                        jz 0xffffffff9944cfac
>         ffffffff9944ceba                        cmp %r14, %rdx
>         ffffffff9944cebd                        jnz 0xffffffff9944d165
>         ffffffff9944cec3                        test %r14, %r14
>         ffffffff9944cec6                        jz 0xffffffff9944cfac
>         ffffffff9944cecc                        movq  0x8(%rbp), %r15
>         ffffffff9944ced0                        nopl  %eax, (%rax,%rax,1)
>         ffffffff9944ced5                        movq  0x1fe5134(%rip), %rax
>         ffffffff9944cedc                        test %r13, %r13
>         ffffffff9944cedf                        jnz 0xffffffff9944ceef
>         ffffffff9944cee1                        mov $0xffffffff80000000, %rax
>         ffffffff9944cee8                        subq  0x17b0611(%rip), %rax
>         ffffffff9944ceef                        add %rax, %r12
>         ffffffff9944cef2                        shr $0xc, %r12
>         ffffffff9944cef6                        shl $0x6, %r12
>         ffffffff9944cefa                        addq  0x17b05ef(%rip), %r12
>         ffffffff9944cf01                        movq  0x8(%r12), %rax
>         ffffffff9944cf06                        mov %r12, %r13
>         ffffffff9944cf09                        test $0x1, %al
>         ffffffff9944cf0b                        jnz 0xffffffff9944d1b1
>         ffffffff9944cf11                        nopl  %eax, (%rax,%rax,1)
>         ffffffff9944cf16                        movq  (%r13), %rax
>         ffffffff9944cf1a                        movq  %rbx, (%rsp)
>         ffffffff9944cf1e                        test $0x8, %ah
>         ffffffff9944cf21                        mov $0x0, %eax
>         ffffffff9944cf26                        cmovz %rax, %r13
>         ffffffff9944cf2a                        data16 nop
>         ffffffff9944cf2c                        movq  0x38(%r13), %r8
>         ffffffff9944cf30                        cmp $0x3, %r8
>         ffffffff9944cf34                        jnbe 0xffffffff9944d1ca
>         ffffffff9944cf3a                        nopl  %eax, (%rax,%rax,1)
>         ffffffff9944cf3f                        movq  0x23d6f72(%rip), %rax
>         ffffffff9944cf46                        mov %rbx, %rdx
>         ffffffff9944cf49                        sub %rax, %rdx
>         ffffffff9944cf4c                        cmp $0x1fffff, %rdx
>         ffffffff9944cf53                        jbe 0xffffffff9944d03a
>         ffffffff9944cf59                        movq  (%r14), %rax
>         ffffffff9944cf5c                        addq  %gs:0x66bccab4(%rip), %rax
>         ffffffff9944cf64                        movq  0x8(%rax), %rdx
>         ffffffff9944cf68                        cmpq  %r13, 0x10(%rax)
>         ffffffff9944cf6c                        jnz 0xffffffff9944d192
>         ffffffff9944cf72                        movl  0x28(%r14), %ecx
>         ffffffff9944cf76                        movq  (%rax), %rax
>         ffffffff9944cf79                        add %rbx, %rcx
>         ffffffff9944cf7c                        cmp %rbx, %rax
>         ffffffff9944cf7f                        jz 0xffffffff9944d1ba
>         ffffffff9944cf85                        movq  0xb8(%r14), %rsi
>         ffffffff9944cf8c                        mov %rcx, %rdi
>         ffffffff9944cf8f                        bswap %rdi
>         ffffffff9944cf92                        xor %rax, %rsi
>         ffffffff9944cf95                        xor %rdi, %rsi
>         ffffffff9944cf98                        movq  %rsi, (%rcx)
>         ffffffff9944cf9b                        leaq  0x2000(%rdx), %rcx
>         ffffffff9944cfa2                        movq  (%r14), %rsi
>         ffffffff9944cfa5                        cmpxchg16bx  %gs:(%rsi)
>         ffffffff9944cfaa                        jnz 0xffffffff9944cf59
>         ffffffff9944cfac                        movq  0x8(%rsp), %rax
>         ffffffff9944cfb1                        subq  %gs:0x28, %rax
>         ffffffff9944cfba                        jnz 0xffffffff9944d1fc
>         ffffffff9944cfc0                        leaq  -0x28(%rbp), %rsp
>         ffffffff9944cfc4                        popq  %rbx
>         ffffffff9944cfc5                        popq  %r12
>         ffffffff9944cfc7                        popq  %r13
>         ffffffff9944cfc9                        popq  %r14
>         ffffffff9944cfcb                        popq  %r15
>         ffffffff9944cfcd                        popq  %rbp
>         ffffffff9944cfce                        retq                            # PRED 38 cycles [126] 2.74 IPC    <-------------

Sorry for late reply, my test box was temporarily unavailable and then
I forgot about this e-mail :)

I don't have a good scientific test(tm) and I don't think coming up
with one is warranted at the moment.

But to illustrate, I slapped together a test case for will-it-scale
where I either cmpxchg8 or 16 in a loop. No lock prefix on these.

On Sapphire Rapids I see well over twice the throughput for the 8-byte variant:

# ./cmpxchg8_processes
warmup
min:481465497 max:481465497 total:481465497
min:464439645 max:464439645 total:464439645
min:461884735 max:461884735 total:461884735
min:460850043 max:460850043 total:460850043
min:461066452 max:461066452 total:461066452
min:463984473 max:463984473 total:463984473
measurement
min:461317703 max:461317703 total:461317703
min:458608942 max:458608942 total:458608942
min:460846336 max:460846336 total:460846336
[snip]

# ./cmpxchg16b_processes
warmup
min:205207128 max:205207128 total:205207128
min:205010535 max:205010535 total:205010535
min:204877781 max:204877781 total:204877781
min:204163814 max:204163814 total:204163814
min:204392000 max:204392000 total:204392000
min:204094222 max:204094222 total:204094222
measurement
min:204243282 max:204243282 total:204243282
min:204136589 max:204136589 total:204136589
min:203504119 max:203504119 total:203504119

So I would say trying it out in a real alloc is worth looking at.

Of course the 16-byte variant is not used just for kicks, so going to
8 bytes is more involved than just replacing the instruction.

The current code follows the standard idea on how to deal with the ABA
problem -- apart from replacing a pointer you validate this is what
you thought by checking the counter in the same instruction.

I note that in the kernel we can do better, but I don't have have all
kinks worked out yet. The core idea builds on the fact that we can
cheaply detect a pending alloc on the same cpu and should a
conflicting free be executing from an interrupt, it can instead add
the returning buffer to a different list and the aba problem
disappears. Should the alloc fast path fail to find a free buffer, it
can disable interrupts an take a look at the fallback list.

Christian Brauner Aug. 16, 2024, 8:34 a.m. UTC | #32

On Wed, Aug 14, 2024 at 04:44:19PM GMT, Al Viro wrote:
> On Wed, Aug 14, 2024 at 02:40:23PM +0200, Christian Brauner wrote:
> > > Christian took in my v3 patch which is a bit different from this one.
> > > It seems to be doing fine in testing with NFS and otherwise.
> > 
> > Every branch gets tested with nfs fstests (in addition to the usual
> > suspects):
> > 
> > Failures: generic/732
> > Failed 1 of 600 tests
> > 
> > And that just fails because it's missing your 4fd042e0465c
> > ("generic/732: don't run it on NFS")
> 
> connectathon would be more interesting...

Fwiw, I wasted almost a day to get more testing done here. But that
connectathon thing at [1] just failed to compile on anything reasonably
new and with errors that indicate that certain apis it uses have been
deprecated for a very long time.

In any case, I went out of my way and found that LTP has stresstests for
NFS for creation, open, mkdir, unlink etc and I added them to testing
now.

[1]: https://github.com/dkruchinin/cthon-nfs-tests

[v2] fs: try an opportunistic lookup for O_CREAT opens too

Commit Message

Comments

Patch