Message ID | 20240806-openfast-v2-1-42da45981811@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] fs: try an opportunistic lookup for O_CREAT opens too | expand |
On Tue, Aug 6, 2024 at 4:32 PM Jeff Layton <jlayton@kernel.org> wrote: > > Today, when opening a file we'll typically do a fast lookup, but if > O_CREAT is set, the kernel always takes the exclusive inode lock. I > assume this was done with the expectation that O_CREAT means that we > always expect to do the create, but that's often not the case. Many > programs set O_CREAT even in scenarios where the file already exists. > > This patch rearranges the pathwalk-for-open code to also attempt a > fast_lookup in certain O_CREAT cases. If a positive dentry is found, the > inode_lock can be avoided altogether, and if auditing isn't enabled, it > can stay in rcuwalk mode for the last step_into. > > One notable exception that is hopefully temporary: if we're doing an > rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the > dentry in that case is more expensive than taking the i_rwsem for now. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > Here's a revised patch that does a fast_lookup in the O_CREAT codepath > too. The main difference here is that if a positive dentry is found and > audit_dummy_context is true, then we keep the walk lazy for the last > component, which avoids having to take any locks on the parent (just > like with non-O_CREAT opens). > > The testcase below runs in about 18s on v6.10 (on an 80 CPU machine). > With this patch, it runs in about 1s: > I don't have an opinion on the patch. If your kernel does not use apparmor and the patch manages to dodge refing the parent, then indeed this should be fully deserialized just like non-creat opens. Instead of the hand-rolled benchmark may I interest you in using will-it-scale instead? Notably it reports the achieved rate once per second, so you can check if there is funky business going on between reruns, gives the cpu the time to kick off turbo boost if applicable etc. I would bench with that myself, but I temporarily don't have handy access to bigger hw. Even so, the below is completely optional and perhaps more of a suggestion for the future :) I hacked up the test case based on tests/open1.c. git clone https://github.com/antonblanchard/will-it-scale.git For example plop into tests/opencreate1.c && gmake && ./opencreate1_processes -t 70: #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <assert.h> #include <string.h> char *testcase_description = "Separate file open/close + O_CREAT"; #define template "/tmp/willitscale.XXXXXX" static char (*tmpfiles)[sizeof(template)]; static unsigned long local_nr_tasks; void testcase_prepare(unsigned long nr_tasks) { int i; tmpfiles = (char(*)[sizeof(template)])malloc(sizeof(template) * nr_tasks); assert(tmpfiles); for (i = 0; i < nr_tasks; i++) { strcpy(tmpfiles[i], template); char *tmpfile = tmpfiles[i]; int fd = mkstemp(tmpfile); assert(fd >= 0); close(fd); } local_nr_tasks = nr_tasks; } void testcase(unsigned long long *iterations, unsigned long nr) { char *tmpfile = tmpfiles[nr]; while (1) { int fd = open(tmpfile, O_RDWR | O_CREAT, 0600); assert(fd >= 0); close(fd); (*iterations)++; } } void testcase_cleanup(void) { int i; for (i = 0; i < local_nr_tasks; i++) { unlink(tmpfiles[i]); } free(tmpfiles); } > --- > fs/namei.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 53 insertions(+), 9 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index 1e05a0f3f04d..2d716fb114c9 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3518,6 +3518,47 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > return ERR_PTR(error); > } > > +static inline bool trailing_slashes(struct nameidata *nd) > +{ > + return (bool)nd->last.name[nd->last.len]; > +} > + > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > +{ > + struct dentry *dentry; > + > + if (open_flag & O_CREAT) { > + /* Don't bother on an O_EXCL create */ > + if (open_flag & O_EXCL) > + return NULL; > + > + /* > + * FIXME: If auditing is enabled, then we'll have to unlazy to > + * use the dentry. For now, don't do this, since it shifts > + * contention from parent's i_rwsem to its d_lockref spinlock. > + * Reconsider this once dentry refcounting handles heavy > + * contention better. > + */ > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > + return NULL; > + } > + > + if (trailing_slashes(nd)) > + nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; > + > + dentry = lookup_fast(nd); > + > + if (open_flag & O_CREAT) { > + /* Discard negative dentries. Need inode_lock to do the create */ > + if (dentry && !dentry->d_inode) { > + if (!(nd->flags & LOOKUP_RCU)) > + dput(dentry); > + dentry = NULL; > + } > + } > + return dentry; > +} > + > static const char *open_last_lookups(struct nameidata *nd, > struct file *file, const struct open_flags *op) > { > @@ -3535,28 +3576,31 @@ static const char *open_last_lookups(struct nameidata *nd, > return handle_dots(nd, nd->last_type); > } > > + /* We _can_ be in RCU mode here */ > + dentry = lookup_fast_for_open(nd, open_flag); > + if (IS_ERR(dentry)) > + return ERR_CAST(dentry); > + > if (!(open_flag & O_CREAT)) { > - if (nd->last.name[nd->last.len]) > - nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; > - /* we _can_ be in RCU mode here */ > - dentry = lookup_fast(nd); > - if (IS_ERR(dentry)) > - return ERR_CAST(dentry); > if (likely(dentry)) > goto finish_lookup; > > if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU)) > return ERR_PTR(-ECHILD); > } else { > - /* create side of things */ > if (nd->flags & LOOKUP_RCU) { > + /* can stay in rcuwalk if not auditing */ > + if (dentry && audit_dummy_context()) > + goto check_slashes; > if (!try_to_unlazy(nd)) > return ERR_PTR(-ECHILD); > } > audit_inode(nd->name, dir, AUDIT_INODE_PARENT); > - /* trailing slashes? */ > - if (unlikely(nd->last.name[nd->last.len])) > +check_slashes: > + if (trailing_slashes(nd)) > return ERR_PTR(-EISDIR); > + if (dentry) > + goto finish_lookup; > } > > if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { > > --- > base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd > change-id: 20240723-openfast-ac49a7b6ade2 > > Best regards, > -- > Jeff Layton <jlayton@kernel.org> >
On Tue, 2024-08-06 at 17:25 +0200, Mateusz Guzik wrote: > On Tue, Aug 6, 2024 at 4:32 PM Jeff Layton <jlayton@kernel.org> > wrote: > > > > Today, when opening a file we'll typically do a fast lookup, but if > > O_CREAT is set, the kernel always takes the exclusive inode lock. I > > assume this was done with the expectation that O_CREAT means that > > we > > always expect to do the create, but that's often not the case. Many > > programs set O_CREAT even in scenarios where the file already > > exists. > > > > This patch rearranges the pathwalk-for-open code to also attempt a > > fast_lookup in certain O_CREAT cases. If a positive dentry is > > found, the > > inode_lock can be avoided altogether, and if auditing isn't > > enabled, it > > can stay in rcuwalk mode for the last step_into. > > > > One notable exception that is hopefully temporary: if we're doing > > an > > rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing > > the > > dentry in that case is more expensive than taking the i_rwsem for > > now. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > Here's a revised patch that does a fast_lookup in the O_CREAT > > codepath > > too. The main difference here is that if a positive dentry is found > > and > > audit_dummy_context is true, then we keep the walk lazy for the > > last > > component, which avoids having to take any locks on the parent > > (just > > like with non-O_CREAT opens). > > > > The testcase below runs in about 18s on v6.10 (on an 80 CPU > > machine). > > With this patch, it runs in about 1s: > > > > I don't have an opinion on the patch. > > If your kernel does not use apparmor and the patch manages to dodge > refing the parent, then indeed this should be fully deserialized just > like non-creat opens. > Yep. Pity that auditing will slow things down, but them's the breaks... > Instead of the hand-rolled benchmark may I interest you in using > will-it-scale instead? Notably it reports the achieved rate once per > second, so you can check if there is funky business going on between > reruns, gives the cpu the time to kick off turbo boost if applicable > etc. > > I would bench with that myself, but I temporarily don't have handy > access to bigger hw. Even so, the below is completely optional and > perhaps more of a suggestion for the future :) > > I hacked up the test case based on tests/open1.c. > > git clone https://github.com/antonblanchard/will-it-scale.git > > For example plop into tests/opencreate1.c && gmake && > ./opencreate1_processes -t 70: > > #include <stdlib.h> > #include <unistd.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <assert.h> > #include <string.h> > > char *testcase_description = "Separate file open/close + O_CREAT"; > > #define template "/tmp/willitscale.XXXXXX" > static char (*tmpfiles)[sizeof(template)]; > static unsigned long local_nr_tasks; > > void testcase_prepare(unsigned long nr_tasks) > { > int i; > tmpfiles = (char(*)[sizeof(template)])malloc(sizeof(template) > * nr_tasks); > assert(tmpfiles); > > for (i = 0; i < nr_tasks; i++) { > strcpy(tmpfiles[i], template); > char *tmpfile = tmpfiles[i]; > int fd = mkstemp(tmpfile); > > assert(fd >= 0); > close(fd); > } > > local_nr_tasks = nr_tasks; > } > > void testcase(unsigned long long *iterations, unsigned long nr) > { > char *tmpfile = tmpfiles[nr]; > > while (1) { > int fd = open(tmpfile, O_RDWR | O_CREAT, 0600); > assert(fd >= 0); > close(fd); > > (*iterations)++; > } > } > > void testcase_cleanup(void) > { > int i; > for (i = 0; i < local_nr_tasks; i++) { > unlink(tmpfiles[i]); > } > free(tmpfiles); > } > > Good suggestion and thanks for the testcase. With v6.10 kernel, I'm seeing numbers like this at -t 70: min:4873 max:11510 total:418915 min:4884 max:10598 total:408848 min:3704 max:12371 total:467658 min:2842 max:11792 total:418239 min:2966 max:11511 total:414144 min:4756 max:11381 total:413137 min:4557 max:10789 total:404628 min:4780 max:11125 total:413349 min:4757 max:11156 total:405963 ...with the patched kernel, things are significantly faster: min:265865 max:508909 total:21464553 min:263252 max:500084 total:21242190 min:263989 max:504929 total:21396968 min:263343 max:505852 total:21346829 min:263023 max:507303 total:21410217 min:263420 max:506593 total:21426307 min:259556 max:494529 total:20927169 min:264451 max:508967 total:21433676 min:263486 max:509460 total:21399874 min:263906 max:507400 total:21393015 I can get some fancier plots if anyone is interested, but the benefit of this patch seems pretty clear.
On Tue, Aug 6, 2024 at 6:17 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Tue, 2024-08-06 at 17:25 +0200, Mateusz Guzik wrote: > > On Tue, Aug 6, 2024 at 4:32 PM Jeff Layton <jlayton@kernel.org> > > wrote: > > > > > > Today, when opening a file we'll typically do a fast lookup, but if > > > O_CREAT is set, the kernel always takes the exclusive inode lock. I > > > assume this was done with the expectation that O_CREAT means that > > > we > > > always expect to do the create, but that's often not the case. Many > > > programs set O_CREAT even in scenarios where the file already > > > exists. > > > > > > This patch rearranges the pathwalk-for-open code to also attempt a > > > fast_lookup in certain O_CREAT cases. If a positive dentry is > > > found, the > > > inode_lock can be avoided altogether, and if auditing isn't > > > enabled, it > > > can stay in rcuwalk mode for the last step_into. > > > > > > One notable exception that is hopefully temporary: if we're doing > > > an > > > rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing > > > the > > > dentry in that case is more expensive than taking the i_rwsem for > > > now. > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > --- > > > Here's a revised patch that does a fast_lookup in the O_CREAT > > > codepath > > > too. The main difference here is that if a positive dentry is found > > > and > > > audit_dummy_context is true, then we keep the walk lazy for the > > > last > > > component, which avoids having to take any locks on the parent > > > (just > > > like with non-O_CREAT opens). > > > > > > The testcase below runs in about 18s on v6.10 (on an 80 CPU > > > machine). > > > With this patch, it runs in about 1s: > > > > > > > I don't have an opinion on the patch. > > > > If your kernel does not use apparmor and the patch manages to dodge > > refing the parent, then indeed this should be fully deserialized just > > like non-creat opens. > > > > Yep. Pity that auditing will slow things down, but them's the breaks... > > > Instead of the hand-rolled benchmark may I interest you in using > > will-it-scale instead? Notably it reports the achieved rate once per > > second, so you can check if there is funky business going on between > > reruns, gives the cpu the time to kick off turbo boost if applicable > > etc. > > > > I would bench with that myself, but I temporarily don't have handy > > access to bigger hw. Even so, the below is completely optional and > > perhaps more of a suggestion for the future :) > > > > I hacked up the test case based on tests/open1.c. > > > > git clone https://github.com/antonblanchard/will-it-scale.git > > > > For example plop into tests/opencreate1.c && gmake && > > ./opencreate1_processes -t 70: > > > > #include <stdlib.h> > > #include <unistd.h> > > #include <sys/types.h> > > #include <sys/stat.h> > > #include <fcntl.h> > > #include <assert.h> > > #include <string.h> > > > > char *testcase_description = "Separate file open/close + O_CREAT"; > > > > #define template "/tmp/willitscale.XXXXXX" > > static char (*tmpfiles)[sizeof(template)]; > > static unsigned long local_nr_tasks; > > > > void testcase_prepare(unsigned long nr_tasks) > > { > > int i; > > tmpfiles = (char(*)[sizeof(template)])malloc(sizeof(template) > > * nr_tasks); > > assert(tmpfiles); > > > > for (i = 0; i < nr_tasks; i++) { > > strcpy(tmpfiles[i], template); > > char *tmpfile = tmpfiles[i]; > > int fd = mkstemp(tmpfile); > > > > assert(fd >= 0); > > close(fd); > > } > > > > local_nr_tasks = nr_tasks; > > } > > > > void testcase(unsigned long long *iterations, unsigned long nr) > > { > > char *tmpfile = tmpfiles[nr]; > > > > while (1) { > > int fd = open(tmpfile, O_RDWR | O_CREAT, 0600); > > assert(fd >= 0); > > close(fd); > > > > (*iterations)++; > > } > > } > > > > void testcase_cleanup(void) > > { > > int i; > > for (i = 0; i < local_nr_tasks; i++) { > > unlink(tmpfiles[i]); > > } > > free(tmpfiles); > > } > > > > > > Good suggestion and thanks for the testcase. With v6.10 kernel, I'm > seeing numbers like this at -t 70: > > min:4873 max:11510 total:418915 > min:4884 max:10598 total:408848 > min:3704 max:12371 total:467658 > min:2842 max:11792 total:418239 > min:2966 max:11511 total:414144 > min:4756 max:11381 total:413137 > min:4557 max:10789 total:404628 > min:4780 max:11125 total:413349 > min:4757 max:11156 total:405963 > > ...with the patched kernel, things are significantly faster: > > min:265865 max:508909 total:21464553 > min:263252 max:500084 total:21242190 > min:263989 max:504929 total:21396968 > min:263343 max:505852 total:21346829 > min:263023 max:507303 total:21410217 > min:263420 max:506593 total:21426307 > min:259556 max:494529 total:20927169 > min:264451 max:508967 total:21433676 > min:263486 max:509460 total:21399874 > min:263906 max:507400 total:21393015 > > I can get some fancier plots if anyone is interested, but the benefit > of this patch seems pretty clear. In the commit message I would replace that handrolled bench thing with +/-: <quote> 70-way open(.., O_RDWR | O_CREAT) calls on already existing files, one per-worker: before: 467658 after: 21464553 (+4589%) </quote> I guess it would make sense to check if unlink1_processes is doing fine if it's not too much hassle. overall pretty decent win :P
Mateusz Guzik <mjguzik@gmail.com> writes: > > I would bench with that myself, but I temporarily don't have handy > access to bigger hw. Even so, the below is completely optional and > perhaps more of a suggestion for the future :) > > I hacked up the test case based on tests/open1.c. Don't you need two test cases? One where the file exists and one where it doesn't. Because the "doesn't exist" will likely be slower than before because it will do the lookups twice, and it will likely even slow single threaded. I assume the penalty will also depend on the number of entries in the path. That all seem to be an important considerations in judging the benefits of the patch. -Andi
On Tue, Aug 6, 2024 at 9:11 PM Andi Kleen <ak@linux.intel.com> wrote: > > Mateusz Guzik <mjguzik@gmail.com> writes: > > > > I would bench with that myself, but I temporarily don't have handy > > access to bigger hw. Even so, the below is completely optional and > > perhaps more of a suggestion for the future :) > > > > I hacked up the test case based on tests/open1.c. > > Don't you need two test cases? One where the file exists and one > where it doesn't. Because the "doesn't exist" will likely be slower > than before because it will do the lookups twice, > and it will likely even slow single threaded. > > I assume the penalty will also depend on the number of entries > in the path. > > That all seem to be an important considerations in judging the benefits > of the patch. > This is why I suggested separately running "unlink1" which is guaranteed to create a file every time -- all iterations will fail the proposed fast path. Unless you meant a mixed variant where only some of the threads create files. Perhaps worthwhile to add, not hard to do (one can switch the mode based on passed worker number).
On Tue, 2024-08-06 at 12:11 -0700, Andi Kleen wrote: > Mateusz Guzik <mjguzik@gmail.com> writes: > > > > I would bench with that myself, but I temporarily don't have handy > > access to bigger hw. Even so, the below is completely optional and > > perhaps more of a suggestion for the future :) > > > > I hacked up the test case based on tests/open1.c. > > Don't you need two test cases? One where the file exists and one > where it doesn't. Because the "doesn't exist" will likely be slower > than before because it will do the lookups twice, > and it will likely even slow single threaded. > > I assume the penalty will also depend on the number of entries > in the path. > > That all seem to be an important considerations in judging the benefits > of the patch. > Definitely. FWIW, I did test a single threaded (bespoke) test case that did a bunch of O_CREAT opens, closes and then unlinks. I didn't measure any discernable difference with this patch. My conclusion from that was that the extra lockless lookup should be cheap. That said, this could show a difference if you have rather long hash chains that need to be walked completely, and you have to actually do the create every time. In practice though, time spent under the inode_lock and doing the create tends to dominate in that case, so I *think* this should still be worthwhile. I'll plan to add a test like that to will_it_scale unless Mateusz beats me to it. A long soak in linux-next is probably also justified with this patch.
On Tue, 2024-08-06 at 10:32 -0400, Jeff Layton wrote: > Today, when opening a file we'll typically do a fast lookup, but if > O_CREAT is set, the kernel always takes the exclusive inode lock. I > assume this was done with the expectation that O_CREAT means that we > always expect to do the create, but that's often not the case. Many > programs set O_CREAT even in scenarios where the file already exists. > > This patch rearranges the pathwalk-for-open code to also attempt a > fast_lookup in certain O_CREAT cases. If a positive dentry is found, the > inode_lock can be avoided altogether, and if auditing isn't enabled, it > can stay in rcuwalk mode for the last step_into. > > One notable exception that is hopefully temporary: if we're doing an > rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the > dentry in that case is more expensive than taking the i_rwsem for now. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > Here's a revised patch that does a fast_lookup in the O_CREAT codepath > too. The main difference here is that if a positive dentry is found and > audit_dummy_context is true, then we keep the walk lazy for the last > component, which avoids having to take any locks on the parent (just > like with non-O_CREAT opens). > > The testcase below runs in about 18s on v6.10 (on an 80 CPU machine). > With this patch, it runs in about 1s: > > #define _GNU_SOURCE 1 > #include <stdio.h> > #include <unistd.h> > #include <errno.h> > #include <fcntl.h> > #include <stdlib.h> > #include <sys/wait.h> > > #define PROCS 70 > #define LOOPS 500000 > > static int openloop(int tnum) > { > char *file; > int i, ret; > > ret = asprintf(&file, "./testfile%d", tnum); > if (ret < 0) { > printf("asprintf failed for proc %d", tnum); > return 1; > } > > for (i = 0; i < LOOPS; ++i) { > int fd = open(file, O_RDWR|O_CREAT, 0644); > > if (fd < 0) { > perror("open"); > return 1; > } > close(fd); > } > unlink(file); > free(file); > return 0; > } > > int main(int argc, char **argv) { > pid_t kids[PROCS]; > int i, ret = 0; > > for (i = 0; i < PROCS; ++i) { > kids[i] = fork(); > if (kids[i] > 0) > return openloop(i); > if (kids[i] < 0) > perror("fork"); > } > > for (i = 0; i < PROCS; ++i) { > int ret2; > > if (kids[i] > 0) { > wait(&ret2); > if (ret2 != 0) > ret = ret2; > } > } > return ret; > } > --- > Changes in v2: > - drop the lockref patch since Mateusz is working on a better approach > - add trailing_slashes helper function > - add a lookup_fast_for_open helper function > - make lookup_fast_for_open skip the lookup if auditing is enabled > - if we find a positive dentry and auditing is disabled, don't unlazy > - Link to v1: https://lore.kernel.org/r/20240802-openfast-v1-0-a1cff2a33063@kernel.org > --- > fs/namei.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 53 insertions(+), 9 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index 1e05a0f3f04d..2d716fb114c9 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3518,6 +3518,47 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > return ERR_PTR(error); > } > > +static inline bool trailing_slashes(struct nameidata *nd) > +{ > + return (bool)nd->last.name[nd->last.len]; > +} > + > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > +{ > + struct dentry *dentry; > + > + if (open_flag & O_CREAT) { > + /* Don't bother on an O_EXCL create */ > + if (open_flag & O_EXCL) > + return NULL; > + > + /* > + * FIXME: If auditing is enabled, then we'll have to unlazy to > + * use the dentry. For now, don't do this, since it shifts > + * contention from parent's i_rwsem to its d_lockref spinlock. > + * Reconsider this once dentry refcounting handles heavy > + * contention better. > + */ > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > + return NULL; > + } > + > + if (trailing_slashes(nd)) > + nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; > + > + dentry = lookup_fast(nd); Self-NAK on this patch. We have to test for IS_ERR on the returned dentry here. I'll send a v3 along after I've retested it. > + > + if (open_flag & O_CREAT) { > + /* Discard negative dentries. Need inode_lock to do the create */ > + if (dentry && !dentry->d_inode) { > + if (!(nd->flags & LOOKUP_RCU)) > + dput(dentry); > + dentry = NULL; > + } > + } > + return dentry; > +} > + > static const char *open_last_lookups(struct nameidata *nd, > struct file *file, const struct open_flags *op) > { > @@ -3535,28 +3576,31 @@ static const char *open_last_lookups(struct nameidata *nd, > return handle_dots(nd, nd->last_type); > } > > + /* We _can_ be in RCU mode here */ > + dentry = lookup_fast_for_open(nd, open_flag); > + if (IS_ERR(dentry)) > + return ERR_CAST(dentry); > + > if (!(open_flag & O_CREAT)) { > - if (nd->last.name[nd->last.len]) > - nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; > - /* we _can_ be in RCU mode here */ > - dentry = lookup_fast(nd); > - if (IS_ERR(dentry)) > - return ERR_CAST(dentry); > if (likely(dentry)) > goto finish_lookup; > > if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU)) > return ERR_PTR(-ECHILD); > } else { > - /* create side of things */ > if (nd->flags & LOOKUP_RCU) { > + /* can stay in rcuwalk if not auditing */ > + if (dentry && audit_dummy_context()) > + goto check_slashes; > if (!try_to_unlazy(nd)) > return ERR_PTR(-ECHILD); > } > audit_inode(nd->name, dir, AUDIT_INODE_PARENT); > - /* trailing slashes? */ > - if (unlikely(nd->last.name[nd->last.len])) > +check_slashes: > + if (trailing_slashes(nd)) > return ERR_PTR(-EISDIR); > + if (dentry) > + goto finish_lookup; > } > > if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { > > --- > base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd > change-id: 20240723-openfast-ac49a7b6ade2 > > Best regards,
On Tue, Aug 6, 2024 at 9:26 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Tue, 2024-08-06 at 12:11 -0700, Andi Kleen wrote: > > Mateusz Guzik <mjguzik@gmail.com> writes: > > > > > > I would bench with that myself, but I temporarily don't have handy > > > access to bigger hw. Even so, the below is completely optional and > > > perhaps more of a suggestion for the future :) > > > > > > I hacked up the test case based on tests/open1.c. > > > > Don't you need two test cases? One where the file exists and one > > where it doesn't. Because the "doesn't exist" will likely be slower > > than before because it will do the lookups twice, > > and it will likely even slow single threaded. > > > > I assume the penalty will also depend on the number of entries > > in the path. > > > > That all seem to be an important considerations in judging the benefits > > of the patch. > > > > Definitely. > > FWIW, I did test a single threaded (bespoke) test case that did a bunch > of O_CREAT opens, closes and then unlinks. I didn't measure any > discernable difference with this patch. My conclusion from that was > that the extra lockless lookup should be cheap. > > That said, this could show a difference if you have rather long hash > chains that need to be walked completely, and you have to actually do > the create every time. In practice though, time spent under the > inode_lock and doing the create tends to dominate in that case, so I > *think* this should still be worthwhile. > > I'll plan to add a test like that to will_it_scale unless Mateusz beats > me to it. A long soak in linux-next is probably also justified with > this patch. Well if we are really going there I have to point out a couple of three things concerning single-threaded performance from entering the kernel to getting back with a file descriptor. The headline is that there is a lot of single-threaded perf left on the table and modulo a significant hit in terms of cycles it is going to be hard to measure a meaningful result. Before I get to the vfs layer, there is a significant loss in the memory allocator because of memcg -- it takes several irq off/on trips for every alloc (needed to grab struct file *). I have a plan what to do with it (handle stuff with local cmpxchg (note no lock prefix)), which I'm trying to get around to. Apart from that you may note the allocator fast path performs a 16-byte cmpxchg, which is again dog slow and executes twice (once for the file obj, another time for the namei buffer). Someone(tm) should patch it up and I have some vague ideas, but 0 idea when I can take a serious stab. As for vfs, just from atomics perspective: - opening a results in a spurious ref/unref cycle on the dentry, i posted a patch [1] to sort it out (maybe Al will write his own instead). single-threaded it gives about +5% to open/close rate - there is an avoidable smp_mb in do_dentry_open, posted a patch [2] and it got acked. i did not bother benching it - someone sneaked in atomic refcount management to "struct filename" to appease io_uring vs audit. this results in another atomic added to every single path lookup. I have an idea what to do there - opening for write (which you do with O_CREAT) calls mnt_get_write_access which contains another smp_mb. this probably can be sorted out the same way percpu semaphores got taken care of - __legitimize_mnt has yet another smp_mb to synchro against unmount. this *probably* can be sorted out with synchronize_rcu, but some care is needed to not induce massive delays on unmount - credential management is also done with atomics (so that's get + put for the open/close cycle), $elsewhere I implemented a scheme where the local count is cached in the thread itself and only rolled up on credential change. this whacks the problem and is something i'm planning on proposing at some point and more That is to say I completely agree this does add extra work, but the kernel is not in shape where true single-threaded impact can be sensibly measured. I do find it prudent to validate nothing happened to somehow floor scalability of the case which does need to create stuff though. [1] https://lore.kernel.org/linux-fsdevel/20240806163256.882140-1-mjguzik@gmail.com/ [2] https://lore.kernel.org/linux-fsdevel/20240806172846.886570-1-mjguzik@gmail.com/
On Tue, 2024-08-06 at 21:22 +0200, Mateusz Guzik wrote: > On Tue, Aug 6, 2024 at 9:11 PM Andi Kleen <ak@linux.intel.com> wrote: > > > > Mateusz Guzik <mjguzik@gmail.com> writes: > > > > > > I would bench with that myself, but I temporarily don't have > > > handy > > > access to bigger hw. Even so, the below is completely optional > > > and > > > perhaps more of a suggestion for the future :) > > > > > > I hacked up the test case based on tests/open1.c. > > > > Don't you need two test cases? One where the file exists and one > > where it doesn't. Because the "doesn't exist" will likely be slower > > than before because it will do the lookups twice, > > and it will likely even slow single threaded. > > > > I assume the penalty will also depend on the number of entries > > in the path. > > > > That all seem to be an important considerations in judging the > > benefits > > of the patch. > > > > This is why I suggested separately running "unlink1" which is > guaranteed to create a file every time -- all iterations will fail > the > proposed fast path. > > Unless you meant a mixed variant where only some of the threads > create > files. Perhaps worthwhile to add, not hard to do (one can switch the > mode based on passed worker number). > Well... # ./unlink1_processes -t 70 -s 100 average: v6.10: 114455 v6.10 + patch: 149513 I suspect what's happening here is that this patch relieves contention for the inode_lock and that allows the unlinks to proceed faster. Running it with a single process though: average: v6.10: 200106 v6.10 + patch: 199188 So, ~.4% degradation there? That doesn't seem too bad given the gain in the threaded test.
> Before I get to the vfs layer, there is a significant loss in the > memory allocator because of memcg -- it takes several irq off/on trips > for every alloc (needed to grab struct file *). I have a plan what to > do with it (handle stuff with local cmpxchg (note no lock prefix)), > which I'm trying to get around to. Apart from that you may note the > allocator fast path performs a 16-byte cmpxchg, which is again dog > slow and executes twice (once for the file obj, another time for the > namei buffer). Someone(tm) should patch it up and I have some vague > ideas, but 0 idea when I can take a serious stab. I just LBR sampled it on my skylake and it doesn't look particularly slow. You see the whole massive block including CMPXCHG16 gets IPC 2.7, which is rather good. If you see lots of cycles on it it's likely a missing cache line. kmem_cache_free: ffffffff9944ce20 nop %edi, %edx ffffffff9944ce24 nopl %eax, (%rax,%rax,1) ffffffff9944ce29 pushq %rbp ffffffff9944ce2a mov %rdi, %rdx ffffffff9944ce2d mov %rsp, %rbp ffffffff9944ce30 pushq %r15 ffffffff9944ce32 pushq %r14 ffffffff9944ce34 pushq %r13 ffffffff9944ce36 pushq %r12 ffffffff9944ce38 mov $0x80000000, %r12d ffffffff9944ce3e pushq %rbx ffffffff9944ce3f mov %rsi, %rbx ffffffff9944ce42 and $0xfffffffffffffff0, %rsp ffffffff9944ce46 sub $0x10, %rsp ffffffff9944ce4a movq %gs:0x28, %rax ffffffff9944ce53 movq %rax, 0x8(%rsp) ffffffff9944ce58 xor %eax, %eax ffffffff9944ce5a add %rsi, %r12 ffffffff9944ce5d jb 0xffffffff9944d1ea ffffffff9944ce63 mov $0xffffffff80000000, %rax ffffffff9944ce6a xor %r13d, %r13d ffffffff9944ce6d subq 0x17b068c(%rip), %rax ffffffff9944ce74 add %r12, %rax ffffffff9944ce77 shr $0xc, %rax ffffffff9944ce7b shl $0x6, %rax ffffffff9944ce7f addq 0x17b066a(%rip), %rax ffffffff9944ce86 movq 0x8(%rax), %rcx ffffffff9944ce8a test $0x1, %cl ffffffff9944ce8d jnz 0xffffffff9944d15c ffffffff9944ce93 nopl %eax, (%rax,%rax,1) ffffffff9944ce98 movq (%rax), %rcx ffffffff9944ce9b and $0x8, %ch ffffffff9944ce9e jz 0xffffffff9944cfea ffffffff9944cea4 test %rax, %rax ffffffff9944cea7 jz 0xffffffff9944cfea ffffffff9944cead movq 0x8(%rax), %r14 ffffffff9944ceb1 test %r14, %r14 ffffffff9944ceb4 jz 0xffffffff9944cfac ffffffff9944ceba cmp %r14, %rdx ffffffff9944cebd jnz 0xffffffff9944d165 ffffffff9944cec3 test %r14, %r14 ffffffff9944cec6 jz 0xffffffff9944cfac ffffffff9944cecc movq 0x8(%rbp), %r15 ffffffff9944ced0 nopl %eax, (%rax,%rax,1) ffffffff9944ced5 movq 0x1fe5134(%rip), %rax ffffffff9944cedc test %r13, %r13 ffffffff9944cedf jnz 0xffffffff9944ceef ffffffff9944cee1 mov $0xffffffff80000000, %rax ffffffff9944cee8 subq 0x17b0611(%rip), %rax ffffffff9944ceef add %rax, %r12 ffffffff9944cef2 shr $0xc, %r12 ffffffff9944cef6 shl $0x6, %r12 ffffffff9944cefa addq 0x17b05ef(%rip), %r12 ffffffff9944cf01 movq 0x8(%r12), %rax ffffffff9944cf06 mov %r12, %r13 ffffffff9944cf09 test $0x1, %al ffffffff9944cf0b jnz 0xffffffff9944d1b1 ffffffff9944cf11 nopl %eax, (%rax,%rax,1) ffffffff9944cf16 movq (%r13), %rax ffffffff9944cf1a movq %rbx, (%rsp) ffffffff9944cf1e test $0x8, %ah ffffffff9944cf21 mov $0x0, %eax ffffffff9944cf26 cmovz %rax, %r13 ffffffff9944cf2a data16 nop ffffffff9944cf2c movq 0x38(%r13), %r8 ffffffff9944cf30 cmp $0x3, %r8 ffffffff9944cf34 jnbe 0xffffffff9944d1ca ffffffff9944cf3a nopl %eax, (%rax,%rax,1) ffffffff9944cf3f movq 0x23d6f72(%rip), %rax ffffffff9944cf46 mov %rbx, %rdx ffffffff9944cf49 sub %rax, %rdx ffffffff9944cf4c cmp $0x1fffff, %rdx ffffffff9944cf53 jbe 0xffffffff9944d03a ffffffff9944cf59 movq (%r14), %rax ffffffff9944cf5c addq %gs:0x66bccab4(%rip), %rax ffffffff9944cf64 movq 0x8(%rax), %rdx ffffffff9944cf68 cmpq %r13, 0x10(%rax) ffffffff9944cf6c jnz 0xffffffff9944d192 ffffffff9944cf72 movl 0x28(%r14), %ecx ffffffff9944cf76 movq (%rax), %rax ffffffff9944cf79 add %rbx, %rcx ffffffff9944cf7c cmp %rbx, %rax ffffffff9944cf7f jz 0xffffffff9944d1ba ffffffff9944cf85 movq 0xb8(%r14), %rsi ffffffff9944cf8c mov %rcx, %rdi ffffffff9944cf8f bswap %rdi ffffffff9944cf92 xor %rax, %rsi ffffffff9944cf95 xor %rdi, %rsi ffffffff9944cf98 movq %rsi, (%rcx) ffffffff9944cf9b leaq 0x2000(%rdx), %rcx ffffffff9944cfa2 movq (%r14), %rsi ffffffff9944cfa5 cmpxchg16bx %gs:(%rsi) ffffffff9944cfaa jnz 0xffffffff9944cf59 ffffffff9944cfac movq 0x8(%rsp), %rax ffffffff9944cfb1 subq %gs:0x28, %rax ffffffff9944cfba jnz 0xffffffff9944d1fc ffffffff9944cfc0 leaq -0x28(%rbp), %rsp ffffffff9944cfc4 popq %rbx ffffffff9944cfc5 popq %r12 ffffffff9944cfc7 popq %r13 ffffffff9944cfc9 popq %r14 ffffffff9944cfcb popq %r15 ffffffff9944cfcd popq %rbp ffffffff9944cfce retq # PRED 38 cycles [126] 2.74 IPC <-------------
> +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > +{ > + struct dentry *dentry; > + > + if (open_flag & O_CREAT) { > + /* Don't bother on an O_EXCL create */ > + if (open_flag & O_EXCL) > + return NULL; > + > + /* > + * FIXME: If auditing is enabled, then we'll have to unlazy to > + * use the dentry. For now, don't do this, since it shifts > + * contention from parent's i_rwsem to its d_lockref spinlock. > + * Reconsider this once dentry refcounting handles heavy > + * contention better. > + */ > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > + return NULL; Hm, the audit_inode() on the parent is done independent of whether the file was actually created or not. But the audit_inode() on the file itself is only done when it was actually created. Imho, there's no need to do audit_inode() on the parent when we immediately find that file already existed. If we accept that then this makes the change a lot simpler. The inconsistency would partially remain though. When the file doesn't exist audit_inode() on the parent is called but by the time we've grabbed the inode lock someone else might already have created the file and then again we wouldn't audit_inode() on the file but we would have on the parent. I think that's fine. But if that's bothersome the more aggressive thing to do would be to pull that audit_inode() on the parent further down after we created the file. Imho, that should be fine?... See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads for a completely untested draft of what I mean.
On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote: > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > > +{ > > + struct dentry *dentry; > > + > > + if (open_flag & O_CREAT) { > > + /* Don't bother on an O_EXCL create */ > > + if (open_flag & O_EXCL) > > + return NULL; > > + > > + /* > > + * FIXME: If auditing is enabled, then we'll have to unlazy to > > + * use the dentry. For now, don't do this, since it shifts > > + * contention from parent's i_rwsem to its d_lockref spinlock. > > + * Reconsider this once dentry refcounting handles heavy > > + * contention better. > > + */ > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > > + return NULL; > > Hm, the audit_inode() on the parent is done independent of whether the > file was actually created or not. But the audit_inode() on the file > itself is only done when it was actually created. Imho, there's no need > to do audit_inode() on the parent when we immediately find that file > already existed. If we accept that then this makes the change a lot > simpler. > > The inconsistency would partially remain though. When the file doesn't > exist audit_inode() on the parent is called but by the time we've > grabbed the inode lock someone else might already have created the file > and then again we wouldn't audit_inode() on the file but we would have > on the parent. > > I think that's fine. But if that's bothersome the more aggressive thing > to do would be to pull that audit_inode() on the parent further down > after we created the file. Imho, that should be fine?... > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads > for a completely untested draft of what I mean. Yeah, that's a lot simpler. That said, my experience when I've worked with audit in the past is that people who are using it are _very_ sensitive to changes of when records get emitted or not. I don't like this, because I think the rules here are ad-hoc and somewhat arbitrary, but keeping everything working exactly the same has been my MO whenever I have to work in there. If a certain access pattern suddenly generates a different set of records (or some are missing, as would be in this case), we might get bug reports about this. I'm ok with simplifying this code in the way you suggest, but we may want to do it in a patch on top of mine, to make it simple to revert later if that becomes necessary.
On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote: > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote: > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > > > +{ > > > + struct dentry *dentry; > > > + > > > + if (open_flag & O_CREAT) { > > > + /* Don't bother on an O_EXCL create */ > > > + if (open_flag & O_EXCL) > > > + return NULL; > > > + > > > + /* > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to > > > + * use the dentry. For now, don't do this, since it shifts > > > + * contention from parent's i_rwsem to its d_lockref spinlock. > > > + * Reconsider this once dentry refcounting handles heavy > > > + * contention better. > > > + */ > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > > > + return NULL; > > > > Hm, the audit_inode() on the parent is done independent of whether the > > file was actually created or not. But the audit_inode() on the file > > itself is only done when it was actually created. Imho, there's no need > > to do audit_inode() on the parent when we immediately find that file > > already existed. If we accept that then this makes the change a lot > > simpler. > > > > The inconsistency would partially remain though. When the file doesn't > > exist audit_inode() on the parent is called but by the time we've > > grabbed the inode lock someone else might already have created the file > > and then again we wouldn't audit_inode() on the file but we would have > > on the parent. > > > > I think that's fine. But if that's bothersome the more aggressive thing > > to do would be to pull that audit_inode() on the parent further down > > after we created the file. Imho, that should be fine?... > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads > > for a completely untested draft of what I mean. > > Yeah, that's a lot simpler. That said, my experience when I've worked > with audit in the past is that people who are using it are _very_ > sensitive to changes of when records get emitted or not. I don't like > this, because I think the rules here are ad-hoc and somewhat arbitrary, > but keeping everything working exactly the same has been my MO whenever > I have to work in there. > > If a certain access pattern suddenly generates a different set of > records (or some are missing, as would be in this case), we might get > bug reports about this. I'm ok with simplifying this code in the way > you suggest, but we may want to do it in a patch on top of mine, to > make it simple to revert later if that becomes necessary. Fwiw, even with the rearranged checks in v3 of the patch audit records will be dropped because we may find a positive dentry but the path may have trailing slashes. At that point we just return without audit whereas before we always would've done that audit. Honestly, we should move that audit event as right now it's just really weird and see if that works. Otherwise the change is somewhat horrible complicating the already convoluted logic even more. So I'm appending the patches that I have on top of your patch in vfs.misc. Can you (other as well ofc) take a look and tell me whether that's not breaking anything completely other than later audit events?
On Thu, 2024-08-08 at 12:36 +0200, Christian Brauner wrote: > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote: > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote: > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > > > > +{ > > > > + struct dentry *dentry; > > > > + > > > > + if (open_flag & O_CREAT) { > > > > + /* Don't bother on an O_EXCL create */ > > > > + if (open_flag & O_EXCL) > > > > + return NULL; > > > > + > > > > + /* > > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to > > > > + * use the dentry. For now, don't do this, since it shifts > > > > + * contention from parent's i_rwsem to its d_lockref spinlock. > > > > + * Reconsider this once dentry refcounting handles heavy > > > > + * contention better. > > > > + */ > > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > > > > + return NULL; > > > > > > Hm, the audit_inode() on the parent is done independent of whether the > > > file was actually created or not. But the audit_inode() on the file > > > itself is only done when it was actually created. Imho, there's no need > > > to do audit_inode() on the parent when we immediately find that file > > > already existed. If we accept that then this makes the change a lot > > > simpler. > > > > > > The inconsistency would partially remain though. When the file doesn't > > > exist audit_inode() on the parent is called but by the time we've > > > grabbed the inode lock someone else might already have created the file > > > and then again we wouldn't audit_inode() on the file but we would have > > > on the parent. > > > > > > I think that's fine. But if that's bothersome the more aggressive thing > > > to do would be to pull that audit_inode() on the parent further down > > > after we created the file. Imho, that should be fine?... > > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads > > > for a completely untested draft of what I mean. > > > > Yeah, that's a lot simpler. That said, my experience when I've worked > > with audit in the past is that people who are using it are _very_ > > sensitive to changes of when records get emitted or not. I don't like > > this, because I think the rules here are ad-hoc and somewhat arbitrary, > > but keeping everything working exactly the same has been my MO whenever > > I have to work in there. > > > > If a certain access pattern suddenly generates a different set of > > records (or some are missing, as would be in this case), we might get > > bug reports about this. I'm ok with simplifying this code in the way > > you suggest, but we may want to do it in a patch on top of mine, to > > make it simple to revert later if that becomes necessary. > > Fwiw, even with the rearranged checks in v3 of the patch audit records > will be dropped because we may find a positive dentry but the path may > have trailing slashes. At that point we just return without audit > whereas before we always would've done that audit. > I don't think so. v3 has it drop out of rcuwalk and do the audit_inode call in the case of trailing slashes. I took great pains here to ensure that if we emitted a record before that we still do it after. Do let me know if I missed a case though. > Honestly, we should move that audit event as right now it's just really > weird and see if that works. Otherwise the change is somewhat horrible > complicating the already convoluted logic even more. > > So I'm appending the patches that I have on top of your patch in > vfs.misc. Can you (other as well ofc) take a look and tell me whether > that's not breaking anything completely other than later audit events? That all looks fine to me. I'm not a heavy user of audit, but the change you've made seems sane to me. Hopefully no one will notice. I'll plan to do some testing with it later today. Thanks!
> I don't think so. v3 has it drop out of rcuwalk and do the audit_inode > call in the case of trailing slashes. I took great pains here to ensure > that if we emitted a record before that we still do it after. Do let me > know if I missed a case though. Ah, I missed that this was predicated on audit not being enabled. Sorry about the noise...
On Thu 08-08-24 12:36:07, Christian Brauner wrote: > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote: > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote: > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > > > > +{ > > > > + struct dentry *dentry; > > > > + > > > > + if (open_flag & O_CREAT) { > > > > + /* Don't bother on an O_EXCL create */ > > > > + if (open_flag & O_EXCL) > > > > + return NULL; > > > > + > > > > + /* > > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to > > > > + * use the dentry. For now, don't do this, since it shifts > > > > + * contention from parent's i_rwsem to its d_lockref spinlock. > > > > + * Reconsider this once dentry refcounting handles heavy > > > > + * contention better. > > > > + */ > > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > > > > + return NULL; > > > > > > Hm, the audit_inode() on the parent is done independent of whether the > > > file was actually created or not. But the audit_inode() on the file > > > itself is only done when it was actually created. Imho, there's no need > > > to do audit_inode() on the parent when we immediately find that file > > > already existed. If we accept that then this makes the change a lot > > > simpler. > > > > > > The inconsistency would partially remain though. When the file doesn't > > > exist audit_inode() on the parent is called but by the time we've > > > grabbed the inode lock someone else might already have created the file > > > and then again we wouldn't audit_inode() on the file but we would have > > > on the parent. > > > > > > I think that's fine. But if that's bothersome the more aggressive thing > > > to do would be to pull that audit_inode() on the parent further down > > > after we created the file. Imho, that should be fine?... > > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads > > > for a completely untested draft of what I mean. > > > > Yeah, that's a lot simpler. That said, my experience when I've worked > > with audit in the past is that people who are using it are _very_ > > sensitive to changes of when records get emitted or not. I don't like > > this, because I think the rules here are ad-hoc and somewhat arbitrary, > > but keeping everything working exactly the same has been my MO whenever > > I have to work in there. > > > > If a certain access pattern suddenly generates a different set of > > records (or some are missing, as would be in this case), we might get > > bug reports about this. I'm ok with simplifying this code in the way > > you suggest, but we may want to do it in a patch on top of mine, to > > make it simple to revert later if that becomes necessary. > > Fwiw, even with the rearranged checks in v3 of the patch audit records > will be dropped because we may find a positive dentry but the path may > have trailing slashes. At that point we just return without audit > whereas before we always would've done that audit. > > Honestly, we should move that audit event as right now it's just really > weird and see if that works. Otherwise the change is somewhat horrible > complicating the already convoluted logic even more. > > So I'm appending the patches that I have on top of your patch in > vfs.misc. Can you (other as well ofc) take a look and tell me whether > that's not breaking anything completely other than later audit events? The changes look good as far as I'm concerned but let me CC audit guys if they have some thoughts regarding the change in generating audit event for the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit event for the parent when we are failing open due to trailing slashes in the pathname? Essentially we are speaking about moving: audit_inode(nd->name, dir, AUDIT_INODE_PARENT); from open_last_lookups() into lookup_open(). Honza
On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote: > On Thu 08-08-24 12:36:07, Christian Brauner wrote: > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote: > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote: > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > > > > > +{ > > > > > + struct dentry *dentry; > > > > > + > > > > > + if (open_flag & O_CREAT) { > > > > > + /* Don't bother on an O_EXCL create */ > > > > > + if (open_flag & O_EXCL) > > > > > + return NULL; > > > > > + > > > > > + /* > > > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to > > > > > + * use the dentry. For now, don't do this, since it shifts > > > > > + * contention from parent's i_rwsem to its d_lockref spinlock. > > > > > + * Reconsider this once dentry refcounting handles heavy > > > > > + * contention better. > > > > > + */ > > > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > > > > > + return NULL; > > > > > > > > Hm, the audit_inode() on the parent is done independent of whether the > > > > file was actually created or not. But the audit_inode() on the file > > > > itself is only done when it was actually created. Imho, there's no need > > > > to do audit_inode() on the parent when we immediately find that file > > > > already existed. If we accept that then this makes the change a lot > > > > simpler. > > > > > > > > The inconsistency would partially remain though. When the file doesn't > > > > exist audit_inode() on the parent is called but by the time we've > > > > grabbed the inode lock someone else might already have created the file > > > > and then again we wouldn't audit_inode() on the file but we would have > > > > on the parent. > > > > > > > > I think that's fine. But if that's bothersome the more aggressive thing > > > > to do would be to pull that audit_inode() on the parent further down > > > > after we created the file. Imho, that should be fine?... > > > > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads > > > > for a completely untested draft of what I mean. > > > > > > Yeah, that's a lot simpler. That said, my experience when I've worked > > > with audit in the past is that people who are using it are _very_ > > > sensitive to changes of when records get emitted or not. I don't like > > > this, because I think the rules here are ad-hoc and somewhat arbitrary, > > > but keeping everything working exactly the same has been my MO whenever > > > I have to work in there. > > > > > > If a certain access pattern suddenly generates a different set of > > > records (or some are missing, as would be in this case), we might get > > > bug reports about this. I'm ok with simplifying this code in the way > > > you suggest, but we may want to do it in a patch on top of mine, to > > > make it simple to revert later if that becomes necessary. > > > > Fwiw, even with the rearranged checks in v3 of the patch audit records > > will be dropped because we may find a positive dentry but the path may > > have trailing slashes. At that point we just return without audit > > whereas before we always would've done that audit. > > > > Honestly, we should move that audit event as right now it's just really > > weird and see if that works. Otherwise the change is somewhat horrible > > complicating the already convoluted logic even more. > > > > So I'm appending the patches that I have on top of your patch in > > vfs.misc. Can you (other as well ofc) take a look and tell me whether > > that's not breaking anything completely other than later audit events? > > The changes look good as far as I'm concerned but let me CC audit guys if > they have some thoughts regarding the change in generating audit event for > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit > event for the parent when we are failing open due to trailing slashes in > the pathname? Essentially we are speaking about moving: > > audit_inode(nd->name, dir, AUDIT_INODE_PARENT); > > from open_last_lookups() into lookup_open(). Thanks for adding the audit mailing list to the CC, Jan. I would ask for others to do the same when discussing changes that could impact audit (similar requests for the LSM framework, SELinux, etc.). The inode/path logging in audit is ... something. I have a longstanding todo item to go revisit the audit inode logging, both to fix some known bugs, and see what we can improve (I'm guessing quite a bit). Unfortunately, there is always something else which is burning a little bit hotter and I haven't been able to get to it yet. The general idea with audit is that you want to record the information both on success and failure. It's easy to understand the success case, as it is a record of what actually happened on the system, but you also want to record the failure case as it can provide some insight on what a process/user is attempting to do, and that can be very important for certain classes of users. I haven't dug into the patches in Christian's tree, but in general I think Jeff's guidance about not changing what is recorded in the audit log is probably good advice (there will surely be exceptions to that, but it's still good guidance).
On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote: > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote: > > On Thu 08-08-24 12:36:07, Christian Brauner wrote: > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote: > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote: > > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > > > > > > +{ > > > > > > + struct dentry *dentry; > > > > > > + > > > > > > + if (open_flag & O_CREAT) { > > > > > > + /* Don't bother on an O_EXCL create */ > > > > > > + if (open_flag & O_EXCL) > > > > > > + return NULL; > > > > > > + > > > > > > + /* > > > > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to > > > > > > + * use the dentry. For now, don't do this, since it shifts > > > > > > + * contention from parent's i_rwsem to its d_lockref spinlock. > > > > > > + * Reconsider this once dentry refcounting handles heavy > > > > > > + * contention better. > > > > > > + */ > > > > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > > > > > > + return NULL; > > > > > > > > > > Hm, the audit_inode() on the parent is done independent of whether the > > > > > file was actually created or not. But the audit_inode() on the file > > > > > itself is only done when it was actually created. Imho, there's no need > > > > > to do audit_inode() on the parent when we immediately find that file > > > > > already existed. If we accept that then this makes the change a lot > > > > > simpler. > > > > > > > > > > The inconsistency would partially remain though. When the file doesn't > > > > > exist audit_inode() on the parent is called but by the time we've > > > > > grabbed the inode lock someone else might already have created the file > > > > > and then again we wouldn't audit_inode() on the file but we would have > > > > > on the parent. > > > > > > > > > > I think that's fine. But if that's bothersome the more aggressive thing > > > > > to do would be to pull that audit_inode() on the parent further down > > > > > after we created the file. Imho, that should be fine?... > > > > > > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads > > > > > for a completely untested draft of what I mean. > > > > > > > > Yeah, that's a lot simpler. That said, my experience when I've worked > > > > with audit in the past is that people who are using it are _very_ > > > > sensitive to changes of when records get emitted or not. I don't like > > > > this, because I think the rules here are ad-hoc and somewhat arbitrary, > > > > but keeping everything working exactly the same has been my MO whenever > > > > I have to work in there. > > > > > > > > If a certain access pattern suddenly generates a different set of > > > > records (or some are missing, as would be in this case), we might get > > > > bug reports about this. I'm ok with simplifying this code in the way > > > > you suggest, but we may want to do it in a patch on top of mine, to > > > > make it simple to revert later if that becomes necessary. > > > > > > Fwiw, even with the rearranged checks in v3 of the patch audit records > > > will be dropped because we may find a positive dentry but the path may > > > have trailing slashes. At that point we just return without audit > > > whereas before we always would've done that audit. > > > > > > Honestly, we should move that audit event as right now it's just really > > > weird and see if that works. Otherwise the change is somewhat horrible > > > complicating the already convoluted logic even more. > > > > > > So I'm appending the patches that I have on top of your patch in > > > vfs.misc. Can you (other as well ofc) take a look and tell me whether > > > that's not breaking anything completely other than later audit events? > > > > The changes look good as far as I'm concerned but let me CC audit guys if > > they have some thoughts regarding the change in generating audit event for > > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit > > event for the parent when we are failing open due to trailing slashes in > > the pathname? Essentially we are speaking about moving: > > > > audit_inode(nd->name, dir, AUDIT_INODE_PARENT); > > > > from open_last_lookups() into lookup_open(). > > Thanks for adding the audit mailing list to the CC, Jan. I would ask > for others to do the same when discussing changes that could impact > audit (similar requests for the LSM framework, SELinux, etc.). > > The inode/path logging in audit is ... something. I have a > longstanding todo item to go revisit the audit inode logging, both to > fix some known bugs, and see what we can improve (I'm guessing quite a > bit). Unfortunately, there is always something else which is burning > a little bit hotter and I haven't been able to get to it yet. > It is "something" alright. The audit logging just happens at strange and inconvenient times vs. what else we're trying to do wrt pathwalking and such. In particular here, the fact __audit_inode can block is what really sucks. Since we're discussing it... ISTM that the inode/path logging here is something like a tracepoint. In particular, we're looking to record a specific set of information at specific points in the code. One of the big differences between them however is that tracepoints don't block. The catch is that we can't just drop messages if we run out of audit logging space, so that would have to be handled reasonably. I wonder if we could leverage the tracepoint infrastructure to help us record the necessary info somehow? Copy the records into a specific ring buffer, and then copy them out to the audit infrastructure in task_work? I don't have any concrete ideas here, but the path/inode audit code has been a burden for a while now and it'd be good to think about how we could do this better. > The general idea with audit is that you want to record the information > both on success and failure. It's easy to understand the success > case, as it is a record of what actually happened on the system, but > you also want to record the failure case as it can provide some > insight on what a process/user is attempting to do, and that can be > very important for certain classes of users. I haven't dug into the > patches in Christian's tree, but in general I think Jeff's guidance > about not changing what is recorded in the audit log is probably good > advice (there will surely be exceptions to that, but it's still good > guidance). > In this particular case, the question is: Do we need to emit a AUDIT_INODE_PARENT record when opening an existing file, just because O_CREAT was set? We don't emit such a record when opening without O_CREAT set.
On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@kernel.org> wrote: > On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote: > > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote: > > > On Thu 08-08-24 12:36:07, Christian Brauner wrote: > > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote: > > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote: > > > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > > > > > > > +{ > > > > > > > + struct dentry *dentry; > > > > > > > + > > > > > > > + if (open_flag & O_CREAT) { > > > > > > > + /* Don't bother on an O_EXCL create */ > > > > > > > + if (open_flag & O_EXCL) > > > > > > > + return NULL; > > > > > > > + > > > > > > > + /* > > > > > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to > > > > > > > + * use the dentry. For now, don't do this, since it shifts > > > > > > > + * contention from parent's i_rwsem to its d_lockref spinlock. > > > > > > > + * Reconsider this once dentry refcounting handles heavy > > > > > > > + * contention better. > > > > > > > + */ > > > > > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > > > > > > > + return NULL; > > > > > > > > > > > > Hm, the audit_inode() on the parent is done independent of whether the > > > > > > file was actually created or not. But the audit_inode() on the file > > > > > > itself is only done when it was actually created. Imho, there's no need > > > > > > to do audit_inode() on the parent when we immediately find that file > > > > > > already existed. If we accept that then this makes the change a lot > > > > > > simpler. > > > > > > > > > > > > The inconsistency would partially remain though. When the file doesn't > > > > > > exist audit_inode() on the parent is called but by the time we've > > > > > > grabbed the inode lock someone else might already have created the file > > > > > > and then again we wouldn't audit_inode() on the file but we would have > > > > > > on the parent. > > > > > > > > > > > > I think that's fine. But if that's bothersome the more aggressive thing > > > > > > to do would be to pull that audit_inode() on the parent further down > > > > > > after we created the file. Imho, that should be fine?... > > > > > > > > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads > > > > > > for a completely untested draft of what I mean. > > > > > > > > > > Yeah, that's a lot simpler. That said, my experience when I've worked > > > > > with audit in the past is that people who are using it are _very_ > > > > > sensitive to changes of when records get emitted or not. I don't like > > > > > this, because I think the rules here are ad-hoc and somewhat arbitrary, > > > > > but keeping everything working exactly the same has been my MO whenever > > > > > I have to work in there. > > > > > > > > > > If a certain access pattern suddenly generates a different set of > > > > > records (or some are missing, as would be in this case), we might get > > > > > bug reports about this. I'm ok with simplifying this code in the way > > > > > you suggest, but we may want to do it in a patch on top of mine, to > > > > > make it simple to revert later if that becomes necessary. > > > > > > > > Fwiw, even with the rearranged checks in v3 of the patch audit records > > > > will be dropped because we may find a positive dentry but the path may > > > > have trailing slashes. At that point we just return without audit > > > > whereas before we always would've done that audit. > > > > > > > > Honestly, we should move that audit event as right now it's just really > > > > weird and see if that works. Otherwise the change is somewhat horrible > > > > complicating the already convoluted logic even more. > > > > > > > > So I'm appending the patches that I have on top of your patch in > > > > vfs.misc. Can you (other as well ofc) take a look and tell me whether > > > > that's not breaking anything completely other than later audit events? > > > > > > The changes look good as far as I'm concerned but let me CC audit guys if > > > they have some thoughts regarding the change in generating audit event for > > > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit > > > event for the parent when we are failing open due to trailing slashes in > > > the pathname? Essentially we are speaking about moving: > > > > > > audit_inode(nd->name, dir, AUDIT_INODE_PARENT); > > > > > > from open_last_lookups() into lookup_open(). > > > > Thanks for adding the audit mailing list to the CC, Jan. I would ask > > for others to do the same when discussing changes that could impact > > audit (similar requests for the LSM framework, SELinux, etc.). > > > > The inode/path logging in audit is ... something. I have a > > longstanding todo item to go revisit the audit inode logging, both to > > fix some known bugs, and see what we can improve (I'm guessing quite a > > bit). Unfortunately, there is always something else which is burning > > a little bit hotter and I haven't been able to get to it yet. > > > > It is "something" alright. The audit logging just happens at strange > and inconvenient times vs. what else we're trying to do wrt pathwalking > and such. In particular here, the fact __audit_inode can block is what > really sucks. > > Since we're discussing it... > > ISTM that the inode/path logging here is something like a tracepoint. > In particular, we're looking to record a specific set of information at > specific points in the code. One of the big differences between them > however is that tracepoints don't block. The catch is that we can't > just drop messages if we run out of audit logging space, so that would > have to be handled reasonably. Yes, the buffer allocation is the tricky bit. Audit does preallocate some structs for tracking names which ideally should handle the vast majority of the cases, but yes, we need something to handle all of the corner cases too without having to resort to audit_panic(). > I wonder if we could leverage the tracepoint infrastructure to help us > record the necessary info somehow? Copy the records into a specific > ring buffer, and then copy them out to the audit infrastructure in > task_work? I believe using task_work will cause a number of challenges for the audit subsystem as we try to bring everything together into a single audit event. We've had a lot of problems with io_uring doing similar things, some of which are still unresolved. > I don't have any concrete ideas here, but the path/inode audit code has > been a burden for a while now and it'd be good to think about how we > could do this better. I've got some grand ideas on how to cut down on a lot of our allocations and string generation in the critical path, not just with the inodes, but with audit records in general. Sadly I just haven't had the time to get to any of it. > > The general idea with audit is that you want to record the information > > both on success and failure. It's easy to understand the success > > case, as it is a record of what actually happened on the system, but > > you also want to record the failure case as it can provide some > > insight on what a process/user is attempting to do, and that can be > > very important for certain classes of users. I haven't dug into the > > patches in Christian's tree, but in general I think Jeff's guidance > > about not changing what is recorded in the audit log is probably good > > advice (there will surely be exceptions to that, but it's still good > > guidance). > > > > In this particular case, the question is: > > Do we need to emit a AUDIT_INODE_PARENT record when opening an existing > file, just because O_CREAT was set? We don't emit such a record when > opening without O_CREAT set. I'm not as current on the third-party security requirements as I used to be, but I do know that oftentimes when a file is created the parent directory is an important bit of information to have in the audit log.
On Thu, 2024-08-08 at 20:28 -0400, Paul Moore wrote: > On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote: > > > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote: > > > > On Thu 08-08-24 12:36:07, Christian Brauner wrote: > > > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote: > > > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote: > > > > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > > > > > > > > +{ > > > > > > > > + struct dentry *dentry; > > > > > > > > + > > > > > > > > + if (open_flag & O_CREAT) { > > > > > > > > + /* Don't bother on an O_EXCL create */ > > > > > > > > + if (open_flag & O_EXCL) > > > > > > > > + return NULL; > > > > > > > > + > > > > > > > > + /* > > > > > > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to > > > > > > > > + * use the dentry. For now, don't do this, since it shifts > > > > > > > > + * contention from parent's i_rwsem to its d_lockref spinlock. > > > > > > > > + * Reconsider this once dentry refcounting handles heavy > > > > > > > > + * contention better. > > > > > > > > + */ > > > > > > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > > > > > > > > + return NULL; > > > > > > > > > > > > > > Hm, the audit_inode() on the parent is done independent of whether the > > > > > > > file was actually created or not. But the audit_inode() on the file > > > > > > > itself is only done when it was actually created. Imho, there's no need > > > > > > > to do audit_inode() on the parent when we immediately find that file > > > > > > > already existed. If we accept that then this makes the change a lot > > > > > > > simpler. > > > > > > > > > > > > > > The inconsistency would partially remain though. When the file doesn't > > > > > > > exist audit_inode() on the parent is called but by the time we've > > > > > > > grabbed the inode lock someone else might already have created the file > > > > > > > and then again we wouldn't audit_inode() on the file but we would have > > > > > > > on the parent. > > > > > > > > > > > > > > I think that's fine. But if that's bothersome the more aggressive thing > > > > > > > to do would be to pull that audit_inode() on the parent further down > > > > > > > after we created the file. Imho, that should be fine?... > > > > > > > > > > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads > > > > > > > for a completely untested draft of what I mean. > > > > > > > > > > > > Yeah, that's a lot simpler. That said, my experience when I've worked > > > > > > with audit in the past is that people who are using it are _very_ > > > > > > sensitive to changes of when records get emitted or not. I don't like > > > > > > this, because I think the rules here are ad-hoc and somewhat arbitrary, > > > > > > but keeping everything working exactly the same has been my MO whenever > > > > > > I have to work in there. > > > > > > > > > > > > If a certain access pattern suddenly generates a different set of > > > > > > records (or some are missing, as would be in this case), we might get > > > > > > bug reports about this. I'm ok with simplifying this code in the way > > > > > > you suggest, but we may want to do it in a patch on top of mine, to > > > > > > make it simple to revert later if that becomes necessary. > > > > > > > > > > Fwiw, even with the rearranged checks in v3 of the patch audit records > > > > > will be dropped because we may find a positive dentry but the path may > > > > > have trailing slashes. At that point we just return without audit > > > > > whereas before we always would've done that audit. > > > > > > > > > > Honestly, we should move that audit event as right now it's just really > > > > > weird and see if that works. Otherwise the change is somewhat horrible > > > > > complicating the already convoluted logic even more. > > > > > > > > > > So I'm appending the patches that I have on top of your patch in > > > > > vfs.misc. Can you (other as well ofc) take a look and tell me whether > > > > > that's not breaking anything completely other than later audit events? > > > > > > > > The changes look good as far as I'm concerned but let me CC audit guys if > > > > they have some thoughts regarding the change in generating audit event for > > > > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit > > > > event for the parent when we are failing open due to trailing slashes in > > > > the pathname? Essentially we are speaking about moving: > > > > > > > > audit_inode(nd->name, dir, AUDIT_INODE_PARENT); > > > > > > > > from open_last_lookups() into lookup_open(). > > > > > > Thanks for adding the audit mailing list to the CC, Jan. I would ask > > > for others to do the same when discussing changes that could impact > > > audit (similar requests for the LSM framework, SELinux, etc.). > > > > > > The inode/path logging in audit is ... something. I have a > > > longstanding todo item to go revisit the audit inode logging, both to > > > fix some known bugs, and see what we can improve (I'm guessing quite a > > > bit). Unfortunately, there is always something else which is burning > > > a little bit hotter and I haven't been able to get to it yet. > > > > > > > It is "something" alright. The audit logging just happens at strange > > and inconvenient times vs. what else we're trying to do wrt pathwalking > > and such. In particular here, the fact __audit_inode can block is what > > really sucks. > > > > Since we're discussing it... > > > > ISTM that the inode/path logging here is something like a tracepoint. > > In particular, we're looking to record a specific set of information at > > specific points in the code. One of the big differences between them > > however is that tracepoints don't block. The catch is that we can't > > just drop messages if we run out of audit logging space, so that would > > have to be handled reasonably. > > Yes, the buffer allocation is the tricky bit. Audit does preallocate > some structs for tracking names which ideally should handle the vast > majority of the cases, but yes, we need something to handle all of the > corner cases too without having to resort to audit_panic(). > > > I wonder if we could leverage the tracepoint infrastructure to help us > > record the necessary info somehow? Copy the records into a specific > > ring buffer, and then copy them out to the audit infrastructure in > > task_work? > > I believe using task_work will cause a number of challenges for the > audit subsystem as we try to bring everything together into a single > audit event. We've had a lot of problems with io_uring doing similar > things, some of which are still unresolved. > > > I don't have any concrete ideas here, but the path/inode audit code has > > been a burden for a while now and it'd be good to think about how we > > could do this better. > > I've got some grand ideas on how to cut down on a lot of our > allocations and string generation in the critical path, not just with > the inodes, but with audit records in general. Sadly I just haven't > had the time to get to any of it. > > > > The general idea with audit is that you want to record the information > > > both on success and failure. It's easy to understand the success > > > case, as it is a record of what actually happened on the system, but > > > you also want to record the failure case as it can provide some > > > insight on what a process/user is attempting to do, and that can be > > > very important for certain classes of users. I haven't dug into the > > > patches in Christian's tree, but in general I think Jeff's guidance > > > about not changing what is recorded in the audit log is probably good > > > advice (there will surely be exceptions to that, but it's still good > > > guidance). > > > > > > > In this particular case, the question is: > > > > Do we need to emit a AUDIT_INODE_PARENT record when opening an existing > > file, just because O_CREAT was set? We don't emit such a record when > > opening without O_CREAT set. > > I'm not as current on the third-party security requirements as I used > to be, but I do know that oftentimes when a file is created the parent > directory is an important bit of information to have in the audit log. > Right. We'd still have that here since we have to unlazy to actually create the file. The question here is about the case where O_CREAT is set, but the file already exists. Nothing is being created in that case, so do we need to emit an audit record for the parent?
On Thu, Aug 8, 2024 at 8:33 PM Jeff Layton <jlayton@kernel.org> wrote: > On Thu, 2024-08-08 at 20:28 -0400, Paul Moore wrote: > > On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@kernel.org> wrote: > > > On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote: > > > > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote: > > > > > On Thu 08-08-24 12:36:07, Christian Brauner wrote: > > > > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote: > > > > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote: > > > > > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > > > > > > > > > +{ > > > > > > > > > + struct dentry *dentry; > > > > > > > > > + > > > > > > > > > + if (open_flag & O_CREAT) { > > > > > > > > > + /* Don't bother on an O_EXCL create */ > > > > > > > > > + if (open_flag & O_EXCL) > > > > > > > > > + return NULL; > > > > > > > > > + > > > > > > > > > + /* > > > > > > > > > + * FIXME: If auditing is enabled, then we'll have to unlazy to > > > > > > > > > + * use the dentry. For now, don't do this, since it shifts > > > > > > > > > + * contention from parent's i_rwsem to its d_lockref spinlock. > > > > > > > > > + * Reconsider this once dentry refcounting handles heavy > > > > > > > > > + * contention better. > > > > > > > > > + */ > > > > > > > > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > > > > > > > > > + return NULL; > > > > > > > > > > > > > > > > Hm, the audit_inode() on the parent is done independent of whether the > > > > > > > > file was actually created or not. But the audit_inode() on the file > > > > > > > > itself is only done when it was actually created. Imho, there's no need > > > > > > > > to do audit_inode() on the parent when we immediately find that file > > > > > > > > already existed. If we accept that then this makes the change a lot > > > > > > > > simpler. > > > > > > > > > > > > > > > > The inconsistency would partially remain though. When the file doesn't > > > > > > > > exist audit_inode() on the parent is called but by the time we've > > > > > > > > grabbed the inode lock someone else might already have created the file > > > > > > > > and then again we wouldn't audit_inode() on the file but we would have > > > > > > > > on the parent. > > > > > > > > > > > > > > > > I think that's fine. But if that's bothersome the more aggressive thing > > > > > > > > to do would be to pull that audit_inode() on the parent further down > > > > > > > > after we created the file. Imho, that should be fine?... > > > > > > > > > > > > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads > > > > > > > > for a completely untested draft of what I mean. > > > > > > > > > > > > > > Yeah, that's a lot simpler. That said, my experience when I've worked > > > > > > > with audit in the past is that people who are using it are _very_ > > > > > > > sensitive to changes of when records get emitted or not. I don't like > > > > > > > this, because I think the rules here are ad-hoc and somewhat arbitrary, > > > > > > > but keeping everything working exactly the same has been my MO whenever > > > > > > > I have to work in there. > > > > > > > > > > > > > > If a certain access pattern suddenly generates a different set of > > > > > > > records (or some are missing, as would be in this case), we might get > > > > > > > bug reports about this. I'm ok with simplifying this code in the way > > > > > > > you suggest, but we may want to do it in a patch on top of mine, to > > > > > > > make it simple to revert later if that becomes necessary. > > > > > > > > > > > > Fwiw, even with the rearranged checks in v3 of the patch audit records > > > > > > will be dropped because we may find a positive dentry but the path may > > > > > > have trailing slashes. At that point we just return without audit > > > > > > whereas before we always would've done that audit. > > > > > > > > > > > > Honestly, we should move that audit event as right now it's just really > > > > > > weird and see if that works. Otherwise the change is somewhat horrible > > > > > > complicating the already convoluted logic even more. > > > > > > > > > > > > So I'm appending the patches that I have on top of your patch in > > > > > > vfs.misc. Can you (other as well ofc) take a look and tell me whether > > > > > > that's not breaking anything completely other than later audit events? > > > > > > > > > > The changes look good as far as I'm concerned but let me CC audit guys if > > > > > they have some thoughts regarding the change in generating audit event for > > > > > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit > > > > > event for the parent when we are failing open due to trailing slashes in > > > > > the pathname? Essentially we are speaking about moving: > > > > > > > > > > audit_inode(nd->name, dir, AUDIT_INODE_PARENT); > > > > > > > > > > from open_last_lookups() into lookup_open(). > > > > > > > > Thanks for adding the audit mailing list to the CC, Jan. I would ask > > > > for others to do the same when discussing changes that could impact > > > > audit (similar requests for the LSM framework, SELinux, etc.). > > > > > > > > The inode/path logging in audit is ... something. I have a > > > > longstanding todo item to go revisit the audit inode logging, both to > > > > fix some known bugs, and see what we can improve (I'm guessing quite a > > > > bit). Unfortunately, there is always something else which is burning > > > > a little bit hotter and I haven't been able to get to it yet. > > > > > > > > > > It is "something" alright. The audit logging just happens at strange > > > and inconvenient times vs. what else we're trying to do wrt pathwalking > > > and such. In particular here, the fact __audit_inode can block is what > > > really sucks. > > > > > > Since we're discussing it... > > > > > > ISTM that the inode/path logging here is something like a tracepoint. > > > In particular, we're looking to record a specific set of information at > > > specific points in the code. One of the big differences between them > > > however is that tracepoints don't block. The catch is that we can't > > > just drop messages if we run out of audit logging space, so that would > > > have to be handled reasonably. > > > > Yes, the buffer allocation is the tricky bit. Audit does preallocate > > some structs for tracking names which ideally should handle the vast > > majority of the cases, but yes, we need something to handle all of the > > corner cases too without having to resort to audit_panic(). > > > > > I wonder if we could leverage the tracepoint infrastructure to help us > > > record the necessary info somehow? Copy the records into a specific > > > ring buffer, and then copy them out to the audit infrastructure in > > > task_work? > > > > I believe using task_work will cause a number of challenges for the > > audit subsystem as we try to bring everything together into a single > > audit event. We've had a lot of problems with io_uring doing similar > > things, some of which are still unresolved. > > > > > I don't have any concrete ideas here, but the path/inode audit code has > > > been a burden for a while now and it'd be good to think about how we > > > could do this better. > > > > I've got some grand ideas on how to cut down on a lot of our > > allocations and string generation in the critical path, not just with > > the inodes, but with audit records in general. Sadly I just haven't > > had the time to get to any of it. > > > > > > The general idea with audit is that you want to record the information > > > > both on success and failure. It's easy to understand the success > > > > case, as it is a record of what actually happened on the system, but > > > > you also want to record the failure case as it can provide some > > > > insight on what a process/user is attempting to do, and that can be > > > > very important for certain classes of users. I haven't dug into the > > > > patches in Christian's tree, but in general I think Jeff's guidance > > > > about not changing what is recorded in the audit log is probably good > > > > advice (there will surely be exceptions to that, but it's still good > > > > guidance). > > > > > > > > > > In this particular case, the question is: > > > > > > Do we need to emit a AUDIT_INODE_PARENT record when opening an existing > > > file, just because O_CREAT was set? We don't emit such a record when > > > opening without O_CREAT set. > > > > I'm not as current on the third-party security requirements as I used > > to be, but I do know that oftentimes when a file is created the parent > > directory is an important bit of information to have in the audit log. > > > > Right. We'd still have that here since we have to unlazy to actually > create the file. > > The question here is about the case where O_CREAT is set, but the file > already exists. Nothing is being created in that case, so do we need to > emit an audit record for the parent? As long as the full path information is present in the existing file's audit record it should be okay.
On Thu, 2024-08-08 at 21:22 -0400, Paul Moore wrote: > On Thu, Aug 8, 2024 at 8:33 PM Jeff Layton <jlayton@kernel.org> > wrote: > > On Thu, 2024-08-08 at 20:28 -0400, Paul Moore wrote: > > > On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@kernel.org> > > > wrote: > > > > On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote: > > > > > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@suse.cz> wrote: > > > > > > On Thu 08-08-24 12:36:07, Christian Brauner wrote: > > > > > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton > > > > > > > wrote: > > > > > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner > > > > > > > > wrote: > > > > > > > > > > +static struct dentry *lookup_fast_for_open(struct > > > > > > > > > > nameidata *nd, int open_flag) > > > > > > > > > > +{ > > > > > > > > > > + struct dentry *dentry; > > > > > > > > > > + > > > > > > > > > > + if (open_flag & O_CREAT) { > > > > > > > > > > + /* Don't bother on an O_EXCL create > > > > > > > > > > */ > > > > > > > > > > + if (open_flag & O_EXCL) > > > > > > > > > > + return NULL; > > > > > > > > > > + > > > > > > > > > > + /* > > > > > > > > > > + * FIXME: If auditing is enabled, > > > > > > > > > > then we'll have to unlazy to > > > > > > > > > > + * use the dentry. For now, don't > > > > > > > > > > do this, since it shifts > > > > > > > > > > + * contention from parent's i_rwsem > > > > > > > > > > to its d_lockref spinlock. > > > > > > > > > > + * Reconsider this once dentry > > > > > > > > > > refcounting handles heavy > > > > > > > > > > + * contention better. > > > > > > > > > > + */ > > > > > > > > > > + if ((nd->flags & LOOKUP_RCU) && > > > > > > > > > > !audit_dummy_context()) > > > > > > > > > > + return NULL; > > > > > > > > > > > > > > > > > > Hm, the audit_inode() on the parent is done > > > > > > > > > independent of whether the > > > > > > > > > file was actually created or not. But the > > > > > > > > > audit_inode() on the file > > > > > > > > > itself is only done when it was actually created. > > > > > > > > > Imho, there's no need > > > > > > > > > to do audit_inode() on the parent when we immediately > > > > > > > > > find that file > > > > > > > > > already existed. If we accept that then this makes > > > > > > > > > the change a lot > > > > > > > > > simpler. > > > > > > > > > > > > > > > > > > The inconsistency would partially remain though. When > > > > > > > > > the file doesn't > > > > > > > > > exist audit_inode() on the parent is called but by > > > > > > > > > the time we've > > > > > > > > > grabbed the inode lock someone else might already > > > > > > > > > have created the file > > > > > > > > > and then again we wouldn't audit_inode() on the file > > > > > > > > > but we would have > > > > > > > > > on the parent. > > > > > > > > > > > > > > > > > > I think that's fine. But if that's bothersome the > > > > > > > > > more aggressive thing > > > > > > > > > to do would be to pull that audit_inode() on the > > > > > > > > > parent further down > > > > > > > > > after we created the file. Imho, that should be > > > > > > > > > fine?... > > > > > > > > > > > > > > > > > > See > > > > > > > > > https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads > > > > > > > > > for a completely untested draft of what I mean. > > > > > > > > > > > > > > > > Yeah, that's a lot simpler. That said, my experience > > > > > > > > when I've worked > > > > > > > > with audit in the past is that people who are using it > > > > > > > > are _very_ > > > > > > > > sensitive to changes of when records get emitted or > > > > > > > > not. I don't like > > > > > > > > this, because I think the rules here are ad-hoc and > > > > > > > > somewhat arbitrary, > > > > > > > > but keeping everything working exactly the same has > > > > > > > > been my MO whenever > > > > > > > > I have to work in there. > > > > > > > > > > > > > > > > If a certain access pattern suddenly generates a > > > > > > > > different set of > > > > > > > > records (or some are missing, as would be in this > > > > > > > > case), we might get > > > > > > > > bug reports about this. I'm ok with simplifying this > > > > > > > > code in the way > > > > > > > > you suggest, but we may want to do it in a patch on top > > > > > > > > of mine, to > > > > > > > > make it simple to revert later if that becomes > > > > > > > > necessary. > > > > > > > > > > > > > > Fwiw, even with the rearranged checks in v3 of the patch > > > > > > > audit records > > > > > > > will be dropped because we may find a positive dentry but > > > > > > > the path may > > > > > > > have trailing slashes. At that point we just return > > > > > > > without audit > > > > > > > whereas before we always would've done that audit. > > > > > > > > > > > > > > Honestly, we should move that audit event as right now > > > > > > > it's just really > > > > > > > weird and see if that works. Otherwise the change is > > > > > > > somewhat horrible > > > > > > > complicating the already convoluted logic even more. > > > > > > > > > > > > > > So I'm appending the patches that I have on top of your > > > > > > > patch in > > > > > > > vfs.misc. Can you (other as well ofc) take a look and > > > > > > > tell me whether > > > > > > > that's not breaking anything completely other than later > > > > > > > audit events? > > > > > > > > > > > > The changes look good as far as I'm concerned but let me CC > > > > > > audit guys if > > > > > > they have some thoughts regarding the change in generating > > > > > > audit event for > > > > > > the parent. Paul, does it matter if open(O_CREAT) doesn't > > > > > > generate audit > > > > > > event for the parent when we are failing open due to > > > > > > trailing slashes in > > > > > > the pathname? Essentially we are speaking about moving: > > > > > > > > > > > > audit_inode(nd->name, dir, AUDIT_INODE_PARENT); > > > > > > > > > > > > from open_last_lookups() into lookup_open(). > > > > > > > > > > Thanks for adding the audit mailing list to the CC, Jan. I > > > > > would ask > > > > > for others to do the same when discussing changes that could > > > > > impact > > > > > audit (similar requests for the LSM framework, SELinux, > > > > > etc.). > > > > > > > > > > The inode/path logging in audit is ... something. I have a > > > > > longstanding todo item to go revisit the audit inode logging, > > > > > both to > > > > > fix some known bugs, and see what we can improve (I'm > > > > > guessing quite a > > > > > bit). Unfortunately, there is always something else which is > > > > > burning > > > > > a little bit hotter and I haven't been able to get to it yet. > > > > > > > > > > > > > It is "something" alright. The audit logging just happens at > > > > strange > > > > and inconvenient times vs. what else we're trying to do wrt > > > > pathwalking > > > > and such. In particular here, the fact __audit_inode can block > > > > is what > > > > really sucks. > > > > > > > > Since we're discussing it... > > > > > > > > ISTM that the inode/path logging here is something like a > > > > tracepoint. > > > > In particular, we're looking to record a specific set of > > > > information at > > > > specific points in the code. One of the big differences between > > > > them > > > > however is that tracepoints don't block. The catch is that we > > > > can't > > > > just drop messages if we run out of audit logging space, so > > > > that would > > > > have to be handled reasonably. > > > > > > Yes, the buffer allocation is the tricky bit. Audit does > > > preallocate > > > some structs for tracking names which ideally should handle the > > > vast > > > majority of the cases, but yes, we need something to handle all > > > of the > > > corner cases too without having to resort to audit_panic(). > > > > > > > I wonder if we could leverage the tracepoint infrastructure to > > > > help us > > > > record the necessary info somehow? Copy the records into a > > > > specific > > > > ring buffer, and then copy them out to the audit infrastructure > > > > in > > > > task_work? > > > > > > I believe using task_work will cause a number of challenges for > > > the > > > audit subsystem as we try to bring everything together into a > > > single > > > audit event. We've had a lot of problems with io_uring doing > > > similar > > > things, some of which are still unresolved. > > > > > > > I don't have any concrete ideas here, but the path/inode audit > > > > code has > > > > been a burden for a while now and it'd be good to think about > > > > how we > > > > could do this better. > > > > > > I've got some grand ideas on how to cut down on a lot of our > > > allocations and string generation in the critical path, not just > > > with > > > the inodes, but with audit records in general. Sadly I just > > > haven't > > > had the time to get to any of it. > > > > > > > > The general idea with audit is that you want to record the > > > > > information > > > > > both on success and failure. It's easy to understand the > > > > > success > > > > > case, as it is a record of what actually happened on the > > > > > system, but > > > > > you also want to record the failure case as it can provide > > > > > some > > > > > insight on what a process/user is attempting to do, and that > > > > > can be > > > > > very important for certain classes of users. I haven't dug > > > > > into the > > > > > patches in Christian's tree, but in general I think Jeff's > > > > > guidance > > > > > about not changing what is recorded in the audit log is > > > > > probably good > > > > > advice (there will surely be exceptions to that, but it's > > > > > still good > > > > > guidance). > > > > > > > > > > > > > In this particular case, the question is: > > > > > > > > Do we need to emit a AUDIT_INODE_PARENT record when opening an > > > > existing > > > > file, just because O_CREAT was set? We don't emit such a record > > > > when > > > > opening without O_CREAT set. > > > > > > I'm not as current on the third-party security requirements as I > > > used > > > to be, but I do know that oftentimes when a file is created the > > > parent > > > directory is an important bit of information to have in the audit > > > log. > > > > > > > Right. We'd still have that here since we have to unlazy to > > actually > > create the file. > > > > The question here is about the case where O_CREAT is set, but the > > file > > already exists. Nothing is being created in that case, so do we > > need to > > emit an audit record for the parent? > > As long as the full path information is present in the existing > file's > audit record it should be okay. > O_CREAT is ignored when the dentry already exists, so doing the same thing that we do when O_CREAT isn't set seems reasonable. We do call this in do_open, which would apply in this case: if (!(file->f_mode & FMODE_CREATED)) audit_inode(nd->name, nd->path.dentry, 0); That should have the necessary path info. If that's the case, then I think Christian's cleanup series on top of mine should be OK. I think that the only thing that would be missing is the AUDIT_INODE_PARENT record for the directory in the case where the dentry already exists, which should be superfluous. ISTR that Red Hat has a pretty extensive testsuite for audit. We might want to get them to run their tests on Christian's changes to be sure there are no surprises, if they are amenable.
On Fri, Aug 9, 2024 at 10:21 AM Jeff Layton <jlayton@kernel.org> wrote: > On Thu, 2024-08-08 at 21:22 -0400, Paul Moore wrote: > > On Thu, Aug 8, 2024 at 8:33 PM Jeff Layton <jlayton@kernel.org> > > wrote: ... > > > The question here is about the case where O_CREAT is set, but the > > > file > > > already exists. Nothing is being created in that case, so do we > > > need to > > > emit an audit record for the parent? > > > > As long as the full path information is present in the existing > > file's > > audit record it should be okay. > > O_CREAT is ignored when the dentry already exists, so doing the same > thing that we do when O_CREAT isn't set seems reasonable. > > We do call this in do_open, which would apply in this case: > > if (!(file->f_mode & FMODE_CREATED)) > audit_inode(nd->name, nd->path.dentry, 0); > > That should have the necessary path info. If that's the case, then I > think Christian's cleanup series on top of mine should be OK. I think > that the only thing that would be missing is the AUDIT_INODE_PARENT > record for the directory in the case where the dentry already exists, > which should be superfluous. > > ISTR that Red Hat has a pretty extensive testsuite for audit. We might > want to get them to run their tests on Christian's changes to be sure > there are no surprises, if they are amenable. I believe you are thinking of the audit-test project, which started as a community effort to develop a test suite suitable for the popular Common Criteria protection profiles of the time (of which auditing was an important requirement). Unfortunately, after a couple rounds of certifications I couldn't get the various companies involved to continue to maintain the public test suite anymore, so everyone went their own way with private forks. I have no idea if the community project still works, but someone at IBM/RH should have a recent~ish version that either runs on a modern system or is close to it. FWIW, the dead/dormant community project can be found at the link below (yes, that is a sf.net link): https://sourceforge.net/projects/audit-test It's been a while since I've been at RH, so I'm not sure if my test/QA contacts are still there, but if you don't have a contact at IBM/RH Jeff let me know and I can try to reach out.
On Tue, Aug 06, 2024 at 03:51:35PM -0400, Jeff Layton wrote: > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > > +{ > > + struct dentry *dentry; > > + > > + if (open_flag & O_CREAT) { > > + /* Don't bother on an O_EXCL create */ > > + if (open_flag & O_EXCL) > > + return NULL; > > + > > + /* > > + * FIXME: If auditing is enabled, then we'll have to unlazy to > > + * use the dentry. For now, don't do this, since it shifts > > + * contention from parent's i_rwsem to its d_lockref spinlock. > > + * Reconsider this once dentry refcounting handles heavy > > + * contention better. > > + */ > > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > > + return NULL; > > + } > > + > > + if (trailing_slashes(nd)) > > + nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; > > + > > + dentry = lookup_fast(nd); > > Self-NAK on this patch. We have to test for IS_ERR on the returned > dentry here. I'll send a v3 along after I've retested it. That's not the only problem; your "is it negative" test is inherently racy in RCU mode. IOW, what is positive at the time you get here can bloody well go negative immediately afterwards. Hit that with O_CREAT and you've got a bogus ENOENT...
On Wed, Aug 14, 2024 at 03:18:17AM +0100, Al Viro wrote: > That's not the only problem; your "is it negative" test is inherently > racy in RCU mode. IOW, what is positive at the time you get here can > bloody well go negative immediately afterwards. Hit that with > O_CREAT and you've got a bogus ENOENT... Hmm... OTOH, in that case you end up in step_into(), which will do the right thing... How well does that series survive NFS client regression tests? That's where I'd expect potentially subtle shite, what with short-circuited ->d_revalidate() on the final pathwalk step in open()...
On Wed, 2024-08-14 at 03:40 +0100, Al Viro wrote: > On Wed, Aug 14, 2024 at 03:18:17AM +0100, Al Viro wrote: > > > That's not the only problem; your "is it negative" test is inherently > > racy in RCU mode. IOW, what is positive at the time you get here can > > bloody well go negative immediately afterwards. Hit that with > > O_CREAT and you've got a bogus ENOENT... > > Hmm... OTOH, in that case you end up in step_into(), which will do the > right thing... > > How well does that series survive NFS client regression tests? > That's where I'd expect potentially subtle shite, what with short-circuited > ->d_revalidate() on the final pathwalk step in open()... Christian took in my v3 patch which is a bit different from this one. It seems to be doing fine in testing with NFS and otherwise. I don't think we short-circuit the d_revalidate though, do we? That version calls lookup_fast on the last component which should d_revalidate the last dentry before returning it.
> Christian took in my v3 patch which is a bit different from this one. > It seems to be doing fine in testing with NFS and otherwise. Every branch gets tested with nfs fstests (in addition to the usual suspects): Failures: generic/732 Failed 1 of 600 tests And that just fails because it's missing your 4fd042e0465c ("generic/732: don't run it on NFS")
On Wed, Aug 14, 2024 at 07:48:17AM -0400, Jeff Layton wrote: > On Wed, 2024-08-14 at 03:40 +0100, Al Viro wrote: > > On Wed, Aug 14, 2024 at 03:18:17AM +0100, Al Viro wrote: > > > > > That's not the only problem; your "is it negative" test is inherently > > > racy in RCU mode. IOW, what is positive at the time you get here can > > > bloody well go negative immediately afterwards. Hit that with > > > O_CREAT and you've got a bogus ENOENT... > > > > Hmm... OTOH, in that case you end up in step_into(), which will do the > > right thing... > > > > How well does that series survive NFS client regression tests? > > That's where I'd expect potentially subtle shite, what with short-circuited > > ->d_revalidate() on the final pathwalk step in open()... > > Christian took in my v3 patch which is a bit different from this one. > It seems to be doing fine in testing with NFS and otherwise. > > I don't think we short-circuit the d_revalidate though, do we? That > version calls lookup_fast on the last component which should > d_revalidate the last dentry before returning it. It's not about a skipped call of ->d_revalidate(); it's about the NFS (especially NFS4) dances inside ->d_revalidate(), where it tries to cut down on roundtrips where possible. The interplay with ->atomic_open() and ->open() is subtle and I'm not sure that we do not depend upon the details of ->i_rwsem locking by fs/namei.c in there - proof of correctness used to be rather convoluted there, especially wrt the unhashing and rehashing aliases. I'm not saying that your changes break things in there, but that's one area where I would look for trouble. NFS has fairly extensive regression tests, and it would be a good idea to beat that patchset with those.
On Wed, Aug 14, 2024 at 02:40:23PM +0200, Christian Brauner wrote: > > Christian took in my v3 patch which is a bit different from this one. > > It seems to be doing fine in testing with NFS and otherwise. > > Every branch gets tested with nfs fstests (in addition to the usual > suspects): > > Failures: generic/732 > Failed 1 of 600 tests > > And that just fails because it's missing your 4fd042e0465c > ("generic/732: don't run it on NFS") connectathon would be more interesting...
On Wed, 2024-08-14 at 16:42 +0100, Al Viro wrote: > On Wed, Aug 14, 2024 at 07:48:17AM -0400, Jeff Layton wrote: > > On Wed, 2024-08-14 at 03:40 +0100, Al Viro wrote: > > > On Wed, Aug 14, 2024 at 03:18:17AM +0100, Al Viro wrote: > > > > > > > That's not the only problem; your "is it negative" test is > > > > inherently > > > > racy in RCU mode. IOW, what is positive at the time you get > > > > here can > > > > bloody well go negative immediately afterwards. Hit that with > > > > O_CREAT and you've got a bogus ENOENT... > > > > > > Hmm... OTOH, in that case you end up in step_into(), which will > > > do the > > > right thing... > > > > > > How well does that series survive NFS client regression > > > tests? > > > That's where I'd expect potentially subtle shite, what with > > > short-circuited > > > ->d_revalidate() on the final pathwalk step in open()... > > > > Christian took in my v3 patch which is a bit different from this > > one. > > It seems to be doing fine in testing with NFS and otherwise. > > > > I don't think we short-circuit the d_revalidate though, do we? That > > version calls lookup_fast on the last component which should > > d_revalidate the last dentry before returning it. > > It's not about a skipped call of ->d_revalidate(); it's about the NFS > (especially NFS4) dances inside ->d_revalidate(), where it tries to > cut down on roundtrips where possible. The interplay with - > >atomic_open() > and ->open() is subtle and I'm not sure that we do not depend upon > the > details of ->i_rwsem locking by fs/namei.c in there - proof of > correctness > used to be rather convoluted there, especially wrt the unhashing and > rehashing aliases. > > I'm not saying that your changes break things in there, but that's > one > area where I would look for trouble. NFS has fairly extensive > regression > tests, and it would be a good idea to beat that patchset with those. I've already run a bunch of NFS tests on it and it seems to be OK so far, but I'll keep testing it. My take: Opening an extant file with O_CREAT set should behave the same as with O_CREAT not set. I did crawl through NFS's d_revalidate functions. There are a couple of places that check for O_CREAT, but they didn't seem to depend on the i_rwsem or any particular locking. Please do let me know if you see anything I missed though. Thanks,
On Tue, Aug 6, 2024 at 10:47 PM Andi Kleen <ak@linux.intel.com> wrote: > > > Before I get to the vfs layer, there is a significant loss in the > > memory allocator because of memcg -- it takes several irq off/on trips > > for every alloc (needed to grab struct file *). I have a plan what to > > do with it (handle stuff with local cmpxchg (note no lock prefix)), > > which I'm trying to get around to. Apart from that you may note the > > allocator fast path performs a 16-byte cmpxchg, which is again dog > > slow and executes twice (once for the file obj, another time for the > > namei buffer). Someone(tm) should patch it up and I have some vague > > ideas, but 0 idea when I can take a serious stab. > > I just LBR sampled it on my skylake and it doesn't look > particularly slow. You see the whole massive block including CMPXCHG16 > gets IPC 2.7, which is rather good. If you see lots of cycles on it it's likely > a missing cache line. > > kmem_cache_free: > ffffffff9944ce20 nop %edi, %edx > ffffffff9944ce24 nopl %eax, (%rax,%rax,1) > ffffffff9944ce29 pushq %rbp > ffffffff9944ce2a mov %rdi, %rdx > ffffffff9944ce2d mov %rsp, %rbp > ffffffff9944ce30 pushq %r15 > ffffffff9944ce32 pushq %r14 > ffffffff9944ce34 pushq %r13 > ffffffff9944ce36 pushq %r12 > ffffffff9944ce38 mov $0x80000000, %r12d > ffffffff9944ce3e pushq %rbx > ffffffff9944ce3f mov %rsi, %rbx > ffffffff9944ce42 and $0xfffffffffffffff0, %rsp > ffffffff9944ce46 sub $0x10, %rsp > ffffffff9944ce4a movq %gs:0x28, %rax > ffffffff9944ce53 movq %rax, 0x8(%rsp) > ffffffff9944ce58 xor %eax, %eax > ffffffff9944ce5a add %rsi, %r12 > ffffffff9944ce5d jb 0xffffffff9944d1ea > ffffffff9944ce63 mov $0xffffffff80000000, %rax > ffffffff9944ce6a xor %r13d, %r13d > ffffffff9944ce6d subq 0x17b068c(%rip), %rax > ffffffff9944ce74 add %r12, %rax > ffffffff9944ce77 shr $0xc, %rax > ffffffff9944ce7b shl $0x6, %rax > ffffffff9944ce7f addq 0x17b066a(%rip), %rax > ffffffff9944ce86 movq 0x8(%rax), %rcx > ffffffff9944ce8a test $0x1, %cl > ffffffff9944ce8d jnz 0xffffffff9944d15c > ffffffff9944ce93 nopl %eax, (%rax,%rax,1) > ffffffff9944ce98 movq (%rax), %rcx > ffffffff9944ce9b and $0x8, %ch > ffffffff9944ce9e jz 0xffffffff9944cfea > ffffffff9944cea4 test %rax, %rax > ffffffff9944cea7 jz 0xffffffff9944cfea > ffffffff9944cead movq 0x8(%rax), %r14 > ffffffff9944ceb1 test %r14, %r14 > ffffffff9944ceb4 jz 0xffffffff9944cfac > ffffffff9944ceba cmp %r14, %rdx > ffffffff9944cebd jnz 0xffffffff9944d165 > ffffffff9944cec3 test %r14, %r14 > ffffffff9944cec6 jz 0xffffffff9944cfac > ffffffff9944cecc movq 0x8(%rbp), %r15 > ffffffff9944ced0 nopl %eax, (%rax,%rax,1) > ffffffff9944ced5 movq 0x1fe5134(%rip), %rax > ffffffff9944cedc test %r13, %r13 > ffffffff9944cedf jnz 0xffffffff9944ceef > ffffffff9944cee1 mov $0xffffffff80000000, %rax > ffffffff9944cee8 subq 0x17b0611(%rip), %rax > ffffffff9944ceef add %rax, %r12 > ffffffff9944cef2 shr $0xc, %r12 > ffffffff9944cef6 shl $0x6, %r12 > ffffffff9944cefa addq 0x17b05ef(%rip), %r12 > ffffffff9944cf01 movq 0x8(%r12), %rax > ffffffff9944cf06 mov %r12, %r13 > ffffffff9944cf09 test $0x1, %al > ffffffff9944cf0b jnz 0xffffffff9944d1b1 > ffffffff9944cf11 nopl %eax, (%rax,%rax,1) > ffffffff9944cf16 movq (%r13), %rax > ffffffff9944cf1a movq %rbx, (%rsp) > ffffffff9944cf1e test $0x8, %ah > ffffffff9944cf21 mov $0x0, %eax > ffffffff9944cf26 cmovz %rax, %r13 > ffffffff9944cf2a data16 nop > ffffffff9944cf2c movq 0x38(%r13), %r8 > ffffffff9944cf30 cmp $0x3, %r8 > ffffffff9944cf34 jnbe 0xffffffff9944d1ca > ffffffff9944cf3a nopl %eax, (%rax,%rax,1) > ffffffff9944cf3f movq 0x23d6f72(%rip), %rax > ffffffff9944cf46 mov %rbx, %rdx > ffffffff9944cf49 sub %rax, %rdx > ffffffff9944cf4c cmp $0x1fffff, %rdx > ffffffff9944cf53 jbe 0xffffffff9944d03a > ffffffff9944cf59 movq (%r14), %rax > ffffffff9944cf5c addq %gs:0x66bccab4(%rip), %rax > ffffffff9944cf64 movq 0x8(%rax), %rdx > ffffffff9944cf68 cmpq %r13, 0x10(%rax) > ffffffff9944cf6c jnz 0xffffffff9944d192 > ffffffff9944cf72 movl 0x28(%r14), %ecx > ffffffff9944cf76 movq (%rax), %rax > ffffffff9944cf79 add %rbx, %rcx > ffffffff9944cf7c cmp %rbx, %rax > ffffffff9944cf7f jz 0xffffffff9944d1ba > ffffffff9944cf85 movq 0xb8(%r14), %rsi > ffffffff9944cf8c mov %rcx, %rdi > ffffffff9944cf8f bswap %rdi > ffffffff9944cf92 xor %rax, %rsi > ffffffff9944cf95 xor %rdi, %rsi > ffffffff9944cf98 movq %rsi, (%rcx) > ffffffff9944cf9b leaq 0x2000(%rdx), %rcx > ffffffff9944cfa2 movq (%r14), %rsi > ffffffff9944cfa5 cmpxchg16bx %gs:(%rsi) > ffffffff9944cfaa jnz 0xffffffff9944cf59 > ffffffff9944cfac movq 0x8(%rsp), %rax > ffffffff9944cfb1 subq %gs:0x28, %rax > ffffffff9944cfba jnz 0xffffffff9944d1fc > ffffffff9944cfc0 leaq -0x28(%rbp), %rsp > ffffffff9944cfc4 popq %rbx > ffffffff9944cfc5 popq %r12 > ffffffff9944cfc7 popq %r13 > ffffffff9944cfc9 popq %r14 > ffffffff9944cfcb popq %r15 > ffffffff9944cfcd popq %rbp > ffffffff9944cfce retq # PRED 38 cycles [126] 2.74 IPC <------------- Sorry for late reply, my test box was temporarily unavailable and then I forgot about this e-mail :) I don't have a good scientific test(tm) and I don't think coming up with one is warranted at the moment. But to illustrate, I slapped together a test case for will-it-scale where I either cmpxchg8 or 16 in a loop. No lock prefix on these. On Sapphire Rapids I see well over twice the throughput for the 8-byte variant: # ./cmpxchg8_processes warmup min:481465497 max:481465497 total:481465497 min:464439645 max:464439645 total:464439645 min:461884735 max:461884735 total:461884735 min:460850043 max:460850043 total:460850043 min:461066452 max:461066452 total:461066452 min:463984473 max:463984473 total:463984473 measurement min:461317703 max:461317703 total:461317703 min:458608942 max:458608942 total:458608942 min:460846336 max:460846336 total:460846336 [snip] # ./cmpxchg16b_processes warmup min:205207128 max:205207128 total:205207128 min:205010535 max:205010535 total:205010535 min:204877781 max:204877781 total:204877781 min:204163814 max:204163814 total:204163814 min:204392000 max:204392000 total:204392000 min:204094222 max:204094222 total:204094222 measurement min:204243282 max:204243282 total:204243282 min:204136589 max:204136589 total:204136589 min:203504119 max:203504119 total:203504119 So I would say trying it out in a real alloc is worth looking at. Of course the 16-byte variant is not used just for kicks, so going to 8 bytes is more involved than just replacing the instruction. The current code follows the standard idea on how to deal with the ABA problem -- apart from replacing a pointer you validate this is what you thought by checking the counter in the same instruction. I note that in the kernel we can do better, but I don't have have all kinks worked out yet. The core idea builds on the fact that we can cheaply detect a pending alloc on the same cpu and should a conflicting free be executing from an interrupt, it can instead add the returning buffer to a different list and the aba problem disappears. Should the alloc fast path fail to find a free buffer, it can disable interrupts an take a look at the fallback list.
On Wed, Aug 14, 2024 at 04:44:19PM GMT, Al Viro wrote: > On Wed, Aug 14, 2024 at 02:40:23PM +0200, Christian Brauner wrote: > > > Christian took in my v3 patch which is a bit different from this one. > > > It seems to be doing fine in testing with NFS and otherwise. > > > > Every branch gets tested with nfs fstests (in addition to the usual > > suspects): > > > > Failures: generic/732 > > Failed 1 of 600 tests > > > > And that just fails because it's missing your 4fd042e0465c > > ("generic/732: don't run it on NFS") > > connectathon would be more interesting... Fwiw, I wasted almost a day to get more testing done here. But that connectathon thing at [1] just failed to compile on anything reasonably new and with errors that indicate that certain apis it uses have been deprecated for a very long time. In any case, I went out of my way and found that LTP has stresstests for NFS for creation, open, mkdir, unlink etc and I added them to testing now. [1]: https://github.com/dkruchinin/cthon-nfs-tests
diff --git a/fs/namei.c b/fs/namei.c index 1e05a0f3f04d..2d716fb114c9 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3518,6 +3518,47 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, return ERR_PTR(error); } +static inline bool trailing_slashes(struct nameidata *nd) +{ + return (bool)nd->last.name[nd->last.len]; +} + +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) +{ + struct dentry *dentry; + + if (open_flag & O_CREAT) { + /* Don't bother on an O_EXCL create */ + if (open_flag & O_EXCL) + return NULL; + + /* + * FIXME: If auditing is enabled, then we'll have to unlazy to + * use the dentry. For now, don't do this, since it shifts + * contention from parent's i_rwsem to its d_lockref spinlock. + * Reconsider this once dentry refcounting handles heavy + * contention better. + */ + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) + return NULL; + } + + if (trailing_slashes(nd)) + nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; + + dentry = lookup_fast(nd); + + if (open_flag & O_CREAT) { + /* Discard negative dentries. Need inode_lock to do the create */ + if (dentry && !dentry->d_inode) { + if (!(nd->flags & LOOKUP_RCU)) + dput(dentry); + dentry = NULL; + } + } + return dentry; +} + static const char *open_last_lookups(struct nameidata *nd, struct file *file, const struct open_flags *op) { @@ -3535,28 +3576,31 @@ static const char *open_last_lookups(struct nameidata *nd, return handle_dots(nd, nd->last_type); } + /* We _can_ be in RCU mode here */ + dentry = lookup_fast_for_open(nd, open_flag); + if (IS_ERR(dentry)) + return ERR_CAST(dentry); + if (!(open_flag & O_CREAT)) { - if (nd->last.name[nd->last.len]) - nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; - /* we _can_ be in RCU mode here */ - dentry = lookup_fast(nd); - if (IS_ERR(dentry)) - return ERR_CAST(dentry); if (likely(dentry)) goto finish_lookup; if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU)) return ERR_PTR(-ECHILD); } else { - /* create side of things */ if (nd->flags & LOOKUP_RCU) { + /* can stay in rcuwalk if not auditing */ + if (dentry && audit_dummy_context()) + goto check_slashes; if (!try_to_unlazy(nd)) return ERR_PTR(-ECHILD); } audit_inode(nd->name, dir, AUDIT_INODE_PARENT); - /* trailing slashes? */ - if (unlikely(nd->last.name[nd->last.len])) +check_slashes: + if (trailing_slashes(nd)) return ERR_PTR(-EISDIR); + if (dentry) + goto finish_lookup; } if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
Today, when opening a file we'll typically do a fast lookup, but if O_CREAT is set, the kernel always takes the exclusive inode lock. I assume this was done with the expectation that O_CREAT means that we always expect to do the create, but that's often not the case. Many programs set O_CREAT even in scenarios where the file already exists. This patch rearranges the pathwalk-for-open code to also attempt a fast_lookup in certain O_CREAT cases. If a positive dentry is found, the inode_lock can be avoided altogether, and if auditing isn't enabled, it can stay in rcuwalk mode for the last step_into. One notable exception that is hopefully temporary: if we're doing an rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the dentry in that case is more expensive than taking the i_rwsem for now. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- Here's a revised patch that does a fast_lookup in the O_CREAT codepath too. The main difference here is that if a positive dentry is found and audit_dummy_context is true, then we keep the walk lazy for the last component, which avoids having to take any locks on the parent (just like with non-O_CREAT opens). The testcase below runs in about 18s on v6.10 (on an 80 CPU machine). With this patch, it runs in about 1s: #define _GNU_SOURCE 1 #include <stdio.h> #include <unistd.h> #include <errno.h> #include <fcntl.h> #include <stdlib.h> #include <sys/wait.h> #define PROCS 70 #define LOOPS 500000 static int openloop(int tnum) { char *file; int i, ret; ret = asprintf(&file, "./testfile%d", tnum); if (ret < 0) { printf("asprintf failed for proc %d", tnum); return 1; } for (i = 0; i < LOOPS; ++i) { int fd = open(file, O_RDWR|O_CREAT, 0644); if (fd < 0) { perror("open"); return 1; } close(fd); } unlink(file); free(file); return 0; } int main(int argc, char **argv) { pid_t kids[PROCS]; int i, ret = 0; for (i = 0; i < PROCS; ++i) { kids[i] = fork(); if (kids[i] > 0) return openloop(i); if (kids[i] < 0) perror("fork"); } for (i = 0; i < PROCS; ++i) { int ret2; if (kids[i] > 0) { wait(&ret2); if (ret2 != 0) ret = ret2; } } return ret; } --- Changes in v2: - drop the lockref patch since Mateusz is working on a better approach - add trailing_slashes helper function - add a lookup_fast_for_open helper function - make lookup_fast_for_open skip the lookup if auditing is enabled - if we find a positive dentry and auditing is disabled, don't unlazy - Link to v1: https://lore.kernel.org/r/20240802-openfast-v1-0-a1cff2a33063@kernel.org --- fs/namei.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 53 insertions(+), 9 deletions(-) --- base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd change-id: 20240723-openfast-ac49a7b6ade2 Best regards,