Message ID | CAHk-=whJgRDtxTudTQ9HV8BFw5-bBsu+c8Ouwd_PrPqPB6_KEQ@mail.gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | generic_permission() optimization | expand |
On Wed, Oct 30, 2024 at 06:16:22PM -1000, Linus Torvalds wrote: > +static inline bool no_acl_inode(struct inode *inode) > +{ > +#ifdef CONFIG_FS_POSIX_ACL > + return likely(!READ_ONCE(inode->i_acl)); > +#else > + return true; > +#endif > +} Hmm... Shouldn't there be || !IS_POSIXACL(inode) in there?
On Wed, 30 Oct 2024 at 20:05, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Oct 30, 2024 at 06:16:22PM -1000, Linus Torvalds wrote: > > > +static inline bool no_acl_inode(struct inode *inode) > > +{ > > +#ifdef CONFIG_FS_POSIX_ACL > > + return likely(!READ_ONCE(inode->i_acl)); > > +#else > > + return true; > > +#endif > > +} > > Hmm... Shouldn't there be || !IS_POSIXACL(inode) in there? I was going for "intentionally minimalistic". IOW, this was meant to be an optimization for a presumed common case, but fall back to the generic code quickly and simply. Put another way: is the !IS_POSIXACL() actually a common situation worth optimizing for? Do real people actually use "noacl"? My gut feel is that it was one of those mistakes that some random odd case is using, but not something worth really optimizing for. But maybe some common situation ends up using it even without "noacl" - like /proc, perhaps? I was kind of hoping that such cases would use 'cache_no_acl()' which makes that inode->i_acl be NULL. Wouldn't that be the right model anyway for !IS_POSIXACL()? Anyway, this is all obviously a "matter of tuning the optimization". And I don't actually know if it's even worth doing in the first place. From just the profiles I looked at, that make_vfsuid() conversion looked like a surprisingly big deal, but obviously this optimistic fast case wouldn't remove all such cases anyway. Linus
On Wed, Oct 30, 2024 at 06:16:22PM -1000, Linus Torvalds wrote: > So call me crazy, but due to entirely unrelated changes (the x86 > barrier_nospec optimization) I was doing some profiles to check that > everything looked fine. > > And just looking at kernel profiles, I noticed that > "generic_permission()" wasn't entirely in the noise. It was right > there along with strncpy_from_user() etc. Which is a bit surprising, > because it should be really cheap to check basic Unix permissions? > > It's all really just "acl_permission_check()" and that code is > actually fairly optimized, except that the whole > > vfsuid = i_uid_into_vfsuid(idmap, inode); > > to check whether we're the owner is *not* cheap. It causes a call to > make_vfsuid(), and it's just messy. I assume you ran your benchmark on baremetal so no containers or idmappings? I find this somewhat suprising. One thing to optimize here independent of your proposal would be to try and __always_inline make_vfsuid(). > > Which made me wonder: we already have code that says "if the Group and > Other permission bits are the same wrt the mask we are checking, don't > bother doing the expensive group checks". > > Why not extend that to "if any of the UGO choices are ok with the > permissions, why bother looking up ownership at all?" > > Now, there is one reason to look up the owner: POSIX ACL's don't > matter to owners, but do matter to others. > > But we could check for the cached case of "no POSIX ACLs" very > cheaply, and only do it for that case. > > IOW, a patch something like the attached. I think this should work (though I find the permission checking model in general to be an unruly mess - that's ignoring the fun hardening bits of it...). Can you send give me a proper SoB patch? Liberal comments in the code would be appreciated. I'd like to put this through LTP and some of the permission tests I've written over the years. > > No, I didn't really check whether it makes any difference at all. But > my gut feel is that a *lot* of permission checks succeed for any user, > with things like system directories are commonly drwxr-xr-x, so if you > want to check read or execute permissions, it really doesn't matter > whether you are the User, the Group, or Other. > > So thus the code does that > > unsigned int all; > all = mode & (mode >> 3); // combine g into o > all &= mode >> 6; // ... and u into o > > so now the low three bits of 'all' are the bits that *every* case has > set. And then > > if (!(mask & ~all & 7)) > return 0; > > basically says "if what we are asking for is not zero in any of those > bits, we're good". > > And it's entirely possible that I'm missing something silly and am > being just stupid. Somebody hit me with the clue bat if so. > > Linus
On Wed, 30 Oct 2024 at 20:42, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I was kind of hoping that such cases would use 'cache_no_acl()' which > makes that inode->i_acl be NULL. Wouldn't that be the right model > anyway for !IS_POSIXACL()? Alternatively, just initialize it to NULL in inode_init_always_gfp(), eg like - inode->i_acl = inode->i_default_acl = ACL_NOT_CACHED; + inode->i_acl = inode->i_default_acl = + (sb->s_flags & SB_POSIXACL) ? ACL_NOT_CACHED : NULL; or similar? Linus
On Thu, 31 Oct 2024 at 03:07, Christian Brauner <brauner@kernel.org> wrote: > > > It's all really just "acl_permission_check()" and that code is > > actually fairly optimized, except that the whole > > > > vfsuid = i_uid_into_vfsuid(idmap, inode); > > > > to check whether we're the owner is *not* cheap. It causes a call to > > make_vfsuid(), and it's just messy. > > I assume you ran your benchmark on baremetal so no containers or > idmappings? I find this somewhat suprising. Yes, I did too, this is the "simple" case for the whole uid mapping case, and I didn't expect it to be noticeable. That said, that function is just called a *LOT*. It's not just for each file opened, it's called for each path component as part of filename lookup, for that may_lookup_inode_permission -> do_inode_permission -> generic_permission -> acl_permission_check path. > One thing to optimize here > independent of your proposal would be to try and __always_inline > make_vfsuid(). Maybe. Part of the cost seems to be the call, but a bigger part seems to be the memory accesses around it with that whole inode->i_sb->s_user_ns chain to load it, and then current->cred->fsuid to compare against the result. Anyway, I'll play around with this a bit more and try to get better profiles. Linus
On Thu, 31 Oct 2024 at 09:04, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Maybe. Part of the cost seems to be the call, but a bigger part seems > to be the memory accesses around it with that whole > inode->i_sb->s_user_ns chain to load it, and then current->cred->fsuid > to compare against the result. > > Anyway, I'll play around with this a bit more and try to get better profiles. Ok, so I've done some more profiles, and yeah, the costs seem to be almost entirely just cache misses. make_vfsuid() shows up on the profile a bit too, but that seems to really be literally just "it's called a lot, and takes some I$ misses". When the profile looks like this: 10.71 │ push %rbx 15.44 │ mov %edx,%eax 7.88 │ mov %rdi,%rbx │ cmp $0xffffffff82532a00,%rdi 9.65 │ ↓ je 3a ... nothing ... 15.00 │ffffffff813493fa: pop %rbx 41.33 │ffffffff813493fb: → jmp ffffffff81bb5460 <__x86_return_thunk> then it really looks like cache misses and some slow indirect branch resolution for 'ret' (that __x86_return_thunk branch is misleading - it is rewritten at runtime, but 'perf report' shows the original object code). So some of it is presumably due to IBRS on this CPU, and that's a big part of make_vfsuid() in this bare-metal non-idmapped case. But the profile does clearly say that in generic_permission(), the problem is just the cache misses on the superblock accesses and all the extra work with the credential check, even when the mapping is the identity mapping. I still get a fair number of calls to make_vfsuid() even with my patch - I guess I'll have to look into why. This is my regular "full build of an already built kernel", which I *expected* to be mainly just a lot of fstat() calls by 'make' which I would have thought almost always hit the fast case. I might have messed something up. Linus
On Thu, Oct 31, 2024 at 08:14:59AM -1000, Linus Torvalds wrote: > On Wed, 30 Oct 2024 at 20:42, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I was kind of hoping that such cases would use 'cache_no_acl()' which > > makes that inode->i_acl be NULL. Wouldn't that be the right model > > anyway for !IS_POSIXACL()? > > Alternatively, just initialize it to NULL in inode_init_always_gfp(), eg like > > - inode->i_acl = inode->i_default_acl = ACL_NOT_CACHED; > + inode->i_acl = inode->i_default_acl = > + (sb->s_flags & SB_POSIXACL) ? ACL_NOT_CACHED : NULL; IIRC, the reason we do not do that was the possibility of mount -o remount,acl Not that it makes much sense, but it's currently supported and POSIX ACLs don't make much sense to start with... Anyway, patch looks sane; I still think that adding || !IS_POSIXACL(inode) wouldn't hurt (yes, it's a dereference of ->i_sb in case when ->i_acl is ACL_NOT_CACHED, but we are going to dereference it shortly after in case we don't take the fast path. OTOH, that probably matters only for fairly specific loads - massive accesses to procfs and sysfs, mostly.
On Thu, 31 Oct 2024 at 12:02, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I still get a fair number of calls to make_vfsuid() even with my patch > - I guess I'll have to look into why. This is my regular "full build > of an already built kernel", which I *expected* to be mainly just a > lot of fstat() calls by 'make' which I would have thought almost > always hit the fast case. > > I might have messed something up. Added some stats, and on my load (reading email in the web browser, some xterms and running an allmodconfig kernel build), I get about a 45% hit-rate for the fast-case: out of 44M calls to generic_permission(), about 20M hit the fast-case path. So it's noticeable, but not *quite* as noticeable as I would have hoped for. I suspect there are a fair number of owner calls for write, and then because permissions are 0644, the code actually has to check the owner. Linus
On Thu, 31 Oct 2024 at 12:28, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Anyway, patch looks sane; I still think that adding || !IS_POSIXACL(inode) > wouldn't hurt (yes, it's a dereference of ->i_sb in case when ->i_acl > is ACL_NOT_CACHED, but we are going to dereference it shortly after > in case we don't take the fast path. OTOH, that probably matters only > for fairly specific loads - massive accesses to procfs and sysfs, mostly. Yeah, so the reason I'd like to avoid it is exactly that the i_sb accesses are the ones that show up in profiles. So I'd rather start with just the cheap inode-only "ACL is clearly not there" check, and later if we find that the ACL_NOT_CACHED case is problematic do we look at that. Linus
On Thu, 31 Oct 2024 at 12:34, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I'd rather start with just the cheap inode-only "ACL is clearly not > there" check, and later if we find that the ACL_NOT_CACHED case is > problematic do we look at that. Actually, if I switch the tests around so that I do the permission bit check first, it becomes very natural to just check IS_POSIXACL() at the end (where we're about to go to the slow case, which will be touching i_sb anyway). Plus I can actually improve code generation by not shifting the mode bits down into the low bits, but instead spreading the requested permission bits around. The "spread bits around" becomes a simple constant multiplication with just three bits set, and the compiler will actually generate much better code (you can do it with two consecutive 'lea' instructions). The expression for this ends up looking a bit like line noise, so a comment explaining each step is a good idea. IOW, here's a rewritten patch that does it that way around, and thus deals with IS_POSIXACL() very naturally and seems to generate quite good code for me. Of course, while I actually tested the previous version after sending it out, this new version is entirely untested. Again. Linus
On Thu, Oct 31, 2024 at 03:17:18PM -1000, Linus Torvalds wrote: > On Thu, 31 Oct 2024 at 12:34, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So I'd rather start with just the cheap inode-only "ACL is clearly not > > there" check, and later if we find that the ACL_NOT_CACHED case is > > problematic do we look at that. > > Actually, if I switch the tests around so that I do the permission bit > check first, it becomes very natural to just check IS_POSIXACL() at > the end (where we're about to go to the slow case, which will be > touching i_sb anyway). > > Plus I can actually improve code generation by not shifting the mode > bits down into the low bits, but instead spreading the requested > permission bits around. > > The "spread bits around" becomes a simple constant multiplication with > just three bits set, and the compiler will actually generate much > better code (you can do it with two consecutive 'lea' instructions). > > The expression for this ends up looking a bit like line noise, so a > comment explaining each step is a good idea. > > IOW, here's a rewritten patch that does it that way around, and thus > deals with IS_POSIXACL() very naturally and seems to generate quite > good code for me. Acked-by: Al Viro <viro@zeniv.linux.org.uk>
fs/namei.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/fs/namei.c b/fs/namei.c index 4a4a22a08ac2..6aeabde0ec9f 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -326,6 +326,25 @@ static int check_acl(struct mnt_idmap *idmap, return -EAGAIN; } +/* + * Very quick optimistic "we know we have no ACL's" check. + * + * Note that this is purely for ACL_TYPE_ACCESS, and purely + * for the "we have cached that there are no ACLs" case. + * + * If this returns true, we know there are no ACLs. But if + * it returns false, we might still not have ACLs (it could + * be the is_uncached_acl() case). + */ +static inline bool no_acl_inode(struct inode *inode) +{ +#ifdef CONFIG_FS_POSIX_ACL + return likely(!READ_ONCE(inode->i_acl)); +#else + return true; +#endif +} + /** * acl_permission_check - perform basic UNIX permission checking * @idmap: idmap of the mount the inode was found from @@ -348,6 +367,19 @@ static int acl_permission_check(struct mnt_idmap *idmap, unsigned int mode = inode->i_mode; vfsuid_t vfsuid; + /* + * Common cheap case: everybody has the requested + * rights, and there are no ACLs to check. No need + * to do any owner/group checks. + */ + if (no_acl_inode(inode)) { + unsigned int all; + all = mode & (mode >> 3); // combine g into o + all &= mode >> 6; // ... and u into o + if (!(mask & ~all & 7)) + return 0; + } + /* Are we the owner? If so, ACL's don't matter */ vfsuid = i_uid_into_vfsuid(idmap, inode); if (likely(vfsuid_eq_kuid(vfsuid, current_fsuid()))) {