diff mbox series

generic_permission() optimization

Message ID CAHk-=whJgRDtxTudTQ9HV8BFw5-bBsu+c8Ouwd_PrPqPB6_KEQ@mail.gmail.com (mailing list archive)
State New
Headers show
Series generic_permission() optimization | expand

Commit Message

Linus Torvalds Oct. 31, 2024, 4:16 a.m. UTC
So call me crazy, but due to entirely unrelated changes (the x86
barrier_nospec optimization) I was doing some profiles to check that
everything looked fine.

And just looking at kernel profiles, I noticed that
"generic_permission()" wasn't entirely in the noise. It was right
there along with strncpy_from_user() etc. Which is a bit surprising,
because it should be really cheap to check basic Unix permissions?

It's all really just "acl_permission_check()" and that code is
actually fairly optimized, except that the whole

        vfsuid = i_uid_into_vfsuid(idmap, inode);

to check whether we're the owner is *not* cheap. It causes a call to
make_vfsuid(), and it's just messy.

Which made me wonder: we already have code that says "if the Group and
Other permission bits are the same wrt the mask we are checking, don't
bother doing the expensive group checks".

Why not extend that to "if any of the UGO choices are ok with the
permissions, why bother looking up ownership at all?"

Now, there is one reason to look up the owner: POSIX ACL's don't
matter to owners, but do matter to others.

But we could check for the cached case of "no POSIX ACLs" very
cheaply, and only do it for that case.

IOW, a patch something like the attached.

No, I didn't really check whether it makes any difference at all. But
my gut feel is that a *lot* of permission checks succeed for any user,
with things like system directories are commonly drwxr-xr-x, so if you
want to check read or execute permissions, it really doesn't matter
whether you are the User, the Group, or Other.

So thus the code does that

        unsigned int all;
        all = mode & (mode >> 3); // combine g into o
        all &= mode >> 6;         // ... and u into o

so now the low three bits of 'all' are the bits that *every* case has
set. And then

        if (!(mask & ~all & 7))
                return 0;

basically says "if what we are asking for is not zero in any of those
bits, we're good".

And it's entirely possible that I'm missing something silly and am
being just stupid. Somebody hit me with the clue bat if so.

               Linus

Comments

Al Viro Oct. 31, 2024, 6:05 a.m. UTC | #1
On Wed, Oct 30, 2024 at 06:16:22PM -1000, Linus Torvalds wrote:

> +static inline bool no_acl_inode(struct inode *inode)
> +{
> +#ifdef CONFIG_FS_POSIX_ACL
> +	return likely(!READ_ONCE(inode->i_acl));
> +#else
> +	return true;
> +#endif
> +}

Hmm... Shouldn't there be || !IS_POSIXACL(inode) in there?
Linus Torvalds Oct. 31, 2024, 6:42 a.m. UTC | #2
On Wed, 30 Oct 2024 at 20:05, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Oct 30, 2024 at 06:16:22PM -1000, Linus Torvalds wrote:
>
> > +static inline bool no_acl_inode(struct inode *inode)
> > +{
> > +#ifdef CONFIG_FS_POSIX_ACL
> > +     return likely(!READ_ONCE(inode->i_acl));
> > +#else
> > +     return true;
> > +#endif
> > +}
>
> Hmm... Shouldn't there be || !IS_POSIXACL(inode) in there?

I was going for "intentionally minimalistic". IOW, this was meant to
be an optimization for a presumed common case, but fall back to the
generic code quickly and simply.

Put another way: is the !IS_POSIXACL() actually a common situation
worth optimizing for?

Do real people actually use "noacl"? My gut feel is that it was one of
those mistakes that some random odd case is using, but not something
worth really optimizing for.

But maybe some common situation ends up using it even without "noacl"
- like /proc, perhaps?

I was kind of hoping that such cases would use 'cache_no_acl()' which
makes that inode->i_acl be NULL. Wouldn't that be the right model
anyway for !IS_POSIXACL()?

Anyway, this is all obviously a "matter of tuning the optimization".
And I don't actually know if it's even worth doing in the first place.

From just the profiles I looked at, that make_vfsuid() conversion
looked like a surprisingly big deal, but obviously this optimistic
fast case wouldn't remove all such cases anyway.

               Linus
Christian Brauner Oct. 31, 2024, 1:02 p.m. UTC | #3
On Wed, Oct 30, 2024 at 06:16:22PM -1000, Linus Torvalds wrote:
> So call me crazy, but due to entirely unrelated changes (the x86
> barrier_nospec optimization) I was doing some profiles to check that
> everything looked fine.
> 
> And just looking at kernel profiles, I noticed that
> "generic_permission()" wasn't entirely in the noise. It was right
> there along with strncpy_from_user() etc. Which is a bit surprising,
> because it should be really cheap to check basic Unix permissions?
> 
> It's all really just "acl_permission_check()" and that code is
> actually fairly optimized, except that the whole
> 
>         vfsuid = i_uid_into_vfsuid(idmap, inode);
> 
> to check whether we're the owner is *not* cheap. It causes a call to
> make_vfsuid(), and it's just messy.

I assume you ran your benchmark on baremetal so no containers or
idmappings? I find this somewhat suprising. One thing to optimize here
independent of your proposal would be to try and __always_inline
make_vfsuid().

> 
> Which made me wonder: we already have code that says "if the Group and
> Other permission bits are the same wrt the mask we are checking, don't
> bother doing the expensive group checks".
> 
> Why not extend that to "if any of the UGO choices are ok with the
> permissions, why bother looking up ownership at all?"
> 
> Now, there is one reason to look up the owner: POSIX ACL's don't
> matter to owners, but do matter to others.
> 
> But we could check for the cached case of "no POSIX ACLs" very
> cheaply, and only do it for that case.
> 
> IOW, a patch something like the attached.

I think this should work (though I find the permission checking model in
general to be an unruly mess - that's ignoring the fun hardening bits of
it...).

Can you send give me a proper SoB patch? Liberal comments in the code
would be appreciated. I'd like to put this through LTP and some of the
permission tests I've written over the years.

> 
> No, I didn't really check whether it makes any difference at all. But
> my gut feel is that a *lot* of permission checks succeed for any user,
> with things like system directories are commonly drwxr-xr-x, so if you
> want to check read or execute permissions, it really doesn't matter
> whether you are the User, the Group, or Other.
> 
> So thus the code does that
> 
>         unsigned int all;
>         all = mode & (mode >> 3); // combine g into o
>         all &= mode >> 6;         // ... and u into o
> 
> so now the low three bits of 'all' are the bits that *every* case has
> set. And then
> 
>         if (!(mask & ~all & 7))
>                 return 0;
> 
> basically says "if what we are asking for is not zero in any of those
> bits, we're good".
> 
> And it's entirely possible that I'm missing something silly and am
> being just stupid. Somebody hit me with the clue bat if so.
> 
>                Linus
Linus Torvalds Oct. 31, 2024, 6:14 p.m. UTC | #4
On Wed, 30 Oct 2024 at 20:42, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I was kind of hoping that such cases would use 'cache_no_acl()' which
> makes that inode->i_acl be NULL. Wouldn't that be the right model
> anyway for !IS_POSIXACL()?

Alternatively, just initialize it to NULL in inode_init_always_gfp(), eg like

-       inode->i_acl = inode->i_default_acl = ACL_NOT_CACHED;
+       inode->i_acl = inode->i_default_acl =
+               (sb->s_flags & SB_POSIXACL) ? ACL_NOT_CACHED : NULL;

or similar?

            Linus
Linus Torvalds Oct. 31, 2024, 7:04 p.m. UTC | #5
On Thu, 31 Oct 2024 at 03:07, Christian Brauner <brauner@kernel.org> wrote:
>
> > It's all really just "acl_permission_check()" and that code is
> > actually fairly optimized, except that the whole
> >
> >         vfsuid = i_uid_into_vfsuid(idmap, inode);
> >
> > to check whether we're the owner is *not* cheap. It causes a call to
> > make_vfsuid(), and it's just messy.
>
> I assume you ran your benchmark on baremetal so no containers or
> idmappings? I find this somewhat suprising.

Yes, I did too, this is the "simple" case for the whole uid mapping
case, and I didn't expect it to be noticeable.

That said, that function is just called a *LOT*. It's not just for
each file opened, it's called for each path component as part of
filename lookup, for that may_lookup_inode_permission ->
do_inode_permission -> generic_permission -> acl_permission_check
path.

>          One thing to optimize here
> independent of your proposal would be to try and __always_inline
> make_vfsuid().

Maybe. Part of the cost seems to be the call, but a bigger part seems
to be the memory accesses around it with that whole
inode->i_sb->s_user_ns chain to load it, and then current->cred->fsuid
to compare against the result.

Anyway, I'll play around with this a bit more and try to get better profiles.

                Linus
Linus Torvalds Oct. 31, 2024, 10:02 p.m. UTC | #6
On Thu, 31 Oct 2024 at 09:04, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Maybe. Part of the cost seems to be the call, but a bigger part seems
> to be the memory accesses around it with that whole
> inode->i_sb->s_user_ns chain to load it, and then current->cred->fsuid
> to compare against the result.
>
> Anyway, I'll play around with this a bit more and try to get better profiles.

Ok, so I've done some more profiles, and yeah, the costs seem to be
almost entirely just cache misses.

make_vfsuid() shows up on the profile a bit too, but that seems to
really be literally just "it's called a lot, and takes some I$
misses".

When the profile looks like this:

 10.71 │      push %rbx
 15.44 │      mov  %edx,%eax
  7.88 │      mov  %rdi,%rbx
       │      cmp  $0xffffffff82532a00,%rdi
  9.65 │    ↓ je   3a
... nothing ...
 15.00 │ffffffff813493fa:   pop  %rbx
 41.33 │ffffffff813493fb: → jmp  ffffffff81bb5460 <__x86_return_thunk>

then it really looks like cache misses and some slow indirect branch
resolution for 'ret' (that __x86_return_thunk branch is misleading -
it is rewritten at runtime, but 'perf report' shows the original
object code).

So some of it is presumably due to IBRS on this CPU, and that's a big
part of make_vfsuid() in this bare-metal non-idmapped case. But the
profile does clearly say that in generic_permission(), the problem is
just the cache misses on the superblock accesses and all the extra
work with the credential check, even when the mapping is the identity
mapping.

I still get a fair number of calls to make_vfsuid() even with my patch
- I guess I'll have to look into why. This is my regular "full build
of an already built kernel", which I *expected* to be mainly just a
lot of fstat() calls by 'make' which I would have thought almost
always hit the fast case.

I might have messed something up.

               Linus
Al Viro Oct. 31, 2024, 10:28 p.m. UTC | #7
On Thu, Oct 31, 2024 at 08:14:59AM -1000, Linus Torvalds wrote:
> On Wed, 30 Oct 2024 at 20:42, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I was kind of hoping that such cases would use 'cache_no_acl()' which
> > makes that inode->i_acl be NULL. Wouldn't that be the right model
> > anyway for !IS_POSIXACL()?
> 
> Alternatively, just initialize it to NULL in inode_init_always_gfp(), eg like
> 
> -       inode->i_acl = inode->i_default_acl = ACL_NOT_CACHED;
> +       inode->i_acl = inode->i_default_acl =
> +               (sb->s_flags & SB_POSIXACL) ? ACL_NOT_CACHED : NULL;

IIRC, the reason we do not do that was the possibility of mount -o remount,acl
Not that it makes much sense, but it's currently supported and POSIX ACLs
don't make much sense to start with...

Anyway, patch looks sane; I still think that adding || !IS_POSIXACL(inode)
wouldn't hurt (yes, it's a dereference of ->i_sb in case when ->i_acl
is ACL_NOT_CACHED, but we are going to dereference it shortly after
in case we don't take the fast path.  OTOH, that probably matters only
for fairly specific loads - massive accesses to procfs and sysfs, mostly.
Linus Torvalds Oct. 31, 2024, 10:31 p.m. UTC | #8
On Thu, 31 Oct 2024 at 12:02, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I still get a fair number of calls to make_vfsuid() even with my patch
> - I guess I'll have to look into why. This is my regular "full build
> of an already built kernel", which I *expected* to be mainly just a
> lot of fstat() calls by 'make' which I would have thought almost
> always hit the fast case.
>
> I might have messed something up.

Added some stats, and on my load (reading email in the web browser,
some xterms and running an allmodconfig kernel build), I get about a
45% hit-rate for the fast-case: out of 44M calls to
generic_permission(), about 20M hit the fast-case path.

So it's noticeable, but not *quite* as noticeable as I would have
hoped for. I suspect there are a fair number of owner calls for write,
and then because permissions are 0644, the code actually has to check
the owner.

              Linus
Linus Torvalds Oct. 31, 2024, 10:34 p.m. UTC | #9
On Thu, 31 Oct 2024 at 12:28, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Anyway, patch looks sane; I still think that adding || !IS_POSIXACL(inode)
> wouldn't hurt (yes, it's a dereference of ->i_sb in case when ->i_acl
> is ACL_NOT_CACHED, but we are going to dereference it shortly after
> in case we don't take the fast path.  OTOH, that probably matters only
> for fairly specific loads - massive accesses to procfs and sysfs, mostly.

Yeah, so the reason I'd like to avoid it is exactly that the i_sb
accesses are the ones that show up in profiles.

So I'd rather start with just the cheap inode-only "ACL is clearly not
there" check, and later if we find that the ACL_NOT_CACHED case is
problematic do we look at that.

           Linus
diff mbox series

Patch

 fs/namei.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 4a4a22a08ac2..6aeabde0ec9f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -326,6 +326,25 @@  static int check_acl(struct mnt_idmap *idmap,
 	return -EAGAIN;
 }
 
+/*
+ * Very quick optimistic "we know we have no ACL's" check.
+ *
+ * Note that this is purely for ACL_TYPE_ACCESS, and purely
+ * for the "we have cached that there are no ACLs" case.
+ *
+ * If this returns true, we know there are no ACLs. But if
+ * it returns false, we might still not have ACLs (it could
+ * be the is_uncached_acl() case).
+ */
+static inline bool no_acl_inode(struct inode *inode)
+{
+#ifdef CONFIG_FS_POSIX_ACL
+	return likely(!READ_ONCE(inode->i_acl));
+#else
+	return true;
+#endif
+}
+
 /**
  * acl_permission_check - perform basic UNIX permission checking
  * @idmap:	idmap of the mount the inode was found from
@@ -348,6 +367,19 @@  static int acl_permission_check(struct mnt_idmap *idmap,
 	unsigned int mode = inode->i_mode;
 	vfsuid_t vfsuid;
 
+	/*
+	 * Common cheap case: everybody has the requested
+	 * rights, and there are no ACLs to check. No need
+	 * to do any owner/group checks.
+	 */
+	if (no_acl_inode(inode)) {
+		unsigned int all;
+		all = mode & (mode >> 3); // combine g into o
+		all &= mode >> 6;	  // ... and u into o
+		if (!(mask & ~all & 7))
+			return 0;
+	}
+
 	/* Are we the owner? If so, ACL's don't matter */
 	vfsuid = i_uid_into_vfsuid(idmap, inode);
 	if (likely(vfsuid_eq_kuid(vfsuid, current_fsuid()))) {