diff mbox series

Looking at profile data once again - avc lookup

Message ID CAHk-=whR4KSGqfEodXMwOMdQBt+V2HHMyz6+CobiydnZE+Vq9Q@mail.gmail.com (mailing list archive)
State Superseded
Delegated to: Paul Moore
Headers show
Series Looking at profile data once again - avc lookup | expand

Commit Message

Linus Torvalds Jan. 28, 2023, 9:15 p.m. UTC
So I happened to do my occasional "let's see what a profile for an
empty kernel build is", which usually highlights some silly thing we
do in pathname lookup (because most of the cost is just 'make' doing
various string handling in user space, and doing variations of
'lstat()' to look at file timestamps).

And, as always, avc_has_perm() and avc_has_perm_noaudit() are pretty
high up there, together with selinux_inode_permission(). It just gets
called a *lot*.

Now, avc_has_perm_noaudit() should be inlined, but it turns out that
it isn't because of some bad behavior of the unlikely path. That part
looks fairly easy to massage away. I'm attaching a largely untested
patch that makes the generated code look a bit better, in case anybody
wants to look at it.

But the real reason for this email is to query about 'selinux_state'.

We pass that around as a pointer quite a bit, to the point where
several function calls have to use stack space for arguments just
because they have more than six arguments. And from what I can tell,
it is *always* just a pointer to 'selinux_state', and never anything
else.

Some cases are obvious:

    git grep -w avc_has_perm |
        grep -v 'avc_has_perm(&selinux_state,'

ie every _single_ callers of 'avc_has_perm()' just passes in a pointer
to that single selinux_state thing.

Some cases are a bit less obvious, like security_get_user_sids(),
which has sel_write_user() do

        struct selinux_state *state = fsi->state;
        ..
                rc = security_sid_to_context(state, sids[i], &newcon, &len);

and then in turn it passes that to avc_has_perm_noaudit(), so it looks
like we might actually have multiple states.

The emphasis here being on "looks like", because it turns out that the
only thing that assigns 'fsi->state' is selinux_fs_info_create(),
which does

        fsi->state = &selinux_state;

oh, look, it's that same &selinux_state again!

In general, I really think there is just one single

        struct selinux_state selinux_state;

declared in security/selinux/hooks.c, and everybody just pointlessly
passes in a pointer to that single thing.

This was all done five years ago by commit aa8e712cee93 ("selinux:
wrap global selinux state") and I don't see the point. It never seems
to have gone anywhere, there's no other state that I can find, and all
it does is add overhead and complexity.

So I'd like to undo that, and go back to the good old days when we
didn't waste time and effort on passing a pointer that always has the
same value as an argument.

Comments? Is there some case I've missed?

               Linus

Comments

Paul Moore Jan. 28, 2023, 10:33 p.m. UTC | #1
On Sat, Jan 28, 2023 at 4:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I happened to do my occasional "let's see what a profile for an
> empty kernel build is", which usually highlights some silly thing we
> do in pathname lookup (because most of the cost is just 'make' doing
> various string handling in user space, and doing variations of
> 'lstat()' to look at file timestamps).
>
> And, as always, avc_has_perm() and avc_has_perm_noaudit() are pretty
> high up there, together with selinux_inode_permission(). It just gets
> called a *lot*.
>
> Now, avc_has_perm_noaudit() should be inlined, but it turns out that
> it isn't because of some bad behavior of the unlikely path. That part
> looks fairly easy to massage away. I'm attaching a largely untested
> patch that makes the generated code look a bit better, in case anybody
> wants to look at it.

I'll take a look, although just a heads-up that I don't generally
merge patches into selinux/next at this point in the -rc cycle unless
they are bug fixes, or some other critical patch; it's likely this
will need to wait until after the upcoming merge window closes.

> But the real reason for this email is to query about 'selinux_state'.
>
> We pass that around as a pointer quite a bit, to the point where
> several function calls have to use stack space for arguments just
> because they have more than six arguments. And from what I can tell,
> it is *always* just a pointer to 'selinux_state', and never anything
> else.

We can, and should, look at those cases where the function parameters
start to spill into the stack space.

> This was all done five years ago by commit aa8e712cee93 ("selinux:
> wrap global selinux state") and I don't see the point. It never seems
> to have gone anywhere, there's no other state that I can find, and all
> it does is add overhead and complexity.
>
> So I'd like to undo that, and go back to the good old days when we
> didn't waste time and effort on passing a pointer that always has the
> same value as an argument.
>
> Comments? Is there some case I've missed?

You're correct in that selinux_state parameters currently always point
back to the single global instance, however there was, and still is, a
point to that patch ... although I will admit it is a long time
coming.  The purpose of this change was to help pave the way for
namespacing SELinux by easing development and helping to reduce
ongoing merge conflicts with the in-development patches.  For a
variety of reasons the namespacing work has languished a bit, but it
hasn't been forgotten and I expect it to gain importance once the LSM
stacking patches land (LSM stacking alone isn't going to solve all of
the problems that many are expecting, it will need to be combined with
LSM-specific namespacing).
Linus Torvalds Jan. 29, 2023, 7:36 p.m. UTC | #2
On Sat, Jan 28, 2023 at 2:33 PM Paul Moore <paul@paul-moore.com> wrote:
>
> I'll take a look, although just a heads-up that I don't generally
> merge patches into selinux/next at this point in the -rc cycle unless
> they are bug fixes, or some other critical patch; it's likely this
> will need to wait until after the upcoming merge window closes.

Yeah, that patch was not some kind of "please apply this urgent fix",
more of a "I'm looking at path walking again, and the selinux code is
more expensive than the *actual* path walk is" heads up.

> > Comments? Is there some case I've missed?
>
> You're correct in that selinux_state parameters currently always point
> back to the single global instance, however there was, and still is, a
> point to that patch ... although I will admit it is a long time
> coming.

Honestly, considering that the selinux code is literally more
expensive than THE REAL WORKLOAD it is checking, I really want people
to take a second look.

If some new feature makes that crazy-expensive thing *worse*, we have issues.

If it's been that way for five years with no progress, and no clear
indication that it's even some high-priority issue that lots of people
are asking for, maybe that should be a big hint.

                 Linus
Paul Moore Jan. 30, 2023, 5:14 p.m. UTC | #3
On Sun, Jan 29, 2023 at 2:37 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sat, Jan 28, 2023 at 2:33 PM Paul Moore <paul@paul-moore.com> wrote:
> >
> > I'll take a look, although just a heads-up that I don't generally
> > merge patches into selinux/next at this point in the -rc cycle unless
> > they are bug fixes, or some other critical patch; it's likely this
> > will need to wait until after the upcoming merge window closes.
>
> Yeah, that patch was not some kind of "please apply this urgent fix",
> more of a "I'm looking at path walking again, and the selinux code is
> more expensive than the *actual* path walk is" heads up.

Yep, just wanted to set expectations so you wouldn't be surprised to
not see this during the upcoming merge window.

> > > Comments? Is there some case I've missed?
> >
> > You're correct in that selinux_state parameters currently always point
> > back to the single global instance, however there was, and still is, a
> > point to that patch ... although I will admit it is a long time
> > coming.
>
> Honestly, considering that the selinux code is literally more
> expensive than THE REAL WORKLOAD it is checking, I really want people
> to take a second look.

WE WILL

> If some new feature makes that crazy-expensive thing *worse*, we have issues.
>
> If it's been that way for five years with no progress, and no clear
> indication that it's even some high-priority issue that lots of people
> are asking for, maybe that should be a big hint.

To be fair, people *are* asking SELinux namespacing, but there are
some very thorny problems that remain unsolved.  However, after the
merge window we should consider moving away from passing the
selinux_state as a parameter and just using it as a global resource.
Paul Moore Jan. 30, 2023, 5:46 p.m. UTC | #4
On Mon, Jan 30, 2023 at 12:14 PM Paul Moore <paul@paul-moore.com> wrote:
> On Sun, Jan 29, 2023 at 2:37 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Sat, Jan 28, 2023 at 2:33 PM Paul Moore <paul@paul-moore.com> wrote:
> > Honestly, considering that the selinux code is literally more
> > expensive than THE REAL WORKLOAD it is checking, I really want people
> > to take a second look.
>
> WE WILL

I should add, do you have any particular test script you use?  If not,
that's fine, I just cobble something together, but I figured if you
had something already it would save me having to remember the details
on the perf tools.
Linus Torvalds Jan. 30, 2023, 6:35 p.m. UTC | #5
On Mon, Jan 30, 2023 at 9:47 AM Paul Moore <paul@paul-moore.com> wrote:
>
> I should add, do you have any particular test script you use?  If not,
> that's fine, I just cobble something together, but I figured if you
> had something already it would save me having to remember the details
> on the perf tools.

So I've done various things over the years, including just writing
special tools that do nothing but a recursive 'stat()' over and over
again over a big tree, just to pinpoint the path lookup costs (big
enough of a tree that you see actual cache effects). Then do that
either single-threaded or multi-threaded to see the locking issues.

But what I keep coming back to is to just have a fully built "make
allmodconfig" tree - which I have _anyway_, and then doing

     perf record -e cycles:pp make -j64

on it. You'll need to do something like

    echo 1 > /proc/sys/kernel/perf_event_paranoid

before starting your profiling session to make it possible to do that
profile as a normal user and get the kernel data.

And then look at the end result with just

    perf report --sort=dso,symbol

which avoids sorting by process, because I don't care _which_ process
does something, I just want to see the kernel symbol table end
results. Press 'k' to zoom into just the kernel profile, and Bob's
your uncle.

You can play with callchain data ("-g"), but I tend to like the plain
flat profile to just see where things are happening. I'll do the call
chain if I then start to look into things like "which caller was the
main reason for that queued_spin_lock_slowpath cost" kinds of things,
but it's not always even necessary.

Unless you have some big kernel debugging options on, what you
normally see for that load would be

 - memset and memcpy (including very much our user-space versions of
it, like clear_page_rep and copy_user_generic_string).

 - depending on number of CPU cores, locking (I *despise*
folio_memcg_lock, but that's not from the pathname lookup, it's the
page fault path, particularly WP faults)

 - page table setup and clearing

 - ... and finally, pathname walking, generally with
selinux_inode_permission and avc_has_perm_*() fairly high up

So it's not like 'make' is dominated by pathname walking - the process
related stuff tends to be higher - but I've ended up using that as a
kernel profile source because it's a real load for me.

Also, most of the time by far is spent in 'make' doing various string
things in user space. Our kernel makefiles tend to have a lot of
symbol expansion etc. I just ignore all the user space stuff.

There are other loads I occasionally look at, but this is basically
the one I always tend to return to because it tends to stress the two
things I personally end up interested in - the VFS layer and the VM
code. I don't really tend to do IO etc.

Put another way: there's nothing _special_ about the above, except for
the "it's a real load that does actually show a few core kernel
areas".

Also, the above is just about the least fancy use of prof you'll ever
see. No events, no special hardware counters for things like cache
misses or anything, just plain old "where does the time go". I do end
up looking at the annotated assembly code (press 'a' on the selected
symbol), but it's worth noting that even with hardware profiling
(Intel: PEBS, AMD: IBS), saying "exactly where did we spend time" is a
pretty ambiguous thing on modern OoO cores - you have to interpret the
data by just seeing lots of it. But usually you can see "hot loop
here", "mispredict there" or "that load is taking cache misses", so
the instruction-level profiles do need to be taken with a huge grain
of salt and some experience with that microarchitecture to really make
sense ot them.

                   Linus
Paul Moore Jan. 30, 2023, 7:54 p.m. UTC | #6
On Mon, Jan 30, 2023 at 1:35 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jan 30, 2023 at 9:47 AM Paul Moore <paul@paul-moore.com> wrote:
> >
> > I should add, do you have any particular test script you use?  If not,
> > that's fine, I just cobble something together, but I figured if you
> > had something already it would save me having to remember the details
> > on the perf tools.
>
> So I've done various things over the years, including just writing
> special tools that do nothing but a recursive 'stat()' over and over
> again over a big tree, just to pinpoint the path lookup costs (big
> enough of a tree that you see actual cache effects). Then do that
> either single-threaded or multi-threaded to see the locking issues.
>
> But what I keep coming back to is to just have a fully built "make
> allmodconfig" tree - which I have _anyway_, and then doing
>
>      perf record -e cycles:pp make -j64
>

...

> Put another way: there's nothing _special_ about the above, except for
> the "it's a real load that does actually show a few core kernel
> areas".

Thanks.
diff mbox series

Patch

From 50d32ecb066a820146f5dc441185c0e03c11b3e5 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 28 Jan 2023 12:53:26 -0800
Subject: [PATCH] selinux: uninline unlikely parts of avc_has_perm_noaudit()

avc_has_perm_noaudit()is one of those hot functions that end up being
used by almost all filesystem operations (through "avc_has_perm()") and
it's intended to be cheap enough to inline.

However, it turns out that the unlikely parts of it (where it doesn't
find an existing avc node) need a fair amount of stack space for the
automatic replacement node, so if it were to be inlined (at least clang
does not) it would just use stack space unnecessarily.

So split the unlikely part out of it, and mark that part noinline.  That
improves the actual likely part.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 security/selinux/avc.c | 52 +++++++++++++++++++++++++++++++++---------
 1 file changed, 41 insertions(+), 11 deletions(-)

diff --git a/security/selinux/avc.c b/security/selinux/avc.c
index 9a43af0ebd7d..85e5af7f856e 100644
--- a/security/selinux/avc.c
+++ b/security/selinux/avc.c
@@ -1112,6 +1112,38 @@  int avc_has_extended_perms(struct selinux_state *state,
 	return rc;
 }
 
+/*
+ * This is the "we have no node" part of avc_has_perm_noaudit(),
+ * which is unlikely and needs extra stack space for the new
+ * node that we generate. So don't inline it.
+ *
+ * NOTE! Called with RCU lock held for reading, and needs to
+ * drop it. And since the RCU lock was for the node lifetime
+ * (which we don't have), we could just drop it early. Except
+ * avc_compute_av() knows it's called under the rcu lock, and
+ * drops and re-takes it. What a crock.
+ *
+ * So we drop it after calling avc_compute_av() (and *inside*
+ * avc_compute_av()).
+ */
+static noinline int avc_perm_nonode(struct selinux_state *state,
+				u32 ssid, u32 tsid,
+				u16 tclass, u32 requested,
+				unsigned int flags,
+				struct av_decision *avd)
+{
+		u32 denied;
+		struct avc_xperms_node xp_node;
+
+		avc_compute_av(state, ssid, tsid, tclass, avd, &xp_node);
+		rcu_read_unlock();
+		denied = requested & ~(avd->allowed);
+		if (unlikely(denied))
+			return avc_denied(state, ssid, tsid, tclass, requested, 0, 0,
+					flags, avd);
+		return 0;
+}
+
 /**
  * avc_has_perm_noaudit - Check permissions but perform no auditing.
  * @state: SELinux state
@@ -1139,10 +1171,8 @@  inline int avc_has_perm_noaudit(struct selinux_state *state,
 				unsigned int flags,
 				struct av_decision *avd)
 {
-	struct avc_node *node;
-	struct avc_xperms_node xp_node;
-	int rc = 0;
 	u32 denied;
+	struct avc_node *node;
 
 	if (WARN_ON(!requested))
 		return -EACCES;
@@ -1151,17 +1181,17 @@  inline int avc_has_perm_noaudit(struct selinux_state *state,
 
 	node = avc_lookup(state->avc, ssid, tsid, tclass);
 	if (unlikely(!node))
-		avc_compute_av(state, ssid, tsid, tclass, avd, &xp_node);
-	else
-		memcpy(avd, &node->ae.avd, sizeof(*avd));
+		return avc_perm_nonode(state, ssid, tsid, tclass, requested, flags, avd);
+
+	denied = requested & ~node->ae.avd.allowed;
+	memcpy(avd, &node->ae.avd, sizeof(*avd));
+	rcu_read_unlock();
 
-	denied = requested & ~(avd->allowed);
 	if (unlikely(denied))
-		rc = avc_denied(state, ssid, tsid, tclass, requested, 0, 0,
-				flags, avd);
+		return avc_denied(state, ssid, tsid, tclass, requested, 0, 0,
+					flags, avd);
 
-	rcu_read_unlock();
-	return rc;
+	return 0;
 }
 
 /**
-- 
2.39.0.rc2.4.g1435931e43