diff mbox series

[v4] proc: Allow pid_revalidate() during LOOKUP_RCU

Message ID 20210104232123.31378-1-stephen.s.brennan@oracle.com (mailing list archive)
State New
Headers show
Series [v4] proc: Allow pid_revalidate() during LOOKUP_RCU | expand

Commit Message

Stephen Brennan Jan. 4, 2021, 11:21 p.m. UTC
The pid_revalidate() function drops from RCU into REF lookup mode. When
many threads are resolving paths within /proc in parallel, this can
result in heavy spinlock contention on d_lockref as each thread tries to
grab a reference to the /proc dentry (and drop it shortly thereafter).

Investigation indicates that it is not necessary to drop RCU in
pid_revalidate(), as no RCU data is modified and the function never
sleeps. So, remove the LOOKUP_RCU check.

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
---
When running running ~100 parallel instances of "TZ=/etc/localtime ps -fe
>/dev/null" on a 100CPU machine, the %sys utilization reaches 90%, and perf
shows the following code path as being responsible for heavy contention on
the d_lockref spinlock:

      walk_component()
        lookup_fast()
          d_revalidate()
            pid_revalidate() // returns -ECHILD
          unlazy_child()
            lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention

By applying this patch, %sys utilization falls to around 60% under the same
workload. Although this particular workload is a bit contrived, we have seen
some monitoring scripts which produced similarly high %sys time due to this
contention.

Changes in v4:
- Simplify by unconditionally calling pid_update_inode() from pid_revalidate,
  and removing the LOOKUP_RCU check.
Changes in v3:
- Rather than call pid_update_inode() with flags, create
  proc_inode_needs_update() to determine whether the call can be skipped.
- Restore the call to the security hook (see next patch).
Changes in v2:
- Remove get_pid_task_rcu_user() and get_proc_task_rcu(), since they were
  unnecessary.
- Remove the call to security_task_to_inode().

 fs/proc/base.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

Comments

Al Viro Jan. 5, 2021, 5:59 a.m. UTC | #1
On Mon, Jan 04, 2021 at 03:21:22PM -0800, Stephen Brennan wrote:
> The pid_revalidate() function drops from RCU into REF lookup mode. When
> many threads are resolving paths within /proc in parallel, this can
> result in heavy spinlock contention on d_lockref as each thread tries to
> grab a reference to the /proc dentry (and drop it shortly thereafter).
> 
> Investigation indicates that it is not necessary to drop RCU in
> pid_revalidate(), as no RCU data is modified and the function never
> sleeps. So, remove the LOOKUP_RCU check.

Umm...  I'm rather worried about the side effect you are removing here -
you are suddenly exposing a bunch of methods in there to RCU mode.
E.g. is proc_pid_permission() safe with MAY_NOT_BLOCK in the mask?
generic_permission() call in there is fine, but has_pid_permission()
doesn't even see the mask.  Is that thing safe in RCU mode?  AFAICS,
this
static int selinux_ptrace_access_check(struct task_struct *child,
                                     unsigned int mode)
{
        u32 sid = current_sid();
        u32 csid = task_sid(child);

        if (mode & PTRACE_MODE_READ)
                return avc_has_perm(&selinux_state,
                                    sid, csid, SECCLASS_FILE, FILE__READ, NULL);

        return avc_has_perm(&selinux_state,
                            sid, csid, SECCLASS_PROCESS, PROCESS__PTRACE, NULL);
}
is reachable and IIRC avc_has_perm() should *NOT* be called in RCU mode.
If nothing else, audit handling needs care...

And LSM-related stuff is only a part of possible issues here.  It does need
a careful code audit - you are taking a bunch of methods into the conditions
they'd never been tested in.  ->permission(), ->get_link(), ->d_revalidate(),
->d_hash() and ->d_compare() instances for objects that subtree.  The last
two are not there in case of anything in /proc/<pid>, but the first 3 very
much are.
Al Viro Jan. 5, 2021, 4:50 p.m. UTC | #2
On Tue, Jan 05, 2021 at 05:59:35AM +0000, Al Viro wrote:

> Umm...  I'm rather worried about the side effect you are removing here -
> you are suddenly exposing a bunch of methods in there to RCU mode.
> E.g. is proc_pid_permission() safe with MAY_NOT_BLOCK in the mask?
> generic_permission() call in there is fine, but has_pid_permission()
> doesn't even see the mask.  Is that thing safe in RCU mode?  AFAICS,
> this
> static int selinux_ptrace_access_check(struct task_struct *child,
>                                      unsigned int mode)
> {
>         u32 sid = current_sid();
>         u32 csid = task_sid(child);
> 
>         if (mode & PTRACE_MODE_READ)
>                 return avc_has_perm(&selinux_state,
>                                     sid, csid, SECCLASS_FILE, FILE__READ, NULL);
> 
>         return avc_has_perm(&selinux_state,
>                             sid, csid, SECCLASS_PROCESS, PROCESS__PTRACE, NULL);
> }
> is reachable and IIRC avc_has_perm() should *NOT* be called in RCU mode.
> If nothing else, audit handling needs care...
> 
> And LSM-related stuff is only a part of possible issues here.  It does need
> a careful code audit - you are taking a bunch of methods into the conditions
> they'd never been tested in.  ->permission(), ->get_link(), ->d_revalidate(),
> ->d_hash() and ->d_compare() instances for objects that subtree.  The last
> two are not there in case of anything in /proc/<pid>, but the first 3 very
> much are.

FWIW, after looking through the selinux and smack I started to wonder whether
we really need that "return -ECHILD rather than audit and fail" in case of
->inode_permission().

AFAICS, the reason we need it is that dump_common_audit_data() is not safe
in RCU mode with LSM_AUDIT_DATA_DENTRY and even more so - with
LSM_AUDIT_DATA_INODE (d_find_alias() + dput() there, and dput() can bloody well
block).

LSM_AUDIT_DATA_DENTRY is easy to handle - wrap
                audit_log_untrustedstring(ab, a->u.dentry->d_name.name);
into grabbing/dropping a->u.dentry->d_lock and we are done.  And as for
the LSM_AUDIT_DATA_INODE...  How about this:

/*
 * Caller *MUST* hold rcu_read_lock() and be guaranteed that inode itself
 * will be around until that gets dropped.
 */
struct dentry *d_find_alias_rcu(struct inode *inode)
{
	struct hlist_head *l = &inode->i_dentry;
        struct dentry *de = NULL;

	spin_lock(&inode->i_lock);
	// ->i_dentry and ->i_rcu are colocated, but the latter won't be
	// used without having I_FREEING set, which means no aliases left
	if (inode->i_state & I_FREEING) {
		spin_unlock(&inode->i_lock);
		return NULL;
	}
	// we can safely access inode->i_dentry
        if (hlist_empty(p)) {
		spin_unlock(&inode->i_lock);
		return NULL;
	}
	if (S_ISDIR(inode->i_mode)) {
		de = hlist_entry(l->first, struct dentry, d_u.d_alias);
	} else hlist_for_each_entry(de, l, d_u.d_alias) {
		if (d_unhashed(de))
			continue;
		// hashed + nonrcu really shouldn't be possible
		if (WARN_ON(READ_ONCE(de->d_flags) & DCACE_NONRCU))
			continue;
		break;
	}
	spin_unlock(&inode->i_lock);
        return de;
}

and have
        case LSM_AUDIT_DATA_INODE: {
                struct dentry *dentry;
                struct inode *inode;

		rcu_read_lock();
                inode = a->u.inode;
                dentry = d_find_alias_rcu(inode);
                if (dentry) {
                        audit_log_format(ab, " name=");
			spin_lock(&dentry->d_lock);
                        audit_log_untrustedstring(ab,
                                         dentry->d_name.name);
			spin_unlock(&dentry->d_lock);
                }
                audit_log_format(ab, " dev=");
                audit_log_untrustedstring(ab, inode->i_sb->s_id);
                audit_log_format(ab, " ino=%lu", inode->i_ino);
		rcu_read_unlock();
                break;
        }
in dump_common_audit_data().  Would that be enough to stop bothering
with the entire AVC_NONBLOCKING thing or is there anything else
involved?
Al Viro Jan. 5, 2021, 5:45 p.m. UTC | #3
On Tue, Jan 05, 2021 at 04:50:05PM +0000, Al Viro wrote:

> struct dentry *d_find_alias_rcu(struct inode *inode)
> {
> 	struct hlist_head *l = &inode->i_dentry;
>         struct dentry *de = NULL;
> 
> 	spin_lock(&inode->i_lock);
> 	// ->i_dentry and ->i_rcu are colocated, but the latter won't be
> 	// used without having I_FREEING set, which means no aliases left
> 	if (inode->i_state & I_FREEING) {
> 		spin_unlock(&inode->i_lock);
> 		return NULL;
> 	}
> 	// we can safely access inode->i_dentry
>         if (hlist_empty(p)) {

	if (hlist_empty(l)) {

obviously...
Al Viro Jan. 5, 2021, 7:59 p.m. UTC | #4
On Tue, Jan 05, 2021 at 04:50:05PM +0000, Al Viro wrote:

> LSM_AUDIT_DATA_DENTRY is easy to handle - wrap
>                 audit_log_untrustedstring(ab, a->u.dentry->d_name.name);
> into grabbing/dropping a->u.dentry->d_lock and we are done.

Incidentally, LSM_AUDIT_DATA_DENTRY in mainline is *not* safe wrt
rename() - for long-named dentries it is possible to get preempted
in the middle of
                audit_log_untrustedstring(ab, a->u.dentry->d_name.name);
and have the bugger renamed, with old name ending up freed.  The
same goes for LSM_AUDIT_DATA_INODE...

Folks, ->d_name.name is not automatically stable and the memory it
points to is not always guaranteed to last as long as dentry itself does.
In something like ->rename(), ->mkdir(), etc. - sure, we have the parent
->i_rwsem held exclusive and nothing's going to rename dentry under us.
But there's a reason why e.g. d_path() has to be careful.  And there
are selinux and smack hooks using LSM_AUDIT_DATA_DENTRY in locking
environment that does not exclude renames.

AFAICS, that race goes back to 2004: 605303cc340d ("[PATCH] selinux: Add dname
to audit output when a path cannot be generated.") in historical tree is where
its first ancestor appears...

The minimal fix is to grab ->d_lock around these audit_log_untrustedstring()
calls, and IMO that's -stable fodder.  It is a slow path (we are spewing an
audit record, not to mention anything else), so I don't believe it's worth
trying to do anything fancier than that.

How about the following?  If nobody objects, I'll drop it into #fixes and
send a pull request in a few days...

dump_common_audit_data(): fix racy accesses to ->d_name
    
We are not guaranteed the locking environment that would prevent
dentry getting renamed right under us.  And it's possible for
old long name to be freed after rename, leading to UAF here.

Cc: stable@kernel.org # v2.6.2+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/security/lsm_audit.c b/security/lsm_audit.c
index 7d8026f3f377..a0cd28cd31a8 100644
--- a/security/lsm_audit.c
+++ b/security/lsm_audit.c
@@ -275,7 +275,9 @@ static void dump_common_audit_data(struct audit_buffer *ab,
 		struct inode *inode;
 
 		audit_log_format(ab, " name=");
+		spin_lock(&a->u.dentry->d_lock);
 		audit_log_untrustedstring(ab, a->u.dentry->d_name.name);
+		spin_unlock(&a->u.dentry->d_lock);
 
 		inode = d_backing_inode(a->u.dentry);
 		if (inode) {
@@ -293,8 +295,9 @@ static void dump_common_audit_data(struct audit_buffer *ab,
 		dentry = d_find_alias(inode);
 		if (dentry) {
 			audit_log_format(ab, " name=");
-			audit_log_untrustedstring(ab,
-					 dentry->d_name.name);
+			spin_lock(&dentry->d_lock);
+			audit_log_untrustedstring(ab, dentry->d_name.name);
+			spin_unlock(&dentry->d_lock);
 			dput(dentry);
 		}
 		audit_log_format(ab, " dev=");
Linus Torvalds Jan. 5, 2021, 8:38 p.m. UTC | #5
On Tue, Jan 5, 2021 at 12:00 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> We are not guaranteed the locking environment that would prevent
> dentry getting renamed right under us.  And it's possible for
> old long name to be freed after rename, leading to UAF here.

This whole thing isn't important enough to get the dentry lock. It's
more of a hint than anything else.

Why isn't the fix to just use READ_ONCE() of the name pointer, and do
it under RCU?

That's what dentry_name() does for the much more complex case of
actually even following parent data for a depth up to 4, much less
just a single name.

So instead of

                       spin_lock(&dentry->d_lock);
                       audit_log_untrustedstring(ab, dentry->d_name.name);
                       spin_unlock(&dentry->d_lock);

why not

                       rcu_read_lock();
                       audit_log_untrustedstring(ab,
READ_ONCE(dentry->d_name.name));
                       rcu_read_unlock();

which looks a lot more in line with the other dentry path functions.

Maybe even have this as part of fs/d_path.c and try to get rid of
magic internal dentry name knowledge from the audit code?

                  Linus
Al Viro Jan. 5, 2021, 9:12 p.m. UTC | #6
On Tue, Jan 05, 2021 at 12:38:31PM -0800, Linus Torvalds wrote:

> This whole thing isn't important enough to get the dentry lock. It's
> more of a hint than anything else.
> 
> Why isn't the fix to just use READ_ONCE() of the name pointer, and do
> it under RCU?

Umm...  Take a look at audit_log_untrustedstring() - it really assumes
that string is not changing under it.  It could be massaged to be
resilent to such changes, and it's not even all that hard (copy the sucker
byte-by-byte, checking them for prohibited characters, with fallback
to hex dump if it finds one), but I really don't want to mess with
that for -stable and TBH I don't see the point - if the system is
spending enough time in spewing into audit for contention and/or
cacheline pingpong to matter, you are FUBAR anyway.

In this case dumber is better; sure, if it was just a string copy
with the accuracy in face of concurrent renames not guaranteed,
I'd be all for "let's see if we can just use %pd printf, or
go for open-coded analogue of such".  But here the lack of
whitespaces and quotes in the output is expected by userland
tools and that's more sensitive than the accuracy...

Again, if there's anybody seriously interested in analogue of
%pd with that (or some other) form of quoting, it could be done.
But I don't think it's a good idea for -stable and it obviously
can be done on top of the minimal race fix.
Stephen Brennan Jan. 5, 2021, 11:25 p.m. UTC | #7
Al Viro <viro@zeniv.linux.org.uk> writes:

> On Tue, Jan 05, 2021 at 04:50:05PM +0000, Al Viro wrote:
>
>> LSM_AUDIT_DATA_DENTRY is easy to handle - wrap
>>                 audit_log_untrustedstring(ab, a->u.dentry->d_name.name);
>> into grabbing/dropping a->u.dentry->d_lock and we are done.
>
> Incidentally, LSM_AUDIT_DATA_DENTRY in mainline is *not* safe wrt
> rename() - for long-named dentries it is possible to get preempted
> in the middle of
>                 audit_log_untrustedstring(ab, a->u.dentry->d_name.name);
> and have the bugger renamed, with old name ending up freed.  The
> same goes for LSM_AUDIT_DATA_INODE...

In the case of proc_pid_permission(), this preemption doesn't seem
possible. We have task_lock() (a spinlock) held by ptrace_may_access()
during this call, so preemption should be disabled:

proc_pid_permission()
  has_pid_permissions()
    ptrace_may_access()
      task_lock()
      __ptrace_may_access()
      | security_ptrace_access_check()
      |   ptrace_access_check -> selinux_ptrace_access_check()
      |     avc_has_perm()
      |       avc_audit() // note that has_pid_permissions() didn't get a
      |                   // flags field to propagate, so flags will not
      |                   // contain MAY_NOT_BLOCK
      |         slow_avc_audit()
      |           common_lsm_audit()
      |             dump_common_audit_data()
      task_unlock()

I understand the issue of d_name.name being freed across a preemption is
more general than proc_pid_permission() (as other callers may have
preemption enabled). However, it seems like there's another issue here.
avc_audit() seems to imply that slow_avc_audit() would sleep:
 
static inline int avc_audit(struct selinux_state *state,
			    u32 ssid, u32 tsid,
			    u16 tclass, u32 requested,
			    struct av_decision *avd,
			    int result,
			    struct common_audit_data *a,
			    int flags)
{
	u32 audited, denied;
	audited = avc_audit_required(requested, avd, result, 0, &denied);
	if (likely(!audited))
		return 0;
	/* fall back to ref-walk if we have to generate audit */
	if (flags & MAY_NOT_BLOCK)
		return -ECHILD;
	return slow_avc_audit(state, ssid, tsid, tclass,
			      requested, audited, denied, result,
			      a);
} 

If there are other cases in here where we might sleep, it would be a
problem to sleep with the task lock held, correct?
Paul Moore Jan. 6, 2021, midnight UTC | #8
On Tue, Jan 5, 2021 at 6:27 PM Stephen Brennan
<stephen.s.brennan@oracle.com> wrote:
> Al Viro <viro@zeniv.linux.org.uk> writes:
>
> > On Tue, Jan 05, 2021 at 04:50:05PM +0000, Al Viro wrote:
> >
> >> LSM_AUDIT_DATA_DENTRY is easy to handle - wrap
> >>                 audit_log_untrustedstring(ab, a->u.dentry->d_name.name);
> >> into grabbing/dropping a->u.dentry->d_lock and we are done.
> >
> > Incidentally, LSM_AUDIT_DATA_DENTRY in mainline is *not* safe wrt
> > rename() - for long-named dentries it is possible to get preempted
> > in the middle of
> >                 audit_log_untrustedstring(ab, a->u.dentry->d_name.name);
> > and have the bugger renamed, with old name ending up freed.  The
> > same goes for LSM_AUDIT_DATA_INODE...
>
> In the case of proc_pid_permission(), this preemption doesn't seem
> possible. We have task_lock() (a spinlock) held by ptrace_may_access()
> during this call, so preemption should be disabled:
>
> proc_pid_permission()
>   has_pid_permissions()
>     ptrace_may_access()
>       task_lock()
>       __ptrace_may_access()
>       | security_ptrace_access_check()
>       |   ptrace_access_check -> selinux_ptrace_access_check()
>       |     avc_has_perm()
>       |       avc_audit() // note that has_pid_permissions() didn't get a
>       |                   // flags field to propagate, so flags will not
>       |                   // contain MAY_NOT_BLOCK
>       |         slow_avc_audit()
>       |           common_lsm_audit()
>       |             dump_common_audit_data()
>       task_unlock()
>
> I understand the issue of d_name.name being freed across a preemption is
> more general than proc_pid_permission() (as other callers may have
> preemption enabled). However, it seems like there's another issue here.
> avc_audit() seems to imply that slow_avc_audit() would sleep:
>
> static inline int avc_audit(struct selinux_state *state,
>                             u32 ssid, u32 tsid,
>                             u16 tclass, u32 requested,
>                             struct av_decision *avd,
>                             int result,
>                             struct common_audit_data *a,
>                             int flags)
> {
>         u32 audited, denied;
>         audited = avc_audit_required(requested, avd, result, 0, &denied);
>         if (likely(!audited))
>                 return 0;
>         /* fall back to ref-walk if we have to generate audit */
>         if (flags & MAY_NOT_BLOCK)
>                 return -ECHILD;
>         return slow_avc_audit(state, ssid, tsid, tclass,
>                               requested, audited, denied, result,
>                               a);
> }
>
> If there are other cases in here where we might sleep, it would be a
> problem to sleep with the task lock held, correct?

I would expect the problem here to be the currently allocated audit
buffer isn't large enough to hold the full audit record, in which case
it will attempt to expand the buffer by a call to pskb_expand_head() -
don't ask why audit buffers are skbs, it's awful - using a gfp flag
that was established when the buffer was first created.  In this
particular case it is GFP_ATOMIC|__GFP_NOWARN, which I believe should
be safe in that it will not sleep on an allocation miss.

I need to go deal with dinner, so I can't trace the entire path at the
moment, but I believe the potential audit buffer allocation is the
main issue.
Al Viro Jan. 6, 2021, 12:38 a.m. UTC | #9
On Tue, Jan 05, 2021 at 07:00:59PM -0500, Paul Moore wrote:

> > > Incidentally, LSM_AUDIT_DATA_DENTRY in mainline is *not* safe wrt
> > > rename() - for long-named dentries it is possible to get preempted
> > > in the middle of
> > >                 audit_log_untrustedstring(ab, a->u.dentry->d_name.name);
> > > and have the bugger renamed, with old name ending up freed.  The
> > > same goes for LSM_AUDIT_DATA_INODE...
> >
> > In the case of proc_pid_permission(), this preemption doesn't seem
> > possible. We have task_lock() (a spinlock) held by ptrace_may_access()
> > during this call, so preemption should be disabled:
> >
> > proc_pid_permission()
> >   has_pid_permissions()
> >     ptrace_may_access()
> >       task_lock()
> >       __ptrace_may_access()
> >       | security_ptrace_access_check()
> >       |   ptrace_access_check -> selinux_ptrace_access_check()
> >       |     avc_has_perm()

		... which does not hit either LSM_AUDIT_DATA_DENTRY nor
LSM_AUDIT_DATA_INODE.  It's really an unrelated issue.

> > preemption enabled). However, it seems like there's another issue here.
> > avc_audit() seems to imply that slow_avc_audit() would sleep:
> >
> > static inline int avc_audit(struct selinux_state *state,
> >                             u32 ssid, u32 tsid,
> >                             u16 tclass, u32 requested,
> >                             struct av_decision *avd,
> >                             int result,
> >                             struct common_audit_data *a,
> >                             int flags)
> > {
> >         u32 audited, denied;
> >         audited = avc_audit_required(requested, avd, result, 0, &denied);
> >         if (likely(!audited))
> >                 return 0;
> >         /* fall back to ref-walk if we have to generate audit */
> >         if (flags & MAY_NOT_BLOCK)
> >                 return -ECHILD;
> >         return slow_avc_audit(state, ssid, tsid, tclass,
> >                               requested, audited, denied, result,
> >                               a);
> > }
> >
> > If there are other cases in here where we might sleep, it would be a
> > problem to sleep with the task lock held, correct?

It can sleep - with LSM_AUDIT_DATA_INODE, which is precisely what
selinux_inode_permission() is hitting.

> I would expect the problem here to be the currently allocated audit
> buffer isn't large enough to hold the full audit record, in which case
> it will attempt to expand the buffer by a call to pskb_expand_head() -
> don't ask why audit buffers are skbs, it's awful - using a gfp flag
> that was established when the buffer was first created.  In this
> particular case it is GFP_ATOMIC|__GFP_NOWARN, which I believe should
> be safe in that it will not sleep on an allocation miss.
> 
> I need to go deal with dinner, so I can't trace the entire path at the
> moment, but I believe the potential audit buffer allocation is the
> main issue.

Nope.  dput() in dump_common_audit_data(), OTOH, is certainly not
safe.  OTTH, it's not really needed there - see vfs.git #work.audit
for (untested) turning that sucker non-blocking.  I hadn't tried
a followup that would get rid of the entire AVC_NONBLOCKING thing yet,
but I suspect that it should simplify the things in there nicely...
Stephen Brennan Jan. 6, 2021, 12:56 a.m. UTC | #10
Al Viro <viro@zeniv.linux.org.uk> writes:

> On Mon, Jan 04, 2021 at 03:21:22PM -0800, Stephen Brennan wrote:
>> The pid_revalidate() function drops from RCU into REF lookup mode. When
>> many threads are resolving paths within /proc in parallel, this can
>> result in heavy spinlock contention on d_lockref as each thread tries to
>> grab a reference to the /proc dentry (and drop it shortly thereafter).
>> 
>> Investigation indicates that it is not necessary to drop RCU in
>> pid_revalidate(), as no RCU data is modified and the function never
>> sleeps. So, remove the LOOKUP_RCU check.
>
> Umm...  I'm rather worried about the side effect you are removing here -
> you are suddenly exposing a bunch of methods in there to RCU mode.
> E.g. is proc_pid_permission() safe with MAY_NOT_BLOCK in the mask?
> generic_permission() call in there is fine, but has_pid_permission()
> doesn't even see the mask.  Is that thing safe in RCU mode?  AFAICS,
> this
> static int selinux_ptrace_access_check(struct task_struct *child,
>                                      unsigned int mode)
> {
>         u32 sid = current_sid();
>         u32 csid = task_sid(child);
>
>         if (mode & PTRACE_MODE_READ)
>                 return avc_has_perm(&selinux_state,
>                                     sid, csid, SECCLASS_FILE, FILE__READ, NULL);
>
>         return avc_has_perm(&selinux_state,
>                             sid, csid, SECCLASS_PROCESS, PROCESS__PTRACE, NULL);
> }
> is reachable and IIRC avc_has_perm() should *NOT* be called in RCU mode.
> If nothing else, audit handling needs care...
>
> And LSM-related stuff is only a part of possible issues here.  It does need
> a careful code audit - you are taking a bunch of methods into the conditions
> they'd never been tested in.  ->permission(), ->get_link(), ->d_revalidate(),
> ->d_hash() and ->d_compare() instances for objects that subtree.  The last
> two are not there in case of anything in /proc/<pid>, but the first 3 very
> much are.

You're right, this was a major oversight on my part. The main motivation
of this patch is to reduce contention on the /proc dentry, which occurs
directly after d_revalidate() returns -ECHILD the first time in
lookup_fast(). To drop into ref mode, we call unlazy_child(), while
nd->path still refers to /proc and dentry refers to /proc/PID. Grabbing
a reference to /proc is the heart of the contention issue.

But directly after a successful d_revalidate() in lookup_fast(), we
return and go to step_into(), which assigns the /proc/PID dentry to
nd->path. After this point, any unlazy operation will not try to grab
the /proc dentry, resulting in significantly less contention.

So it would already be a significant improvement if we kept this change
to pid_revalidate(), and simply added checks to bail out of each of the
other procfs methods if we're in LOOKUP_RCU. Would that be an acceptable
change for you?
Paul Moore Jan. 6, 2021, 2:43 a.m. UTC | #11
On Tue, Jan 5, 2021 at 7:38 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Jan 05, 2021 at 07:00:59PM -0500, Paul Moore wrote:

...

> > I would expect the problem here to be the currently allocated audit
> > buffer isn't large enough to hold the full audit record, in which case
> > it will attempt to expand the buffer by a call to pskb_expand_head() -
> > don't ask why audit buffers are skbs, it's awful - using a gfp flag
> > that was established when the buffer was first created.  In this
> > particular case it is GFP_ATOMIC|__GFP_NOWARN, which I believe should
> > be safe in that it will not sleep on an allocation miss.
> >
> > I need to go deal with dinner, so I can't trace the entire path at the
> > moment, but I believe the potential audit buffer allocation is the
> > main issue.
>
> Nope.  dput() in dump_common_audit_data(), OTOH, is certainly not
> safe.

My mistake.  My initial reaction is to always assume audit is the
problem; I should have traced everything through before commenting.

> OTTH, it's not really needed there - see vfs.git #work.audit
> for (untested) turning that sucker non-blocking.  I hadn't tried
> a followup that would get rid of the entire AVC_NONBLOCKING thing yet,
> but I suspect that it should simplify the things in there nicely...

It would be nice to be able to get rid of the limitation on when we
can update the AVC and do proper auditing.  I doubt the impact is
anything that anyone notices, but I agree that it should make things
much cleaner.  Thanks Al.
Stephen Brennan Jan. 14, 2021, 10:51 p.m. UTC | #12
Al Viro <viro@zeniv.linux.org.uk> writes:
> OTTH, it's not really needed there - see vfs.git #work.audit
> for (untested) turning that sucker non-blocking.  I hadn't tried
> a followup that would get rid of the entire AVC_NONBLOCKING thing yet,
> but I suspect that it should simplify the things in there nicely...

I went ahead and pulled down this branch and combined it with my
pid_revalidate change. Further, I audited all the inode get_link and
permission() implementations, as well as dentry d_revalidate()
implementations, in fs/proc (more on that below). Together, all these
patches have run stable under a steady high load of concurrent PS
processes on a 104CPU machine for over an hour, and greatly reduced the
%sys utilization which the patch originally addressed. How would you
like to proceed with the #work.audit changes? I could include them in a
v5 of this patch series.

Regarding my audit (ha) of dentry and inode functions in the fs/proc/
directory:

* get_link() receives a NULL dentry pointer when called in RCU mode.
* permission() receives MAY_NOT_BLOCK in the mode parameter when called
  from RCU.
* d_revalidate() receives LOOKUP_RCU in flags.

There were generally three groups I found. Group (1) are functions which
contain a check at the top of the function and return -ECHILD, and so
appear to be trivially RCU safe (although this is by dropping out of RCU
completely). Group (2) are functions which have no explicit check, but
on my audit, I was confident that there were no sleeping function calls,
and thus were RCU safe as is. Group (3) are functions which appeared to
be unsafe for some reason or another.

Group (1):
 proc_ns_get_link()
 proc_pid_get_link()
 map_files_d_revalidate()
 proc_misc_d_revalidate()
 tid_fd_revalidate()

Group (2):
 proc_get_link()
 proc_self_get_link()
 proc_thread_self_get_link()
 proc_fd_permission()

Group (3):
 pid_revalidate()            -- addressed by my patch
 proc_map_files_get_link()
 proc_pid_permission()       -- addressed by Al's work.audit branch

proc_map_files_get_link() calls capable() which ends up calling a
security hook, which can get into the audit guts, and so I marked it as
potentially unsafe, and added a patch to bail out of this function
before the capable() check. However, I doubt this is really necessary.

So to conclude, depending on how Al wants to move forward with the
work.audit branch, I could send a full series with the proposed changes.

Stephen
diff mbox series

Patch

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f52217f432bc..633ef74e8dfd 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1974,19 +1974,18 @@  static int pid_revalidate(struct dentry *dentry, unsigned int flags)
 {
 	struct inode *inode;
 	struct task_struct *task;
+	int ret = 0;
 
-	if (flags & LOOKUP_RCU)
-		return -ECHILD;
-
-	inode = d_inode(dentry);
-	task = get_proc_task(inode);
+	rcu_read_lock();
+	inode = d_inode_rcu(dentry);
+	task = pid_task(proc_pid(inode), PIDTYPE_PID);
 
 	if (task) {
 		pid_update_inode(task, inode);
-		put_task_struct(task);
-		return 1;
+		ret = 1;
 	}
-	return 0;
+	rcu_read_unlock();
+	return ret;
 }
 
 static inline bool proc_inode_is_dead(struct inode *inode)