diff mbox series

[v2] lockdown,selinux: avoid bogus SELinux lockdown permission checks

Message ID 20210517092006.803332-1-omosnace@redhat.com (mailing list archive)
State New
Headers show
Series [v2] lockdown,selinux: avoid bogus SELinux lockdown permission checks | expand

Commit Message

Ondrej Mosnacek May 17, 2021, 9:20 a.m. UTC
Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
lockdown") added an implementation of the locked_down LSM hook to
SELinux, with the aim to restrict which domains are allowed to perform
operations that would breach lockdown.

However, in several places the security_locked_down() hook is called in
situations where the current task isn't doing any action that would
directly breach lockdown, leading to SELinux checks that are basically
bogus.

Since in most of these situations converting the callers such that
security_locked_down() is called in a context where the current task
would be meaningful for SELinux is impossible or very non-trivial (and
could lead to TOCTOU issues for the classic Lockdown LSM
implementation), fix this by modifying the hook to accept a struct cred
pointer as argument, where NULL will be interpreted as a request for a
"global", task-independent lockdown decision only. Then modify SELinux
to ignore calls with cred == NULL.

Since most callers will just want to pass current_cred() as the cred
parameter, rename the hook to security_cred_locked_down() and provide
the original security_locked_down() function as a simple wrapper around
the new hook.

The callers migrated to the new hook, passing NULL as cred:
1. arch/powerpc/xmon/xmon.c
     Here the hook seems to be called from non-task context and is only
     used for redacting some sensitive values from output sent to
     userspace.
2. fs/tracefs/inode.c:tracefs_create_file()
     Here the call is used to prevent creating new tracefs entries when
     the kernel is locked down. Assumes that locking down is one-way -
     i.e. if the hook returns non-zero once, it will never return zero
     again, thus no point in creating these files.
3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common()
     Called when a BPF program calls a helper that could leak kernel
     memory. The task context is not relevant here, since the program
     may very well be run in the context of a different task than the
     consumer of the data.
     See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585
4. net/xfrm/xfrm_user.c:copy_to_user_*()
     Here a cryptographic secret is redacted based on the value returned
     from the hook. There are two possible actions that may lead here:
     a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the
        task context is relevant, since the dumped data is sent back to
        the current task.
     b) When deleting an SA via XFRM_MSG_DELSA, the dumped SAs are
        broadcasted to tasks subscribed to XFRM events - here the
        SELinux check is not meningful as the current task's creds do
        not represent the tasks that could potentially see the secret.
     It really doesn't seem worth it to try to preserve the check in the
     a) case, since the eventual leak can be circumvented anyway via b),
     plus there is no way for the task to indicate that it doesn't care
     about the actual key value, so the check could generate a lot of
     noise.

Improvements-suggested-by: Casey Schaufler <casey@schaufler-ca.com>
Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
---

v2:
- change to a single hook based on suggestions by Casey Schaufler

v1: https://lore.kernel.org/lkml/20210507114048.138933-1-omosnace@redhat.com/

 arch/powerpc/xmon/xmon.c      |  4 ++--
 fs/tracefs/inode.c            |  2 +-
 include/linux/lsm_hook_defs.h |  3 ++-
 include/linux/lsm_hooks.h     |  3 ++-
 include/linux/security.h      | 11 ++++++++---
 kernel/trace/bpf_trace.c      |  4 ++--
 net/xfrm/xfrm_user.c          |  2 +-
 security/lockdown/lockdown.c  |  5 +++--
 security/security.c           |  6 +++---
 security/selinux/hooks.c      | 12 +++++++++---
 10 files changed, 33 insertions(+), 19 deletions(-)

Comments

Michael Ellerman May 17, 2021, 11 a.m. UTC | #1
Ondrej Mosnacek <omosnace@redhat.com> writes:
> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> lockdown") added an implementation of the locked_down LSM hook to
> SELinux, with the aim to restrict which domains are allowed to perform
> operations that would breach lockdown.
>
> However, in several places the security_locked_down() hook is called in
> situations where the current task isn't doing any action that would
> directly breach lockdown, leading to SELinux checks that are basically
> bogus.
>
> Since in most of these situations converting the callers such that
> security_locked_down() is called in a context where the current task
> would be meaningful for SELinux is impossible or very non-trivial (and
> could lead to TOCTOU issues for the classic Lockdown LSM
> implementation), fix this by modifying the hook to accept a struct cred
> pointer as argument, where NULL will be interpreted as a request for a
> "global", task-independent lockdown decision only. Then modify SELinux
> to ignore calls with cred == NULL.
>
> Since most callers will just want to pass current_cred() as the cred
> parameter, rename the hook to security_cred_locked_down() and provide
> the original security_locked_down() function as a simple wrapper around
> the new hook.
>
> The callers migrated to the new hook, passing NULL as cred:
> 1. arch/powerpc/xmon/xmon.c
>      Here the hook seems to be called from non-task context and is only
>      used for redacting some sensitive values from output sent to
>      userspace.

It's hard to follow but it actually disables interactive use of xmon
entirely if lockdown is in confidentiality mode, and disables
modifications of the kernel in integrity mode.

But that's not really that important, the patch looks fine.

Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)

cheers
Ondrej Mosnacek May 26, 2021, 11:44 a.m. UTC | #2
On Mon, May 17, 2021 at 1:00 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
> Ondrej Mosnacek <omosnace@redhat.com> writes:
> > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> > lockdown") added an implementation of the locked_down LSM hook to
> > SELinux, with the aim to restrict which domains are allowed to perform
> > operations that would breach lockdown.
> >
> > However, in several places the security_locked_down() hook is called in
> > situations where the current task isn't doing any action that would
> > directly breach lockdown, leading to SELinux checks that are basically
> > bogus.
> >
> > Since in most of these situations converting the callers such that
> > security_locked_down() is called in a context where the current task
> > would be meaningful for SELinux is impossible or very non-trivial (and
> > could lead to TOCTOU issues for the classic Lockdown LSM
> > implementation), fix this by modifying the hook to accept a struct cred
> > pointer as argument, where NULL will be interpreted as a request for a
> > "global", task-independent lockdown decision only. Then modify SELinux
> > to ignore calls with cred == NULL.
> >
> > Since most callers will just want to pass current_cred() as the cred
> > parameter, rename the hook to security_cred_locked_down() and provide
> > the original security_locked_down() function as a simple wrapper around
> > the new hook.
> >
> > The callers migrated to the new hook, passing NULL as cred:
> > 1. arch/powerpc/xmon/xmon.c
> >      Here the hook seems to be called from non-task context and is only
> >      used for redacting some sensitive values from output sent to
> >      userspace.
>
> It's hard to follow but it actually disables interactive use of xmon
> entirely if lockdown is in confidentiality mode, and disables
> modifications of the kernel in integrity mode.
>
> But that's not really that important, the patch looks fine.
>
> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)

Thanks, Michael!

James/Paul, is there anything blocking this patch from being merged?
Especially the BPF case is causing real trouble for people and the
only workaround is to broadly allow lockdown::confidentiality in the
policy.

--
Ondrej Mosnacek
Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.
James Morris May 27, 2021, 4:28 a.m. UTC | #3
On Wed, 26 May 2021, Ondrej Mosnacek wrote:

> Thanks, Michael!
> 
> James/Paul, is there anything blocking this patch from being merged?
> Especially the BPF case is causing real trouble for people and the
> only workaround is to broadly allow lockdown::confidentiality in the
> policy.

It would be good to see more signoffs/reviews, especially from Paul, but 
he is busy with the io_uring stuff.

Let's see if anyone else can look at this in the next couple of days.
Paul Moore May 27, 2021, 2:18 p.m. UTC | #4
On Thu, May 27, 2021 at 12:33 AM James Morris <jmorris@namei.org> wrote:
> On Wed, 26 May 2021, Ondrej Mosnacek wrote:
>
> > Thanks, Michael!
> >
> > James/Paul, is there anything blocking this patch from being merged?
> > Especially the BPF case is causing real trouble for people and the
> > only workaround is to broadly allow lockdown::confidentiality in the
> > policy.
>
> It would be good to see more signoffs/reviews, especially from Paul, but
> he is busy with the io_uring stuff.

Yes, it's been a busy week with various things going on around here.
I looked at the v1 posting but haven't had a chance yet to look at v2;
I promise to get to it today, but it might not happen until later
tonight.

> Let's see if anyone else can look at this in the next couple of days.
Paul Moore May 28, 2021, 1:37 a.m. UTC | #5
On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
>
> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> lockdown") added an implementation of the locked_down LSM hook to
> SELinux, with the aim to restrict which domains are allowed to perform
> operations that would breach lockdown.
>
> However, in several places the security_locked_down() hook is called in
> situations where the current task isn't doing any action that would
> directly breach lockdown, leading to SELinux checks that are basically
> bogus.
>
> Since in most of these situations converting the callers such that
> security_locked_down() is called in a context where the current task
> would be meaningful for SELinux is impossible or very non-trivial (and
> could lead to TOCTOU issues for the classic Lockdown LSM
> implementation), fix this by modifying the hook to accept a struct cred
> pointer as argument, where NULL will be interpreted as a request for a
> "global", task-independent lockdown decision only. Then modify SELinux
> to ignore calls with cred == NULL.

I'm not overly excited about skipping the access check when cred is
NULL.  Based on the description and the little bit that I've dug into
thus far it looks like using SECINITSID_KERNEL as the subject would be
much more appropriate.  *Something* (the kernel in most of the
relevant cases it looks like) is requesting that a potentially
sensitive disclosure be made, and ignoring it seems like the wrong
thing to do.  Leaving the access control intact also provides a nice
avenue to audit these requests should users want to do that.

Those users that generally don't care can grant kernel_t all the
necessary permissions without much policy.

> Since most callers will just want to pass current_cred() as the cred
> parameter, rename the hook to security_cred_locked_down() and provide
> the original security_locked_down() function as a simple wrapper around
> the new hook.

I know you and Casey went back and forth on this in v1, but I agree
with Casey that having two LSM hooks here is a mistake.  I know it
makes backports hard, but spoiler alert: maintaining complex software
over any non-trivial period of time is hard, reeeeally hard sometimes
;)

> The callers migrated to the new hook, passing NULL as cred:
> 1. arch/powerpc/xmon/xmon.c
>      Here the hook seems to be called from non-task context and is only
>      used for redacting some sensitive values from output sent to
>      userspace.

This definitely sounds like kernel_t based on the description above.

> 2. fs/tracefs/inode.c:tracefs_create_file()
>      Here the call is used to prevent creating new tracefs entries when
>      the kernel is locked down. Assumes that locking down is one-way -
>      i.e. if the hook returns non-zero once, it will never return zero
>      again, thus no point in creating these files.

More kernel_t.

> 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common()
>      Called when a BPF program calls a helper that could leak kernel
>      memory. The task context is not relevant here, since the program
>      may very well be run in the context of a different task than the
>      consumer of the data.
>      See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585

The access control check isn't so much who is consuming the data, but
who is requesting a potential violation of a "lockdown", yes?  For
example, the SELinux policy rule for the current lockdown check looks
something like this:

  allow <who> <who> : lockdown { <reason> };

It seems to me that the task context is relevant here and performing
the access control check based on the task's domain is correct.  If we
are also concerned about who has access to this sensitive information
once it has been determined that the task can cause it to be sent, we
should have another check point for that, assuming the access isn't
already covered by another check/hook.

> 4. net/xfrm/xfrm_user.c:copy_to_user_*()
>      Here a cryptographic secret is redacted based on the value returned
>      from the hook. There are two possible actions that may lead here:
>      a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the
>         task context is relevant, since the dumped data is sent back to
>         the current task.

If the task context is relevant we should use it.

>      b) When deleting an SA via XFRM_MSG_DELSA, the dumped SAs are
>         broadcasted to tasks subscribed to XFRM events - here the
>         SELinux check is not meningful as the current task's creds do
>         not represent the tasks that could potentially see the secret.

This looks very similar to the BPF hook discussed above, I believe my
comments above apply here as well.

>      It really doesn't seem worth it to try to preserve the check in the
>      a) case ...

After you've read all of the above I hope you can understand why I
disagree with this.

>      ... since the eventual leak can be circumvented anyway via b)

I don't follow the statement above ... ?  However I'm not sure it
matters much considering my other concerns.

>      plus there is no way for the task to indicate that it doesn't care
>      about the actual key value, so the check could generate a lot of
>      noise.
>
> Improvements-suggested-by: Casey Schaufler <casey@schaufler-ca.com>
> Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
> Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
> ---
>
> v2:
> - change to a single hook based on suggestions by Casey Schaufler
>
> v1: https://lore.kernel.org/lkml/20210507114048.138933-1-omosnace@redhat.com/
>
>  arch/powerpc/xmon/xmon.c      |  4 ++--
>  fs/tracefs/inode.c            |  2 +-
>  include/linux/lsm_hook_defs.h |  3 ++-
>  include/linux/lsm_hooks.h     |  3 ++-
>  include/linux/security.h      | 11 ++++++++---
>  kernel/trace/bpf_trace.c      |  4 ++--
>  net/xfrm/xfrm_user.c          |  2 +-
>  security/lockdown/lockdown.c  |  5 +++--
>  security/security.c           |  6 +++---
>  security/selinux/hooks.c      | 12 +++++++++---
>  10 files changed, 33 insertions(+), 19 deletions(-)
Daniel Borkmann May 28, 2021, 7:09 a.m. UTC | #6
On 5/28/21 3:37 AM, Paul Moore wrote:
> On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
>>
>> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
>> lockdown") added an implementation of the locked_down LSM hook to
>> SELinux, with the aim to restrict which domains are allowed to perform
>> operations that would breach lockdown.
>>
>> However, in several places the security_locked_down() hook is called in
>> situations where the current task isn't doing any action that would
>> directly breach lockdown, leading to SELinux checks that are basically
>> bogus.
>>
>> Since in most of these situations converting the callers such that
>> security_locked_down() is called in a context where the current task
>> would be meaningful for SELinux is impossible or very non-trivial (and
>> could lead to TOCTOU issues for the classic Lockdown LSM
>> implementation), fix this by modifying the hook to accept a struct cred
>> pointer as argument, where NULL will be interpreted as a request for a
>> "global", task-independent lockdown decision only. Then modify SELinux
>> to ignore calls with cred == NULL.
> 
> I'm not overly excited about skipping the access check when cred is
> NULL.  Based on the description and the little bit that I've dug into
> thus far it looks like using SECINITSID_KERNEL as the subject would be
> much more appropriate.  *Something* (the kernel in most of the
> relevant cases it looks like) is requesting that a potentially
> sensitive disclosure be made, and ignoring it seems like the wrong
> thing to do.  Leaving the access control intact also provides a nice
> avenue to audit these requests should users want to do that.

I think the rationale/workaround for ignoring calls with cred == NULL (or the previous
patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his
seen tracing cases:

   i) The audit events that are triggered due to calls to security_locked_down()
      can OOM kill a machine, see below details [0].

  ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end()
      when presumingly trying to wake up kauditd [1].

How would your suggestion above solve both i) and ii)?

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585 :

   I starting seeing this with F-34. When I run a container that is traced with eBPF
   to record the syscalls it is doing, auditd is flooded with messages like:

   type=AVC msg=audit(1619784520.593:282387): avc:  denied  { confidentiality } for
    pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
     scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0 tclass=lockdown permissive=0

   This seems to be leading to auditd running out of space in the backlog buffer and
   eventually OOMs the machine.

   auditd running at 99% CPU presumably processing all the messages, eventually I get:
   Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
   Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
   Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
   Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
   Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
   Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014

[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/ :

   Upstream kernel 5.11.0-rc7 and later was found to deadlock during a bpf_probe_read_compat()
   call within a sched_switch tracepoint. The problem is reproducible with the reg_alloc3
   testcase from SystemTap's BPF backend testsuite on x86_64 as well as the runqlat,runqslower
   tools from bcc on ppc64le. Example stack trace from [1]:

   [  730.868702] stack backtrace:
   [  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
   [  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
   [  730.873278] Call Trace:
   [  730.873770]  dump_stack+0x7f/0xa1
   [  730.874433]  check_noncircular+0xdf/0x100
   [  730.875232]  __lock_acquire+0x1202/0x1e10
   [  730.876031]  ? __lock_acquire+0xfc0/0x1e10
   [  730.876844]  lock_acquire+0xc2/0x3a0
   [  730.877551]  ? __wake_up_common_lock+0x52/0x90
   [  730.878434]  ? lock_acquire+0xc2/0x3a0
   [  730.879186]  ? lock_is_held_type+0xa7/0x120
   [  730.880044]  ? skb_queue_tail+0x1b/0x50
   [  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
   [  730.881656]  ? __wake_up_common_lock+0x52/0x90
   [  730.882532]  __wake_up_common_lock+0x52/0x90
   [  730.883375]  audit_log_end+0x5b/0x100
   [  730.884104]  slow_avc_audit+0x69/0x90
   [  730.884836]  avc_has_perm+0x8b/0xb0
   [  730.885532]  selinux_lockdown+0xa5/0xd0
   [  730.886297]  security_locked_down+0x20/0x40
   [  730.887133]  bpf_probe_read_compat+0x66/0xd0
   [  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
   [  730.888917]  trace_call_bpf+0xe9/0x240
   [  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
   [  730.890579]  perf_trace_sched_switch+0x142/0x180
   [  730.891485]  ? __schedule+0x6d8/0xb20
   [  730.892209]  __schedule+0x6d8/0xb20
   [  730.892899]  schedule+0x5b/0xc0
   [  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
   [  730.894457]  syscall_exit_to_user_mode+0x27/0x70
   [  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae

>> Since most callers will just want to pass current_cred() as the cred
>> parameter, rename the hook to security_cred_locked_down() and provide
>> the original security_locked_down() function as a simple wrapper around
>> the new hook.
[...]
> 
>> 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common()
>>       Called when a BPF program calls a helper that could leak kernel
>>       memory. The task context is not relevant here, since the program
>>       may very well be run in the context of a different task than the
>>       consumer of the data.
>>       See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585
> 
> The access control check isn't so much who is consuming the data, but
> who is requesting a potential violation of a "lockdown", yes?  For
> example, the SELinux policy rule for the current lockdown check looks
> something like this:
> 
>    allow <who> <who> : lockdown { <reason> };
> 
> It seems to me that the task context is relevant here and performing
> the access control check based on the task's domain is correct.
This doesn't make much sense to me, it's /not/ the task 'requesting a potential
violation of a "lockdown"', but rather the running tracing program which is e.g.
inspecting kernel data structures around the triggered event. If I understood
you correctly, having an 'allow' check on, say, httpd would be rather odd since
things like perf/bcc/bpftrace/systemtap/etc is installing the tracing probe instead.

Meaning, if we would /not/ trace such events (like in the prior mentioned syscall
example), then there is also no call to the security_locked_down() from that same/
unmodified application.

Thanks,
Daniel
Jiri Olsa May 28, 2021, 9:53 a.m. UTC | #7
On Fri, May 28, 2021 at 09:09:57AM +0200, Daniel Borkmann wrote:
> On 5/28/21 3:37 AM, Paul Moore wrote:
> > On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> > > 
> > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> > > lockdown") added an implementation of the locked_down LSM hook to
> > > SELinux, with the aim to restrict which domains are allowed to perform
> > > operations that would breach lockdown.
> > > 
> > > However, in several places the security_locked_down() hook is called in
> > > situations where the current task isn't doing any action that would
> > > directly breach lockdown, leading to SELinux checks that are basically
> > > bogus.
> > > 
> > > Since in most of these situations converting the callers such that
> > > security_locked_down() is called in a context where the current task
> > > would be meaningful for SELinux is impossible or very non-trivial (and
> > > could lead to TOCTOU issues for the classic Lockdown LSM
> > > implementation), fix this by modifying the hook to accept a struct cred
> > > pointer as argument, where NULL will be interpreted as a request for a
> > > "global", task-independent lockdown decision only. Then modify SELinux
> > > to ignore calls with cred == NULL.
> > 
> > I'm not overly excited about skipping the access check when cred is
> > NULL.  Based on the description and the little bit that I've dug into
> > thus far it looks like using SECINITSID_KERNEL as the subject would be
> > much more appropriate.  *Something* (the kernel in most of the
> > relevant cases it looks like) is requesting that a potentially
> > sensitive disclosure be made, and ignoring it seems like the wrong
> > thing to do.  Leaving the access control intact also provides a nice
> > avenue to audit these requests should users want to do that.
> 
> I think the rationale/workaround for ignoring calls with cred == NULL (or the previous
> patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his
> seen tracing cases:
> 
>   i) The audit events that are triggered due to calls to security_locked_down()
>      can OOM kill a machine, see below details [0].
> 
>  ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end()
>      when presumingly trying to wake up kauditd [1].

hi,
I saw the same deadlock, ended up with this sequence:

  rq_lock(rq) -> trace_sched_switch -> bpf_prog -> selinux_lockdown -> audit_log_end -> wake_up_interruptible -> try_to_wake_up -> rq_lock(rq)

problem is that trace_sched_switch already holds rq_lock

I had powerpc server where I could reproduce this easily,
but now for some reason I can't hit the issue anymore

jirka

> 
> How would your suggestion above solve both i) and ii)?
> 
> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585 :
> 
>   I starting seeing this with F-34. When I run a container that is traced with eBPF
>   to record the syscalls it is doing, auditd is flooded with messages like:
> 
>   type=AVC msg=audit(1619784520.593:282387): avc:  denied  { confidentiality } for
>    pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
>     scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0 tclass=lockdown permissive=0
> 
>   This seems to be leading to auditd running out of space in the backlog buffer and
>   eventually OOMs the machine.
> 
>   auditd running at 99% CPU presumably processing all the messages, eventually I get:
>   Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
>   Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
>   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
>   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
>   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
>   Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
>   Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
>   Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
>   Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
> 
> [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/ :
> 
>   Upstream kernel 5.11.0-rc7 and later was found to deadlock during a bpf_probe_read_compat()
>   call within a sched_switch tracepoint. The problem is reproducible with the reg_alloc3
>   testcase from SystemTap's BPF backend testsuite on x86_64 as well as the runqlat,runqslower
>   tools from bcc on ppc64le. Example stack trace from [1]:
> 
>   [  730.868702] stack backtrace:
>   [  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
>   [  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
>   [  730.873278] Call Trace:
>   [  730.873770]  dump_stack+0x7f/0xa1
>   [  730.874433]  check_noncircular+0xdf/0x100
>   [  730.875232]  __lock_acquire+0x1202/0x1e10
>   [  730.876031]  ? __lock_acquire+0xfc0/0x1e10
>   [  730.876844]  lock_acquire+0xc2/0x3a0
>   [  730.877551]  ? __wake_up_common_lock+0x52/0x90
>   [  730.878434]  ? lock_acquire+0xc2/0x3a0
>   [  730.879186]  ? lock_is_held_type+0xa7/0x120
>   [  730.880044]  ? skb_queue_tail+0x1b/0x50
>   [  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
>   [  730.881656]  ? __wake_up_common_lock+0x52/0x90
>   [  730.882532]  __wake_up_common_lock+0x52/0x90
>   [  730.883375]  audit_log_end+0x5b/0x100
>   [  730.884104]  slow_avc_audit+0x69/0x90
>   [  730.884836]  avc_has_perm+0x8b/0xb0
>   [  730.885532]  selinux_lockdown+0xa5/0xd0
>   [  730.886297]  security_locked_down+0x20/0x40
>   [  730.887133]  bpf_probe_read_compat+0x66/0xd0
>   [  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
>   [  730.888917]  trace_call_bpf+0xe9/0x240
>   [  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
>   [  730.890579]  perf_trace_sched_switch+0x142/0x180
>   [  730.891485]  ? __schedule+0x6d8/0xb20
>   [  730.892209]  __schedule+0x6d8/0xb20
>   [  730.892899]  schedule+0x5b/0xc0
>   [  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
>   [  730.894457]  syscall_exit_to_user_mode+0x27/0x70
>   [  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> > > Since most callers will just want to pass current_cred() as the cred
> > > parameter, rename the hook to security_cred_locked_down() and provide
> > > the original security_locked_down() function as a simple wrapper around
> > > the new hook.
> [...]
> > 
> > > 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common()
> > >       Called when a BPF program calls a helper that could leak kernel
> > >       memory. The task context is not relevant here, since the program
> > >       may very well be run in the context of a different task than the
> > >       consumer of the data.
> > >       See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585
> > 
> > The access control check isn't so much who is consuming the data, but
> > who is requesting a potential violation of a "lockdown", yes?  For
> > example, the SELinux policy rule for the current lockdown check looks
> > something like this:
> > 
> >    allow <who> <who> : lockdown { <reason> };
> > 
> > It seems to me that the task context is relevant here and performing
> > the access control check based on the task's domain is correct.
> This doesn't make much sense to me, it's /not/ the task 'requesting a potential
> violation of a "lockdown"', but rather the running tracing program which is e.g.
> inspecting kernel data structures around the triggered event. If I understood
> you correctly, having an 'allow' check on, say, httpd would be rather odd since
> things like perf/bcc/bpftrace/systemtap/etc is installing the tracing probe instead.
> 
> Meaning, if we would /not/ trace such events (like in the prior mentioned syscall
> example), then there is also no call to the security_locked_down() from that same/
> unmodified application.
> 
> Thanks,
> Daniel
>
Daniel Borkmann May 28, 2021, 9:56 a.m. UTC | #8
On 5/28/21 9:09 AM, Daniel Borkmann wrote:
> On 5/28/21 3:37 AM, Paul Moore wrote:
>> On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
>>>
>>> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
>>> lockdown") added an implementation of the locked_down LSM hook to
>>> SELinux, with the aim to restrict which domains are allowed to perform
>>> operations that would breach lockdown.
>>>
>>> However, in several places the security_locked_down() hook is called in
>>> situations where the current task isn't doing any action that would
>>> directly breach lockdown, leading to SELinux checks that are basically
>>> bogus.
>>>
>>> Since in most of these situations converting the callers such that
>>> security_locked_down() is called in a context where the current task
>>> would be meaningful for SELinux is impossible or very non-trivial (and
>>> could lead to TOCTOU issues for the classic Lockdown LSM
>>> implementation), fix this by modifying the hook to accept a struct cred
>>> pointer as argument, where NULL will be interpreted as a request for a
>>> "global", task-independent lockdown decision only. Then modify SELinux
>>> to ignore calls with cred == NULL.
>>
>> I'm not overly excited about skipping the access check when cred is
>> NULL.  Based on the description and the little bit that I've dug into
>> thus far it looks like using SECINITSID_KERNEL as the subject would be
>> much more appropriate.  *Something* (the kernel in most of the
>> relevant cases it looks like) is requesting that a potentially
>> sensitive disclosure be made, and ignoring it seems like the wrong
>> thing to do.  Leaving the access control intact also provides a nice
>> avenue to audit these requests should users want to do that.
> 
> I think the rationale/workaround for ignoring calls with cred == NULL (or the previous
> patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his
> seen tracing cases:
> 
>    i) The audit events that are triggered due to calls to security_locked_down()
>       can OOM kill a machine, see below details [0].
> 
>   ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end()
>       when presumingly trying to wake up kauditd [1].

Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked
at the rest but it's also kind of independent), the attached fix should address both
reported issues, please take a look & test.

Thanks a lot,
Daniel
Jiri Olsa May 28, 2021, 10:16 a.m. UTC | #9
On Fri, May 28, 2021 at 11:56:02AM +0200, Daniel Borkmann wrote:

SNIP

> 
> Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked
> at the rest but it's also kind of independent), the attached fix should address both
> reported issues, please take a look & test.
> 
> Thanks a lot,
> Daniel

> From 5893ad528dc0a0a68933b8f2a81b18d3f539660d Mon Sep 17 00:00:00 2001
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Fri, 28 May 2021 09:16:31 +0000
> Subject: [PATCH bpf] bpf, audit, lockdown: Fix bogus SELinux lockdown permission checks
> 
> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
> added an implementation of the locked_down LSM hook to SELinux, with the aim
> to restrict which domains are allowed to perform operations that would breach
> lockdown. This is indirectly also getting audit subsystem involved to report
> events. The latter is problematic, as reported by Ondrej and Serhei, since it
> can bring down the whole system via audit:
> 
>   i) The audit events that are triggered due to calls to security_locked_down()
>      can OOM kill a machine, see below details [0].
> 
>  ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end()
>      when presumingly trying to wake up kauditd [1].
> 
> Fix both at the same time by taking a completely different approach, that is,
> move the check into the program verification phase where we actually retrieve
> the func proto. This also reliably gets the task (current) that is trying to
> install the tracing program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also
> fixes the OOM since we're moving this out of the BPF helpers which can be called
> millions of times per second.

nice idea.. I'll try to reproduce and test

jirka
Jiri Olsa May 28, 2021, 11:47 a.m. UTC | #10
On Fri, May 28, 2021 at 11:56:02AM +0200, Daniel Borkmann wrote:

SNIP

> Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked
> at the rest but it's also kind of independent), the attached fix should address both
> reported issues, please take a look & test.
> 
> Thanks a lot,
> Daniel

> From 5893ad528dc0a0a68933b8f2a81b18d3f539660d Mon Sep 17 00:00:00 2001
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Fri, 28 May 2021 09:16:31 +0000
> Subject: [PATCH bpf] bpf, audit, lockdown: Fix bogus SELinux lockdown permission checks
> 
> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
> added an implementation of the locked_down LSM hook to SELinux, with the aim
> to restrict which domains are allowed to perform operations that would breach
> lockdown. This is indirectly also getting audit subsystem involved to report
> events. The latter is problematic, as reported by Ondrej and Serhei, since it
> can bring down the whole system via audit:
> 
>   i) The audit events that are triggered due to calls to security_locked_down()
>      can OOM kill a machine, see below details [0].
> 
>  ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end()
>      when presumingly trying to wake up kauditd [1].
> 
> Fix both at the same time by taking a completely different approach, that is,
> move the check into the program verification phase where we actually retrieve
> the func proto. This also reliably gets the task (current) that is trying to
> install the tracing program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also
> fixes the OOM since we're moving this out of the BPF helpers which can be called
> millions of times per second.
> 
> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:
> 
>   I starting seeing this with F-34. When I run a container that is traced with
>   BPF to record the syscalls it is doing, auditd is flooded with messages like:
> 
>   type=AVC msg=audit(1619784520.593:282387): avc:  denied  { confidentiality }
>     for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
>       scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
>         tclass=lockdown permissive=0
> 
>   This seems to be leading to auditd running out of space in the backlog buffer
>   and eventually OOMs the machine.
> 
>   [...]
>   auditd running at 99% CPU presumably processing all the messages, eventually I get:
>   Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
>   Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
>   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
>   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
>   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
>   Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
>   Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
>   Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
>   Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
>   [...]
> 
> [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
>     Serhei Makarov says:
> 
>   Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
>   bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
>   is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
>   testsuite on x86_64 as well as the runqlat,runqslower tools from bcc on
>   ppc64le. Example stack trace:
> 
>   [...]
>   [  730.868702] stack backtrace:
>   [  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
>   [  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
>   [  730.873278] Call Trace:
>   [  730.873770]  dump_stack+0x7f/0xa1
>   [  730.874433]  check_noncircular+0xdf/0x100
>   [  730.875232]  __lock_acquire+0x1202/0x1e10
>   [  730.876031]  ? __lock_acquire+0xfc0/0x1e10
>   [  730.876844]  lock_acquire+0xc2/0x3a0
>   [  730.877551]  ? __wake_up_common_lock+0x52/0x90
>   [  730.878434]  ? lock_acquire+0xc2/0x3a0
>   [  730.879186]  ? lock_is_held_type+0xa7/0x120
>   [  730.880044]  ? skb_queue_tail+0x1b/0x50
>   [  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
>   [  730.881656]  ? __wake_up_common_lock+0x52/0x90
>   [  730.882532]  __wake_up_common_lock+0x52/0x90
>   [  730.883375]  audit_log_end+0x5b/0x100
>   [  730.884104]  slow_avc_audit+0x69/0x90
>   [  730.884836]  avc_has_perm+0x8b/0xb0
>   [  730.885532]  selinux_lockdown+0xa5/0xd0
>   [  730.886297]  security_locked_down+0x20/0x40
>   [  730.887133]  bpf_probe_read_compat+0x66/0xd0
>   [  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
>   [  730.888917]  trace_call_bpf+0xe9/0x240
>   [  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
>   [  730.890579]  perf_trace_sched_switch+0x142/0x180
>   [  730.891485]  ? __schedule+0x6d8/0xb20
>   [  730.892209]  __schedule+0x6d8/0xb20
>   [  730.892899]  schedule+0x5b/0xc0
>   [  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
>   [  730.894457]  syscall_exit_to_user_mode+0x27/0x70
>   [  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>   [...]
> 
> Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
> Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
> Reported-by: Jakub Hrozek <jhrozek@redhat.com>
> Reported-by: Serhei Makarov <smakarov@redhat.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Stephen Smalley <sds@tycho.nsa.gov>
> Cc: Jerome Marchand <jmarchan@redhat.com>
> Cc: Frank Eigler <fche@redhat.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Paul Moore <paul@paul-moore.com>

found the original server and reproduced.. this patch fixes it for me 

Tested-by: Jiri Olsa <jolsa@redhat.com>

thanks,
jirka

> ---
>  kernel/bpf/helpers.c     |  6 ++++--
>  kernel/trace/bpf_trace.c | 36 +++++++++++++-----------------------
>  2 files changed, 17 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 73443498d88f..6f6e090c5310 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1069,11 +1069,13 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>  	case BPF_FUNC_probe_read_user:
>  		return &bpf_probe_read_user_proto;
>  	case BPF_FUNC_probe_read_kernel:
> -		return &bpf_probe_read_kernel_proto;
> +		return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
> +		       NULL : &bpf_probe_read_kernel_proto;
>  	case BPF_FUNC_probe_read_user_str:
>  		return &bpf_probe_read_user_str_proto;
>  	case BPF_FUNC_probe_read_kernel_str:
> -		return &bpf_probe_read_kernel_str_proto;
> +		return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
> +		       NULL : &bpf_probe_read_kernel_str_proto;
>  	case BPF_FUNC_snprintf_btf:
>  		return &bpf_snprintf_btf_proto;
>  	case BPF_FUNC_snprintf:
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index d2d7cf6cfe83..3df43d89d642 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -215,16 +215,10 @@ const struct bpf_func_proto bpf_probe_read_user_str_proto = {
>  static __always_inline int
>  bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr)
>  {
> -	int ret = security_locked_down(LOCKDOWN_BPF_READ);
> +	int ret = copy_from_kernel_nofault(dst, unsafe_ptr, size);
>  
>  	if (unlikely(ret < 0))
> -		goto fail;
> -	ret = copy_from_kernel_nofault(dst, unsafe_ptr, size);
> -	if (unlikely(ret < 0))
> -		goto fail;
> -	return ret;
> -fail:
> -	memset(dst, 0, size);
> +		memset(dst, 0, size);
>  	return ret;
>  }
>  
> @@ -246,11 +240,6 @@ const struct bpf_func_proto bpf_probe_read_kernel_proto = {
>  static __always_inline int
>  bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr)
>  {
> -	int ret = security_locked_down(LOCKDOWN_BPF_READ);
> -
> -	if (unlikely(ret < 0))
> -		goto fail;
> -
>  	/*
>  	 * The strncpy_from_kernel_nofault() call will likely not fill the
>  	 * entire buffer, but that's okay in this circumstance as we're probing
> @@ -260,13 +249,10 @@ bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr)
>  	 * code altogether don't copy garbage; otherwise length of string
>  	 * is returned that can be used for bpf_perf_event_output() et al.
>  	 */
> -	ret = strncpy_from_kernel_nofault(dst, unsafe_ptr, size);
> -	if (unlikely(ret < 0))
> -		goto fail;
> +	int ret = strncpy_from_kernel_nofault(dst, unsafe_ptr, size);
>  
> -	return ret;
> -fail:
> -	memset(dst, 0, size);
> +	if (unlikely(ret < 0))
> +		memset(dst, 0, size);
>  	return ret;
>  }
>  
> @@ -1011,16 +997,20 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  	case BPF_FUNC_probe_read_user:
>  		return &bpf_probe_read_user_proto;
>  	case BPF_FUNC_probe_read_kernel:
> -		return &bpf_probe_read_kernel_proto;
> +		return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
> +		       NULL : &bpf_probe_read_kernel_proto;
>  	case BPF_FUNC_probe_read_user_str:
>  		return &bpf_probe_read_user_str_proto;
>  	case BPF_FUNC_probe_read_kernel_str:
> -		return &bpf_probe_read_kernel_str_proto;
> +		return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
> +		       NULL : &bpf_probe_read_kernel_str_proto;
>  #ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
>  	case BPF_FUNC_probe_read:
> -		return &bpf_probe_read_compat_proto;
> +		return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
> +		       NULL : &bpf_probe_read_compat_proto;
>  	case BPF_FUNC_probe_read_str:
> -		return &bpf_probe_read_compat_str_proto;
> +		return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
> +		       NULL : &bpf_probe_read_compat_str_proto;
>  #endif
>  #ifdef CONFIG_CGROUPS
>  	case BPF_FUNC_get_current_cgroup_id:
> -- 
> 2.27.0
>
Daniel Borkmann May 28, 2021, 11:54 a.m. UTC | #11
On 5/28/21 1:47 PM, Jiri Olsa wrote:
> On Fri, May 28, 2021 at 11:56:02AM +0200, Daniel Borkmann wrote:
> 
>> Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked
>> at the rest but it's also kind of independent), the attached fix should address both
>> reported issues, please take a look & test.
>>
>> Thanks a lot,
>> Daniel
> 
>>  From 5893ad528dc0a0a68933b8f2a81b18d3f539660d Mon Sep 17 00:00:00 2001
>> From: Daniel Borkmann <daniel@iogearbox.net>
>> Date: Fri, 28 May 2021 09:16:31 +0000
>> Subject: [PATCH bpf] bpf, audit, lockdown: Fix bogus SELinux lockdown permission checks
>>
>> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
>> added an implementation of the locked_down LSM hook to SELinux, with the aim
>> to restrict which domains are allowed to perform operations that would breach
>> lockdown. This is indirectly also getting audit subsystem involved to report
>> events. The latter is problematic, as reported by Ondrej and Serhei, since it
>> can bring down the whole system via audit:
>>
>>    i) The audit events that are triggered due to calls to security_locked_down()
>>       can OOM kill a machine, see below details [0].
>>
>>   ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end()
>>       when presumingly trying to wake up kauditd [1].
>>
>> Fix both at the same time by taking a completely different approach, that is,
>> move the check into the program verification phase where we actually retrieve
>> the func proto. This also reliably gets the task (current) that is trying to
>> install the tracing program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also
>> fixes the OOM since we're moving this out of the BPF helpers which can be called
>> millions of times per second.
>>
>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:
>>
>>    I starting seeing this with F-34. When I run a container that is traced with
>>    BPF to record the syscalls it is doing, auditd is flooded with messages like:
>>
>>    type=AVC msg=audit(1619784520.593:282387): avc:  denied  { confidentiality }
>>      for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
>>        scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
>>          tclass=lockdown permissive=0
>>
>>    This seems to be leading to auditd running out of space in the backlog buffer
>>    and eventually OOMs the machine.
>>
>>    [...]
>>    auditd running at 99% CPU presumably processing all the messages, eventually I get:
>>    Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
>>    Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
>>    Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
>>    Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
>>    Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
>>    Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
>>    Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
>>    Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
>>    Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
>>    [...]
>>
>> [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
>>      Serhei Makarov says:
>>
>>    Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
>>    bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
>>    is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
>>    testsuite on x86_64 as well as the runqlat,runqslower tools from bcc on
>>    ppc64le. Example stack trace:
>>
>>    [...]
>>    [  730.868702] stack backtrace:
>>    [  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
>>    [  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
>>    [  730.873278] Call Trace:
>>    [  730.873770]  dump_stack+0x7f/0xa1
>>    [  730.874433]  check_noncircular+0xdf/0x100
>>    [  730.875232]  __lock_acquire+0x1202/0x1e10
>>    [  730.876031]  ? __lock_acquire+0xfc0/0x1e10
>>    [  730.876844]  lock_acquire+0xc2/0x3a0
>>    [  730.877551]  ? __wake_up_common_lock+0x52/0x90
>>    [  730.878434]  ? lock_acquire+0xc2/0x3a0
>>    [  730.879186]  ? lock_is_held_type+0xa7/0x120
>>    [  730.880044]  ? skb_queue_tail+0x1b/0x50
>>    [  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
>>    [  730.881656]  ? __wake_up_common_lock+0x52/0x90
>>    [  730.882532]  __wake_up_common_lock+0x52/0x90
>>    [  730.883375]  audit_log_end+0x5b/0x100
>>    [  730.884104]  slow_avc_audit+0x69/0x90
>>    [  730.884836]  avc_has_perm+0x8b/0xb0
>>    [  730.885532]  selinux_lockdown+0xa5/0xd0
>>    [  730.886297]  security_locked_down+0x20/0x40
>>    [  730.887133]  bpf_probe_read_compat+0x66/0xd0
>>    [  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
>>    [  730.888917]  trace_call_bpf+0xe9/0x240
>>    [  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
>>    [  730.890579]  perf_trace_sched_switch+0x142/0x180
>>    [  730.891485]  ? __schedule+0x6d8/0xb20
>>    [  730.892209]  __schedule+0x6d8/0xb20
>>    [  730.892899]  schedule+0x5b/0xc0
>>    [  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
>>    [  730.894457]  syscall_exit_to_user_mode+0x27/0x70
>>    [  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>    [...]
>>
>> Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
>> Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
>> Reported-by: Jakub Hrozek <jhrozek@redhat.com>
>> Reported-by: Serhei Makarov <smakarov@redhat.com>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> Cc: Stephen Smalley <sds@tycho.nsa.gov>
>> Cc: Jerome Marchand <jmarchan@redhat.com>
>> Cc: Frank Eigler <fche@redhat.com>
>> Cc: Jiri Olsa <jolsa@redhat.com>
>> Cc: Paul Moore <paul@paul-moore.com>
> 
> found the original server and reproduced.. this patch fixes it for me
> 
> Tested-by: Jiri Olsa <jolsa@redhat.com>

Thanks Jiri! If Paul is fine with this as well, I can later route this fix via
bpf tree.

Best,
Daniel
Ondrej Mosnacek May 28, 2021, 1:42 p.m. UTC | #12
(I'm off work today and plan to reply also to Paul's comments next
week, but for now let me at least share a couple quick thoughts on
Daniel's patch.)

On Fri, May 28, 2021 at 11:56 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 5/28/21 9:09 AM, Daniel Borkmann wrote:
> > On 5/28/21 3:37 AM, Paul Moore wrote:
> >> On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> >>>
> >>> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> >>> lockdown") added an implementation of the locked_down LSM hook to
> >>> SELinux, with the aim to restrict which domains are allowed to perform
> >>> operations that would breach lockdown.
> >>>
> >>> However, in several places the security_locked_down() hook is called in
> >>> situations where the current task isn't doing any action that would
> >>> directly breach lockdown, leading to SELinux checks that are basically
> >>> bogus.
> >>>
> >>> Since in most of these situations converting the callers such that
> >>> security_locked_down() is called in a context where the current task
> >>> would be meaningful for SELinux is impossible or very non-trivial (and
> >>> could lead to TOCTOU issues for the classic Lockdown LSM
> >>> implementation), fix this by modifying the hook to accept a struct cred
> >>> pointer as argument, where NULL will be interpreted as a request for a
> >>> "global", task-independent lockdown decision only. Then modify SELinux
> >>> to ignore calls with cred == NULL.
> >>
> >> I'm not overly excited about skipping the access check when cred is
> >> NULL.  Based on the description and the little bit that I've dug into
> >> thus far it looks like using SECINITSID_KERNEL as the subject would be
> >> much more appropriate.  *Something* (the kernel in most of the
> >> relevant cases it looks like) is requesting that a potentially
> >> sensitive disclosure be made, and ignoring it seems like the wrong
> >> thing to do.  Leaving the access control intact also provides a nice
> >> avenue to audit these requests should users want to do that.
> >
> > I think the rationale/workaround for ignoring calls with cred == NULL (or the previous
> > patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his
> > seen tracing cases:
> >
> >    i) The audit events that are triggered due to calls to security_locked_down()
> >       can OOM kill a machine, see below details [0].
> >
> >   ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end()
> >       when presumingly trying to wake up kauditd [1].

Actually, I wasn't aware of the deadlock... But calling an LSM hook
[that is backed by a SELinux access check] from within a BPF helper is
calling for all kinds of trouble, so I'm not surprised :)

> Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked
> at the rest but it's also kind of independent), the attached fix should address both
> reported issues, please take a look & test.

Thanks, I like this solution, although there are a few gotchas:

1. This patch creates a slight "regression" in that if someone flips
the Lockdown LSM into lockdown mode on runtime, existing (already
loaded) BPF programs will still be able to call the
confidentiality-breaching helpers, while before the lockdown would
apply also to them. Personally, I don't think it's a big deal (and I
bet there are other existing cases where some handle kept from before
lockdown could leak data), but I wanted to mention it in case someone
thinks the opposite.

2. IIUC. when a BPF program is rejected due to lockdown/SELinux, the
kernel will return -EINVAL to userspace (looking at
check_helper_call() in kernel/bpf/verifier.c; didn't have time to look
at other callers...). It would be nicer if the error code from the
security_locked_down() call would be passed through the call chain and
eventually returned to the caller. It should be relatively
straightforward to convert bpf_base_func_proto() to return a PTR_ERR()
instead of NULL on error, but it looks like this would result in quite
a big patch updating all the callers (and callers of callers, etc.)
with a not-so-small chance of missing some NULL check and introducing
a bug... I guess we could live with EINVAL-on-denied in stable kernels
and only have the error path refactoring in -next; I'm not sure...

3. This is a bit of a shot-in-the-dark, but I suppose there might be
some BPF programs that would be able to do something useful also when
the read_kernel helpers return an error, yet the kernel will now
outright refuse to load them (when the lockdown hook returns nonzero).
I have no idea if such BPF programs realistically exist in practice,
but perhaps it would be worth returning some dummy
always-error-returning helper function instead of NULL from
bpf_base_func_proto() when security_locked_down() returns an error.
That would also resolve (2.), basically. (Then there is the question
of what error code to use (because Lockdown LSM uses -EPERM, while
SELinux -EACCESS), but I think always returning -EPERM from such stub
helpers would be a viable choice.)
Steven Rostedt May 28, 2021, 1:58 p.m. UTC | #13
On Mon, 17 May 2021 11:20:06 +0200
Ondrej Mosnacek <omosnace@redhat.com> wrote:

> diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> index 1261e8b41edb..7edde3fc22f5 100644
> --- a/fs/tracefs/inode.c
> +++ b/fs/tracefs/inode.c
> @@ -396,7 +396,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
>  	struct dentry *dentry;
>  	struct inode *inode;
>  
> -	if (security_locked_down(LOCKDOWN_TRACEFS))
> +	if (security_cred_locked_down(NULL, LOCKDOWN_TRACEFS))
>  		return NULL;
>  
>  	if (!(mode & S_IFMT))

Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve
Daniel Borkmann May 28, 2021, 2:20 p.m. UTC | #14
On 5/28/21 3:42 PM, Ondrej Mosnacek wrote:
> (I'm off work today and plan to reply also to Paul's comments next
> week, but for now let me at least share a couple quick thoughts on
> Daniel's patch.)
> 
> On Fri, May 28, 2021 at 11:56 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 5/28/21 9:09 AM, Daniel Borkmann wrote:
>>> On 5/28/21 3:37 AM, Paul Moore wrote:
>>>> On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
>>>>>
>>>>> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
>>>>> lockdown") added an implementation of the locked_down LSM hook to
>>>>> SELinux, with the aim to restrict which domains are allowed to perform
>>>>> operations that would breach lockdown.
>>>>>
>>>>> However, in several places the security_locked_down() hook is called in
>>>>> situations where the current task isn't doing any action that would
>>>>> directly breach lockdown, leading to SELinux checks that are basically
>>>>> bogus.
>>>>>
>>>>> Since in most of these situations converting the callers such that
>>>>> security_locked_down() is called in a context where the current task
>>>>> would be meaningful for SELinux is impossible or very non-trivial (and
>>>>> could lead to TOCTOU issues for the classic Lockdown LSM
>>>>> implementation), fix this by modifying the hook to accept a struct cred
>>>>> pointer as argument, where NULL will be interpreted as a request for a
>>>>> "global", task-independent lockdown decision only. Then modify SELinux
>>>>> to ignore calls with cred == NULL.
>>>>
>>>> I'm not overly excited about skipping the access check when cred is
>>>> NULL.  Based on the description and the little bit that I've dug into
>>>> thus far it looks like using SECINITSID_KERNEL as the subject would be
>>>> much more appropriate.  *Something* (the kernel in most of the
>>>> relevant cases it looks like) is requesting that a potentially
>>>> sensitive disclosure be made, and ignoring it seems like the wrong
>>>> thing to do.  Leaving the access control intact also provides a nice
>>>> avenue to audit these requests should users want to do that.
>>>
>>> I think the rationale/workaround for ignoring calls with cred == NULL (or the previous
>>> patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his
>>> seen tracing cases:
>>>
>>>     i) The audit events that are triggered due to calls to security_locked_down()
>>>        can OOM kill a machine, see below details [0].
>>>
>>>    ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end()
>>>        when presumingly trying to wake up kauditd [1].
> 
> Actually, I wasn't aware of the deadlock... But calling an LSM hook
> [that is backed by a SELinux access check] from within a BPF helper is
> calling for all kinds of trouble, so I'm not surprised :)

Fully agree, it's just waiting to blow up in unpredictable ways.. :/

>> Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked
>> at the rest but it's also kind of independent), the attached fix should address both
>> reported issues, please take a look & test.
> 
> Thanks, I like this solution, although there are a few gotchas:
> 
> 1. This patch creates a slight "regression" in that if someone flips
> the Lockdown LSM into lockdown mode on runtime, existing (already
> loaded) BPF programs will still be able to call the
> confidentiality-breaching helpers, while before the lockdown would
> apply also to them. Personally, I don't think it's a big deal (and I
> bet there are other existing cases where some handle kept from before
> lockdown could leak data), but I wanted to mention it in case someone
> thinks the opposite.

Yes, right, though this is nothing new either in the sense that there are
plenty of other cases with security_locked_down() that operate this way
e.g. take the open_kcore() for /proc/kcore access or the module_sig_check()
for mod signatures just to pick some random ones, same approach where the
enforcement is happen at open/load time.

> 2. IIUC. when a BPF program is rejected due to lockdown/SELinux, the
> kernel will return -EINVAL to userspace (looking at
> check_helper_call() in kernel/bpf/verifier.c; didn't have time to look
> at other callers...). It would be nicer if the error code from the
> security_locked_down() call would be passed through the call chain and
> eventually returned to the caller. It should be relatively
> straightforward to convert bpf_base_func_proto() to return a PTR_ERR()
> instead of NULL on error, but it looks like this would result in quite
> a big patch updating all the callers (and callers of callers, etc.)
> with a not-so-small chance of missing some NULL check and introducing
> a bug... I guess we could live with EINVAL-on-denied in stable kernels
> and only have the error path refactoring in -next; I'm not sure...

Right, it would return a verifier log entry with reporting to the user that
the prog is attempting to use an unavailable/unknown helper function. We do
have similar return NULL with bpf_capable() and perfmon_capable() checks.
Potentially, we could do PTR_ERR() in future where we tell if it failed due
to missing CAPs, due to lockdown or just due to helper not compiled in..

> 3. This is a bit of a shot-in-the-dark, but I suppose there might be
> some BPF programs that would be able to do something useful also when
> the read_kernel helpers return an error, yet the kernel will now
> outright refuse to load them (when the lockdown hook returns nonzero).
> I have no idea if such BPF programs realistically exist in practice,
> but perhaps it would be worth returning some dummy
> always-error-returning helper function instead of NULL from
> bpf_base_func_proto() when security_locked_down() returns an error.
> That would also resolve (2.), basically. (Then there is the question
> of what error code to use (because Lockdown LSM uses -EPERM, while
> SELinux -EACCESS), but I think always returning -EPERM from such stub
> helpers would be a viable choice.)

It would actually be harder to debug. Returning NULL at verification
time, libbpf, for example, would have a chance to probe for this. See the
feature_probes[] in libbpf's kernel_supports(), so it could provide a
meaningful warning to the user that the tracing functionality is unavailable
on the system. With returning an error from the helper, libbpf cannot check
it.. theoretically, it could but significantly more cumbersome given it
needs to attach the probe somewhere, trigger it, read out the helper result
and pass it back to libbpf user space.. not really feasible. Overall,
moving into func_proto and returning NULL is the much better approach
(and in line with the CAP check enforcement).

Thanks,
Daniel
Paul Moore May 28, 2021, 3:47 p.m. UTC | #15
On Fri, May 28, 2021 at 3:10 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 5/28/21 3:37 AM, Paul Moore wrote:
> > On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> >>
> >> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> >> lockdown") added an implementation of the locked_down LSM hook to
> >> SELinux, with the aim to restrict which domains are allowed to perform
> >> operations that would breach lockdown.
> >>
> >> However, in several places the security_locked_down() hook is called in
> >> situations where the current task isn't doing any action that would
> >> directly breach lockdown, leading to SELinux checks that are basically
> >> bogus.
> >>
> >> Since in most of these situations converting the callers such that
> >> security_locked_down() is called in a context where the current task
> >> would be meaningful for SELinux is impossible or very non-trivial (and
> >> could lead to TOCTOU issues for the classic Lockdown LSM
> >> implementation), fix this by modifying the hook to accept a struct cred
> >> pointer as argument, where NULL will be interpreted as a request for a
> >> "global", task-independent lockdown decision only. Then modify SELinux
> >> to ignore calls with cred == NULL.
> >
> > I'm not overly excited about skipping the access check when cred is
> > NULL.  Based on the description and the little bit that I've dug into
> > thus far it looks like using SECINITSID_KERNEL as the subject would be
> > much more appropriate.  *Something* (the kernel in most of the
> > relevant cases it looks like) is requesting that a potentially
> > sensitive disclosure be made, and ignoring it seems like the wrong
> > thing to do.  Leaving the access control intact also provides a nice
> > avenue to audit these requests should users want to do that.
>
> I think the rationale/workaround for ignoring calls with cred == NULL (or the previous
> patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his
> seen tracing cases:
>
>    i) The audit events that are triggered due to calls to security_locked_down()
>       can OOM kill a machine, see below details [0].
>
>   ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end()
>       when presumingly trying to wake up kauditd [1].
>
> How would your suggestion above solve both i) and ii)?

First off, a bit of general commentary - I'm not sure if Ondrej was
aware of this, but info like that is good to have in the commit
description.  Perhaps it was in the linked RHBZ but I try not to look
at those when reviewing patches; the commit descriptions must be
self-sufficient since we can't rely on the accessibility or the
lifetime of external references.  It's fine if people want to include
external links in their commits, I would actually even encourage it in
some cases, but the links shouldn't replace a proper description of
the problem and why the proposed solution is The Best Solution.

With that out of the way, it sounds like your issue isn't so much the
access check, but rather the frequency of the access denials and the
resulting audit records in your particular use case.  My initial
reaction is that you might want to understand why you are getting so
many SELinux access denials, your loaded security policy clearly does
not match with your intended use :)  Beyond that, if you want to
basically leave things as-is but quiet the high frequency audit
records that result from these SELinux denials you might want to look
into the SELinux "dontaudit" policy rule, it was created for things
like this.  Some info can be found in The SELinux Notebook, relevant
link below:

* https://github.com/SELinuxProject/selinux-notebook/blob/main/src/avc_rules.md#dontaudit

The deadlock issue that was previously reported remains an open case
as far as I'm concerned; I'm presently occupied trying to sort out a
rather serious issue with respect to io_uring and LSM/audit (plus
general stuff at $DAYJOB) so I haven't had time to investigate this
any further.  Of course anyone else is welcome to dive into it (I
always want to encourage this, especially from "performance people"
who just want to shut it all off), however if the answer is basically
"disable LSM and/or audit checks" you have to know that it is going to
result in a high degree of skepticism from me, so heavy documentation
on why it is The Best Solution would be a very good thing :)  Beyond
that, I think the suggestions above of "why do you have so many policy
denials?" and "have you looked into dontaudit?" are solid places to
look for a solution in your particular case.

> >> Since most callers will just want to pass current_cred() as the cred
> >> parameter, rename the hook to security_cred_locked_down() and provide
> >> the original security_locked_down() function as a simple wrapper around
> >> the new hook.
>
> [...]
> >
> >> 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common()
> >>       Called when a BPF program calls a helper that could leak kernel
> >>       memory. The task context is not relevant here, since the program
> >>       may very well be run in the context of a different task than the
> >>       consumer of the data.
> >>       See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585
> >
> > The access control check isn't so much who is consuming the data, but
> > who is requesting a potential violation of a "lockdown", yes?  For
> > example, the SELinux policy rule for the current lockdown check looks
> > something like this:
> >
> >    allow <who> <who> : lockdown { <reason> };
> >
> > It seems to me that the task context is relevant here and performing
> > the access control check based on the task's domain is correct.
>
> This doesn't make much sense to me, it's /not/ the task 'requesting a potential
> violation of a "lockdown"', but rather the running tracing program which is e.g.
> inspecting kernel data structures around the triggered event. If I understood
> you correctly, having an 'allow' check on, say, httpd would be rather odd since
> things like perf/bcc/bpftrace/systemtap/etc is installing the tracing probe instead.
>
> Meaning, if we would /not/ trace such events (like in the prior mentioned syscall
> example), then there is also no call to the security_locked_down() from that same/
> unmodified application.

My turn to say that you don't make much sense to me :)

Let's reset.

What task_struct is running the BPF tracing program which is calling
into security_locked_down()?  My current feeling is that it is this
context/domain/cred that should be used for the access control check;
in the cases where it is a kernel thread, I think passing NULL is
reasonable, but I think the proper thing for SELinux is to interpret
NULL as kernel_t.
Paul Moore May 28, 2021, 3:54 p.m. UTC | #16
On Fri, May 28, 2021 at 10:43 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 5/28/21 3:42 PM, Ondrej Mosnacek wrote:
> > (I'm off work today and plan to reply also to Paul's comments next
> > week, but for now let me at least share a couple quick thoughts on
> > Daniel's patch.)

Oooh, I sense some disagreement brewing :)

> > On Fri, May 28, 2021 at 11:56 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 5/28/21 9:09 AM, Daniel Borkmann wrote:
> >>> On 5/28/21 3:37 AM, Paul Moore wrote:
> >>>> On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:

...

> >> Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked
> >> at the rest but it's also kind of independent), the attached fix should address both
> >> reported issues, please take a look & test.
> >
> > Thanks, I like this solution, although there are a few gotchas:
> >
> > 1. This patch creates a slight "regression" in that if someone flips
> > the Lockdown LSM into lockdown mode on runtime, existing (already
> > loaded) BPF programs will still be able to call the
> > confidentiality-breaching helpers, while before the lockdown would
> > apply also to them. Personally, I don't think it's a big deal (and I
> > bet there are other existing cases where some handle kept from before
> > lockdown could leak data), but I wanted to mention it in case someone
> > thinks the opposite.
>
> Yes, right, though this is nothing new either in the sense that there are
> plenty of other cases with security_locked_down() that operate this way
> e.g. take the open_kcore() for /proc/kcore access or the module_sig_check()
> for mod signatures just to pick some random ones, same approach where the
> enforcement is happen at open/load time.

Another, yes, this is not really a good thing to do.  Also, just
because there are other places that don't really do The Right Thing
doesn't mean that it is okay to also not do The Right Thing here.
It's basically the two-wrongs-don't-make-a-right issue applied to
kernel code.
Daniel Borkmann May 28, 2021, 6:10 p.m. UTC | #17
On 5/28/21 5:47 PM, Paul Moore wrote:
> On Fri, May 28, 2021 at 3:10 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 5/28/21 3:37 AM, Paul Moore wrote:
>>> On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
>>>>
>>>> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
>>>> lockdown") added an implementation of the locked_down LSM hook to
>>>> SELinux, with the aim to restrict which domains are allowed to perform
>>>> operations that would breach lockdown.
>>>>
>>>> However, in several places the security_locked_down() hook is called in
>>>> situations where the current task isn't doing any action that would
>>>> directly breach lockdown, leading to SELinux checks that are basically
>>>> bogus.
>>>>
>>>> Since in most of these situations converting the callers such that
>>>> security_locked_down() is called in a context where the current task
>>>> would be meaningful for SELinux is impossible or very non-trivial (and
>>>> could lead to TOCTOU issues for the classic Lockdown LSM
>>>> implementation), fix this by modifying the hook to accept a struct cred
>>>> pointer as argument, where NULL will be interpreted as a request for a
>>>> "global", task-independent lockdown decision only. Then modify SELinux
>>>> to ignore calls with cred == NULL.
>>>
>>> I'm not overly excited about skipping the access check when cred is
>>> NULL.  Based on the description and the little bit that I've dug into
>>> thus far it looks like using SECINITSID_KERNEL as the subject would be
>>> much more appropriate.  *Something* (the kernel in most of the
>>> relevant cases it looks like) is requesting that a potentially
>>> sensitive disclosure be made, and ignoring it seems like the wrong
>>> thing to do.  Leaving the access control intact also provides a nice
>>> avenue to audit these requests should users want to do that.
>>
>> I think the rationale/workaround for ignoring calls with cred == NULL (or the previous
>> patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his
>> seen tracing cases:
>>
>>     i) The audit events that are triggered due to calls to security_locked_down()
>>        can OOM kill a machine, see below details [0].
>>
>>    ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end()
>>        when presumingly trying to wake up kauditd [1].
>>
>> How would your suggestion above solve both i) and ii)?
> 
> First off, a bit of general commentary - I'm not sure if Ondrej was
> aware of this, but info like that is good to have in the commit
> description.  Perhaps it was in the linked RHBZ but I try not to look
> at those when reviewing patches; the commit descriptions must be
> self-sufficient since we can't rely on the accessibility or the
> lifetime of external references.  It's fine if people want to include
> external links in their commits, I would actually even encourage it in
> some cases, but the links shouldn't replace a proper description of
> the problem and why the proposed solution is The Best Solution.
> 
> With that out of the way, it sounds like your issue isn't so much the
> access check, but rather the frequency of the access denials and the
> resulting audit records in your particular use case.  My initial
> reaction is that you might want to understand why you are getting so
> many SELinux access denials, your loaded security policy clearly does
> not match with your intended use :)  Beyond that, if you want to
> basically leave things as-is but quiet the high frequency audit
> records that result from these SELinux denials you might want to look
> into the SELinux "dontaudit" policy rule, it was created for things
> like this.  Some info can be found in The SELinux Notebook, relevant
> link below:
> 
> * https://github.com/SELinuxProject/selinux-notebook/blob/main/src/avc_rules.md#dontaudit
> 
> The deadlock issue that was previously reported remains an open case
> as far as I'm concerned; I'm presently occupied trying to sort out a
> rather serious issue with respect to io_uring and LSM/audit (plus
> general stuff at $DAYJOB) so I haven't had time to investigate this
> any further.  Of course anyone else is welcome to dive into it (I
> always want to encourage this, especially from "performance people"
> who just want to shut it all off), however if the answer is basically
> "disable LSM and/or audit checks" you have to know that it is going to
> result in a high degree of skepticism from me, so heavy documentation
> on why it is The Best Solution would be a very good thing :)  Beyond
> that, I think the suggestions above of "why do you have so many policy
> denials?" and "have you looked into dontaudit?" are solid places to
> look for a solution in your particular case.
> 
>>>> Since most callers will just want to pass current_cred() as the cred
>>>> parameter, rename the hook to security_cred_locked_down() and provide
>>>> the original security_locked_down() function as a simple wrapper around
>>>> the new hook.
>>
>> [...]
>>>
>>>> 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common()
>>>>        Called when a BPF program calls a helper that could leak kernel
>>>>        memory. The task context is not relevant here, since the program
>>>>        may very well be run in the context of a different task than the
>>>>        consumer of the data.
>>>>        See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585
>>>
>>> The access control check isn't so much who is consuming the data, but
>>> who is requesting a potential violation of a "lockdown", yes?  For
>>> example, the SELinux policy rule for the current lockdown check looks
>>> something like this:
>>>
>>>     allow <who> <who> : lockdown { <reason> };
>>>
>>> It seems to me that the task context is relevant here and performing
>>> the access control check based on the task's domain is correct.
>>
>> This doesn't make much sense to me, it's /not/ the task 'requesting a potential
>> violation of a "lockdown"', but rather the running tracing program which is e.g.
>> inspecting kernel data structures around the triggered event. If I understood
>> you correctly, having an 'allow' check on, say, httpd would be rather odd since
>> things like perf/bcc/bpftrace/systemtap/etc is installing the tracing probe instead.
>>
>> Meaning, if we would /not/ trace such events (like in the prior mentioned syscall
>> example), then there is also no call to the security_locked_down() from that same/
>> unmodified application.
> 
> My turn to say that you don't make much sense to me :)
> 
> Let's reset.

Sure, yep, lets shortly take one step back. :)

> What task_struct is running the BPF tracing program which is calling
> into security_locked_down()?  My current feeling is that it is this
> context/domain/cred that should be used for the access control check;
> in the cases where it is a kernel thread, I think passing NULL is
> reasonable, but I think the proper thing for SELinux is to interpret
> NULL as kernel_t.

If this was a typical LSM hook and, say, your app calls into bind(2) where
we then invoke security_socket_bind() and check 'current' task, then I'm all
with you, because this was _explicitly initiated_ by the httpd app, so that
allow/deny policy belongs in the context of httpd.

In the case of tracing, it's different. You install small programs that are
triggered when certain events fire. Random example from bpftrace's README [0],
you want to generate a histogram of syscall counts by program. One-liner is:

   bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

bpftrace then goes and generates a BPF prog from this internally. One way of
doing it could be to call bpf_get_current_task() helper and then access
current->comm via one of bpf_probe_read_kernel{,_str}() helpers. So the
program itself has nothing to do with httpd or any other random app doing
a syscall here. The BPF prog _explicitly initiated_ the lockdown check.
The allow/deny policy belongs in the context of bpftrace: meaning, you want
to grant bpftrace access to use these helpers, but other tracers on the
systems like my_random_tracer not. While this works for prior mentioned
cases of security_locked_down() with open_kcore() for /proc/kcore access
or the module_sig_check(), it is broken for tracing as-is, and the patch
I sent earlier fixes this.

Thanks,
Daniel

   [0] https://github.com/iovisor/bpftrace
Paul Moore May 28, 2021, 10:52 p.m. UTC | #18
On Fri, May 28, 2021 at 2:28 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 5/28/21 5:47 PM, Paul Moore wrote:
> > Let's reset.
>
> Sure, yep, lets shortly take one step back. :)
>
> > What task_struct is running the BPF tracing program which is calling
> > into security_locked_down()?  My current feeling is that it is this
> > context/domain/cred that should be used for the access control check;
> > in the cases where it is a kernel thread, I think passing NULL is
> > reasonable, but I think the proper thing for SELinux is to interpret
> > NULL as kernel_t.
>
> If this was a typical LSM hook and, say, your app calls into bind(2) where
> we then invoke security_socket_bind() and check 'current' task, then I'm all
> with you, because this was _explicitly initiated_ by the httpd app, so that
> allow/deny policy belongs in the context of httpd.
>
> In the case of tracing, it's different. You install small programs that are
> triggered when certain events fire. Random example from bpftrace's README [0],
> you want to generate a histogram of syscall counts by program. One-liner is:
>
>    bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
>
> bpftrace then goes and generates a BPF prog from this internally. One way of
> doing it could be to call bpf_get_current_task() helper and then access
> current->comm via one of bpf_probe_read_kernel{,_str}() helpers. So the
> program itself has nothing to do with httpd or any other random app doing
> a syscall here. The BPF prog _explicitly initiated_ the lockdown check.
> The allow/deny policy belongs in the context of bpftrace: meaning, you want
> to grant bpftrace access to use these helpers, but other tracers on the
> systems like my_random_tracer not. While this works for prior mentioned
> cases of security_locked_down() with open_kcore() for /proc/kcore access
> or the module_sig_check(), it is broken for tracing as-is, and the patch
> I sent earlier fixes this.

Sigh.

Generally it's helpful when someone asks a question if you answer it
directly before going off and answering your own questions.  Listen, I
get it, you wrote a patch and it fixes your problem (you've mentioned
that already) and it's wonderful and all that, but the rest of us
(maybe just me) need to sort this out too and talking past questions
isn't a great way to help us get there (once again, maybe just me).  I
think I can infer an answer from you, but you've made me grumpy now so
I'm not ACK'ing or NACK'ing anything right now; I clearly need to go
spend some time reading through BPF code.  Woo.

>    [0] https://github.com/iovisor/bpftrace
Paul Moore May 29, 2021, 6:48 p.m. UTC | #19
On Fri, May 28, 2021 at 2:28 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> In the case of tracing, it's different. You install small programs that are
> triggered when certain events fire. Random example from bpftrace's README [0],
> you want to generate a histogram of syscall counts by program. One-liner is:
>
>    bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
>
> bpftrace then goes and generates a BPF prog from this internally. One way of
> doing it could be to call bpf_get_current_task() helper and then access
> current->comm via one of bpf_probe_read_kernel{,_str}() helpers ...

I think we can all agree that the BPF tracing is a bit chaotic in the
sense that the tracing programs can be executed in various
places/contexts and that presents some challenges with respect to
access control and auditing.  If you are following the io_uring stuff
that is going on now you can see a little of what is required to make
audit work properly in the various io_uring contexts and that is
relatively small compared to what is possible with BPF tracing.  Of
course this assumes I've managed to understand bpf tracing properly
this morning, and I very well may still be missing points and/or
confused about some of the important details.  Corrections are
welcome.

Daniel's patch side steps that worry by just doing the lockdown
permission check when the BPF program is loaded, but that isn't a
great solution if the policy changes afterward.  I was hoping there
might be some way to perform the permission check as needed, but the
more I look the more that appears to be difficult, if not impossible
(once again, corrections are welcome).

I'm now wondering if the right solution here is to make use of the LSM
notifier mechanism.  I'm not yet entirely sure if this would work from
a BPF perspective, but I could envision the BPF subsystem registering
a LSM notification callback via register_blocking_lsm_notifier(), see
if Infiniband code as an example, and then when the LSM(s) policy
changes the BPF subsystem would get a notification and it could
revalidate the existing BPF programs and take block/remove/whatever
the offending BPF programs.  This obviously requires a few things
which I'm not sure are easily done, or even possible:

1. Somehow the BPF programs would need to be "marked" at
load/verification time with respect to their lockdown requirements so
that decisions can be made later.  Perhaps a flag in bpf_prog_aux?

2. While it looks like it should be possible to iterate over all of
the loaded BPF programs in the LSM notifier callback via
idr_for_each(prog_idr, ...), it is not clear to me if it is possible
to safely remove, or somehow disable, BPF programs once they have been
loaded.  Hopefully the BPF folks can help answer that question.

3. Disabling of BPF programs might be preferable to removing them
entirely on LSM policy changes as it would be possible to make the
lockdown state less restrictive at a future point in time, allowing
for the BPF program to be executed again.  Once again, not sure if
this is even possible.

Related, the lockdown LSM should probably also grow LSM notifier
support similar to selinux_lsm_notifier_avc_callback(), for example
either lock_kernel_down() or lockdown_write() might want to do a
call_blocking_lsm_notifier(LSM_POLICY_CHANGE, NULL) call.
Daniel Borkmann May 31, 2021, 8:24 a.m. UTC | #20
On 5/29/21 8:48 PM, Paul Moore wrote:
[...]
> Daniel's patch side steps that worry by just doing the lockdown
> permission check when the BPF program is loaded, but that isn't a
> great solution if the policy changes afterward.  I was hoping there
> might be some way to perform the permission check as needed, but the
> more I look the more that appears to be difficult, if not impossible
> (once again, corrections are welcome).

Your observation is correct, will try to clarify below a bit.

> I'm now wondering if the right solution here is to make use of the LSM
> notifier mechanism.  I'm not yet entirely sure if this would work from
> a BPF perspective, but I could envision the BPF subsystem registering
> a LSM notification callback via register_blocking_lsm_notifier(), see
> if Infiniband code as an example, and then when the LSM(s) policy
> changes the BPF subsystem would get a notification and it could
> revalidate the existing BPF programs and take block/remove/whatever
> the offending BPF programs.  This obviously requires a few things
> which I'm not sure are easily done, or even possible:
> 
> 1. Somehow the BPF programs would need to be "marked" at
> load/verification time with respect to their lockdown requirements so
> that decisions can be made later.  Perhaps a flag in bpf_prog_aux?
> 
> 2. While it looks like it should be possible to iterate over all of
> the loaded BPF programs in the LSM notifier callback via
> idr_for_each(prog_idr, ...), it is not clear to me if it is possible
> to safely remove, or somehow disable, BPF programs once they have been
> loaded.  Hopefully the BPF folks can help answer that question.
> 
> 3. Disabling of BPF programs might be preferable to removing them
> entirely on LSM policy changes as it would be possible to make the
> lockdown state less restrictive at a future point in time, allowing
> for the BPF program to be executed again.  Once again, not sure if
> this is even possible.

Part of why this gets really complex/impossible is that BPF programs in
the kernel are reference counted from various sides, be it that there
are references from user space to them (fd from application, BPF fs, or
BPF links), hooks where they are attached to as well as tail call maps
where one BPF prog calls into another. There is currently also no global
infra of some sort where you could piggy back to atomically keep track of
all the references in a list or such. And the other thing is that BPF progs
have no ownership that is tied to a specific task after they have been
loaded. Meaning, once they are loaded into the kernel by an application
and attached to a specific hook, they can remain there potentially until
reboot of the node, so lifecycle of the user space application != lifecycle
of the BPF program.

It's maybe best to compare this aspect to kernel modules in the sense that
you have an application that loads it into the kernel (insmod, etc, where
you could also enforce lockdown signature check), but after that, they can
be managed by other entities as well (implicitly refcounted from kernel,
removed by other applications, etc).

My understanding of the lockdown settings are that users have options
to select/enforce a lockdown level of CONFIG_LOCK_DOWN_KERNEL_FORCE_{INTEGRITY,
CONFIDENTIALITY} at compilation time, they have a lockdown={integrity|
confidentiality} boot-time parameter, /sys/kernel/security/lockdown,
and then more fine-grained policy via 59438b46471a ("security,lockdown,selinux:
implement SELinux lockdown"). Once you have set a global policy level,
you cannot revert back to a less strict mode. So the SELinux policy is
specifically tied around tasks to further restrict applications in respect
to the global policy. I presume that would mean for those users that majority
of tasks have the confidentiality option set via SELinux with just a few
necessary using the integrity global policy. So overall the enforcing
option when BPF program is loaded is the only really sensible option to
me given only there we have the valid current task where such policy can
be enforced.

Best,
Daniel
Paul Moore June 1, 2021, 8:47 p.m. UTC | #21
On Mon, May 31, 2021 at 4:24 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 5/29/21 8:48 PM, Paul Moore wrote:
> [...]
> > Daniel's patch side steps that worry by just doing the lockdown
> > permission check when the BPF program is loaded, but that isn't a
> > great solution if the policy changes afterward.  I was hoping there
> > might be some way to perform the permission check as needed, but the
> > more I look the more that appears to be difficult, if not impossible
> > (once again, corrections are welcome).
>
> Your observation is correct, will try to clarify below a bit.
>
> > I'm now wondering if the right solution here is to make use of the LSM
> > notifier mechanism.  I'm not yet entirely sure if this would work from
> > a BPF perspective, but I could envision the BPF subsystem registering
> > a LSM notification callback via register_blocking_lsm_notifier(), see
> > if Infiniband code as an example, and then when the LSM(s) policy
> > changes the BPF subsystem would get a notification and it could
> > revalidate the existing BPF programs and take block/remove/whatever
> > the offending BPF programs.  This obviously requires a few things
> > which I'm not sure are easily done, or even possible:
> >
> > 1. Somehow the BPF programs would need to be "marked" at
> > load/verification time with respect to their lockdown requirements so
> > that decisions can be made later.  Perhaps a flag in bpf_prog_aux?
> >
> > 2. While it looks like it should be possible to iterate over all of
> > the loaded BPF programs in the LSM notifier callback via
> > idr_for_each(prog_idr, ...), it is not clear to me if it is possible
> > to safely remove, or somehow disable, BPF programs once they have been
> > loaded.  Hopefully the BPF folks can help answer that question.
> >
> > 3. Disabling of BPF programs might be preferable to removing them
> > entirely on LSM policy changes as it would be possible to make the
> > lockdown state less restrictive at a future point in time, allowing
> > for the BPF program to be executed again.  Once again, not sure if
> > this is even possible.
>
> Part of why this gets really complex/impossible is that BPF programs in
> the kernel are reference counted from various sides, be it that there
> are references from user space to them (fd from application, BPF fs, or
> BPF links), hooks where they are attached to as well as tail call maps
> where one BPF prog calls into another. There is currently also no global
> infra of some sort where you could piggy back to atomically keep track of
> all the references in a list or such. And the other thing is that BPF progs
> have no ownership that is tied to a specific task after they have been
> loaded. Meaning, once they are loaded into the kernel by an application
> and attached to a specific hook, they can remain there potentially until
> reboot of the node, so lifecycle of the user space application != lifecycle
> of the BPF program.

I don't think the disjoint lifecycle or lack of task ownership is a
deal breaker from a LSM perspective as the LSMs can stash whatever
info they need in the security pointer during the program allocation
hook, e.g. selinux_bpf_prog_alloc() saves the security domain which
allocates/loads the BPF program.

The thing I'm worried about would be the case where a LSM policy
change requires that an existing BPF program be removed or disabled.
I'm guessing based on the refcounting that there is not presently a
clean way to remove a BPF program from the system, but is this
something we could resolve?  If we can't safely remove a BPF program
from the system, can we replace/swap it with an empty/NULL BPF
program?

> It's maybe best to compare this aspect to kernel modules in the sense that
> you have an application that loads it into the kernel (insmod, etc, where
> you could also enforce lockdown signature check), but after that, they can
> be managed by other entities as well (implicitly refcounted from kernel,
> removed by other applications, etc).

Well, I guess we could consider BPF programs as out-of-tree kernel
modules that potentially do very odd and dangerous things, e.g.
performing access control checks *inside* access control checks ...
but yeah, I get your point at a basic level, I just think that
comparing BPF programs to kernel modules is a not-so-great comparison
in general.

> My understanding of the lockdown settings are that users have options
> to select/enforce a lockdown level of CONFIG_LOCK_DOWN_KERNEL_FORCE_{INTEGRITY,
> CONFIDENTIALITY} at compilation time, they have a lockdown={integrity|
> confidentiality} boot-time parameter, /sys/kernel/security/lockdown,
> and then more fine-grained policy via 59438b46471a ("security,lockdown,selinux:
> implement SELinux lockdown"). Once you have set a global policy level,
> you cannot revert back to a less strict mode.

I don't recall there being anything in the SELinux lockdown support
that prevents a newly loaded policy from allowing a change in the
lockdown level, either stricter or more permissive, for a given
domain.  Looking quickly at the code, that still seems to be the case.

The SELinux lockdown access controls function independently of the
global build and runtime lockdown configuration.

> So the SELinux policy is
> specifically tied around tasks to further restrict applications in respect
> to the global policy.

As a reminder, there is no guarantee that both the SELinux and
lockdown LSM are both loaded and active at runtime, it is possible
that only SELinux is active.  If SELinux is the only LSM enforcing
lockdown access controls, there is no global lockdown setting, it is
determined per-domain.

> I presume that would mean for those users that majority
> of tasks have the confidentiality option set via SELinux with just a few
> necessary using the integrity global policy. So overall the enforcing
> option when BPF program is loaded is the only really sensible option to
> me given only there we have the valid current task where such policy can
> be enforced.

--
paul moore
www.paul-moore.com
Daniel Borkmann June 2, 2021, 12:40 p.m. UTC | #22
On 6/1/21 10:47 PM, Paul Moore wrote:
> On Mon, May 31, 2021 at 4:24 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 5/29/21 8:48 PM, Paul Moore wrote:
>> [...]
>>> Daniel's patch side steps that worry by just doing the lockdown
>>> permission check when the BPF program is loaded, but that isn't a
>>> great solution if the policy changes afterward.  I was hoping there
>>> might be some way to perform the permission check as needed, but the
>>> more I look the more that appears to be difficult, if not impossible
>>> (once again, corrections are welcome).
>>
>> Your observation is correct, will try to clarify below a bit.
>>
>>> I'm now wondering if the right solution here is to make use of the LSM
>>> notifier mechanism.  I'm not yet entirely sure if this would work from
>>> a BPF perspective, but I could envision the BPF subsystem registering
>>> a LSM notification callback via register_blocking_lsm_notifier(), see
>>> if Infiniband code as an example, and then when the LSM(s) policy
>>> changes the BPF subsystem would get a notification and it could
>>> revalidate the existing BPF programs and take block/remove/whatever
>>> the offending BPF programs.  This obviously requires a few things
>>> which I'm not sure are easily done, or even possible:
>>>
>>> 1. Somehow the BPF programs would need to be "marked" at
>>> load/verification time with respect to their lockdown requirements so
>>> that decisions can be made later.  Perhaps a flag in bpf_prog_aux?
>>>
>>> 2. While it looks like it should be possible to iterate over all of
>>> the loaded BPF programs in the LSM notifier callback via
>>> idr_for_each(prog_idr, ...), it is not clear to me if it is possible
>>> to safely remove, or somehow disable, BPF programs once they have been
>>> loaded.  Hopefully the BPF folks can help answer that question.
>>>
>>> 3. Disabling of BPF programs might be preferable to removing them
>>> entirely on LSM policy changes as it would be possible to make the
>>> lockdown state less restrictive at a future point in time, allowing
>>> for the BPF program to be executed again.  Once again, not sure if
>>> this is even possible.
>>
>> Part of why this gets really complex/impossible is that BPF programs in
>> the kernel are reference counted from various sides, be it that there
>> are references from user space to them (fd from application, BPF fs, or
>> BPF links), hooks where they are attached to as well as tail call maps
>> where one BPF prog calls into another. There is currently also no global
>> infra of some sort where you could piggy back to atomically keep track of
>> all the references in a list or such. And the other thing is that BPF progs
>> have no ownership that is tied to a specific task after they have been
>> loaded. Meaning, once they are loaded into the kernel by an application
>> and attached to a specific hook, they can remain there potentially until
>> reboot of the node, so lifecycle of the user space application != lifecycle
>> of the BPF program.
> 
> I don't think the disjoint lifecycle or lack of task ownership is a
> deal breaker from a LSM perspective as the LSMs can stash whatever
> info they need in the security pointer during the program allocation
> hook, e.g. selinux_bpf_prog_alloc() saves the security domain which
> allocates/loads the BPF program.
> 
> The thing I'm worried about would be the case where a LSM policy
> change requires that an existing BPF program be removed or disabled.
> I'm guessing based on the refcounting that there is not presently a
> clean way to remove a BPF program from the system, but is this
> something we could resolve?  If we can't safely remove a BPF program
> from the system, can we replace/swap it with an empty/NULL BPF
> program?

Removing progs would somehow mean destroying those references from an
async event and then /safely/ guaranteeing that nothing is accessing
them anymore. But then if policy changes once more where they would
be allowed again we would need to revert back to the original state,
which brings us to your replace/swap question with an empty/null prog.
It's not feasible either, because there are different BPF program types
and they can have different return code semantics that lead to subsequent
actions. If we were to replace them with an empty/NULL program, then
essentially this will get us into an undefined system state given it's
unclear what should be a default policy for each program type, etc.
Just to pick one simple example, outside of tracing, that comes to mind:
say, you attached a program with tc to a given device ingress hook. That
program implements firewalling functionality, and potentially deep down
in that program there is functionality to record/sample packets along
with some meta data. Part of what is exported to the ring buffer to the
user space reader may be a struct net_device field that is otherwise not
available (or at least not yet), hence it's probe-read with mentioned
helpers. If you were now to change the SELinux policy for that tc loader
application, and therefore replace/swap the progs in the kernel that were
loaded with it (given tc's lockdown policy was recorded in their sec blob)
with an empty/NULL program, then either you say allow-all or drop-all,
but either way, you break the firewalling functionality completely by
locking yourself out of the machine or letting everything through. There
is no sane way where we could reason about the context/internals of a
given program where it would be safe to replace with a simple empty/NULL
prog.

Best,
Daniel
Ondrej Mosnacek June 2, 2021, 1:39 p.m. UTC | #23
On Fri, May 28, 2021 at 3:37 AM Paul Moore <paul@paul-moore.com> wrote:
> On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> >
> > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> > lockdown") added an implementation of the locked_down LSM hook to
> > SELinux, with the aim to restrict which domains are allowed to perform
> > operations that would breach lockdown.
> >
> > However, in several places the security_locked_down() hook is called in
> > situations where the current task isn't doing any action that would
> > directly breach lockdown, leading to SELinux checks that are basically
> > bogus.
> >
> > Since in most of these situations converting the callers such that
> > security_locked_down() is called in a context where the current task
> > would be meaningful for SELinux is impossible or very non-trivial (and
> > could lead to TOCTOU issues for the classic Lockdown LSM
> > implementation), fix this by modifying the hook to accept a struct cred
> > pointer as argument, where NULL will be interpreted as a request for a
> > "global", task-independent lockdown decision only. Then modify SELinux
> > to ignore calls with cred == NULL.
>
> I'm not overly excited about skipping the access check when cred is
> NULL.  Based on the description and the little bit that I've dug into
> thus far it looks like using SECINITSID_KERNEL as the subject would be
> much more appropriate.  *Something* (the kernel in most of the
> relevant cases it looks like) is requesting that a potentially
> sensitive disclosure be made, and ignoring it seems like the wrong
> thing to do.  Leaving the access control intact also provides a nice
> avenue to audit these requests should users want to do that.
>
> Those users that generally don't care can grant kernel_t all the
> necessary permissions without much policy.

Seems kind of pointless to me, but it's a relatively simple change to
do a check against SECINITSID_KERNEL, so I don't mind doing it like
that.

> > Since most callers will just want to pass current_cred() as the cred
> > parameter, rename the hook to security_cred_locked_down() and provide
> > the original security_locked_down() function as a simple wrapper around
> > the new hook.
>
> I know you and Casey went back and forth on this in v1, but I agree
> with Casey that having two LSM hooks here is a mistake.  I know it
> makes backports hard, but spoiler alert: maintaining complex software
> over any non-trivial period of time is hard, reeeeally hard sometimes
> ;)

Do you mean having two slots in lsm_hook_defs.h or also having two
security_*() functions? (It's not clear to me if you're just
reiterating disagreement with v1 or if you dislike the simplified v2
as well.)

> > The callers migrated to the new hook, passing NULL as cred:
> > 1. arch/powerpc/xmon/xmon.c
> >      Here the hook seems to be called from non-task context and is only
> >      used for redacting some sensitive values from output sent to
> >      userspace.
>
> This definitely sounds like kernel_t based on the description above.

Here I'm a little concerned that the hook might be called from some
unusual interrupt, which is not masked by spin_lock_irqsave()... We
ran into this with PMI (Platform Management Interrupt) before, see
commit 5ae5fbd21079 ("powerpc/perf: Fix handling of privilege level
checks in perf interrupt context"). While I can't see anything that
would suggest something like this happening here, the whole thing is
so foreign to me that I'm wary of making assumptions :)

@Michael/PPC devs, can you confirm to us that xmon_is_locked_down() is
only called from normal syscall/interrupt context (as opposed to
something tricky like PMI)?

> > 2. fs/tracefs/inode.c:tracefs_create_file()
> >      Here the call is used to prevent creating new tracefs entries when
> >      the kernel is locked down. Assumes that locking down is one-way -
> >      i.e. if the hook returns non-zero once, it will never return zero
> >      again, thus no point in creating these files.
>
> More kernel_t.

This should be OK.

> > 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common()
> >      Called when a BPF program calls a helper that could leak kernel
> >      memory. The task context is not relevant here, since the program
> >      may very well be run in the context of a different task than the
> >      consumer of the data.
> >      See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585
>
> The access control check isn't so much who is consuming the data, but
> who is requesting a potential violation of a "lockdown", yes?  For
> example, the SELinux policy rule for the current lockdown check looks
> something like this:
>
>   allow <who> <who> : lockdown { <reason> };
>
> It seems to me that the task context is relevant here and performing
> the access control check based on the task's domain is correct.  If we
> are also concerned about who has access to this sensitive information
> once it has been determined that the task can cause it to be sent, we
> should have another check point for that, assuming the access isn't
> already covered by another check/hook.

This case is being discussed further in this thread, so I'm going to
skip it in this reply.

> > 4. net/xfrm/xfrm_user.c:copy_to_user_*()
> >      Here a cryptographic secret is redacted based on the value returned
> >      from the hook. There are two possible actions that may lead here:
> >      a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the
> >         task context is relevant, since the dumped data is sent back to
> >         the current task.
>
> If the task context is relevant we should use it.

Yes, but as I said it would create an asymmetry with case b), which
I'll expand on below...

> >      b) When deleting an SA via XFRM_MSG_DELSA, the dumped SAs are
> >         broadcasted to tasks subscribed to XFRM events - here the
> >         SELinux check is not meningful as the current task's creds do
> >         not represent the tasks that could potentially see the secret.
>
> This looks very similar to the BPF hook discussed above, I believe my
> comments above apply here as well.

Using the current task is just logically wrong in this case. The
current task here is just simply deleting an SA that happens to have
some secret value in it. When deleting an SA, a notification is sent
to a group of subscribers (some group of other tasks), which includes
a dump of the secret value. The current task isn't doing any attempt
to breach lockdown, it's just deleting an SA.

It also makes it really awkward to make policy decisions around this.
Suppose that domains A, B, and C need to be able to add/delete SAs and
domains D, E, and F need to receive notifications about changes in
SAs. Then if, say, domain E actually needs to see the secret values in
the notifications, you must grant the confidentiality permission to
all of A, B, C to keep things working. And now you have opened up the
door for A, B, C to do other lockdown-confidentiality stuff, even
though these domains themselves actually don't request/need any
confidential data from the kernel. That's just not logical and you may
actually end up (slightly) worse security-wise than if you just
skipped checking for XFRM secrets altogether, because you need to
allow confidentiality to domains for which it may be excessive.

This is why I talk about the task that gets to see the sensitive
values as the relevant one - because otherwise the semantics of a
given domain having the confidentiality permission granted becomes
very hard to reason about.

> >      It really doesn't seem worth it to try to preserve the check in the
> >      a) case ...
>
> After you've read all of the above I hope you can understand why I
> disagree with this.
>
> >      ... since the eventual leak can be circumvented anyway via b)
>
> I don't follow the statement above ... ?  However I'm not sure it
> matters much considering my other concerns.

What I meant was that if we skip/kernel_t-ize the check in case b)
(for which I don't see a good alternative), then denying
confidentiality perm to a given domain wouldn't prevent it from seeing
the key value, as it could potentially see them by subscribing to SA
modification events. IMO, in that case it's better to just give up on
controlling the SA secrets with SELinux lockdown altogether than to
create some false assumptions of this being covered. You may disagree
and would be willing to implement the partial checking as well if you
insist, but we need to first come to a consensus about case b) before
such discussion becomes relevant, anyway...

Given the yet unresolved discussions around the XFRM and BPF cases, I
plan to respin the patch with just the tracefs and xmon changes and we
can then incrementally address the rest as the individual discussions
come to a consensus.

--
Ondrej Mosnacek
Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.
Paul Moore June 2, 2021, 3:13 p.m. UTC | #24
On Wed, Jun 2, 2021 at 8:40 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 6/1/21 10:47 PM, Paul Moore wrote:
> > The thing I'm worried about would be the case where a LSM policy
> > change requires that an existing BPF program be removed or disabled.
> > I'm guessing based on the refcounting that there is not presently a
> > clean way to remove a BPF program from the system, but is this
> > something we could resolve?  If we can't safely remove a BPF program
> > from the system, can we replace/swap it with an empty/NULL BPF
> > program?
>
> Removing progs would somehow mean destroying those references from an
> async event and then /safely/ guaranteeing that nothing is accessing
> them anymore. But then if policy changes once more where they would
> be allowed again we would need to revert back to the original state,
> which brings us to your replace/swap question with an empty/null prog.
> It's not feasible either, because there are different BPF program types
> and they can have different return code semantics that lead to subsequent
> actions. If we were to replace them with an empty/NULL program, then
> essentially this will get us into an undefined system state given it's
> unclear what should be a default policy for each program type, etc.
> Just to pick one simple example, outside of tracing, that comes to mind:
> say, you attached a program with tc to a given device ingress hook. That
> program implements firewalling functionality, and potentially deep down
> in that program there is functionality to record/sample packets along
> with some meta data. Part of what is exported to the ring buffer to the
> user space reader may be a struct net_device field that is otherwise not
> available (or at least not yet), hence it's probe-read with mentioned
> helpers. If you were now to change the SELinux policy for that tc loader
> application, and therefore replace/swap the progs in the kernel that were
> loaded with it (given tc's lockdown policy was recorded in their sec blob)
> with an empty/NULL program, then either you say allow-all or drop-all,
> but either way, you break the firewalling functionality completely by
> locking yourself out of the machine or letting everything through. There
> is no sane way where we could reason about the context/internals of a
> given program where it would be safe to replace with a simple empty/NULL
> prog.

Help me out here, is your answer that the access check can only be
done at BPF program load time?  That isn't really a solution from a
SELinux perspective as far as I'm concerned.

I understand the ideas I've tossed out aren't practical from a BPF
perspective, but it would be nice if we could find something that does
work.  Surely you BPF folks can think of some way to provide a
runtime, not load time, check?
Paul Moore June 3, 2021, 5:46 p.m. UTC | #25
On Wed, Jun 2, 2021 at 9:40 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> On Fri, May 28, 2021 at 3:37 AM Paul Moore <paul@paul-moore.com> wrote:
> > On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> > >
> > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> > > lockdown") added an implementation of the locked_down LSM hook to
> > > SELinux, with the aim to restrict which domains are allowed to perform
> > > operations that would breach lockdown.
> > >
> > > However, in several places the security_locked_down() hook is called in
> > > situations where the current task isn't doing any action that would
> > > directly breach lockdown, leading to SELinux checks that are basically
> > > bogus.
> > >
> > > Since in most of these situations converting the callers such that
> > > security_locked_down() is called in a context where the current task
> > > would be meaningful for SELinux is impossible or very non-trivial (and
> > > could lead to TOCTOU issues for the classic Lockdown LSM
> > > implementation), fix this by modifying the hook to accept a struct cred
> > > pointer as argument, where NULL will be interpreted as a request for a
> > > "global", task-independent lockdown decision only. Then modify SELinux
> > > to ignore calls with cred == NULL.
> >
> > I'm not overly excited about skipping the access check when cred is
> > NULL.  Based on the description and the little bit that I've dug into
> > thus far it looks like using SECINITSID_KERNEL as the subject would be
> > much more appropriate.  *Something* (the kernel in most of the
> > relevant cases it looks like) is requesting that a potentially
> > sensitive disclosure be made, and ignoring it seems like the wrong
> > thing to do.  Leaving the access control intact also provides a nice
> > avenue to audit these requests should users want to do that.
> >
> > Those users that generally don't care can grant kernel_t all the
> > necessary permissions without much policy.
>
> Seems kind of pointless to me, but it's a relatively simple change to
> do a check against SECINITSID_KERNEL, so I don't mind doing it like
> that.

It's not pointless, the granularity isn't as great as one would like,
but it doesn't mean it is pointless.  *Someone* is acting, in this
case it just happens to be the kernel.  It is likely the most admins
and policy developers will not care, but some might, and we should
enable that.

> > > Since most callers will just want to pass current_cred() as the cred
> > > parameter, rename the hook to security_cred_locked_down() and provide
> > > the original security_locked_down() function as a simple wrapper around
> > > the new hook.
> >
> > I know you and Casey went back and forth on this in v1, but I agree
> > with Casey that having two LSM hooks here is a mistake.  I know it
> > makes backports hard, but spoiler alert: maintaining complex software
> > over any non-trivial period of time is hard, reeeeally hard sometimes
> > ;)
>
> Do you mean having two slots in lsm_hook_defs.h or also having two
> security_*() functions? (It's not clear to me if you're just
> reiterating disagreement with v1 or if you dislike the simplified v2
> as well.)

To be clear I don't think there should be two functions for this, just
make whatever changes are necessary to the existing
security_locked_down() LSM hook.  Yes, the backport is hard.  Yes, it
will touch a lot of code.  Yes, those are lame excuses to not do the
right thing.

> > > The callers migrated to the new hook, passing NULL as cred:
> > > 1. arch/powerpc/xmon/xmon.c
> > >      Here the hook seems to be called from non-task context and is only
> > >      used for redacting some sensitive values from output sent to
> > >      userspace.
> >
> > This definitely sounds like kernel_t based on the description above.
>
> Here I'm a little concerned that the hook might be called from some
> unusual interrupt, which is not masked by spin_lock_irqsave()... We
> ran into this with PMI (Platform Management Interrupt) before, see
> commit 5ae5fbd21079 ("powerpc/perf: Fix handling of privilege level
> checks in perf interrupt context"). While I can't see anything that
> would suggest something like this happening here, the whole thing is
> so foreign to me that I'm wary of making assumptions :)
>
> @Michael/PPC devs, can you confirm to us that xmon_is_locked_down() is
> only called from normal syscall/interrupt context (as opposed to
> something tricky like PMI)?

You did submit the code change so I assumed you weren't concerned
about it :)  If it is a bad hook placement that is something else
entirely.

Hopefully we'll get some guidance from the PPC folks.

> > > 4. net/xfrm/xfrm_user.c:copy_to_user_*()
> > >      Here a cryptographic secret is redacted based on the value returned
> > >      from the hook. There are two possible actions that may lead here:
> > >      a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the
> > >         task context is relevant, since the dumped data is sent back to
> > >         the current task.
> >
> > If the task context is relevant we should use it.
>
> Yes, but as I said it would create an asymmetry with case b), which
> I'll expand on below...
>
> > >      b) When deleting an SA via XFRM_MSG_DELSA, the dumped SAs are
> > >         broadcasted to tasks subscribed to XFRM events - here the
> > >         SELinux check is not meningful as the current task's creds do
> > >         not represent the tasks that could potentially see the secret.
> >
> > This looks very similar to the BPF hook discussed above, I believe my
> > comments above apply here as well.
>
> Using the current task is just logically wrong in this case. The
> current task here is just simply deleting an SA that happens to have
> some secret value in it. When deleting an SA, a notification is sent
> to a group of subscribers (some group of other tasks), which includes
> a dump of the secret value. The current task isn't doing any attempt
> to breach lockdown, it's just deleting an SA.
>
> It also makes it really awkward to make policy decisions around this.
> Suppose that domains A, B, and C need to be able to add/delete SAs and
> domains D, E, and F need to receive notifications about changes in
> SAs. Then if, say, domain E actually needs to see the secret values in
> the notifications, you must grant the confidentiality permission to
> all of A, B, C to keep things working. And now you have opened up the
> door for A, B, C to do other lockdown-confidentiality stuff, even
> though these domains themselves actually don't request/need any
> confidential data from the kernel. That's just not logical and you may
> actually end up (slightly) worse security-wise than if you just
> skipped checking for XFRM secrets altogether, because you need to
> allow confidentiality to domains for which it may be excessive.

It sounds an awful lot like the lockdown hook is in the wrong spot.
It sounds like it would be a lot better to relocate the hook than
remove it.
Daniel Borkmann June 3, 2021, 6:52 p.m. UTC | #26
On 6/2/21 5:13 PM, Paul Moore wrote:
[...]
> Help me out here, is your answer that the access check can only be
> done at BPF program load time?  That isn't really a solution from a
> SELinux perspective as far as I'm concerned.

That is the current answer. The unfortunate irony is that 59438b46471a
("security,lockdown,selinux: implement SELinux lockdown") broke this in
the first place. W/o the SELinux hook implementation it would have been
working just fine at runtime, but given it's UAPI since quite a while
now, that ship has sailed.

> I understand the ideas I've tossed out aren't practical from a BPF
> perspective, but it would be nice if we could find something that does
> work.  Surely you BPF folks can think of some way to provide a
> runtime, not load time, check?

I did run this entire discussion by both of the other BPF co-maintainers
(Alexei, Andrii, CC'ed) and together we did further brainstorming on the
matter on how we could solve this, but couldn't find a sensible & clean
solution so far.

You could potentially track the programs in the sec blob and iff they have
been JITed fix up the jump targets via text_poke to a dummy handler for
those requiring it and such, but that's just entirely fragile, horrid and
broken.

Given users are actively hitting issues with already released kernels in
the wild, we concluded to fix the majority of the damage caused by commit
59438b46471a [concerning BPF at least, the rest done by Ondrej as I understand]
with the below fix that is shipping to Linus. This is a step in the right
direction for moving things forward regardless. With the hook at load
it's also not doing anything that is off with respect to the remainder
of lockdown hooks, so solving a policy change can be looked at from a
more broader/general scope given same applies to other users, too, iff
it's indeed the case that it turns out to be feasible. Anyway, I've
reflected an overall summary of the discussions also in the commit msg.

Thanks,
Daniel

---

[PATCH] bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks

Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:

   1) The audit events that are triggered due to calls to security_locked_down()
      can OOM kill a machine, see below details [0].

   2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
      when trying to wake up kauditd, for example, when using trace_sched_switch()
      tracepoint, see details in [1]. Triggering this was not via some hypothetical
      corner case, but with existing tools like runqlat & runqslower from bcc, for
      example, which make use of this tracepoint. Rough call sequence goes like:

      rq_lock(rq) -> -------------------------+
        trace_sched_switch() ->               |
          bpf_prog_xyz() ->                   +-> deadlock
            selinux_lockdown() ->             |
              audit_log_end() ->              |
                wake_up_interruptible() ->    |
                  try_to_wake_up() ->         |
                    rq_lock(rq) --------------+

What's worse is that the intention of 59438b46471a to further restrict lockdown
settings for specific applications in respect to the global lockdown policy is
completely broken for BPF. The SELinux policy rule for the current lockdown check
looks something like this:

   allow <who> <who> : lockdown { <reason> };

However, this doesn't match with the 'current' task where the security_locked_down()
is executed, example: httpd does a syscall. There is a tracing program attached
to the syscall which triggers a BPF program to run, which ends up doing a
bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
the permission check against 'current', that is, httpd in this example. httpd
has literally zero relation to this tracing program, and it would be nonsensical
having to write an SELinux policy rule against httpd to let the tracing helper
pass. The policy in this case needs to be against the entity that is installing
the BPF program. For example, if bpftrace would generate a histogram of syscall
counts by user space application:

   bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

bpftrace would then go and generate a BPF program from this internally. One way
of doing it [for the sake of the example] could be to call bpf_get_current_task()
helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
helpers. So the program itself has nothing to do with httpd or any other random
app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
check. The allow/deny policy belongs in the context of bpftrace: meaning, you
want to grant bpftrace access to use these helpers, but other tracers on the
system like my_random_tracer _not_.

Therefore fix all three issues at the same time by taking a completely different
approach for the security_locked_down() hook, that is, move the check into the
program verification phase where we actually retrieve the BPF func proto. This
also reliably gets the task (current) that is trying to install the BPF tracing
program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
we're moving this out of the BPF helper's fast-path which can be called several
millions of times per second.

The check is then also in line with other security_locked_down() hooks in the
system where the enforcement is performed at open/load time, for example,
open_kcore() for /proc/kcore access or module_sig_check() for module signatures
just to pick few random ones. What's out of scope in the fix as well as in
other security_locked_down() hook locations /outside/ of BPF subsystem is that
if the lockdown policy changes on the fly there is no retrospective action.
This requires a different discussion, potentially complex infrastructure, and
it's also not clear whether this can be solved generically. Either way, it is
out of scope for a suitable stable fix which this one is targeting. Note that
the breakage is specifically on 59438b46471a where it started to rely on 'current'
as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf:
Restrict bpf when kernel lockdown is in confidentiality mode").

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:

   I starting seeing this with F-34. When I run a container that is traced with
   BPF to record the syscalls it is doing, auditd is flooded with messages like:

   type=AVC msg=audit(1619784520.593:282387): avc:  denied  { confidentiality }
     for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
       scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
         tclass=lockdown permissive=0

   This seems to be leading to auditd running out of space in the backlog buffer
   and eventually OOMs the machine.

   [...]
   auditd running at 99% CPU presumably processing all the messages, eventually I get:
   Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
   Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
   Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
   Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
   Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
   Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
   [...]

[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
     Serhei Makarov says:

   Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
   bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
   is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
   testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
   ppc64le. Example stack trace:

   [...]
   [  730.868702] stack backtrace:
   [  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
   [  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
   [  730.873278] Call Trace:
   [  730.873770]  dump_stack+0x7f/0xa1
   [  730.874433]  check_noncircular+0xdf/0x100
   [  730.875232]  __lock_acquire+0x1202/0x1e10
   [  730.876031]  ? __lock_acquire+0xfc0/0x1e10
   [  730.876844]  lock_acquire+0xc2/0x3a0
   [  730.877551]  ? __wake_up_common_lock+0x52/0x90
   [  730.878434]  ? lock_acquire+0xc2/0x3a0
   [  730.879186]  ? lock_is_held_type+0xa7/0x120
   [  730.880044]  ? skb_queue_tail+0x1b/0x50
   [  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
   [  730.881656]  ? __wake_up_common_lock+0x52/0x90
   [  730.882532]  __wake_up_common_lock+0x52/0x90
   [  730.883375]  audit_log_end+0x5b/0x100
   [  730.884104]  slow_avc_audit+0x69/0x90
   [  730.884836]  avc_has_perm+0x8b/0xb0
   [  730.885532]  selinux_lockdown+0xa5/0xd0
   [  730.886297]  security_locked_down+0x20/0x40
   [  730.887133]  bpf_probe_read_compat+0x66/0xd0
   [  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
   [  730.888917]  trace_call_bpf+0xe9/0x240
   [  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
   [  730.890579]  perf_trace_sched_switch+0x142/0x180
   [  730.891485]  ? __schedule+0x6d8/0xb20
   [  730.892209]  __schedule+0x6d8/0xb20
   [  730.892899]  schedule+0x5b/0xc0
   [  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
   [  730.894457]  syscall_exit_to_user_mode+0x27/0x70
   [  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
   [...]

Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Reported-by: Jakub Hrozek <jhrozek@redhat.com>
Reported-by: Serhei Makarov <smakarov@redhat.com>
Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Frank Eigler <fche@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
---
  kernel/bpf/helpers.c     |  7 +++++--
  kernel/trace/bpf_trace.c | 32 ++++++++++++--------------------
  2 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 73443498d88f..a2f1f15ce432 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -14,6 +14,7 @@
  #include <linux/jiffies.h>
  #include <linux/pid_namespace.h>
  #include <linux/proc_ns.h>
+#include <linux/security.h>

  #include "../../lib/kstrtox.h"

@@ -1069,11 +1070,13 @@ bpf_base_func_proto(enum bpf_func_id func_id)
  	case BPF_FUNC_probe_read_user:
  		return &bpf_probe_read_user_proto;
  	case BPF_FUNC_probe_read_kernel:
-		return &bpf_probe_read_kernel_proto;
+		return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
+		       NULL : &bpf_probe_read_kernel_proto;
  	case BPF_FUNC_probe_read_user_str:
  		return &bpf_probe_read_user_str_proto;
  	case BPF_FUNC_probe_read_kernel_str:
-		return &bpf_probe_read_kernel_str_proto;
+		return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
+		       NULL : &bpf_probe_read_kernel_str_proto;
  	case BPF_FUNC_snprintf_btf:
  		return &bpf_snprintf_btf_proto;
  	case BPF_FUNC_snprintf:
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d2d7cf6cfe83..7a52bc172841 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -215,16 +215,11 @@ const struct bpf_func_proto bpf_probe_read_user_str_proto = {
  static __always_inline int
  bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr)
  {
-	int ret = security_locked_down(LOCKDOWN_BPF_READ);
+	int ret;

-	if (unlikely(ret < 0))
-		goto fail;
  	ret = copy_from_kernel_nofault(dst, unsafe_ptr, size);
  	if (unlikely(ret < 0))
-		goto fail;
-	return ret;
-fail:
-	memset(dst, 0, size);
+		memset(dst, 0, size);
  	return ret;
  }

@@ -246,10 +241,7 @@ const struct bpf_func_proto bpf_probe_read_kernel_proto = {
  static __always_inline int
  bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr)
  {
-	int ret = security_locked_down(LOCKDOWN_BPF_READ);
-
-	if (unlikely(ret < 0))
-		goto fail;
+	int ret;

  	/*
  	 * The strncpy_from_kernel_nofault() call will likely not fill the
@@ -262,11 +254,7 @@ bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr)
  	 */
  	ret = strncpy_from_kernel_nofault(dst, unsafe_ptr, size);
  	if (unlikely(ret < 0))
-		goto fail;
-
-	return ret;
-fail:
-	memset(dst, 0, size);
+		memset(dst, 0, size);
  	return ret;
  }

@@ -1011,16 +999,20 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  	case BPF_FUNC_probe_read_user:
  		return &bpf_probe_read_user_proto;
  	case BPF_FUNC_probe_read_kernel:
-		return &bpf_probe_read_kernel_proto;
+		return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
+		       NULL : &bpf_probe_read_kernel_proto;
  	case BPF_FUNC_probe_read_user_str:
  		return &bpf_probe_read_user_str_proto;
  	case BPF_FUNC_probe_read_kernel_str:
-		return &bpf_probe_read_kernel_str_proto;
+		return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
+		       NULL : &bpf_probe_read_kernel_str_proto;
  #ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
  	case BPF_FUNC_probe_read:
-		return &bpf_probe_read_compat_proto;
+		return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
+		       NULL : &bpf_probe_read_compat_proto;
  	case BPF_FUNC_probe_read_str:
-		return &bpf_probe_read_compat_str_proto;
+		return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
+		       NULL : &bpf_probe_read_compat_str_proto;
  #endif
  #ifdef CONFIG_CGROUPS
  	case BPF_FUNC_get_current_cgroup_id:
Paul Moore June 4, 2021, 4:50 a.m. UTC | #27
On Thu, Jun 3, 2021 at 2:53 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 6/2/21 5:13 PM, Paul Moore wrote:
> [...]
> > Help me out here, is your answer that the access check can only be
> > done at BPF program load time?  That isn't really a solution from a
> > SELinux perspective as far as I'm concerned.
>
> That is the current answer. The unfortunate irony is that 59438b46471a
> ("security,lockdown,selinux: implement SELinux lockdown") broke this in
> the first place. W/o the SELinux hook implementation it would have been
> working just fine at runtime, but given it's UAPI since quite a while
> now, that ship has sailed.

Explaining the other side of the "unfortunate irony ..." comment is
going to take us in a direction that isn't very constructive so I'm
going to skip past that now and simply say that if there was better
cooperation across subsystems, especially with the LSM folks, a lot of
this pain could be mitigated.

... and yes I said "mitigated", I'm not foolish to think the pain
could be avoided entirely ;)

> > I understand the ideas I've tossed out aren't practical from a BPF
> > perspective, but it would be nice if we could find something that does
> > work.  Surely you BPF folks can think of some way to provide a
> > runtime, not load time, check?
>
> I did run this entire discussion by both of the other BPF co-maintainers
> (Alexei, Andrii, CC'ed) and together we did further brainstorming on the
> matter on how we could solve this, but couldn't find a sensible & clean
> solution so far.

Before I jump into the patch below I just want to say that I
appreciate you looking into solutions on the BPF side of things.
However, I voted "no" on this patch previously and since you haven't
really changed it, my "no"/NACK vote remains, at least until we
exhaust a few more options.

> [PATCH] bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
>
> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
> added an implementation of the locked_down LSM hook to SELinux, with the aim
> to restrict which domains are allowed to perform operations that would breach
> lockdown. This is indirectly also getting audit subsystem involved to report
> events. The latter is problematic, as reported by Ondrej and Serhei, since it
> can bring down the whole system via audit:
>
>    1) The audit events that are triggered due to calls to security_locked_down()
>       can OOM kill a machine, see below details [0].
>
>    2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
>       when trying to wake up kauditd, for example, when using trace_sched_switch()
>       tracepoint, see details in [1]. Triggering this was not via some hypothetical
>       corner case, but with existing tools like runqlat & runqslower from bcc, for
>       example, which make use of this tracepoint. Rough call sequence goes like:
>
>       rq_lock(rq) -> -------------------------+
>         trace_sched_switch() ->               |
>           bpf_prog_xyz() ->                   +-> deadlock
>             selinux_lockdown() ->             |
>               audit_log_end() ->              |
>                 wake_up_interruptible() ->    |
>                   try_to_wake_up() ->         |
>                     rq_lock(rq) --------------+

Since BPF is a bit of chaotic nightmare in the sense that it basically
out-of-tree kernel code that can be called from anywhere and do pretty
much anything; it presents quite the challenge for those of us worried
about LSM access controls.

You and the other BPF folks have investigated ways in which BPF might
be able to disable helper functions allowing us to do proper runtime
access checks but haven't been able to make it work, which brings this
patch up yet again.  I'm not a fan of this patch as it basically
allows BPF programs to side-step any changes to the security policy
once the BPF programs have been loaded; this is Not Good.

So let's look at this from a different angle.  Let's look at the two
problems you mention above.

If we start with the runqueue deadlock we see the main problem is that
audit_log_end() pokes the kauditd_wait waitqueue to ensure the
kauditd_thread thread wakes up and processes the audit queue.  The
audit_log_start() function does something similar, but it is
conditional on a number of factors and isn't as likely to be hit.  If
we relocate these kauditd wakeup calls we can remove the deadlock in
trace_sched_switch().  In the case of CONFIG_AUDITSYSCALL=y we can
probably just move the wakeup to __audit_syscall_exit() and in the
case of CONFIG_AUDITSYSCALL=n we can likely just change the
wait_event_freezable() call in kauditd_thread to a
wait_event_freezable_timeout() call with a HZ timeout (the audit
stream will be much less on these systems anyway so a queue overflow
is much less likely).  I'm building a kernel with these changes now, I
should have something to test when I wake up tomorrow morning.  It
might even provide a bit of a performance boost as we would only be
calling a wakeup function once for each syscall.

The other issue is related to security_locked_down() and using the
right subject for the access control check.  As has been pointed out
several times in this thread, the current code uses the current() task
as the subject, which is arguably incorrect for many of the BPF helper
functions.  In the case of BPF, we have talked about using the
credentials of the task which loaded the BPF program instead of
current(), and that does make a certain amount of sense.  Such an
approach should make the security policy easier to develop and
rationalize, leading to a significant decrease in audit records coming
from LSM access denials.  The question is how to implement such a
change.  The current SELinux security_bpf_prog_alloc() hook causes the
newly loaded BPF program to inherit the subject context from the task
which loads the BPF program; if it is possible to reference the
bpf_prog struct, or really just the associated bpf_prog_aux->security
blob, from inside a security_bpf_locked_down() function we use that
subject information to perform the access check.  BPF folks, is there
a way to get that information from within a BPF kernel helper
function?  If it isn't currently possible, could it be made possible
(or something similar)?

If it turns out we can do both of these things (relocated audit
wakeup, bpf_prog reference inside kernel helpers) I think we can
arrive at a fix which both groups can accept.
Daniel Borkmann June 4, 2021, 6:02 p.m. UTC | #28
On 6/4/21 6:50 AM, Paul Moore wrote:
> On Thu, Jun 3, 2021 at 2:53 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
[...]
>> I did run this entire discussion by both of the other BPF co-maintainers
>> (Alexei, Andrii, CC'ed) and together we did further brainstorming on the
>> matter on how we could solve this, but couldn't find a sensible & clean
>> solution so far.
> 
> Before I jump into the patch below I just want to say that I
> appreciate you looking into solutions on the BPF side of things.
> However, I voted "no" on this patch previously and since you haven't
> really changed it, my "no"/NACK vote remains, at least until we
> exhaust a few more options.

Just to set the record straight, you previously did neither ACK nor NACK it. And
again, as summarized earlier, this patch is _fixing_ the majority of the damage
caused by 59438b46471a for at least the BPF side of things where users run into this,
Ondrej the rest. Everything else can be discussed on top, and so far it seems there
is no clean solution in front of us either, not even speaking of one that is small
and suitable for _stable_ trees. Let me reiterate where we are: it's not that the
original implementation in 9d1f8be5cf42 ("bpf: Restrict bpf when kernel lockdown is
in confidentiality mode") was broken, it's that the later added _SELinux_ locked_down
hook implementation that is broken, and not other LSMs. Now you're trying to retrofittingly
ask us for hacks at all costs just because of /a/ broken LSM implementation. Maybe
another avenue is to just swallow the pill and revert 59438b46471a since it made
assumptions that don't work /generally/. And the use case for an application runtime
policy change in particular in case of lockdown where users only have 3 choices is
extremely tiny/rare, if it's not then something is very wrong in your deployment.
Let me ask you this ... are you also planning to address *all* the other cases inside
the kernel? If your answer is no, then this entire discussion is pointless.

>> [PATCH] bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
>>
>> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
>> added an implementation of the locked_down LSM hook to SELinux, with the aim
>> to restrict which domains are allowed to perform operations that would breach
>> lockdown. This is indirectly also getting audit subsystem involved to report
>> events. The latter is problematic, as reported by Ondrej and Serhei, since it
>> can bring down the whole system via audit:
>>
>>     1) The audit events that are triggered due to calls to security_locked_down()
>>        can OOM kill a machine, see below details [0].
>>
>>     2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
>>        when trying to wake up kauditd, for example, when using trace_sched_switch()
>>        tracepoint, see details in [1]. Triggering this was not via some hypothetical
>>        corner case, but with existing tools like runqlat & runqslower from bcc, for
>>        example, which make use of this tracepoint. Rough call sequence goes like:
>>
>>        rq_lock(rq) -> -------------------------+
>>          trace_sched_switch() ->               |
>>            bpf_prog_xyz() ->                   +-> deadlock
>>              selinux_lockdown() ->             |
>>                audit_log_end() ->              |
>>                  wake_up_interruptible() ->    |
>>                    try_to_wake_up() ->         |
>>                      rq_lock(rq) --------------+
> 
> Since BPF is a bit of chaotic nightmare in the sense that it basically
> out-of-tree kernel code that can be called from anywhere and do pretty
> much anything; it presents quite the challenge for those of us worried
> about LSM access controls.

There is no need to generalize ... for those worried, BPF subsystem has LSM access
controls for the syscall since 2017 via afdb09c720b6 ("security: bpf: Add LSM hooks
for bpf object related syscall").

[...]
> So let's look at this from a different angle.  Let's look at the two
> problems you mention above.
> 
> If we start with the runqueue deadlock we see the main problem is that
> audit_log_end() pokes the kauditd_wait waitqueue to ensure the
> kauditd_thread thread wakes up and processes the audit queue.  The
> audit_log_start() function does something similar, but it is
> conditional on a number of factors and isn't as likely to be hit.  If
> we relocate these kauditd wakeup calls we can remove the deadlock in
> trace_sched_switch().  In the case of CONFIG_AUDITSYSCALL=y we can
> probably just move the wakeup to __audit_syscall_exit() and in the
> case of CONFIG_AUDITSYSCALL=n we can likely just change the
> wait_event_freezable() call in kauditd_thread to a
> wait_event_freezable_timeout() call with a HZ timeout (the audit
> stream will be much less on these systems anyway so a queue overflow
> is much less likely).  I'm building a kernel with these changes now, I
> should have something to test when I wake up tomorrow morning.  It
> might even provide a bit of a performance boost as we would only be
> calling a wakeup function once for each syscall.

As other SELinux developers like Ondrej already pointed out to you in
this thread:

   Actually, I wasn't aware of the deadlock... But calling an LSM hook
   [that is backed by a SELinux access check] from within a BPF helper
   is calling for all kinds of trouble, so I'm not surprised

This is _generally_ a bad idea since it will potentially blow up in
random ways. A _simple_ on/off switch like lockdown_is_locked_down()
did was okayish (maybe modulo the pr_notice() which should rather have
been a pr_notice_ratelimited() when this is potentially called Mio
of times per sec worst case), but everything else is, again, just
asking for trouble now and/or in future when folks extend the SELinux
backend implementation or add a locked_down hook to other LSMs. That
59438b46471a was missing it was the first proof of exactly this, and
other LSMs will run into the same. Similarly for relying on 'current'
given it works on /some/ of the security_locked_down() call-sites /but
not others/. No matter from which angle you look at it, calling a LSM
hook from a helper is just plain broken.

> The other issue is related to security_locked_down() and using the
> right subject for the access control check.  As has been pointed out
> several times in this thread, the current code uses the current() task
> as the subject, which is arguably incorrect for many of the BPF helper
> functions.  In the case of BPF, we have talked about using the
> credentials of the task which loaded the BPF program instead of
> current(), and that does make a certain amount of sense.  Such an
> approach should make the security policy easier to develop and
> rationalize, leading to a significant decrease in audit records coming
> from LSM access denials.  The question is how to implement such a
> change.  The current SELinux security_bpf_prog_alloc() hook causes the
> newly loaded BPF program to inherit the subject context from the task
> which loads the BPF program; if it is possible to reference the
> bpf_prog struct, or really just the associated bpf_prog_aux->security
> blob, from inside a security_bpf_locked_down() function we use that
> subject information to perform the access check.  BPF folks, is there
> a way to get that information from within a BPF kernel helper
> function?  If it isn't currently possible, could it be made possible
> (or something similar)?

While this could be a potential avenue, the problem here is that BPF
helpers have neither access to the prog struct nor to bpf_prog_aux. As
I mentioned earlier, potentially you could go and fix up JITed images
for those progs where the credentials of the loading task require this
when policy suddenly changes to a more stricter level. I'm not a fan
of this though because of the fragility involved here.

Again, the problem is not limited to BPF at all. kprobes is doing register-
time hooks which are equivalent to the one of BPF. Anything in run-time
trying to prevent probe_read_kernel by kprobes or BPF is broken by design.

Thanks,
Daniel
Paul Moore June 4, 2021, 11:34 p.m. UTC | #29
On Fri, Jun 4, 2021 at 2:02 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 6/4/21 6:50 AM, Paul Moore wrote:
> > On Thu, Jun 3, 2021 at 2:53 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> [...]
> >> I did run this entire discussion by both of the other BPF co-maintainers
> >> (Alexei, Andrii, CC'ed) and together we did further brainstorming on the
> >> matter on how we could solve this, but couldn't find a sensible & clean
> >> solution so far.
> >
> > Before I jump into the patch below I just want to say that I
> > appreciate you looking into solutions on the BPF side of things.
> > However, I voted "no" on this patch previously and since you haven't
> > really changed it, my "no"/NACK vote remains, at least until we
> > exhaust a few more options.
>
> Just to set the record straight, you previously did neither ACK nor NACK it.

I think I said it wasn't a great solution.  Perhaps I should have been
more clear, but I don't like NACK'ing things while we are still
hashing out possible solutions on the lists.

From my perspective NACK'ing is pretty harsh and I try to leave that
as a last resort.

> And
> again, as summarized earlier, this patch is _fixing_ the majority of the damage
> caused by 59438b46471a for at least the BPF side of things where users run into this,
> Ondrej the rest. Everything else can be discussed on top, and so far it seems there
> is no clean solution in front of us either, not even speaking of one that is small
> and suitable for _stable_ trees. Let me reiterate where we are: it's not that the
> original implementation in 9d1f8be5cf42 ("bpf: Restrict bpf when kernel lockdown is
> in confidentiality mode") was broken, it's that the later added _SELinux_ locked_down
> hook implementation that is broken, and not other LSMs.

I think there are still options for how to solve this moving forward,
more on that at the end of the email, and I would like to see if we
can chase down some of those ideas first.  Maybe the ideas aren't
practical, or maybe they are practical but not from a -stable
perspective; we/I don't know until we talk about it.  Based on
previous experience I'm afraid to ACK a quick-fix without some
discussion about the proper long-term fix.  If the long-term fix isn't
suitable for -stable, then we can take a smaller fix just for the
stable trees.

> Now you're trying to retrofittingly
> ask us for hacks at all costs just because of /a/ broken LSM implementation.

This is starting to get a little off the rails now isn't it?  Daniel
you've always come across as a reasonable person to me, but phrasing
things like that is inflammatory at best.

Could the SELinux lockdown hooks have been done better, of course, I
don't think we are debating that anymore.  New functionality, checks,
etc. are added to the kernel all the time, and because we're all human
we screw that up on occasion.  The important part is that we come
together to fix it, which is what I think we're trying to do here
(it's what I'm trying to do here).

> Maybe
> another avenue is to just swallow the pill and revert 59438b46471a since it made
> assumptions that don't work /generally/. And the use case for an application runtime
> policy change in particular in case of lockdown where users only have 3 choices is
> extremely tiny/rare, if it's not then something is very wrong in your deployment.
> Let me ask you this ... are you also planning to address *all* the other cases inside
> the kernel? If your answer is no, then this entire discussion is pointless.

Um, yes?  We were talking about that earlier in the thread before the
BPF portions of the fix started to dominate the discussion.  Just
because the BPF portion of the fix has dominated the discussion
recently doesn't mean the other issues aren't going to be addressed.

When stuff is busted, I work to fix it.  I think everyone here does.

> >> [PATCH] bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
> >>
> >> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
> >> added an implementation of the locked_down LSM hook to SELinux, with the aim
> >> to restrict which domains are allowed to perform operations that would breach
> >> lockdown. This is indirectly also getting audit subsystem involved to report
> >> events. The latter is problematic, as reported by Ondrej and Serhei, since it
> >> can bring down the whole system via audit:
> >>
> >>     1) The audit events that are triggered due to calls to security_locked_down()
> >>        can OOM kill a machine, see below details [0].
> >>
> >>     2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
> >>        when trying to wake up kauditd, for example, when using trace_sched_switch()
> >>        tracepoint, see details in [1]. Triggering this was not via some hypothetical
> >>        corner case, but with existing tools like runqlat & runqslower from bcc, for
> >>        example, which make use of this tracepoint. Rough call sequence goes like:
> >>
> >>        rq_lock(rq) -> -------------------------+
> >>          trace_sched_switch() ->               |
> >>            bpf_prog_xyz() ->                   +-> deadlock
> >>              selinux_lockdown() ->             |
> >>                audit_log_end() ->              |
> >>                  wake_up_interruptible() ->    |
> >>                    try_to_wake_up() ->         |
> >>                      rq_lock(rq) --------------+
> >
> > Since BPF is a bit of chaotic nightmare in the sense that it basically
> > out-of-tree kernel code that can be called from anywhere and do pretty
> > much anything; it presents quite the challenge for those of us worried
> > about LSM access controls.
>
> There is no need to generalize ... for those worried, BPF subsystem has LSM access
> controls for the syscall since 2017 via afdb09c720b6 ("security: bpf: Add LSM hooks
> for bpf object related syscall").

We weren't really talking about those hooks in this thread, we're
talking about the security_locked_down() hooks in the BPF helper
functions.  The BPF/LSM hooks you mention above have nothing to do
with that at present, but yes we do have lots of cool LSM hooks that
allow admins to do lots of cool things with access controls.

Anyway, that was a distraction I started in my last email when I
shouldn't have.  If I'm going to call you out for inflammatory
language I should do the same to myself - you have my apology.

> > So let's look at this from a different angle.  Let's look at the two
> > problems you mention above.
> >
> > If we start with the runqueue deadlock we see the main problem is that
> > audit_log_end() pokes the kauditd_wait waitqueue to ensure the
> > kauditd_thread thread wakes up and processes the audit queue.  The
> > audit_log_start() function does something similar, but it is
> > conditional on a number of factors and isn't as likely to be hit.  If
> > we relocate these kauditd wakeup calls we can remove the deadlock in
> > trace_sched_switch().  In the case of CONFIG_AUDITSYSCALL=y we can
> > probably just move the wakeup to __audit_syscall_exit() and in the
> > case of CONFIG_AUDITSYSCALL=n we can likely just change the
> > wait_event_freezable() call in kauditd_thread to a
> > wait_event_freezable_timeout() call with a HZ timeout (the audit
> > stream will be much less on these systems anyway so a queue overflow
> > is much less likely).  I'm building a kernel with these changes now, I
> > should have something to test when I wake up tomorrow morning.  It
> > might even provide a bit of a performance boost as we would only be
> > calling a wakeup function once for each syscall.
>
> As other SELinux developers like Ondrej already pointed out to you in
> this thread:
>
>    Actually, I wasn't aware of the deadlock... But calling an LSM hook
>    [that is backed by a SELinux access check] from within a BPF helper
>    is calling for all kinds of trouble, so I'm not surprised

Skipping past the phrasing of your comment, like any two people Ondrej
and I disagree on some things, and this is one of those things.
Surely the BPF folks never disagree on anything :)

I agree that the current security_locked_down() SELinux hook
implementation has problems, mostly due to the audit code path, but I
think it is something we can fix.  Simply throwing our hands up in the
air in defeat and ripping out the LSM checks in the BPF helpers is
something I see as a last resort and I think we still have one more
potential solution to discuss.

The SELinux hook implementation was only a problem because of the
wakeup call in audit_log_end().  The fact that the hook progressed all
the way to audit_log_end() for the AVC record shows that the bulk of
the hook is fine, we just need to remove the wakeup from
audit_log_end().  To that end, I've been playing with a patch that
does just that, it removes all of the wakeup calls from the
audit_log_start() and audit_log_end() functions, I made some small
tweaks to the patch before I started responding to your email (the
kernel is compiling as I type this) but I expect to post it to the
audit list as a RFC tonight or this weekend for review.

> > The other issue is related to security_locked_down() and using the
> > right subject for the access control check.  As has been pointed out
> > several times in this thread, the current code uses the current() task
> > as the subject, which is arguably incorrect for many of the BPF helper
> > functions.  In the case of BPF, we have talked about using the
> > credentials of the task which loaded the BPF program instead of
> > current(), and that does make a certain amount of sense.  Such an
> > approach should make the security policy easier to develop and
> > rationalize, leading to a significant decrease in audit records coming
> > from LSM access denials.  The question is how to implement such a
> > change.  The current SELinux security_bpf_prog_alloc() hook causes the
> > newly loaded BPF program to inherit the subject context from the task
> > which loads the BPF program; if it is possible to reference the
> > bpf_prog struct, or really just the associated bpf_prog_aux->security
> > blob, from inside a security_bpf_locked_down() function we use that
> > subject information to perform the access check.  BPF folks, is there
> > a way to get that information from within a BPF kernel helper
> > function?  If it isn't currently possible, could it be made possible
> > (or something similar)?
>
> While this could be a potential avenue, the problem here is that BPF
> helpers have neither access to the prog struct nor to bpf_prog_aux. As
> I mentioned earlier, potentially you could go and fix up JITed images
> for those progs where the credentials of the loading task require this
> when policy suddenly changes to a more stricter level. I'm not a fan
> of this though because of the fragility involved here.

FWIW, the JIT hacks sounded awful to me too, I'm not advocating for
that as a solution.  What I'm thinking of is a sort of "bpf_current"
that functions similar to current() but could be used from within the
BPF helpers, and/or the LSMs hooks they call, to reference the
bpf_prog struct of the currently executing BPF program.  Once again,
I'm not a kernel BPF expert, but it seems like we could create a
per-CPU pointer for the bpf_prog struct and assign it as part of the
__BPF_PROG_RUN macro.  The macro already has the bpf_prog pointer so I
wouldn't expect that doing a per-CPU pointer assignment wouldn't be
too costly considering the other things that take place.  Or am I
missing something important?

> Again, the problem is not limited to BPF at all. kprobes is doing register-
> time hooks which are equivalent to the one of BPF. Anything in run-time
> trying to prevent probe_read_kernel by kprobes or BPF is broken by design.

Not being an expert on kprobes I can't really comment on that, but
right now I'm focused on trying to make things work for the BPF
helpers.  I suspect that if we can get the SELinux lockdown
implementation working properly for BPF the solution for kprobes won't
be far off.
Alexei Starovoitov June 5, 2021, 12:08 a.m. UTC | #30
On Fri, Jun 4, 2021 at 4:34 PM Paul Moore <paul@paul-moore.com> wrote:
>
> > Again, the problem is not limited to BPF at all. kprobes is doing register-
> > time hooks which are equivalent to the one of BPF. Anything in run-time
> > trying to prevent probe_read_kernel by kprobes or BPF is broken by design.
>
> Not being an expert on kprobes I can't really comment on that, but
> right now I'm focused on trying to make things work for the BPF
> helpers.  I suspect that if we can get the SELinux lockdown
> implementation working properly for BPF the solution for kprobes won't
> be far off.

Paul,

Both kprobe and bpf can call probe_read_kernel==copy_from_kernel_nofault
from all contexts.
Including NMI. Most of audit_log_* is not acceptable.
Just removing a wakeup is not solving anything.
Audit hooks don't belong in NMI.
Audit design needs memory allocation. Hence it's not suitable
for NMI and hardirq. But kprobes and bpf progs do run just fine there.
BPF, for example, only uses pre-allocated memory.
Casey Schaufler June 5, 2021, 6:10 p.m. UTC | #31
On 6/4/2021 5:08 PM, Alexei Starovoitov wrote:
> On Fri, Jun 4, 2021 at 4:34 PM Paul Moore <paul@paul-moore.com> wrote:
>>> Again, the problem is not limited to BPF at all. kprobes is doing register-
>>> the hooks which are equivalent to the one of BPF. Anything in run-time
>>> trying to prevent probe_read_kernel by kprobes or BPF is broken by design.
>> Not being an expert on kprobes I can't really comment on that, but
>> right now I'm focused on trying to make things work for the BPF
>> helpers.  I suspect that if we can get the SELinux lockdown
>> implementation working properly for BPF the solution for kprobes won't
>> be far off.
> Paul,
>
> Both kprobe and bpf can call probe_read_kernel==copy_from_kernel_nofault
> from all contexts.
> Including NMI. Most of audit_log_* is not acceptable.
> Just removing a wakeup is not solving anything.
> Audit hooks don't belong in NMI.
> Audit design needs memory allocation. Hence it's not suitable
> for NMI and hardirq. But kprobes and bpf progs do run just fine there.
> BPF, for example, only uses pre-allocated memory.

You have fallen into a common fallacy. The fact that the "code runs"
does not assure that the "system works right". In the security world
we face this all the time, often with performance expectations. In this
case the BPF design has failed to accommodate the long standing needs
of audit and SELinux. Shifting the responsibility for these design flaws
to SELinux is inappropriate. Integration of sub-systems is usually the
burden of the newcomer, which in this case is BPF. Paul is doing the
bulk of your work for you. Maybe you could step up to your responsibility
and work with him, not against him.
Linus Torvalds June 5, 2021, 6:17 p.m. UTC | #32
On Sat, Jun 5, 2021 at 11:11 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> You have fallen into a common fallacy. The fact that the "code runs"
> does not assure that the "system works right". In the security world
> we face this all the time, often with performance expectations. In this
> case the BPF design has failed [..]

I think it's the lockdown patches that have failed. They did the wrong
thing, they didn't work,

The report in question is for a regression.

THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS.

Honestly, security people need to understand that "not working" is not
a success case of security. It's a failure case.

Yes, "not working" may be secure. But security in that case is *pointless*.

              Linus
Paul Moore June 6, 2021, 1:30 a.m. UTC | #33
On Fri, Jun 4, 2021 at 8:08 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, Jun 4, 2021 at 4:34 PM Paul Moore <paul@paul-moore.com> wrote:
> >
> > > Again, the problem is not limited to BPF at all. kprobes is doing register-
> > > time hooks which are equivalent to the one of BPF. Anything in run-time
> > > trying to prevent probe_read_kernel by kprobes or BPF is broken by design.
> >
> > Not being an expert on kprobes I can't really comment on that, but
> > right now I'm focused on trying to make things work for the BPF
> > helpers.  I suspect that if we can get the SELinux lockdown
> > implementation working properly for BPF the solution for kprobes won't
> > be far off.
>
> Paul,

Hi Alexei,

> Both kprobe and bpf can call probe_read_kernel==copy_from_kernel_nofault
> from all contexts.
> Including NMI.

Thanks, that is helpful.  In hindsight it should have been obvious
that kprobe/BPF would offer to insert code into the NMI handlers, but
I don't recall it earlier in the discussion, it's possible I simply
missed the mention.

> Most of audit_log_* is not acceptable.
> Just removing a wakeup is not solving anything.

That's not really fair now is it?  Removing the wakeups in
audit_log_start() and audit_log_end() does solve some problems,
although not all of them (i.e. the NMI problem being the 800lb
gorilla).  Because of the NMI case we're not going to solve the
LSM/audit case anytime soon so it looks like we are going to have to
fall back to the patch Daniel proposed.

Acked-by: Paul Moore <paul@paul-moore.com>

--
paul moore
www.paul-moore.com
Paul Moore June 6, 2021, 2:11 a.m. UTC | #34
On Sat, Jun 5, 2021 at 2:17 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sat, Jun 5, 2021 at 11:11 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >
> > You have fallen into a common fallacy. The fact that the "code runs"
> > does not assure that the "system works right". In the security world
> > we face this all the time, often with performance expectations. In this
> > case the BPF design has failed [..]
>
> I think it's the lockdown patches that have failed. They did the wrong
> thing, they didn't work,
>
> The report in question is for a regression.
>
> THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS.

To think I was worried we might end this thread without a bit of CAPS
LOCK, whew! :)

I don't think anyone in this discussion, even Casey's last comment,
was denying that there was a problem.  The discussion and the
disagreements were about what a "proper" fix would be, and how one
might implement that fix; of course there were different ideas of
"proper" and implementations vary even when people agree, so things
were a bit of a mess.  If you want to get upset and shouty, I think
there are a few things spread across the subsystems involved that
would be worthy targets, but to say that Casey, myself, or anyone else
who plays under security/ denied the problem in this thread is not
fair, or correct, in my opinion.

> Honestly, security people need to understand that "not working" is not
> a success case of security. It's a failure case.

I can't pretend to know what all of the "security people" are
thinking, but I can say with a good degree of certainty that my goal
is not to crash, panic, kill, or otherwise disable a user's system.
When it comes to things like the LSM hooks, my goal is to try and make
sure we have the right hooks in the right places so that admins and
users have the tools they need to control access to their data and
systems in the way that they choose.  Sometimes this puts us at odds
with other subsystems in the kernel, we saw that in this thread, but
that's to be expected anytime you have competing priorities.  The
important part is that eventually we figure out some way to move
forward, and the fact that we are still all making progress and
putting out new kernel releases is proof that we are finding a way.
That's what matters to me, and if I was forced to guess, I would
imagine that matters quite a lot to most of us here.
Ondrej Mosnacek June 8, 2021, 11:01 a.m. UTC | #35
On Thu, Jun 3, 2021 at 7:46 PM Paul Moore <paul@paul-moore.com> wrote:
> On Wed, Jun 2, 2021 at 9:40 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> > On Fri, May 28, 2021 at 3:37 AM Paul Moore <paul@paul-moore.com> wrote:
[...]
> > > I know you and Casey went back and forth on this in v1, but I agree
> > > with Casey that having two LSM hooks here is a mistake.  I know it
> > > makes backports hard, but spoiler alert: maintaining complex software
> > > over any non-trivial period of time is hard, reeeeally hard sometimes
> > > ;)
> >
> > Do you mean having two slots in lsm_hook_defs.h or also having two
> > security_*() functions? (It's not clear to me if you're just
> > reiterating disagreement with v1 or if you dislike the simplified v2
> > as well.)
>
> To be clear I don't think there should be two functions for this, just
> make whatever changes are necessary to the existing
> security_locked_down() LSM hook.  Yes, the backport is hard.  Yes, it
> will touch a lot of code.  Yes, those are lame excuses to not do the
> right thing.

OK, I guess I'll just go with the bigger patch.

> > > > The callers migrated to the new hook, passing NULL as cred:
> > > > 1. arch/powerpc/xmon/xmon.c
> > > >      Here the hook seems to be called from non-task context and is only
> > > >      used for redacting some sensitive values from output sent to
> > > >      userspace.
> > >
> > > This definitely sounds like kernel_t based on the description above.
> >
> > Here I'm a little concerned that the hook might be called from some
> > unusual interrupt, which is not masked by spin_lock_irqsave()... We
> > ran into this with PMI (Platform Management Interrupt) before, see
> > commit 5ae5fbd21079 ("powerpc/perf: Fix handling of privilege level
> > checks in perf interrupt context"). While I can't see anything that
> > would suggest something like this happening here, the whole thing is
> > so foreign to me that I'm wary of making assumptions :)
> >
> > @Michael/PPC devs, can you confirm to us that xmon_is_locked_down() is
> > only called from normal syscall/interrupt context (as opposed to
> > something tricky like PMI)?
>
> You did submit the code change so I assumed you weren't concerned
> about it :)  If it is a bad hook placement that is something else
> entirely.

What I submitted effectively avoided the SELinux hook to be called, so
I didn't bother checking deeper in what context those hook calls would
occur.

> Hopefully we'll get some guidance from the PPC folks.
>
> > > > 4. net/xfrm/xfrm_user.c:copy_to_user_*()
> > > >      Here a cryptographic secret is redacted based on the value returned
> > > >      from the hook. There are two possible actions that may lead here:
> > > >      a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the
> > > >         task context is relevant, since the dumped data is sent back to
> > > >         the current task.
> > >
> > > If the task context is relevant we should use it.
> >
> > Yes, but as I said it would create an asymmetry with case b), which
> > I'll expand on below...
> >
> > > >      b) When deleting an SA via XFRM_MSG_DELSA, the dumped SAs are
> > > >         broadcasted to tasks subscribed to XFRM events - here the
> > > >         SELinux check is not meningful as the current task's creds do
> > > >         not represent the tasks that could potentially see the secret.
> > >
> > > This looks very similar to the BPF hook discussed above, I believe my
> > > comments above apply here as well.
> >
> > Using the current task is just logically wrong in this case. The
> > current task here is just simply deleting an SA that happens to have
> > some secret value in it. When deleting an SA, a notification is sent
> > to a group of subscribers (some group of other tasks), which includes
> > a dump of the secret value. The current task isn't doing any attempt
> > to breach lockdown, it's just deleting an SA.
> >
> > It also makes it really awkward to make policy decisions around this.
> > Suppose that domains A, B, and C need to be able to add/delete SAs and
> > domains D, E, and F need to receive notifications about changes in
> > SAs. Then if, say, domain E actually needs to see the secret values in
> > the notifications, you must grant the confidentiality permission to
> > all of A, B, C to keep things working. And now you have opened up the
> > door for A, B, C to do other lockdown-confidentiality stuff, even
> > though these domains themselves actually don't request/need any
> > confidential data from the kernel. That's just not logical and you may
> > actually end up (slightly) worse security-wise than if you just
> > skipped checking for XFRM secrets altogether, because you need to
> > allow confidentiality to domains for which it may be excessive.
>
> It sounds an awful lot like the lockdown hook is in the wrong spot.
> It sounds like it would be a lot better to relocate the hook than
> remove it.

I don't see how you would solve this by moving the hook. Where do you
want to relocate it? The main obstacle is that the message containing
the SA dump is sent to consumers via a simple netlink broadcast, which
doesn't provide a facility to redact the SA secret on a per-consumer
basis. I can't see any way to make the checks meaningful for SELinux
without a major overhaul of the broadcast logic.

IMO, switching the subject to kernel_t either in both cases or at
least in case b) is the best compromise between usability, security,
and developers' time / code complexity. I don't think controlling
which of the CAP_NET_ADMIN domains can/can't see/leak SA secrets is
worth obsessing about.

--
Ondrej Mosnacek
Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.
Paul Moore June 9, 2021, 2:40 a.m. UTC | #36
On Tue, Jun 8, 2021 at 7:02 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> On Thu, Jun 3, 2021 at 7:46 PM Paul Moore <paul@paul-moore.com> wrote:

...

> > It sounds an awful lot like the lockdown hook is in the wrong spot.
> > It sounds like it would be a lot better to relocate the hook than
> > remove it.
>
> I don't see how you would solve this by moving the hook. Where do you
> want to relocate it?

Wherever it makes sense.  Based on your comments it really sounded
like the hook was in a bad spot and since your approach in a lot of
this had been to remove or disable hooks I wanted to make sure that
relocating the hook was something you had considered.  Thankfully it
sounds like you have considered moving the hook - that's good.

> The main obstacle is that the message containing
> the SA dump is sent to consumers via a simple netlink broadcast, which
> doesn't provide a facility to redact the SA secret on a per-consumer
> basis. I can't see any way to make the checks meaningful for SELinux
> without a major overhaul of the broadcast logic.

Fair enough.
diff mbox series

Patch

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index c8173e92f19d..90992793b4c5 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -299,7 +299,7 @@  static bool xmon_is_locked_down(void)
 	static bool lockdown;
 
 	if (!lockdown) {
-		lockdown = !!security_locked_down(LOCKDOWN_XMON_RW);
+		lockdown = !!security_cred_locked_down(NULL, LOCKDOWN_XMON_RW);
 		if (lockdown) {
 			printf("xmon: Disabled due to kernel lockdown\n");
 			xmon_is_ro = true;
@@ -307,7 +307,7 @@  static bool xmon_is_locked_down(void)
 	}
 
 	if (!xmon_is_ro) {
-		xmon_is_ro = !!security_locked_down(LOCKDOWN_XMON_WR);
+		xmon_is_ro = !!security_cred_locked_down(NULL, LOCKDOWN_XMON_WR);
 		if (xmon_is_ro)
 			printf("xmon: Read-only due to kernel lockdown\n");
 	}
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 1261e8b41edb..7edde3fc22f5 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -396,7 +396,7 @@  struct dentry *tracefs_create_file(const char *name, umode_t mode,
 	struct dentry *dentry;
 	struct inode *inode;
 
-	if (security_locked_down(LOCKDOWN_TRACEFS))
+	if (security_cred_locked_down(NULL, LOCKDOWN_TRACEFS))
 		return NULL;
 
 	if (!(mode & S_IFMT))
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 2adeea44c0d5..0115d7e3db55 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -393,7 +393,8 @@  LSM_HOOK(int, 0, bpf_prog_alloc_security, struct bpf_prog_aux *aux)
 LSM_HOOK(void, LSM_RET_VOID, bpf_prog_free_security, struct bpf_prog_aux *aux)
 #endif /* CONFIG_BPF_SYSCALL */
 
-LSM_HOOK(int, 0, locked_down, enum lockdown_reason what)
+LSM_HOOK(int, 0, cred_locked_down, const struct cred *cred,
+	 enum lockdown_reason what)
 
 #ifdef CONFIG_PERF_EVENTS
 LSM_HOOK(int, 0, perf_event_open, struct perf_event_attr *attr, int type)
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 5c4c5c0602cb..2d2d82ffea34 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1539,10 +1539,11 @@ 
  * @bpf_prog_free_security:
  *	Clean up the security information stored inside bpf prog.
  *
- * @locked_down:
+ * @cred_locked_down:
  *     Determine whether a kernel feature that potentially enables arbitrary
  *     code execution in kernel space should be permitted.
  *
+ *     @cred: credential asociated with the operation, or NULL if not applicable
  *     @what: kernel feature being accessed
  *
  * Security hooks for perf events
diff --git a/include/linux/security.h b/include/linux/security.h
index 24eda04221e9..6a609787a03a 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -26,6 +26,7 @@ 
 #include <linux/kernel_read_file.h>
 #include <linux/key.h>
 #include <linux/capability.h>
+#include <linux/cred.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/err.h>
@@ -33,7 +34,6 @@ 
 #include <linux/mm.h>
 
 struct linux_binprm;
-struct cred;
 struct rlimit;
 struct kernel_siginfo;
 struct sembuf;
@@ -470,7 +470,7 @@  void security_inode_invalidate_secctx(struct inode *inode);
 int security_inode_notifysecctx(struct inode *inode, void *ctx, u32 ctxlen);
 int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
 int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
-int security_locked_down(enum lockdown_reason what);
+int security_cred_locked_down(const struct cred *cred, enum lockdown_reason what);
 #else /* CONFIG_SECURITY */
 
 static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data)
@@ -1343,12 +1343,17 @@  static inline int security_inode_getsecctx(struct inode *inode, void **ctx, u32
 {
 	return -EOPNOTSUPP;
 }
-static inline int security_locked_down(enum lockdown_reason what)
+static inline int security_cred_locked_down(struct cred *cred, enum lockdown_reason what)
 {
 	return 0;
 }
 #endif	/* CONFIG_SECURITY */
 
+static inline int security_locked_down(enum lockdown_reason what)
+{
+	return security_cred_locked_down(current_cred(), what);
+}
+
 #if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
 int security_post_notification(const struct cred *w_cred,
 			       const struct cred *cred,
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d2d7cf6cfe83..d8a242837c1e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -215,7 +215,7 @@  const struct bpf_func_proto bpf_probe_read_user_str_proto = {
 static __always_inline int
 bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr)
 {
-	int ret = security_locked_down(LOCKDOWN_BPF_READ);
+	int ret = security_cred_locked_down(NULL, LOCKDOWN_BPF_READ);
 
 	if (unlikely(ret < 0))
 		goto fail;
@@ -246,7 +246,7 @@  const struct bpf_func_proto bpf_probe_read_kernel_proto = {
 static __always_inline int
 bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr)
 {
-	int ret = security_locked_down(LOCKDOWN_BPF_READ);
+	int ret = security_cred_locked_down(NULL, LOCKDOWN_BPF_READ);
 
 	if (unlikely(ret < 0))
 		goto fail;
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index f0aecee4d539..5f45848c4ff3 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -851,7 +851,7 @@  static int copy_user_offload(struct xfrm_state_offload *xso, struct sk_buff *skb
 static bool xfrm_redact(void)
 {
 	return IS_ENABLED(CONFIG_SECURITY) &&
-		security_locked_down(LOCKDOWN_XFRM_SECRET);
+		security_cred_locked_down(NULL, LOCKDOWN_XFRM_SECRET);
 }
 
 static int copy_to_user_auth(struct xfrm_algo_auth *auth, struct sk_buff *skb)
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
index 87cbdc64d272..2a13c866c22a 100644
--- a/security/lockdown/lockdown.c
+++ b/security/lockdown/lockdown.c
@@ -55,7 +55,8 @@  early_param("lockdown", lockdown_param);
  * lockdown_is_locked_down - Find out if the kernel is locked down
  * @what: Tag to use in notice generated if lockdown is in effect
  */
-static int lockdown_is_locked_down(enum lockdown_reason what)
+static int lockdown_is_locked_down(const struct cred *cred,
+				   enum lockdown_reason what)
 {
 	if (WARN(what >= LOCKDOWN_CONFIDENTIALITY_MAX,
 		 "Invalid lockdown reason"))
@@ -72,7 +73,7 @@  static int lockdown_is_locked_down(enum lockdown_reason what)
 }
 
 static struct security_hook_list lockdown_hooks[] __lsm_ro_after_init = {
-	LSM_HOOK_INIT(locked_down, lockdown_is_locked_down),
+	LSM_HOOK_INIT(cred_locked_down, lockdown_is_locked_down),
 };
 
 static int __init lockdown_lsm_init(void)
diff --git a/security/security.c b/security/security.c
index 0c1c9796e3e4..c987b70bc16c 100644
--- a/security/security.c
+++ b/security/security.c
@@ -2592,11 +2592,11 @@  void security_bpf_prog_free(struct bpf_prog_aux *aux)
 }
 #endif /* CONFIG_BPF_SYSCALL */
 
-int security_locked_down(enum lockdown_reason what)
+int security_cred_locked_down(const struct cred *cred, enum lockdown_reason what)
 {
-	return call_int_hook(locked_down, 0, what);
+	return call_int_hook(cred_locked_down, 0, cred, what);
 }
-EXPORT_SYMBOL(security_locked_down);
+EXPORT_SYMBOL(security_cred_locked_down);
 
 #ifdef CONFIG_PERF_EVENTS
 int security_perf_event_open(struct perf_event_attr *attr, int type)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index eaea837d89d1..c415e3ca4f24 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -7017,10 +7017,10 @@  static void selinux_bpf_prog_free(struct bpf_prog_aux *aux)
 }
 #endif
 
-static int selinux_lockdown(enum lockdown_reason what)
+static int selinux_lockdown(const struct cred *cred, enum lockdown_reason what)
 {
 	struct common_audit_data ad;
-	u32 sid = current_sid();
+	u32 sid;
 	int invalid_reason = (what <= LOCKDOWN_NONE) ||
 			     (what == LOCKDOWN_INTEGRITY_MAX) ||
 			     (what >= LOCKDOWN_CONFIDENTIALITY_MAX);
@@ -7032,6 +7032,12 @@  static int selinux_lockdown(enum lockdown_reason what)
 		return -EINVAL;
 	}
 
+	/* Ignore if there is no relevant cred to check against */
+	if (!cred)
+		return 0;
+
+	sid = cred_sid(cred);
+
 	ad.type = LSM_AUDIT_DATA_LOCKDOWN;
 	ad.u.reason = what;
 
@@ -7353,7 +7359,7 @@  static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(perf_event_write, selinux_perf_event_write),
 #endif
 
-	LSM_HOOK_INIT(locked_down, selinux_lockdown),
+	LSM_HOOK_INIT(cred_locked_down, selinux_lockdown),
 
 	/*
 	 * PUT "CLONING" (ACCESSING + ALLOCATING) HOOKS HERE