diff mbox series

[RFC,v19,2/5] security: Add new SHOULD_EXEC_CHECK and SHOULD_EXEC_RESTRICT securebits

Message ID 20240704190137.696169-3-mic@digikod.net (mailing list archive)
State New
Headers show
Series Script execution control (was O_MAYEXEC) | expand

Commit Message

Mickaël Salaün July 4, 2024, 7:01 p.m. UTC
These new SECBIT_SHOULD_EXEC_CHECK, SECBIT_SHOULD_EXEC_RESTRICT, and
their *_LOCKED counterparts are designed to be set by processes setting
up an execution environment, such as a user session, a container, or a
security sandbox.  Like seccomp filters or Landlock domains, the
securebits are inherited across proceses.

When SECBIT_SHOULD_EXEC_CHECK is set, programs interpreting code should
check executable resources with execveat(2) + AT_CHECK (see previous
patch).

When SECBIT_SHOULD_EXEC_RESTRICT is set, a process should only allow
execution of approved resources, if any (see SECBIT_SHOULD_EXEC_CHECK).

For a secure environment, we might also want
SECBIT_SHOULD_EXEC_CHECK_LOCKED and SECBIT_SHOULD_EXEC_RESTRICT_LOCKED
to be set.  For a test environment (e.g. testing on a fleet to identify
potential issues), only the SECBIT_SHOULD_EXEC_CHECK* bits can be set to
still be able to identify potential issues (e.g. with interpreters logs
or LSMs audit entries).

It should be noted that unlike other security bits, the
SECBIT_SHOULD_EXEC_CHECK and SECBIT_SHOULD_EXEC_RESTRICT bits are
dedicated to user space willing to restrict itself.  Because of that,
they only make sense in the context of a trusted environment (e.g.
sandbox, container, user session, full system) where the process
changing its behavior (according to these bits) and all its parent
processes are trusted.  Otherwise, any parent process could just execute
its own malicious code (interpreting a script or not), or even enforce a
seccomp filter to mask these bits.

Such a secure environment can be achieved with an appropriate access
control policy (e.g. mount's noexec option, file access rights, LSM
configuration) and an enlighten ld.so checking that libraries are
allowed for execution e.g., to protect against illegitimate use of
LD_PRELOAD.

Scripts may need some changes to deal with untrusted data (e.g. stdin,
environment variables), but that is outside the scope of the kernel.

The only restriction enforced by the kernel is the right to ptrace
another process.  Processes are denied to ptrace less restricted ones,
unless the tracer has CAP_SYS_PTRACE.  This is mainly a safeguard to
avoid trivial privilege escalations e.g., by a debugging process being
abused with a confused deputy attack.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <paul@paul-moore.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Link: https://lore.kernel.org/r/20240704190137.696169-3-mic@digikod.net
---

New design since v18:
https://lore.kernel.org/r/20220104155024.48023-3-mic@digikod.net
---
 include/uapi/linux/securebits.h | 56 ++++++++++++++++++++++++++++-
 security/commoncap.c            | 63 ++++++++++++++++++++++++++++-----
 2 files changed, 110 insertions(+), 9 deletions(-)

Comments

Kees Cook July 5, 2024, 12:18 a.m. UTC | #1
On Thu, Jul 04, 2024 at 09:01:34PM +0200, Mickaël Salaün wrote:
> Such a secure environment can be achieved with an appropriate access
> control policy (e.g. mount's noexec option, file access rights, LSM
> configuration) and an enlighten ld.so checking that libraries are
> allowed for execution e.g., to protect against illegitimate use of
> LD_PRELOAD.
> 
> Scripts may need some changes to deal with untrusted data (e.g. stdin,
> environment variables), but that is outside the scope of the kernel.

If the threat model includes an attacker sitting at a shell prompt, we
need to be very careful about how process perform enforcement. E.g. even
on a locked down system, if an attacker has access to LD_PRELOAD or a
seccomp wrapper (which you both mention here), it would be possible to
run commands where the resulting process is tricked into thinking it
doesn't have the bits set.

But this would be exactly true for calling execveat(): LD_PRELOAD or
seccomp policy could have it just return 0.

While I like AT_CHECK, I do wonder if it's better to do the checks via
open(), as was originally designed with O_MAYEXEC. Because then
enforcement is gated by the kernel -- the process does not get a file
descriptor _at all_, no matter what LD_PRELOAD or seccomp tricks it into
doing.

And this thinking also applies to faccessat() too: if a process can be
tricked into thinking the access check passed, it'll happily interpret
whatever. :( But not being able to open the fd _at all_ when O_MAYEXEC
is being checked seems substantially safer to me...
Mickaël Salaün July 5, 2024, 5:54 p.m. UTC | #2
On Thu, Jul 04, 2024 at 05:18:04PM -0700, Kees Cook wrote:
> On Thu, Jul 04, 2024 at 09:01:34PM +0200, Mickaël Salaün wrote:
> > Such a secure environment can be achieved with an appropriate access
> > control policy (e.g. mount's noexec option, file access rights, LSM
> > configuration) and an enlighten ld.so checking that libraries are
> > allowed for execution e.g., to protect against illegitimate use of
> > LD_PRELOAD.
> > 
> > Scripts may need some changes to deal with untrusted data (e.g. stdin,
> > environment variables), but that is outside the scope of the kernel.
> 
> If the threat model includes an attacker sitting at a shell prompt, we
> need to be very careful about how process perform enforcement. E.g. even
> on a locked down system, if an attacker has access to LD_PRELOAD or a

LD_PRELOAD should be OK once ld.so will be patched to check the
libraries.  We can still imagine a debug library used to bypass security
checks, but in this case the issue would be that this library is
executable in the first place.

> seccomp wrapper (which you both mention here), it would be possible to
> run commands where the resulting process is tricked into thinking it
> doesn't have the bits set.

As explained in the UAPI comments, all parent processes need to be
trusted.  This meeans that their code is trusted, their seccomp filters
are trusted, and that they are patched, if needed, to check file
executability.

> 
> But this would be exactly true for calling execveat(): LD_PRELOAD or
> seccomp policy could have it just return 0.

If an attacker is allowed/able to load an arbitrary seccomp filter on a
process, we cannot trust this process.

> 
> While I like AT_CHECK, I do wonder if it's better to do the checks via
> open(), as was originally designed with O_MAYEXEC. Because then
> enforcement is gated by the kernel -- the process does not get a file
> descriptor _at all_, no matter what LD_PRELOAD or seccomp tricks it into
> doing.

Being able to check a path name or a file descriptor (with the same
syscall) is more flexible and cover more use cases.  The execveat(2)
interface, including current and future flags, is dedicated to file
execution.  I then think that using execveat(2) for this kind of check
makes more sense, and will easily evolve with this syscall.

> 
> And this thinking also applies to faccessat() too: if a process can be
> tricked into thinking the access check passed, it'll happily interpret
> whatever. :( But not being able to open the fd _at all_ when O_MAYEXEC
> is being checked seems substantially safer to me...

If attackers can filter execveat(2), they can also filter open(2) and
any other syscalls.  In all cases, that would mean an issue in the
security policy.
Kees Cook July 5, 2024, 9:44 p.m. UTC | #3
On Fri, Jul 05, 2024 at 07:54:16PM +0200, Mickaël Salaün wrote:
> On Thu, Jul 04, 2024 at 05:18:04PM -0700, Kees Cook wrote:
> > On Thu, Jul 04, 2024 at 09:01:34PM +0200, Mickaël Salaün wrote:
> > > Such a secure environment can be achieved with an appropriate access
> > > control policy (e.g. mount's noexec option, file access rights, LSM
> > > configuration) and an enlighten ld.so checking that libraries are
> > > allowed for execution e.g., to protect against illegitimate use of
> > > LD_PRELOAD.
> > > 
> > > Scripts may need some changes to deal with untrusted data (e.g. stdin,
> > > environment variables), but that is outside the scope of the kernel.
> > 
> > If the threat model includes an attacker sitting at a shell prompt, we
> > need to be very careful about how process perform enforcement. E.g. even
> > on a locked down system, if an attacker has access to LD_PRELOAD or a
> 
> LD_PRELOAD should be OK once ld.so will be patched to check the
> libraries.  We can still imagine a debug library used to bypass security
> checks, but in this case the issue would be that this library is
> executable in the first place.

Ah yes, that's fair: the shell would discover the malicious library
while using AT_CHECK during resolution of the LD_PRELOAD.

> > seccomp wrapper (which you both mention here), it would be possible to
> > run commands where the resulting process is tricked into thinking it
> > doesn't have the bits set.
> 
> As explained in the UAPI comments, all parent processes need to be
> trusted.  This meeans that their code is trusted, their seccomp filters
> are trusted, and that they are patched, if needed, to check file
> executability.

But we have launchers that apply arbitrary seccomp policy, e.g. minijail
on Chrome OS, or even systemd on regular distros. In theory, this should
be handled via other ACLs.

> > But this would be exactly true for calling execveat(): LD_PRELOAD or
> > seccomp policy could have it just return 0.
> 
> If an attacker is allowed/able to load an arbitrary seccomp filter on a
> process, we cannot trust this process.
> 
> > 
> > While I like AT_CHECK, I do wonder if it's better to do the checks via
> > open(), as was originally designed with O_MAYEXEC. Because then
> > enforcement is gated by the kernel -- the process does not get a file
> > descriptor _at all_, no matter what LD_PRELOAD or seccomp tricks it into
> > doing.
> 
> Being able to check a path name or a file descriptor (with the same
> syscall) is more flexible and cover more use cases.

If flexibility costs us reliability, I think that flexibility is not
a benefit.

> The execveat(2)
> interface, including current and future flags, is dedicated to file
> execution.  I then think that using execveat(2) for this kind of check
> makes more sense, and will easily evolve with this syscall.

Yeah, I do recognize that is feels much more natural, but I remain
unhappy about how difficult it will become to audit a system for safety
when the check is strictly per-process opt-in, and not enforced by the
kernel for a given process tree. But, I think this may have always been
a fiction in my mind. :)

> > And this thinking also applies to faccessat() too: if a process can be
> > tricked into thinking the access check passed, it'll happily interpret
> > whatever. :( But not being able to open the fd _at all_ when O_MAYEXEC
> > is being checked seems substantially safer to me...
> 
> If attackers can filter execveat(2), they can also filter open(2) and
> any other syscalls.  In all cases, that would mean an issue in the
> security policy.

Hm, as in, make a separate call to open(2) without O_MAYEXEC, and pass
that fd back to the filtered open(2) that did have O_MAYEXEC. Yes, true.

I guess it does become morally equivalent.

Okay. Well, let me ask about usability. Right now, a process will need
to do:

- should I use AT_CHECK? (check secbit)
- if yes: perform execveat(AT_CHECK)

Why not leave the secbit test up to the kernel, and then the program can
just unconditionally call execveat(AT_CHECK)?

Though perhaps the issue here is that an execveat() EINVAL doesn't
tell the program if AT_CHECK is unimplemented or if something else
went wrong, and the secbit prctl() will give the correct signal about
AT_CHECK availability?
Jarkko Sakkinen July 5, 2024, 10:22 p.m. UTC | #4
On Sat Jul 6, 2024 at 12:44 AM EEST, Kees Cook wrote:
> > As explained in the UAPI comments, all parent processes need to be
> > trusted.  This meeans that their code is trusted, their seccomp filters
> > are trusted, and that they are patched, if needed, to check file
> > executability.
>
> But we have launchers that apply arbitrary seccomp policy, e.g. minijail
> on Chrome OS, or even systemd on regular distros. In theory, this should
> be handled via other ACLs.

Or a regular web browser? AFAIK seccomp filtering was the tool to make
secure browser tabs in the first place.

BR, Jarkko
Mickaël Salaün July 6, 2024, 2:56 p.m. UTC | #5
On Fri, Jul 05, 2024 at 02:44:03PM -0700, Kees Cook wrote:
> On Fri, Jul 05, 2024 at 07:54:16PM +0200, Mickaël Salaün wrote:
> > On Thu, Jul 04, 2024 at 05:18:04PM -0700, Kees Cook wrote:
> > > On Thu, Jul 04, 2024 at 09:01:34PM +0200, Mickaël Salaün wrote:
> > > > Such a secure environment can be achieved with an appropriate access
> > > > control policy (e.g. mount's noexec option, file access rights, LSM
> > > > configuration) and an enlighten ld.so checking that libraries are
> > > > allowed for execution e.g., to protect against illegitimate use of
> > > > LD_PRELOAD.
> > > > 
> > > > Scripts may need some changes to deal with untrusted data (e.g. stdin,
> > > > environment variables), but that is outside the scope of the kernel.
> > > 
> > > If the threat model includes an attacker sitting at a shell prompt, we
> > > need to be very careful about how process perform enforcement. E.g. even
> > > on a locked down system, if an attacker has access to LD_PRELOAD or a
> > 
> > LD_PRELOAD should be OK once ld.so will be patched to check the
> > libraries.  We can still imagine a debug library used to bypass security
> > checks, but in this case the issue would be that this library is
> > executable in the first place.
> 
> Ah yes, that's fair: the shell would discover the malicious library
> while using AT_CHECK during resolution of the LD_PRELOAD.

That's the idea, but it would be checked by ld.so, not the shell.

> 
> > > seccomp wrapper (which you both mention here), it would be possible to
> > > run commands where the resulting process is tricked into thinking it
> > > doesn't have the bits set.
> > 
> > As explained in the UAPI comments, all parent processes need to be
> > trusted.  This meeans that their code is trusted, their seccomp filters
> > are trusted, and that they are patched, if needed, to check file
> > executability.
> 
> But we have launchers that apply arbitrary seccomp policy, e.g. minijail
> on Chrome OS, or even systemd on regular distros. In theory, this should
> be handled via other ACLs.

Processes running with untrusted seccomp filter should be considered
untrusted.  It would then make sense for these seccomp filters/programs
to be considered executable code, and then for minijail and systemd to
check them with AT_CHECK (according to the securebits policy).

> 
> > > But this would be exactly true for calling execveat(): LD_PRELOAD or
> > > seccomp policy could have it just return 0.
> > 
> > If an attacker is allowed/able to load an arbitrary seccomp filter on a
> > process, we cannot trust this process.
> > 
> > > 
> > > While I like AT_CHECK, I do wonder if it's better to do the checks via
> > > open(), as was originally designed with O_MAYEXEC. Because then
> > > enforcement is gated by the kernel -- the process does not get a file
> > > descriptor _at all_, no matter what LD_PRELOAD or seccomp tricks it into
> > > doing.
> > 
> > Being able to check a path name or a file descriptor (with the same
> > syscall) is more flexible and cover more use cases.
> 
> If flexibility costs us reliability, I think that flexibility is not
> a benefit.

Well, it's a matter of letting user space do what they think is best,
and I think there are legitimate and safe uses of path names, even if I
agree that this should not be used in most use cases.  Would we want
faccessat2(2) to only take file descriptor as argument and not file
path? I don't think so but I'd defer to the VFS maintainers.

Christian, Al, Linus?

Steve, could you share a use case with file paths?

> 
> > The execveat(2)
> > interface, including current and future flags, is dedicated to file
> > execution.  I then think that using execveat(2) for this kind of check
> > makes more sense, and will easily evolve with this syscall.
> 
> Yeah, I do recognize that is feels much more natural, but I remain
> unhappy about how difficult it will become to audit a system for safety
> when the check is strictly per-process opt-in, and not enforced by the
> kernel for a given process tree. But, I think this may have always been
> a fiction in my mind. :)

Hmm, I'm not sure to follow. Securebits are inherited, so process tree.
And we need the parent processes to be trusted anyway.

> 
> > > And this thinking also applies to faccessat() too: if a process can be
> > > tricked into thinking the access check passed, it'll happily interpret
> > > whatever. :( But not being able to open the fd _at all_ when O_MAYEXEC
> > > is being checked seems substantially safer to me...
> > 
> > If attackers can filter execveat(2), they can also filter open(2) and
> > any other syscalls.  In all cases, that would mean an issue in the
> > security policy.
> 
> Hm, as in, make a separate call to open(2) without O_MAYEXEC, and pass
> that fd back to the filtered open(2) that did have O_MAYEXEC. Yes, true.
> 
> I guess it does become morally equivalent.
> 
> Okay. Well, let me ask about usability. Right now, a process will need
> to do:
> 
> - should I use AT_CHECK? (check secbit)
> - if yes: perform execveat(AT_CHECK)
> 
> Why not leave the secbit test up to the kernel, and then the program can
> just unconditionally call execveat(AT_CHECK)?

That was kind of the approach of the previous patch series and Linus
wanted the new interface to follow the kernel semantic.  Enforcing this
kind of restriction will always be the duty of user space anyway, so I
think it's simpler (i.e. no mix of policy definition, access check, and
policy enforcement, but a standalone execveat feature), more flexible,
and it fully delegates the policy enforcement to user space instead of
trying to enforce some part in the kernel which would only give the
illusion of security/policy enforcement.

> 
> Though perhaps the issue here is that an execveat() EINVAL doesn't
> tell the program if AT_CHECK is unimplemented or if something else
> went wrong, and the secbit prctl() will give the correct signal about
> AT_CHECK availability?

This kind of check could indeed help to identify the issue.
Mickaël Salaün July 6, 2024, 2:56 p.m. UTC | #6
On Sat, Jul 06, 2024 at 01:22:06AM +0300, Jarkko Sakkinen wrote:
> On Sat Jul 6, 2024 at 12:44 AM EEST, Kees Cook wrote:
> > > As explained in the UAPI comments, all parent processes need to be
> > > trusted.  This meeans that their code is trusted, their seccomp filters
> > > are trusted, and that they are patched, if needed, to check file
> > > executability.
> >
> > But we have launchers that apply arbitrary seccomp policy, e.g. minijail
> > on Chrome OS, or even systemd on regular distros. In theory, this should
> > be handled via other ACLs.
> 
> Or a regular web browser? AFAIK seccomp filtering was the tool to make
> secure browser tabs in the first place.

Yes, and that't OK.  Web browsers embedded their own seccomp filters and
they are then as trusted as the browser code.
Jarkko Sakkinen July 6, 2024, 5:28 p.m. UTC | #7
On Sat Jul 6, 2024 at 5:56 PM EEST, Mickaël Salaün wrote:
> On Sat, Jul 06, 2024 at 01:22:06AM +0300, Jarkko Sakkinen wrote:
> > On Sat Jul 6, 2024 at 12:44 AM EEST, Kees Cook wrote:
> > > > As explained in the UAPI comments, all parent processes need to be
> > > > trusted.  This meeans that their code is trusted, their seccomp filters
> > > > are trusted, and that they are patched, if needed, to check file
> > > > executability.
> > >
> > > But we have launchers that apply arbitrary seccomp policy, e.g. minijail
> > > on Chrome OS, or even systemd on regular distros. In theory, this should
> > > be handled via other ACLs.
> > 
> > Or a regular web browser? AFAIK seccomp filtering was the tool to make
> > secure browser tabs in the first place.
>
> Yes, and that't OK.  Web browsers embedded their own seccomp filters and
> they are then as trusted as the browser code.

I'd recommend to slice of tech detail from cover letter, as long as
those details are in the commit messages.

Then, in the cover letter I'd go through maybe two familiar scenarios,
with interactions to this functionality.

A desktop web browser could be perhaps one of them...

BR, Jarkko
diff mbox series

Patch

diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h
index d6d98877ff1a..3fdb0382718b 100644
--- a/include/uapi/linux/securebits.h
+++ b/include/uapi/linux/securebits.h
@@ -52,10 +52,64 @@ 
 #define SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED \
 			(issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE_LOCKED))
 
+/*
+ * When SECBIT_SHOULD_EXEC_CHECK is set, a process should check all executable
+ * files with execveat(2) + AT_CHECK.  However, such check should only be
+ * performed if all to-be-executed code only comes from regular files.  For
+ * instance, if a script interpreter is called with both a script snipped as
+ * argument and a regular file, the interpreter should not check any file.
+ * Doing otherwise would mislead the kernel to think that only the script file
+ * is being executed, which could for instance lead to unexpected permission
+ * change and break current use cases.
+ *
+ * This secure bit may be set by user session managers, service managers,
+ * container runtimes, sandboxer tools...  Except for test environments, the
+ * related SECBIT_SHOULD_EXEC_CHECK_LOCKED bit should also be set.
+ *
+ * Ptracing another process is deny if the tracer has SECBIT_SHOULD_EXEC_CHECK
+ * but not the tracee.  SECBIT_SHOULD_EXEC_CHECK_LOCKED also checked.
+ */
+#define SECURE_SHOULD_EXEC_CHECK		8
+#define SECURE_SHOULD_EXEC_CHECK_LOCKED		9  /* make bit-8 immutable */
+
+#define SECBIT_SHOULD_EXEC_CHECK (issecure_mask(SECURE_SHOULD_EXEC_CHECK))
+#define SECBIT_SHOULD_EXEC_CHECK_LOCKED \
+			(issecure_mask(SECURE_SHOULD_EXEC_CHECK_LOCKED))
+
+/*
+ * When SECBIT_SHOULD_EXEC_RESTRICT is set, a process should only allow
+ * execution of approved files, if any (see SECBIT_SHOULD_EXEC_CHECK).  For
+ * instance, script interpreters called with a script snippet as argument
+ * should always deny such execution if SECBIT_SHOULD_EXEC_RESTRICT is set.
+ * However, if a script interpreter is called with both
+ * SECBIT_SHOULD_EXEC_CHECK and SECBIT_SHOULD_EXEC_RESTRICT, they should
+ * interpret the provided script files if no unchecked code is also provided
+ * (e.g. directly as argument).
+ *
+ * This secure bit may be set by user session managers, service managers,
+ * container runtimes, sandboxer tools...  Except for test environments, the
+ * related SECBIT_SHOULD_EXEC_RESTRICT_LOCKED bit should also be set.
+ *
+ * Ptracing another process is deny if the tracer has
+ * SECBIT_SHOULD_EXEC_RESTRICT but not the tracee.
+ * SECBIT_SHOULD_EXEC_RESTRICT_LOCKED is also checked.
+ */
+#define SECURE_SHOULD_EXEC_RESTRICT		10
+#define SECURE_SHOULD_EXEC_RESTRICT_LOCKED	11  /* make bit-8 immutable */
+
+#define SECBIT_SHOULD_EXEC_RESTRICT (issecure_mask(SECURE_SHOULD_EXEC_RESTRICT))
+#define SECBIT_SHOULD_EXEC_RESTRICT_LOCKED \
+			(issecure_mask(SECURE_SHOULD_EXEC_RESTRICT_LOCKED))
+
 #define SECURE_ALL_BITS		(issecure_mask(SECURE_NOROOT) | \
 				 issecure_mask(SECURE_NO_SETUID_FIXUP) | \
 				 issecure_mask(SECURE_KEEP_CAPS) | \
-				 issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE))
+				 issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE) | \
+				 issecure_mask(SECURE_SHOULD_EXEC_CHECK) | \
+				 issecure_mask(SECURE_SHOULD_EXEC_RESTRICT))
 #define SECURE_ALL_LOCKS	(SECURE_ALL_BITS << 1)
 
+#define SECURE_ALL_UNPRIVILEGED (issecure_mask(SECURE_SHOULD_EXEC_CHECK) | \
+				 issecure_mask(SECURE_SHOULD_EXEC_RESTRICT))
+
 #endif /* _UAPI_LINUX_SECUREBITS_H */
diff --git a/security/commoncap.c b/security/commoncap.c
index 162d96b3a676..34b4493e2a69 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -117,6 +117,33 @@  int cap_settime(const struct timespec64 *ts, const struct timezone *tz)
 	return 0;
 }
 
+static bool ptrace_secbits_allowed(const struct cred *tracer,
+				   const struct cred *tracee)
+{
+	const unsigned long tracer_secbits = SECURE_ALL_UNPRIVILEGED &
+					     tracer->securebits;
+	const unsigned long tracee_secbits = SECURE_ALL_UNPRIVILEGED &
+					     tracee->securebits;
+	/* Ignores locking of unset secure bits (cf. SECURE_ALL_LOCKS). */
+	const unsigned long tracer_locked = (tracer_secbits << 1) &
+					    tracer->securebits;
+	const unsigned long tracee_locked = (tracee_secbits << 1) &
+					    tracee->securebits;
+
+	/* The tracee must not have less constraints than the tracer. */
+	if ((tracer_secbits | tracee_secbits) != tracee_secbits)
+		return false;
+
+	/*
+	 * Makes sure that the tracer's locks for restrictions are the same for
+	 * the tracee.
+	 */
+	if ((tracer_locked | tracee_locked) != tracee_locked)
+		return false;
+
+	return true;
+}
+
 /**
  * cap_ptrace_access_check - Determine whether the current process may access
  *			   another
@@ -146,7 +173,8 @@  int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
 	else
 		caller_caps = &cred->cap_permitted;
 	if (cred->user_ns == child_cred->user_ns &&
-	    cap_issubset(child_cred->cap_permitted, *caller_caps))
+	    cap_issubset(child_cred->cap_permitted, *caller_caps) &&
+	    ptrace_secbits_allowed(cred, child_cred))
 		goto out;
 	if (ns_capable(child_cred->user_ns, CAP_SYS_PTRACE))
 		goto out;
@@ -178,7 +206,8 @@  int cap_ptrace_traceme(struct task_struct *parent)
 	cred = __task_cred(parent);
 	child_cred = current_cred();
 	if (cred->user_ns == child_cred->user_ns &&
-	    cap_issubset(child_cred->cap_permitted, cred->cap_permitted))
+	    cap_issubset(child_cred->cap_permitted, cred->cap_permitted) &&
+	    ptrace_secbits_allowed(cred, child_cred))
 		goto out;
 	if (has_ns_capability(parent, child_cred->user_ns, CAP_SYS_PTRACE))
 		goto out;
@@ -1302,21 +1331,39 @@  int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 		     & (old->securebits ^ arg2))			/*[1]*/
 		    || ((old->securebits & SECURE_ALL_LOCKS & ~arg2))	/*[2]*/
 		    || (arg2 & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS))	/*[3]*/
-		    || (cap_capable(current_cred(),
-				    current_cred()->user_ns,
-				    CAP_SETPCAP,
-				    CAP_OPT_NONE) != 0)			/*[4]*/
 			/*
 			 * [1] no changing of bits that are locked
 			 * [2] no unlocking of locks
 			 * [3] no setting of unsupported bits
-			 * [4] doing anything requires privilege (go read about
-			 *     the "sendmail capabilities bug")
 			 */
 		    )
 			/* cannot change a locked bit */
 			return -EPERM;
 
+		/*
+		 * Doing anything requires privilege (go read about the
+		 * "sendmail capabilities bug"), except for unprivileged bits.
+		 * Indeed, the SECURE_ALL_UNPRIVILEGED bits are not
+		 * restrictions enforced by the kernel but by user space on
+		 * itself.  The kernel is only in charge of protecting against
+		 * privilege escalation with ptrace protections.
+		 */
+		if (cap_capable(current_cred(), current_cred()->user_ns,
+				CAP_SETPCAP, CAP_OPT_NONE) != 0) {
+			const unsigned long unpriv_and_locks =
+				SECURE_ALL_UNPRIVILEGED |
+				SECURE_ALL_UNPRIVILEGED << 1;
+			const unsigned long changed = old->securebits ^ arg2;
+
+			/* For legacy reason, denies non-change. */
+			if (!changed)
+				return -EPERM;
+
+			/* Denies privileged changes. */
+			if (changed & ~unpriv_and_locks)
+				return -EPERM;
+		}
+
 		new = prepare_creds();
 		if (!new)
 			return -ENOMEM;