diff mbox series

[PATCHv4] exec: Fix a deadlock in ptrace

Message ID AM6PR03MB5170B976E6387FDDAD59A118E4E70@AM6PR03MB5170.eurprd03.prod.outlook.com (mailing list archive)
State New, archived
Headers show
Series [PATCHv4] exec: Fix a deadlock in ptrace | expand

Commit Message

Bernd Edlinger March 2, 2020, 10:18 p.m. UTC
This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.

I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated.  They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.

The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

strace          D    0 30614  30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

expect          D    0 31933  30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The proposed solution is to take the cred_guard_mutex only
in a critical section at the beginning, and at the end of the
execve function, and let PTRACE_ATTACH fail with EAGAIN while
execve is not complete, but other functions like vm_access are
allowed to complete normally.

I also took the opportunity to improve the documentation
of prepare_creds, which is obviously out of sync.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 Documentation/security/credentials.rst    | 19 +++++----
 fs/exec.c                                 | 28 +++++++++++--
 include/linux/binfmts.h                   |  6 ++-
 include/linux/sched/signal.h              |  1 +
 init/init_task.c                          |  1 +
 kernel/cred.c                             |  2 +-
 kernel/fork.c                             |  1 +
 kernel/ptrace.c                           |  4 ++
 mm/process_vm_access.c                    |  2 +-
 tools/testing/selftests/ptrace/Makefile   |  4 +-
 tools/testing/selftests/ptrace/vmaccess.c | 66 +++++++++++++++++++++++++++++++
 11 files changed, 117 insertions(+), 17 deletions(-)
 create mode 100644 tools/testing/selftests/ptrace/vmaccess.c

v2: adds a test case which passes when this patch is applied.
v3: fixes the issue without introducing a new mutex.
v4: fixes one comment and a formatting issue found by checkpatch.pl in the test case.

Comments

Kees Cook March 3, 2020, 2:26 a.m. UTC | #1
On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
> 
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated.  They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
> 
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> 
> strace          D    0 30614  30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> expect          D    0 31933  30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The proposed solution is to take the cred_guard_mutex only
> in a critical section at the beginning, and at the end of the
> execve function, and let PTRACE_ATTACH fail with EAGAIN while
> execve is not complete, but other functions like vm_access are
> allowed to complete normally.

Sorry to be bummer, but I don't think this will work. A few more things
during the exec process depend on cred_guard_mutex being held.

If I'm reading this patch correctly, this changes the lifetime of the
cred_guard_mutex lock to be:
	- during prepare_bprm_creds()
	- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through
install_exec_creds().

That means, for example, that check_unsafe_exec()'s documented invariant
is violated:
    /*
     * determine how safe it is to execute the proposed program
     * - the caller must hold ->cred_guard_mutex to protect against
     *   PTRACE_ATTACH or seccomp thread-sync
     */
    static void check_unsafe_exec(struct linux_binprm *bprm) ...
which is looking at no_new_privs as well as other details, and making
decisions about the bprm state from the current state.

I think it also means that the potentially multiple invocations
of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
a lock (another place where current's no_new_privs is evaluated).

Related, it also means that cred_guard_mutex is unheld for every
invocation of search_binary_handler() (which can loop via the previously
mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
currently.)

For seccomp, the expectations about existing thread states risks races
too. There are two locks held for TSYNC:
- current->sighand->siglock is held to keep new threads from
  appearing/disappearing, which would destroy filter refcounting and
  lead to memory corruption.
- cred_guard_mutex is held to keep no_new_privs in sync with filters to
  avoid no_new_privs and filter confusion during exec, which could
  lead to exploitable setuid conditions (see below).

Just racing a malicious thread during TSYNC is not a very strong
example (a malicious thread could do lots of fun things to "current"
before it ever got near calling TSYNC), but I think there is the risk
of mismatched/confused states that we don't want to allow. One is a
particularly bad state that could lead to privilege escalations (in the
form of the old "sendmail doesn't check setuid" flaw; if a setuid process
has a filter attached that silently fails a priv-dropping setuid call
and continues execution with elevated privs, it can be tricked into
doing bad things on behalf of the unprivileged parent, which was the
primary goal of the original use of cred_guard_mutex with TSYNC[1]):

thread A clones thread B
thread B starts setuid exec
thread A sets no_new_privs
thread A calls seccomp with TSYNC
thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
thread B passes check_unsafe_exec() with no_new_privs unset
thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
thread A still in seccomp_sync_threads() sets no_new_privs on thread B
thread B finishes exec, now running with elevated privs, a filter chosen
         by thread A, _and_ nnp set (which doesn't matter)

With the original locking, thread B will fail check_unsafe_exec()
because filter and nnp state are changed together, with "atomicity"
protected by the cred_guard_mutex.

And this is just the bad state I _can_ see. I'm worried there are more...

All this said, I do see a small similarity here to the work I did to
stabilize stack rlimits (there was an ongoing problem with making multiple
decisions for the bprm based on current's state -- but current's state
was mutable during exec). For this, I saved rlim_stack to bprm and ignored
current's copy until exec ended and then stored bprm's copy into current.
If the only problem anyone can see here is the handling of no_new_privs,
we might be able to solve that similarly, at least disentangling tsync/nnp
from cred_guard_mutex.

-Kees

[1] https://lore.kernel.org/lkml/20140625142121.GD7892@redhat.com/
Bernd Edlinger March 3, 2020, 4:54 a.m. UTC | #2
On 3/3/20 3:26 AM, Kees Cook wrote:
> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>>
>> I observed that when running strace on the gcc test suite, it always
>> blocks after a while, when expect calls execve, because other threads
>> have to be terminated.  They send ptrace events, but the strace is no
>> longer able to respond, since it is blocked in vm_access.
>>
>> The deadlock is always happening when strace needs to access the
>> tracees process mmap, while another thread in the tracee starts to
>> execve a child process, but that cannot continue until the
>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>>
>> strace          D    0 30614  30584 0x00000000
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> schedule_preempt_disabled+0x15/0x20
>> __mutex_lock.isra.13+0x1ec/0x520
>> __mutex_lock_killable_slowpath+0x13/0x20
>> mutex_lock_killable+0x28/0x30
>> mm_access+0x27/0xa0
>> process_vm_rw_core.isra.3+0xff/0x550
>> process_vm_rw+0xdd/0xf0
>> __x64_sys_process_vm_readv+0x31/0x40
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> expect          D    0 31933  30876 0x80004003
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> flush_old_exec+0xc4/0x770
>> load_elf_binary+0x35a/0x16c0
>> search_binary_handler+0x97/0x1d0
>> __do_execve_file.isra.40+0x5d4/0x8a0
>> __x64_sys_execve+0x49/0x60
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> The proposed solution is to take the cred_guard_mutex only
>> in a critical section at the beginning, and at the end of the
>> execve function, and let PTRACE_ATTACH fail with EAGAIN while
>> execve is not complete, but other functions like vm_access are
>> allowed to complete normally.
> 
> Sorry to be bummer, but I don't think this will work. A few more things
> during the exec process depend on cred_guard_mutex being held.
> 
> If I'm reading this patch correctly, this changes the lifetime of the
> cred_guard_mutex lock to be:
> 	- during prepare_bprm_creds()
> 	- from flush_old_exec() through install_exec_creds()
> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> install_exec_creds().
> 
> That means, for example, that check_unsafe_exec()'s documented invariant
> is violated:
>     /*
>      * determine how safe it is to execute the proposed program
>      * - the caller must hold ->cred_guard_mutex to protect against
>      *   PTRACE_ATTACH or seccomp thread-sync
>      */

Oh, right, I haven't understood that hint...

>     static void check_unsafe_exec(struct linux_binprm *bprm) ...
> which is looking at no_new_privs as well as other details, and making
> decisions about the bprm state from the current state.
> 
> I think it also means that the potentially multiple invocations
> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> a lock (another place where current's no_new_privs is evaluated).

So no_new_privs can change from 0->1, but should not
when execve is running.

As long as the calling thread is in execve it won't do this,
and the only other place, where it may set for other threads
is in seccomp_sync_threads, but that can easily be avoided see below.

> 
> Related, it also means that cred_guard_mutex is unheld for every
> invocation of search_binary_handler() (which can loop via the previously
> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> currently.)
> 
> For seccomp, the expectations about existing thread states risks races
> too. There are two locks held for TSYNC:
> - current->sighand->siglock is held to keep new threads from
>   appearing/disappearing, which would destroy filter refcounting and
>   lead to memory corruption.

I don't understand what you mean here.
How can this lead to memory corruption?

> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
>   avoid no_new_privs and filter confusion during exec, which could
>   lead to exploitable setuid conditions (see below).
> 
> Just racing a malicious thread during TSYNC is not a very strong
> example (a malicious thread could do lots of fun things to "current"
> before it ever got near calling TSYNC), but I think there is the risk
> of mismatched/confused states that we don't want to allow. One is a
> particularly bad state that could lead to privilege escalations (in the
> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> has a filter attached that silently fails a priv-dropping setuid call
> and continues execution with elevated privs, it can be tricked into
> doing bad things on behalf of the unprivileged parent, which was the
> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> 
> thread A clones thread B
> thread B starts setuid exec
> thread A sets no_new_privs
> thread A calls seccomp with TSYNC
> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> thread B passes check_unsafe_exec() with no_new_privs unset
> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> thread B finishes exec, now running with elevated privs, a filter chosen
>          by thread A, _and_ nnp set (which doesn't matter)
> 
> With the original locking, thread B will fail check_unsafe_exec()
> because filter and nnp state are changed together, with "atomicity"
> protected by the cred_guard_mutex.
> 

Ah, good point, thanks!

This can be fixed by checking current->signal->cred_locked_for_ptrace
while the cred_guard_mutex is locked, like this for instance:

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b6ea3dc..377abf0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
        BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
        assert_spin_locked(&current->sighand->siglock);
 
+       if (current->signal->cred_locked_for_ptrace)
+               return -EAGAIN;
+
        /* Validate all threads being eligible for synchronization. */
        caller = current;
        for_each_thread(caller, thread) {


> And this is just the bad state I _can_ see. I'm worried there are more...
> 
> All this said, I do see a small similarity here to the work I did to
> stabilize stack rlimits (there was an ongoing problem with making multiple
> decisions for the bprm based on current's state -- but current's state
> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> current's copy until exec ended and then stored bprm's copy into current.
> If the only problem anyone can see here is the handling of no_new_privs,
> we might be able to solve that similarly, at least disentangling tsync/nnp
> from cred_guard_mutex.
> 

I still think that is solvable with using cred_locked_for_ptrace and
simply make the tsync fail if it would otherwise be blocked.


Thanks
Bernd.
Kees Cook March 3, 2020, 5:29 a.m. UTC | #3
On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> On 3/3/20 3:26 AM, Kees Cook wrote:
> > On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> > > [...]
> >
> > If I'm reading this patch correctly, this changes the lifetime of the
> > cred_guard_mutex lock to be:
> > 	- during prepare_bprm_creds()
> > 	- from flush_old_exec() through install_exec_creds()
> > Before, cred_guard_mutex was held from prepare_bprm_creds() through
> > install_exec_creds().

BTW, I think the effect of this change (i.e. my paragraph above) should
be distinctly called out in the commit log if this solution moves
forward.

> > That means, for example, that check_unsafe_exec()'s documented invariant
> > is violated:
> >     /*
> >      * determine how safe it is to execute the proposed program
> >      * - the caller must hold ->cred_guard_mutex to protect against
> >      *   PTRACE_ATTACH or seccomp thread-sync
> >      */
> 
> Oh, right, I haven't understood that hint...

I know no_new_privs is checked there, but I haven't studied the
PTRACE_ATTACH part of that comment. If that is handled with the new
check, this comment should be updated.

> > I think it also means that the potentially multiple invocations
> > of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> > binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> > a lock (another place where current's no_new_privs is evaluated).
> 
> So no_new_privs can change from 0->1, but should not
> when execve is running.
> 
> As long as the calling thread is in execve it won't do this,
> and the only other place, where it may set for other threads
> is in seccomp_sync_threads, but that can easily be avoided see below.

Yeah, everything was fine until I had to go complicate things with
TSYNC. ;) The real goal is making sure an exec cannot gain privs while
later gaining a seccomp filter from an unpriv process. The no_new_privs
flag was used to control this, but it required that the filter not get
applied during exec.

> > Related, it also means that cred_guard_mutex is unheld for every
> > invocation of search_binary_handler() (which can loop via the previously
> > mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> > dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> > currently.)
> > 
> > For seccomp, the expectations about existing thread states risks races
> > too. There are two locks held for TSYNC:
> > - current->sighand->siglock is held to keep new threads from
> >   appearing/disappearing, which would destroy filter refcounting and
> >   lead to memory corruption.
> 
> I don't understand what you mean here.
> How can this lead to memory corruption?

Mainly this is a matter of how seccomp manages its filter hierarchy
(since the filters are shared through process ancestry), so if a thread
appears in the middle of TSYNC it may be racing another TSYNC and break
ancestry, leading to bad reference counting on process death, etc.
(Though, yes, with refcount_t now, things should never corrupt, just
waste memory.)

> > - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> >   avoid no_new_privs and filter confusion during exec, which could
> >   lead to exploitable setuid conditions (see below).
> > 
> > Just racing a malicious thread during TSYNC is not a very strong
> > example (a malicious thread could do lots of fun things to "current"
> > before it ever got near calling TSYNC), but I think there is the risk
> > of mismatched/confused states that we don't want to allow. One is a
> > particularly bad state that could lead to privilege escalations (in the
> > form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> > has a filter attached that silently fails a priv-dropping setuid call
> > and continues execution with elevated privs, it can be tricked into
> > doing bad things on behalf of the unprivileged parent, which was the
> > primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> > 
> > thread A clones thread B
> > thread B starts setuid exec
> > thread A sets no_new_privs
> > thread A calls seccomp with TSYNC
> > thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> > thread B passes check_unsafe_exec() with no_new_privs unset
> > thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> > thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> > thread B finishes exec, now running with elevated privs, a filter chosen
> >          by thread A, _and_ nnp set (which doesn't matter)
> > 
> > With the original locking, thread B will fail check_unsafe_exec()
> > because filter and nnp state are changed together, with "atomicity"
> > protected by the cred_guard_mutex.
> > 
> 
> Ah, good point, thanks!
> 
> This can be fixed by checking current->signal->cred_locked_for_ptrace
> while the cred_guard_mutex is locked, like this for instance:
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index b6ea3dc..377abf0 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
>         assert_spin_locked(&current->sighand->siglock);
>  
> +       if (current->signal->cred_locked_for_ptrace)
> +               return -EAGAIN;
> +

Hmm. I guess something like that could work. TSYNC expects to be able to
report _which_ thread wrecked the call, though... I wonder if in_execve
could be used to figure out the offending thread. Hm, nope, that would
be outside of lock too (and all users are "current" right now, so the
lock wasn't needed before).

>         /* Validate all threads being eligible for synchronization. */
>         caller = current;
>         for_each_thread(caller, thread) {
> 
> 
> > And this is just the bad state I _can_ see. I'm worried there are more...
> > 
> > All this said, I do see a small similarity here to the work I did to
> > stabilize stack rlimits (there was an ongoing problem with making multiple
> > decisions for the bprm based on current's state -- but current's state
> > was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> > current's copy until exec ended and then stored bprm's copy into current.
> > If the only problem anyone can see here is the handling of no_new_privs,
> > we might be able to solve that similarly, at least disentangling tsync/nnp
> > from cred_guard_mutex.
> > 
> 
> I still think that is solvable with using cred_locked_for_ptrace and
> simply make the tsync fail if it would otherwise be blocked.

I wonder if we can find a better name than "cred_locked_for_ptrace"?
Maybe "cred_unfinished" or "cred_locked_in_exec" or something?

And the comment on bool cred_locked_for_ptrace should mention that
access is only allowed under cred_guard_mutex lock.

> > > +	sig->cred_locked_for_ptrace = false;

This is redundant to the zalloc -- I think you can drop it (unless
someone wants to keep it for clarify?)

Also, I think cred_locked_for_ptrace needs checking deeper, in
__ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
be sufficient to see a stable version of the thread...

(I remain very nervous about weakening cred_guard_mutex without
addressing the many many users...)
Bernd Edlinger March 3, 2020, 8:08 a.m. UTC | #4
On 3/3/20 6:29 AM, Kees Cook wrote:
> On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
>> On 3/3/20 3:26 AM, Kees Cook wrote:
>>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
>>>> [...]
>>>
>>> If I'm reading this patch correctly, this changes the lifetime of the
>>> cred_guard_mutex lock to be:
>>> 	- during prepare_bprm_creds()
>>> 	- from flush_old_exec() through install_exec_creds()
>>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
>>> install_exec_creds().
> 
> BTW, I think the effect of this change (i.e. my paragraph above) should
> be distinctly called out in the commit log if this solution moves
> forward.
> 

Okay, will do.

>>> That means, for example, that check_unsafe_exec()'s documented invariant
>>> is violated:
>>>     /*
>>>      * determine how safe it is to execute the proposed program
>>>      * - the caller must hold ->cred_guard_mutex to protect against
>>>      *   PTRACE_ATTACH or seccomp thread-sync
>>>      */
>>
>> Oh, right, I haven't understood that hint...
> 
> I know no_new_privs is checked there, but I haven't studied the
> PTRACE_ATTACH part of that comment. If that is handled with the new
> check, this comment should be updated.
> 

Okay, I change that comment to:

/*
 * determine how safe it is to execute the proposed program
 * - the caller must have set ->cred_locked_in_execve to protect against
 *   PTRACE_ATTACH or seccomp thread-sync
 */

>>> I think it also means that the potentially multiple invocations
>>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
>>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
>>> a lock (another place where current's no_new_privs is evaluated).
>>
>> So no_new_privs can change from 0->1, but should not
>> when execve is running.
>>
>> As long as the calling thread is in execve it won't do this,
>> and the only other place, where it may set for other threads
>> is in seccomp_sync_threads, but that can easily be avoided see below.
> 
> Yeah, everything was fine until I had to go complicate things with
> TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> later gaining a seccomp filter from an unpriv process. The no_new_privs
> flag was used to control this, but it required that the filter not get
> applied during exec.
> 
>>> Related, it also means that cred_guard_mutex is unheld for every
>>> invocation of search_binary_handler() (which can loop via the previously
>>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
>>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
>>> currently.)
>>>
>>> For seccomp, the expectations about existing thread states risks races
>>> too. There are two locks held for TSYNC:
>>> - current->sighand->siglock is held to keep new threads from
>>>   appearing/disappearing, which would destroy filter refcounting and
>>>   lead to memory corruption.
>>
>> I don't understand what you mean here.
>> How can this lead to memory corruption?
> 
> Mainly this is a matter of how seccomp manages its filter hierarchy
> (since the filters are shared through process ancestry), so if a thread
> appears in the middle of TSYNC it may be racing another TSYNC and break
> ancestry, leading to bad reference counting on process death, etc.
> (Though, yes, with refcount_t now, things should never corrupt, just
> waste memory.)
> 

I assume for now, that the current->sighand->siglock held while iterating all
threads is sufficient here.

>>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
>>>   avoid no_new_privs and filter confusion during exec, which could
>>>   lead to exploitable setuid conditions (see below).
>>>
>>> Just racing a malicious thread during TSYNC is not a very strong
>>> example (a malicious thread could do lots of fun things to "current"
>>> before it ever got near calling TSYNC), but I think there is the risk
>>> of mismatched/confused states that we don't want to allow. One is a
>>> particularly bad state that could lead to privilege escalations (in the
>>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
>>> has a filter attached that silently fails a priv-dropping setuid call
>>> and continues execution with elevated privs, it can be tricked into
>>> doing bad things on behalf of the unprivileged parent, which was the
>>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
>>>
>>> thread A clones thread B
>>> thread B starts setuid exec
>>> thread A sets no_new_privs
>>> thread A calls seccomp with TSYNC
>>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
>>> thread B passes check_unsafe_exec() with no_new_privs unset
>>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
>>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
>>> thread B finishes exec, now running with elevated privs, a filter chosen
>>>          by thread A, _and_ nnp set (which doesn't matter)
>>>
>>> With the original locking, thread B will fail check_unsafe_exec()
>>> because filter and nnp state are changed together, with "atomicity"
>>> protected by the cred_guard_mutex.
>>>
>>
>> Ah, good point, thanks!
>>
>> This can be fixed by checking current->signal->cred_locked_for_ptrace
>> while the cred_guard_mutex is locked, like this for instance:
>>
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index b6ea3dc..377abf0 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
>>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
>>         assert_spin_locked(&current->sighand->siglock);
>>  
>> +       if (current->signal->cred_locked_for_ptrace)
>> +               return -EAGAIN;
>> +
> 
> Hmm. I guess something like that could work. TSYNC expects to be able to
> report _which_ thread wrecked the call, though... I wonder if in_execve
> could be used to figure out the offending thread. Hm, nope, that would
> be outside of lock too (and all users are "current" right now, so the
> lock wasn't needed before).
> 

I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
but the caller will die quickly and cannot do anything with that information
when another thread executes execve, right?

>>         /* Validate all threads being eligible for synchronization. */
>>         caller = current;
>>         for_each_thread(caller, thread) {
>>
>>
>>> And this is just the bad state I _can_ see. I'm worried there are more...
>>>
>>> All this said, I do see a small similarity here to the work I did to
>>> stabilize stack rlimits (there was an ongoing problem with making multiple
>>> decisions for the bprm based on current's state -- but current's state
>>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
>>> current's copy until exec ended and then stored bprm's copy into current.
>>> If the only problem anyone can see here is the handling of no_new_privs,
>>> we might be able to solve that similarly, at least disentangling tsync/nnp
>>> from cred_guard_mutex.
>>>
>>
>> I still think that is solvable with using cred_locked_for_ptrace and
>> simply make the tsync fail if it would otherwise be blocked.
> 
> I wonder if we can find a better name than "cred_locked_for_ptrace"?
> Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> 

Yeah, I'd go with "cred_locked_in_execve".

> And the comment on bool cred_locked_for_ptrace should mention that
> access is only allowed under cred_guard_mutex lock.
> 

okay.

>>>> +	sig->cred_locked_for_ptrace = false;
> 
> This is redundant to the zalloc -- I think you can drop it (unless
> someone wants to keep it for clarify?)
> 

I'll remove that here and in init/init_task.c

> Also, I think cred_locked_for_ptrace needs checking deeper, in
> __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> be sufficient to see a stable version of the thread...
> 

No, these need to be addressed individually, but most users just want
to know if the current credentials are sufficient at this moment, but will
not change the credentials, as ptrace and TSYNC do. 

BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
So adding an access to cred_locked_for_execve in ptrace_may_access is
probably not an option.

However, one nice added value by this change is this:

void *thread(void *arg)
{
	ptrace(PTRACE_TRACEME, 0,0,0);
	return NULL;
}

int main(void)
{
	int pid = fork();

	if (!pid) {
		pthread_t pt;
		pthread_create(&pt, NULL, thread, NULL);
		pthread_join(pt, NULL);
		execlp("echo", "echo", "passed", NULL);
	}

	sleep(1000);
	ptrace(PTRACE_ATTACH, pid, 0,0);
	kill(pid, SIGCONT);
	return 0;
}

cat /proc/3812/stack 
[<0>] flush_old_exec+0xbf/0x760
[<0>] load_elf_binary+0x35a/0x16c0
[<0>] search_binary_handler+0x97/0x1d0
[<0>] __do_execve_file.isra.40+0x624/0x920
[<0>] __x64_sys_execve+0x49/0x60
[<0>] do_syscall_64+0x64/0x220
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9


> (I remain very nervous about weakening cred_guard_mutex without
> addressing the many many users...)
> 

They need to be looked at closely, that's pretty clear.
Most fall in the class, that just the current credentials need
to stay stable for a certain time.


Bernd.
Christian Brauner March 3, 2020, 8:34 a.m. UTC | #5
On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
> On 3/3/20 6:29 AM, Kees Cook wrote:
> > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> >> On 3/3/20 3:26 AM, Kees Cook wrote:
> >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> >>>> [...]
> >>>
> >>> If I'm reading this patch correctly, this changes the lifetime of the
> >>> cred_guard_mutex lock to be:
> >>> 	- during prepare_bprm_creds()
> >>> 	- from flush_old_exec() through install_exec_creds()
> >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> >>> install_exec_creds().
> > 
> > BTW, I think the effect of this change (i.e. my paragraph above) should
> > be distinctly called out in the commit log if this solution moves
> > forward.
> > 
> 
> Okay, will do.
> 
> >>> That means, for example, that check_unsafe_exec()'s documented invariant
> >>> is violated:
> >>>     /*
> >>>      * determine how safe it is to execute the proposed program
> >>>      * - the caller must hold ->cred_guard_mutex to protect against
> >>>      *   PTRACE_ATTACH or seccomp thread-sync
> >>>      */
> >>
> >> Oh, right, I haven't understood that hint...
> > 
> > I know no_new_privs is checked there, but I haven't studied the
> > PTRACE_ATTACH part of that comment. If that is handled with the new
> > check, this comment should be updated.
> > 
> 
> Okay, I change that comment to:
> 
> /*
>  * determine how safe it is to execute the proposed program
>  * - the caller must have set ->cred_locked_in_execve to protect against
>  *   PTRACE_ATTACH or seccomp thread-sync
>  */
> 
> >>> I think it also means that the potentially multiple invocations
> >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> >>> a lock (another place where current's no_new_privs is evaluated).
> >>
> >> So no_new_privs can change from 0->1, but should not
> >> when execve is running.
> >>
> >> As long as the calling thread is in execve it won't do this,
> >> and the only other place, where it may set for other threads
> >> is in seccomp_sync_threads, but that can easily be avoided see below.
> > 
> > Yeah, everything was fine until I had to go complicate things with
> > TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> > later gaining a seccomp filter from an unpriv process. The no_new_privs
> > flag was used to control this, but it required that the filter not get
> > applied during exec.
> > 
> >>> Related, it also means that cred_guard_mutex is unheld for every
> >>> invocation of search_binary_handler() (which can loop via the previously
> >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> >>> currently.)
> >>>
> >>> For seccomp, the expectations about existing thread states risks races
> >>> too. There are two locks held for TSYNC:
> >>> - current->sighand->siglock is held to keep new threads from
> >>>   appearing/disappearing, which would destroy filter refcounting and
> >>>   lead to memory corruption.
> >>
> >> I don't understand what you mean here.
> >> How can this lead to memory corruption?
> > 
> > Mainly this is a matter of how seccomp manages its filter hierarchy
> > (since the filters are shared through process ancestry), so if a thread
> > appears in the middle of TSYNC it may be racing another TSYNC and break
> > ancestry, leading to bad reference counting on process death, etc.
> > (Though, yes, with refcount_t now, things should never corrupt, just
> > waste memory.)
> > 
> 
> I assume for now, that the current->sighand->siglock held while iterating all
> threads is sufficient here.
> 
> >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> >>>   avoid no_new_privs and filter confusion during exec, which could
> >>>   lead to exploitable setuid conditions (see below).
> >>>
> >>> Just racing a malicious thread during TSYNC is not a very strong
> >>> example (a malicious thread could do lots of fun things to "current"
> >>> before it ever got near calling TSYNC), but I think there is the risk
> >>> of mismatched/confused states that we don't want to allow. One is a
> >>> particularly bad state that could lead to privilege escalations (in the
> >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> >>> has a filter attached that silently fails a priv-dropping setuid call
> >>> and continues execution with elevated privs, it can be tricked into
> >>> doing bad things on behalf of the unprivileged parent, which was the
> >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> >>>
> >>> thread A clones thread B
> >>> thread B starts setuid exec
> >>> thread A sets no_new_privs
> >>> thread A calls seccomp with TSYNC
> >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> >>> thread B passes check_unsafe_exec() with no_new_privs unset
> >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> >>> thread B finishes exec, now running with elevated privs, a filter chosen
> >>>          by thread A, _and_ nnp set (which doesn't matter)
> >>>
> >>> With the original locking, thread B will fail check_unsafe_exec()
> >>> because filter and nnp state are changed together, with "atomicity"
> >>> protected by the cred_guard_mutex.
> >>>
> >>
> >> Ah, good point, thanks!
> >>
> >> This can be fixed by checking current->signal->cred_locked_for_ptrace
> >> while the cred_guard_mutex is locked, like this for instance:
> >>
> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >> index b6ea3dc..377abf0 100644
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> >>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
> >>         assert_spin_locked(&current->sighand->siglock);
> >>  
> >> +       if (current->signal->cred_locked_for_ptrace)
> >> +               return -EAGAIN;
> >> +
> > 
> > Hmm. I guess something like that could work. TSYNC expects to be able to
> > report _which_ thread wrecked the call, though... I wonder if in_execve
> > could be used to figure out the offending thread. Hm, nope, that would
> > be outside of lock too (and all users are "current" right now, so the
> > lock wasn't needed before).
> > 
> 
> I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
> but the caller will die quickly and cannot do anything with that information
> when another thread executes execve, right?
> 
> >>         /* Validate all threads being eligible for synchronization. */
> >>         caller = current;
> >>         for_each_thread(caller, thread) {
> >>
> >>
> >>> And this is just the bad state I _can_ see. I'm worried there are more...
> >>>
> >>> All this said, I do see a small similarity here to the work I did to
> >>> stabilize stack rlimits (there was an ongoing problem with making multiple
> >>> decisions for the bprm based on current's state -- but current's state
> >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> >>> current's copy until exec ended and then stored bprm's copy into current.
> >>> If the only problem anyone can see here is the handling of no_new_privs,
> >>> we might be able to solve that similarly, at least disentangling tsync/nnp
> >>> from cred_guard_mutex.
> >>>
> >>
> >> I still think that is solvable with using cred_locked_for_ptrace and
> >> simply make the tsync fail if it would otherwise be blocked.
> > 
> > I wonder if we can find a better name than "cred_locked_for_ptrace"?
> > Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> > 
> 
> Yeah, I'd go with "cred_locked_in_execve".
> 
> > And the comment on bool cred_locked_for_ptrace should mention that
> > access is only allowed under cred_guard_mutex lock.
> > 
> 
> okay.
> 
> >>>> +	sig->cred_locked_for_ptrace = false;
> > 
> > This is redundant to the zalloc -- I think you can drop it (unless
> > someone wants to keep it for clarify?)
> > 
> 
> I'll remove that here and in init/init_task.c
> 
> > Also, I think cred_locked_for_ptrace needs checking deeper, in
> > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> > be sufficient to see a stable version of the thread...
> > 
> 
> No, these need to be addressed individually, but most users just want
> to know if the current credentials are sufficient at this moment, but will
> not change the credentials, as ptrace and TSYNC do. 
> 
> BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
> mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
> So adding an access to cred_locked_for_execve in ptrace_may_access is
> probably not an option.
> 
> However, one nice added value by this change is this:
> 
> void *thread(void *arg)
> {
> 	ptrace(PTRACE_TRACEME, 0,0,0);
> 	return NULL;
> }
> 
> int main(void)
> {
> 	int pid = fork();
> 
> 	if (!pid) {
> 		pthread_t pt;
> 		pthread_create(&pt, NULL, thread, NULL);
> 		pthread_join(pt, NULL);
> 		execlp("echo", "echo", "passed", NULL);
> 	}
> 
> 	sleep(1000);
> 	ptrace(PTRACE_ATTACH, pid, 0,0);
> 	kill(pid, SIGCONT);
> 	return 0;
> }
> 
> cat /proc/3812/stack 
> [<0>] flush_old_exec+0xbf/0x760
> [<0>] load_elf_binary+0x35a/0x16c0
> [<0>] search_binary_handler+0x97/0x1d0
> [<0>] __do_execve_file.isra.40+0x624/0x920
> [<0>] __x64_sys_execve+0x49/0x60
> [<0>] do_syscall_64+0x64/0x220
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> 
> > (I remain very nervous about weakening cred_guard_mutex without
> > addressing the many many users...)
> > 
> 
> They need to be looked at closely, that's pretty clear.
> Most fall in the class, that just the current credentials need
> to stay stable for a certain time.

I remain rather set on wanting some very basic tests with this change.
Imho, looking through tools/testing/selftests again we don't have nearly
enough for these codepaths; not to say none. Basically, if someone wants
to make a change affecting the current problem we should really have at
least a single simple test/reproducer that can be run without digging
through lore. And hopefully over time we'll have more tests.

Christian
Christian Brauner March 3, 2020, 8:43 a.m. UTC | #6
On Tue, Mar 03, 2020 at 09:34:26AM +0100, Christian Brauner wrote:
> On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
> > On 3/3/20 6:29 AM, Kees Cook wrote:
> > > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> > >> On 3/3/20 3:26 AM, Kees Cook wrote:
> > >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> > >>>> [...]
> > >>>
> > >>> If I'm reading this patch correctly, this changes the lifetime of the
> > >>> cred_guard_mutex lock to be:
> > >>> 	- during prepare_bprm_creds()
> > >>> 	- from flush_old_exec() through install_exec_creds()
> > >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> > >>> install_exec_creds().
> > > 
> > > BTW, I think the effect of this change (i.e. my paragraph above) should
> > > be distinctly called out in the commit log if this solution moves
> > > forward.
> > > 
> > 
> > Okay, will do.
> > 
> > >>> That means, for example, that check_unsafe_exec()'s documented invariant
> > >>> is violated:
> > >>>     /*
> > >>>      * determine how safe it is to execute the proposed program
> > >>>      * - the caller must hold ->cred_guard_mutex to protect against
> > >>>      *   PTRACE_ATTACH or seccomp thread-sync
> > >>>      */
> > >>
> > >> Oh, right, I haven't understood that hint...
> > > 
> > > I know no_new_privs is checked there, but I haven't studied the
> > > PTRACE_ATTACH part of that comment. If that is handled with the new
> > > check, this comment should be updated.
> > > 
> > 
> > Okay, I change that comment to:
> > 
> > /*
> >  * determine how safe it is to execute the proposed program
> >  * - the caller must have set ->cred_locked_in_execve to protect against
> >  *   PTRACE_ATTACH or seccomp thread-sync
> >  */
> > 
> > >>> I think it also means that the potentially multiple invocations
> > >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> > >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> > >>> a lock (another place where current's no_new_privs is evaluated).
> > >>
> > >> So no_new_privs can change from 0->1, but should not
> > >> when execve is running.
> > >>
> > >> As long as the calling thread is in execve it won't do this,
> > >> and the only other place, where it may set for other threads
> > >> is in seccomp_sync_threads, but that can easily be avoided see below.
> > > 
> > > Yeah, everything was fine until I had to go complicate things with
> > > TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> > > later gaining a seccomp filter from an unpriv process. The no_new_privs
> > > flag was used to control this, but it required that the filter not get
> > > applied during exec.
> > > 
> > >>> Related, it also means that cred_guard_mutex is unheld for every
> > >>> invocation of search_binary_handler() (which can loop via the previously
> > >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> > >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> > >>> currently.)
> > >>>
> > >>> For seccomp, the expectations about existing thread states risks races
> > >>> too. There are two locks held for TSYNC:
> > >>> - current->sighand->siglock is held to keep new threads from
> > >>>   appearing/disappearing, which would destroy filter refcounting and
> > >>>   lead to memory corruption.
> > >>
> > >> I don't understand what you mean here.
> > >> How can this lead to memory corruption?
> > > 
> > > Mainly this is a matter of how seccomp manages its filter hierarchy
> > > (since the filters are shared through process ancestry), so if a thread
> > > appears in the middle of TSYNC it may be racing another TSYNC and break
> > > ancestry, leading to bad reference counting on process death, etc.
> > > (Though, yes, with refcount_t now, things should never corrupt, just
> > > waste memory.)
> > > 
> > 
> > I assume for now, that the current->sighand->siglock held while iterating all
> > threads is sufficient here.
> > 
> > >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> > >>>   avoid no_new_privs and filter confusion during exec, which could
> > >>>   lead to exploitable setuid conditions (see below).
> > >>>
> > >>> Just racing a malicious thread during TSYNC is not a very strong
> > >>> example (a malicious thread could do lots of fun things to "current"
> > >>> before it ever got near calling TSYNC), but I think there is the risk
> > >>> of mismatched/confused states that we don't want to allow. One is a
> > >>> particularly bad state that could lead to privilege escalations (in the
> > >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> > >>> has a filter attached that silently fails a priv-dropping setuid call
> > >>> and continues execution with elevated privs, it can be tricked into
> > >>> doing bad things on behalf of the unprivileged parent, which was the
> > >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> > >>>
> > >>> thread A clones thread B
> > >>> thread B starts setuid exec
> > >>> thread A sets no_new_privs
> > >>> thread A calls seccomp with TSYNC
> > >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> > >>> thread B passes check_unsafe_exec() with no_new_privs unset
> > >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> > >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> > >>> thread B finishes exec, now running with elevated privs, a filter chosen
> > >>>          by thread A, _and_ nnp set (which doesn't matter)
> > >>>
> > >>> With the original locking, thread B will fail check_unsafe_exec()
> > >>> because filter and nnp state are changed together, with "atomicity"
> > >>> protected by the cred_guard_mutex.
> > >>>
> > >>
> > >> Ah, good point, thanks!
> > >>
> > >> This can be fixed by checking current->signal->cred_locked_for_ptrace
> > >> while the cred_guard_mutex is locked, like this for instance:
> > >>
> > >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > >> index b6ea3dc..377abf0 100644
> > >> --- a/kernel/seccomp.c
> > >> +++ b/kernel/seccomp.c
> > >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> > >>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
> > >>         assert_spin_locked(&current->sighand->siglock);
> > >>  
> > >> +       if (current->signal->cred_locked_for_ptrace)
> > >> +               return -EAGAIN;
> > >> +
> > > 
> > > Hmm. I guess something like that could work. TSYNC expects to be able to
> > > report _which_ thread wrecked the call, though... I wonder if in_execve
> > > could be used to figure out the offending thread. Hm, nope, that would
> > > be outside of lock too (and all users are "current" right now, so the
> > > lock wasn't needed before).
> > > 
> > 
> > I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
> > but the caller will die quickly and cannot do anything with that information
> > when another thread executes execve, right?
> > 
> > >>         /* Validate all threads being eligible for synchronization. */
> > >>         caller = current;
> > >>         for_each_thread(caller, thread) {
> > >>
> > >>
> > >>> And this is just the bad state I _can_ see. I'm worried there are more...
> > >>>
> > >>> All this said, I do see a small similarity here to the work I did to
> > >>> stabilize stack rlimits (there was an ongoing problem with making multiple
> > >>> decisions for the bprm based on current's state -- but current's state
> > >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> > >>> current's copy until exec ended and then stored bprm's copy into current.
> > >>> If the only problem anyone can see here is the handling of no_new_privs,
> > >>> we might be able to solve that similarly, at least disentangling tsync/nnp
> > >>> from cred_guard_mutex.
> > >>>
> > >>
> > >> I still think that is solvable with using cred_locked_for_ptrace and
> > >> simply make the tsync fail if it would otherwise be blocked.
> > > 
> > > I wonder if we can find a better name than "cred_locked_for_ptrace"?
> > > Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> > > 
> > 
> > Yeah, I'd go with "cred_locked_in_execve".
> > 
> > > And the comment on bool cred_locked_for_ptrace should mention that
> > > access is only allowed under cred_guard_mutex lock.
> > > 
> > 
> > okay.
> > 
> > >>>> +	sig->cred_locked_for_ptrace = false;
> > > 
> > > This is redundant to the zalloc -- I think you can drop it (unless
> > > someone wants to keep it for clarify?)
> > > 
> > 
> > I'll remove that here and in init/init_task.c
> > 
> > > Also, I think cred_locked_for_ptrace needs checking deeper, in
> > > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> > > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> > > be sufficient to see a stable version of the thread...
> > > 
> > 
> > No, these need to be addressed individually, but most users just want
> > to know if the current credentials are sufficient at this moment, but will
> > not change the credentials, as ptrace and TSYNC do. 
> > 
> > BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
> > mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
> > So adding an access to cred_locked_for_execve in ptrace_may_access is
> > probably not an option.
> > 
> > However, one nice added value by this change is this:
> > 
> > void *thread(void *arg)
> > {
> > 	ptrace(PTRACE_TRACEME, 0,0,0);
> > 	return NULL;
> > }
> > 
> > int main(void)
> > {
> > 	int pid = fork();
> > 
> > 	if (!pid) {
> > 		pthread_t pt;
> > 		pthread_create(&pt, NULL, thread, NULL);
> > 		pthread_join(pt, NULL);
> > 		execlp("echo", "echo", "passed", NULL);
> > 	}
> > 
> > 	sleep(1000);
> > 	ptrace(PTRACE_ATTACH, pid, 0,0);
> > 	kill(pid, SIGCONT);
> > 	return 0;
> > }
> > 
> > cat /proc/3812/stack 
> > [<0>] flush_old_exec+0xbf/0x760
> > [<0>] load_elf_binary+0x35a/0x16c0
> > [<0>] search_binary_handler+0x97/0x1d0
> > [<0>] __do_execve_file.isra.40+0x624/0x920
> > [<0>] __x64_sys_execve+0x49/0x60
> > [<0>] do_syscall_64+0x64/0x220
> > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > 
> > > (I remain very nervous about weakening cred_guard_mutex without
> > > addressing the many many users...)
> > > 
> > 
> > They need to be looked at closely, that's pretty clear.
> > Most fall in the class, that just the current credentials need
> > to stay stable for a certain time.
> 
> I remain rather set on wanting some very basic tests with this change.
> Imho, looking through tools/testing/selftests again we don't have nearly
> enough for these codepaths; not to say none. Basically, if someone wants
> to make a change affecting the current problem we should really have at
> least a single simple test/reproducer that can be run without digging
> through lore. And hopefully over time we'll have more tests.

Which you added in v4. Which is great! (I should've mentioned this in my
first mail.)
Christian
Christian Brauner March 3, 2020, 8:58 a.m. UTC | #7
On Mon, Mar 02, 2020 at 06:26:47PM -0800, Kees Cook wrote:
> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> > This fixes a deadlock in the tracer when tracing a multi-threaded
> > application that calls execve while more than one thread are running.
> > 
> > I observed that when running strace on the gcc test suite, it always
> > blocks after a while, when expect calls execve, because other threads
> > have to be terminated.  They send ptrace events, but the strace is no
> > longer able to respond, since it is blocked in vm_access.
> > 
> > The deadlock is always happening when strace needs to access the
> > tracees process mmap, while another thread in the tracee starts to
> > execve a child process, but that cannot continue until the
> > PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> > 
> > strace          D    0 30614  30584 0x00000000
> > Call Trace:
> > __schedule+0x3ce/0x6e0
> > schedule+0x5c/0xd0
> > schedule_preempt_disabled+0x15/0x20
> > __mutex_lock.isra.13+0x1ec/0x520
> > __mutex_lock_killable_slowpath+0x13/0x20
> > mutex_lock_killable+0x28/0x30
> > mm_access+0x27/0xa0
> > process_vm_rw_core.isra.3+0xff/0x550
> > process_vm_rw+0xdd/0xf0
> > __x64_sys_process_vm_readv+0x31/0x40
> > do_syscall_64+0x64/0x220
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > expect          D    0 31933  30876 0x80004003
> > Call Trace:
> > __schedule+0x3ce/0x6e0
> > schedule+0x5c/0xd0
> > flush_old_exec+0xc4/0x770
> > load_elf_binary+0x35a/0x16c0
> > search_binary_handler+0x97/0x1d0
> > __do_execve_file.isra.40+0x5d4/0x8a0
> > __x64_sys_execve+0x49/0x60
> > do_syscall_64+0x64/0x220
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > The proposed solution is to take the cred_guard_mutex only
> > in a critical section at the beginning, and at the end of the
> > execve function, and let PTRACE_ATTACH fail with EAGAIN while
> > execve is not complete, but other functions like vm_access are
> > allowed to complete normally.
> 
> Sorry to be bummer, but I don't think this will work. A few more things
> during the exec process depend on cred_guard_mutex being held.
> 
> If I'm reading this patch correctly, this changes the lifetime of the
> cred_guard_mutex lock to be:
> 	- during prepare_bprm_creds()
> 	- from flush_old_exec() through install_exec_creds()
> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> install_exec_creds().
> 
> That means, for example, that check_unsafe_exec()'s documented invariant
> is violated:
>     /*
>      * determine how safe it is to execute the proposed program
>      * - the caller must hold ->cred_guard_mutex to protect against
>      *   PTRACE_ATTACH or seccomp thread-sync
>      */
>     static void check_unsafe_exec(struct linux_binprm *bprm) ...
> which is looking at no_new_privs as well as other details, and making
> decisions about the bprm state from the current state.
> 
> I think it also means that the potentially multiple invocations
> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> a lock (another place where current's no_new_privs is evaluated).
> 
> Related, it also means that cred_guard_mutex is unheld for every
> invocation of search_binary_handler() (which can loop via the previously
> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> currently.)

So one issue I see with having to reacquire the cred_guard_mutex might
be that this would allow tasks holding the cred_guard_mutex to block a
killed exec'ing task from exiting, right?
Bernd Edlinger March 3, 2020, 10:34 a.m. UTC | #8
On 3/3/20 9:58 AM, Christian Brauner wrote:
> On Mon, Mar 02, 2020 at 06:26:47PM -0800, Kees Cook wrote:
>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
>>> This fixes a deadlock in the tracer when tracing a multi-threaded
>>> application that calls execve while more than one thread are running.
>>>
>>> I observed that when running strace on the gcc test suite, it always
>>> blocks after a while, when expect calls execve, because other threads
>>> have to be terminated.  They send ptrace events, but the strace is no
>>> longer able to respond, since it is blocked in vm_access.
>>>
>>> The deadlock is always happening when strace needs to access the
>>> tracees process mmap, while another thread in the tracee starts to
>>> execve a child process, but that cannot continue until the
>>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>>>
>>> strace          D    0 30614  30584 0x00000000
>>> Call Trace:
>>> __schedule+0x3ce/0x6e0
>>> schedule+0x5c/0xd0
>>> schedule_preempt_disabled+0x15/0x20
>>> __mutex_lock.isra.13+0x1ec/0x520
>>> __mutex_lock_killable_slowpath+0x13/0x20
>>> mutex_lock_killable+0x28/0x30
>>> mm_access+0x27/0xa0
>>> process_vm_rw_core.isra.3+0xff/0x550
>>> process_vm_rw+0xdd/0xf0
>>> __x64_sys_process_vm_readv+0x31/0x40
>>> do_syscall_64+0x64/0x220
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> expect          D    0 31933  30876 0x80004003
>>> Call Trace:
>>> __schedule+0x3ce/0x6e0
>>> schedule+0x5c/0xd0
>>> flush_old_exec+0xc4/0x770
>>> load_elf_binary+0x35a/0x16c0
>>> search_binary_handler+0x97/0x1d0
>>> __do_execve_file.isra.40+0x5d4/0x8a0
>>> __x64_sys_execve+0x49/0x60
>>> do_syscall_64+0x64/0x220
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> The proposed solution is to take the cred_guard_mutex only
>>> in a critical section at the beginning, and at the end of the
>>> execve function, and let PTRACE_ATTACH fail with EAGAIN while
>>> execve is not complete, but other functions like vm_access are
>>> allowed to complete normally.
>>
>> Sorry to be bummer, but I don't think this will work. A few more things
>> during the exec process depend on cred_guard_mutex being held.
>>
>> If I'm reading this patch correctly, this changes the lifetime of the
>> cred_guard_mutex lock to be:
>> 	- during prepare_bprm_creds()
>> 	- from flush_old_exec() through install_exec_creds()
>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
>> install_exec_creds().
>>
>> That means, for example, that check_unsafe_exec()'s documented invariant
>> is violated:
>>     /*
>>      * determine how safe it is to execute the proposed program
>>      * - the caller must hold ->cred_guard_mutex to protect against
>>      *   PTRACE_ATTACH or seccomp thread-sync
>>      */
>>     static void check_unsafe_exec(struct linux_binprm *bprm) ...
>> which is looking at no_new_privs as well as other details, and making
>> decisions about the bprm state from the current state.
>>
>> I think it also means that the potentially multiple invocations
>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
>> a lock (another place where current's no_new_privs is evaluated).
>>
>> Related, it also means that cred_guard_mutex is unheld for every
>> invocation of search_binary_handler() (which can loop via the previously
>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
>> currently.)
> 
> So one issue I see with having to reacquire the cred_guard_mutex might
> be that this would allow tasks holding the cred_guard_mutex to block a
> killed exec'ing task from exiting, right?
> 

Yes maybe, but I think it will not be worse than it is now.
Since the second time the mutex is acquired it is done with
mutex_lock_killable, so at least kill -9 should get it terminated.


Bernd.
Bernd Edlinger March 3, 2020, 11:23 a.m. UTC | #9
On 3/3/20 11:34 AM, Bernd Edlinger wrote:
> On 3/3/20 9:58 AM, Christian Brauner wrote:
>> So one issue I see with having to reacquire the cred_guard_mutex might
>> be that this would allow tasks holding the cred_guard_mutex to block a
>> killed exec'ing task from exiting, right?
>>
> 
> Yes maybe, but I think it will not be worse than it is now.
> Since the second time the mutex is acquired it is done with
> mutex_lock_killable, so at least kill -9 should get it terminated.
> 



>  static void free_bprm(struct linux_binprm *bprm)
>  {
>  	free_arg_pages(bprm);
>  	if (bprm->cred) {
> +		if (!bprm->called_flush_old_exec)
> +			mutex_lock(&current->signal->cred_guard_mutex);
> +		current->signal->cred_locked_for_ptrace = false;
>  		mutex_unlock(&current->signal->cred_guard_mutex);


Hmm, cough...
actually when the mutex_lock_killable fails, due to kill -9, in flush_old_exec
free_bprm locks the same mutex, this time unkillable, but I should better do
mutex_lock_killable here, and if that fails, I can leave cred_locked_for_ptrace,
it shouldn't matter, since this is a fatal signal anyway, right?

Bernd.
Christian Brauner March 3, 2020, 2:20 p.m. UTC | #10
On Tue, Mar 03, 2020 at 11:23:31AM +0000, Bernd Edlinger wrote:
> On 3/3/20 11:34 AM, Bernd Edlinger wrote:
> > On 3/3/20 9:58 AM, Christian Brauner wrote:
> >> So one issue I see with having to reacquire the cred_guard_mutex might
> >> be that this would allow tasks holding the cred_guard_mutex to block a
> >> killed exec'ing task from exiting, right?
> >>
> > 
> > Yes maybe, but I think it will not be worse than it is now.
> > Since the second time the mutex is acquired it is done with
> > mutex_lock_killable, so at least kill -9 should get it terminated.
> > 
> 
> 
> 
> >  static void free_bprm(struct linux_binprm *bprm)
> >  {
> >  	free_arg_pages(bprm);
> >  	if (bprm->cred) {
> > +		if (!bprm->called_flush_old_exec)
> > +			mutex_lock(&current->signal->cred_guard_mutex);
> > +		current->signal->cred_locked_for_ptrace = false;
> >  		mutex_unlock(&current->signal->cred_guard_mutex);
> 
> 
> Hmm, cough...
> actually when the mutex_lock_killable fails, due to kill -9, in flush_old_exec
> free_bprm locks the same mutex, this time unkillable, but I should better do
> mutex_lock_killable here, and if that fails, I can leave cred_locked_for_ptrace,
> it shouldn't matter, since this is a fatal signal anyway, right?

I think so, yes.
Christian Brauner March 4, 2020, 3:30 p.m. UTC | #11
On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
> On 3/3/20 6:29 AM, Kees Cook wrote:
> > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> >> On 3/3/20 3:26 AM, Kees Cook wrote:
> >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> >>>> [...]
> >>>
> >>> If I'm reading this patch correctly, this changes the lifetime of the
> >>> cred_guard_mutex lock to be:
> >>> 	- during prepare_bprm_creds()
> >>> 	- from flush_old_exec() through install_exec_creds()
> >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> >>> install_exec_creds().
> > 
> > BTW, I think the effect of this change (i.e. my paragraph above) should
> > be distinctly called out in the commit log if this solution moves
> > forward.
> > 
> 
> Okay, will do.
> 
> >>> That means, for example, that check_unsafe_exec()'s documented invariant
> >>> is violated:
> >>>     /*
> >>>      * determine how safe it is to execute the proposed program
> >>>      * - the caller must hold ->cred_guard_mutex to protect against
> >>>      *   PTRACE_ATTACH or seccomp thread-sync
> >>>      */
> >>
> >> Oh, right, I haven't understood that hint...
> > 
> > I know no_new_privs is checked there, but I haven't studied the
> > PTRACE_ATTACH part of that comment. If that is handled with the new
> > check, this comment should be updated.
> > 
> 
> Okay, I change that comment to:
> 
> /*
>  * determine how safe it is to execute the proposed program
>  * - the caller must have set ->cred_locked_in_execve to protect against
>  *   PTRACE_ATTACH or seccomp thread-sync
>  */
> 
> >>> I think it also means that the potentially multiple invocations
> >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> >>> a lock (another place where current's no_new_privs is evaluated).
> >>
> >> So no_new_privs can change from 0->1, but should not
> >> when execve is running.
> >>
> >> As long as the calling thread is in execve it won't do this,
> >> and the only other place, where it may set for other threads
> >> is in seccomp_sync_threads, but that can easily be avoided see below.
> > 
> > Yeah, everything was fine until I had to go complicate things with
> > TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> > later gaining a seccomp filter from an unpriv process. The no_new_privs
> > flag was used to control this, but it required that the filter not get
> > applied during exec.
> > 
> >>> Related, it also means that cred_guard_mutex is unheld for every
> >>> invocation of search_binary_handler() (which can loop via the previously
> >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> >>> currently.)
> >>>
> >>> For seccomp, the expectations about existing thread states risks races
> >>> too. There are two locks held for TSYNC:
> >>> - current->sighand->siglock is held to keep new threads from
> >>>   appearing/disappearing, which would destroy filter refcounting and
> >>>   lead to memory corruption.
> >>
> >> I don't understand what you mean here.
> >> How can this lead to memory corruption?
> > 
> > Mainly this is a matter of how seccomp manages its filter hierarchy
> > (since the filters are shared through process ancestry), so if a thread
> > appears in the middle of TSYNC it may be racing another TSYNC and break
> > ancestry, leading to bad reference counting on process death, etc.
> > (Though, yes, with refcount_t now, things should never corrupt, just
> > waste memory.)
> > 
> 
> I assume for now, that the current->sighand->siglock held while iterating all
> threads is sufficient here.
> 
> >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> >>>   avoid no_new_privs and filter confusion during exec, which could
> >>>   lead to exploitable setuid conditions (see below).
> >>>
> >>> Just racing a malicious thread during TSYNC is not a very strong
> >>> example (a malicious thread could do lots of fun things to "current"
> >>> before it ever got near calling TSYNC), but I think there is the risk
> >>> of mismatched/confused states that we don't want to allow. One is a
> >>> particularly bad state that could lead to privilege escalations (in the
> >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> >>> has a filter attached that silently fails a priv-dropping setuid call
> >>> and continues execution with elevated privs, it can be tricked into
> >>> doing bad things on behalf of the unprivileged parent, which was the
> >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> >>>
> >>> thread A clones thread B
> >>> thread B starts setuid exec
> >>> thread A sets no_new_privs
> >>> thread A calls seccomp with TSYNC
> >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> >>> thread B passes check_unsafe_exec() with no_new_privs unset
> >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> >>> thread B finishes exec, now running with elevated privs, a filter chosen
> >>>          by thread A, _and_ nnp set (which doesn't matter)
> >>>
> >>> With the original locking, thread B will fail check_unsafe_exec()
> >>> because filter and nnp state are changed together, with "atomicity"
> >>> protected by the cred_guard_mutex.
> >>>
> >>
> >> Ah, good point, thanks!
> >>
> >> This can be fixed by checking current->signal->cred_locked_for_ptrace
> >> while the cred_guard_mutex is locked, like this for instance:
> >>
> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >> index b6ea3dc..377abf0 100644
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> >>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
> >>         assert_spin_locked(&current->sighand->siglock);
> >>  
> >> +       if (current->signal->cred_locked_for_ptrace)
> >> +               return -EAGAIN;
> >> +
> > 
> > Hmm. I guess something like that could work. TSYNC expects to be able to
> > report _which_ thread wrecked the call, though... I wonder if in_execve
> > could be used to figure out the offending thread. Hm, nope, that would
> > be outside of lock too (and all users are "current" right now, so the
> > lock wasn't needed before).
> > 
> 
> I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
> but the caller will die quickly and cannot do anything with that information
> when another thread executes execve, right?
> 
> >>         /* Validate all threads being eligible for synchronization. */
> >>         caller = current;
> >>         for_each_thread(caller, thread) {
> >>
> >>
> >>> And this is just the bad state I _can_ see. I'm worried there are more...
> >>>
> >>> All this said, I do see a small similarity here to the work I did to
> >>> stabilize stack rlimits (there was an ongoing problem with making multiple
> >>> decisions for the bprm based on current's state -- but current's state
> >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> >>> current's copy until exec ended and then stored bprm's copy into current.
> >>> If the only problem anyone can see here is the handling of no_new_privs,
> >>> we might be able to solve that similarly, at least disentangling tsync/nnp
> >>> from cred_guard_mutex.
> >>>
> >>
> >> I still think that is solvable with using cred_locked_for_ptrace and
> >> simply make the tsync fail if it would otherwise be blocked.
> > 
> > I wonder if we can find a better name than "cred_locked_for_ptrace"?
> > Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> > 
> 
> Yeah, I'd go with "cred_locked_in_execve".
> 
> > And the comment on bool cred_locked_for_ptrace should mention that
> > access is only allowed under cred_guard_mutex lock.
> > 
> 
> okay.
> 
> >>>> +	sig->cred_locked_for_ptrace = false;
> > 
> > This is redundant to the zalloc -- I think you can drop it (unless
> > someone wants to keep it for clarify?)
> > 
> 
> I'll remove that here and in init/init_task.c
> 
> > Also, I think cred_locked_for_ptrace needs checking deeper, in
> > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> > be sufficient to see a stable version of the thread...
> > 
> 
> No, these need to be addressed individually, but most users just want
> to know if the current credentials are sufficient at this moment, but will
> not change the credentials, as ptrace and TSYNC do. 
> 
> BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
> mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
> So adding an access to cred_locked_for_execve in ptrace_may_access is
> probably not an option.

That could be solved by e.g. adding ptrace_may_access_{no}exec() taking
cred_guard_mutex.
diff mbox series

Patch

diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..61d6704 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,9 +437,14 @@  new set of credentials by calling::
 
 	struct cred *prepare_creds(void);
 
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful.  It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
+is acquired before this function gets called, and released after setting
+current->signal->cred_locked_for_ptrace.  The same mutex is acquired later,
+while the credentials and the process mmap are actually changed, and
+current->signal->cred_locked_for_ptrace is reset again.
 
 The mutex prevents ``ptrace()`` from altering the ptrace state of a process
 while security checks on credentials construction and changing is taking place
@@ -466,9 +471,8 @@  by calling::
 
 This will alter various aspects of the credentials and the process, giving the
 LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
 
 This function is guaranteed to return 0, so that it can be tail-called at the
 end of such functions as ``sys_setresuid()``.
@@ -486,8 +490,7 @@  invoked::
 
 	void abort_creds(struct cred *new);
 
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
 
 
 A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..e466301 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,6 +1266,12 @@  int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	retval = mutex_lock_killable(&current->signal->cred_guard_mutex);
+	if (retval)
+		goto out;
+
+	bprm->called_flush_old_exec = 1;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1398,28 +1404,41 @@  void finalize_exec(struct linux_binprm *bprm)
 EXPORT_SYMBOL(finalize_exec);
 
 /*
- * Prepare credentials and lock ->cred_guard_mutex.
+ * Prepare credentials and set ->cred_locked_for_ptrace.
  * install_exec_creds() commits the new creds and drops the lock.
  * Or, if exec fails before, free_bprm() should release ->cred and
  * and unlock.
  */
 static int prepare_bprm_creds(struct linux_binprm *bprm)
 {
+	int ret;
+
 	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
+	ret = -EAGAIN;
+	if (unlikely(current->signal->cred_locked_for_ptrace))
+		goto out;
+
+	ret = -ENOMEM;
 	bprm->cred = prepare_exec_creds();
-	if (likely(bprm->cred))
-		return 0;
+	if (likely(bprm->cred)) {
+		current->signal->cred_locked_for_ptrace = true;
+		ret = 0;
+	}
 
+out:
 	mutex_unlock(&current->signal->cred_guard_mutex);
-	return -ENOMEM;
+	return ret;
 }
 
 static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		if (!bprm->called_flush_old_exec)
+			mutex_lock(&current->signal->cred_guard_mutex);
+		current->signal->cred_locked_for_ptrace = false;
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1469,6 +1488,7 @@  void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	current->signal->cred_locked_for_ptrace = false;
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..2930253 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,11 @@  struct linux_binprm {
 		 * exec has happened. Used to sanitize execution environment
 		 * and to set AT_SECURE auxv for glibc.
 		 */
-		secureexec:1;
+		secureexec:1,
+		/*
+		 * Set by flush_old_exec, when the cred_guard_mutex is taken.
+		 */
+		called_flush_old_exec:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..073a2b7 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -225,6 +225,7 @@  struct signal_struct {
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace) */
+	bool cred_locked_for_ptrace;	/* set while in execve */
 } __randomize_layout;
 
 /*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..ecefff28 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@ 
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+	.cred_locked_for_ptrace = false,
 #ifdef CONFIG_POSIX_TIMERS
 	.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
 	.cputimer	= {
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..e4c78de 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -676,7 +676,7 @@  void __init cred_init(void)
  *
  * Returns the new credentials or NULL if out of memory.
  *
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
  */
 struct cred *prepare_kernel_cred(struct task_struct *daemon)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 0808095..a2b2ec8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@  static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
 	mutex_init(&sig->cred_guard_mutex);
+	sig->cred_locked_for_ptrace = false;
 
 	return 0;
 }
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179..abf09ba 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -395,6 +395,10 @@  static int ptrace_attach(struct task_struct *task, long request,
 	if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
 		goto out;
 
+	retval = -EAGAIN;
+	if (task->signal->cred_locked_for_ptrace)
+		goto unlock_creds;
+
 	task_lock(task);
 	retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
 	task_unlock(task);
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@  static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
 	if (!mm || IS_ERR(mm)) {
 		rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
 		/*
-		 * Explicitly map EACCES to EPERM as EPERM is a more a
+		 * Explicitly map EACCES to EPERM as EPERM is a more
 		 * appropriate error code for process_vw_readv/writev
 		 */
 		if (rc == -EACCES)
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@ 
 # SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
 
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
 
 include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..6d8a048
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,66 @@ 
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+	ptrace(PTRACE_TRACEME, 0, 0L, 0L);
+	return NULL;
+}
+
+TEST(vmaccess)
+{
+	int f, pid = fork();
+	char mm[64];
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("true", "true", NULL);
+	}
+
+	sleep(1);
+	sprintf(mm, "/proc/%d/mem", pid);
+	f = open(mm, O_RDONLY);
+	ASSERT_LE(0, f);
+	close(f);
+	f = kill(pid, SIGCONT);
+	ASSERT_EQ(0, f);
+}
+
+TEST(attach)
+{
+	int f, pid = fork();
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread, NULL);
+		pthread_join(pt, NULL);
+		execlp("true", "true", NULL);
+	}
+
+	sleep(1);
+	f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+	ASSERT_EQ(EAGAIN, errno);
+	ASSERT_EQ(f, -1);
+	f = kill(pid, SIGCONT);
+	ASSERT_EQ(0, f);
+}
+
+TEST_HARNESS_MAIN