diff mbox series

exec: Fix a deadlock in ptrace

Message ID AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com (mailing list archive)
State New, archived
Headers show
Series exec: Fix a deadlock in ptrace | expand

Commit Message

Bernd Edlinger March 1, 2020, 11:27 a.m. UTC
This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.

I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated.  They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.

The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

strace          D    0 30614  30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

expect          D    0 31933  30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The proposed solution is to have a second mutex that is
used in mm_access, so it is allowed to continue while the
dying threads are not yet terminated.

I also took the opportunity to improve the documentation
of prepare_creds, which is obviously out of sync.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 Documentation/security/credentials.rst | 18 ++++++++++--------
 fs/exec.c                              |  9 +++++++++
 include/linux/binfmts.h                |  6 +++++-
 include/linux/sched/signal.h           |  1 +
 init/init_task.c                       |  1 +
 kernel/cred.c                          |  2 +-
 kernel/fork.c                          |  5 +++--
 mm/process_vm_access.c                 |  2 +-
 8 files changed, 31 insertions(+), 13 deletions(-)

Comments

Aleksa Sarai March 1, 2020, 3:13 p.m. UTC | #1
On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
> 
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated.  They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
> 
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> 
> strace          D    0 30614  30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> expect          D    0 31933  30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The proposed solution is to have a second mutex that is
> used in mm_access, so it is allowed to continue while the
> dying threads are not yet terminated.
> 
> I also took the opportunity to improve the documentation
> of prepare_creds, which is obviously out of sync.
>
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>

I can't comment on the validity of the patch, but I also found and
reported this issue in 2016[1] and the discussion quickly veered into
the problem being more complicated (and uglier) than it seems at first
glance.

You should probably also Cc stable, given this has been a long-standing
issue and your patch doesn't look (too) invasive.

[1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
Christian Brauner March 1, 2020, 3:58 p.m. UTC | #2
On Mon, Mar 02, 2020 at 02:13:33AM +1100, Aleksa Sarai wrote:
> On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> > This fixes a deadlock in the tracer when tracing a multi-threaded
> > application that calls execve while more than one thread are running.
> > 
> > I observed that when running strace on the gcc test suite, it always
> > blocks after a while, when expect calls execve, because other threads
> > have to be terminated.  They send ptrace events, but the strace is no
> > longer able to respond, since it is blocked in vm_access.
> > 
> > The deadlock is always happening when strace needs to access the
> > tracees process mmap, while another thread in the tracee starts to
> > execve a child process, but that cannot continue until the
> > PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> > 
> > strace          D    0 30614  30584 0x00000000
> > Call Trace:
> > __schedule+0x3ce/0x6e0
> > schedule+0x5c/0xd0
> > schedule_preempt_disabled+0x15/0x20
> > __mutex_lock.isra.13+0x1ec/0x520
> > __mutex_lock_killable_slowpath+0x13/0x20
> > mutex_lock_killable+0x28/0x30
> > mm_access+0x27/0xa0
> > process_vm_rw_core.isra.3+0xff/0x550
> > process_vm_rw+0xdd/0xf0
> > __x64_sys_process_vm_readv+0x31/0x40
> > do_syscall_64+0x64/0x220
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > expect          D    0 31933  30876 0x80004003
> > Call Trace:
> > __schedule+0x3ce/0x6e0
> > schedule+0x5c/0xd0
> > flush_old_exec+0xc4/0x770
> > load_elf_binary+0x35a/0x16c0
> > search_binary_handler+0x97/0x1d0
> > __do_execve_file.isra.40+0x5d4/0x8a0
> > __x64_sys_execve+0x49/0x60
> > do_syscall_64+0x64/0x220
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > The proposed solution is to have a second mutex that is
> > used in mm_access, so it is allowed to continue while the
> > dying threads are not yet terminated.
> > 
> > I also took the opportunity to improve the documentation
> > of prepare_creds, which is obviously out of sync.
> >
> > Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> 
> I can't comment on the validity of the patch, but I also found and
> reported this issue in 2016[1] and the discussion quickly veered into
> the problem being more complicated (and uglier) than it seems at first
> glance.
> 
> You should probably also Cc stable, given this has been a long-standing
> issue and your patch doesn't look (too) invasive.
> 
> [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/

Yeah, I remember you mentioning this a while back.

Bernd, we really want a reproducer for this sent alongside with this
patch added to:
tools/testing/selftests/ptrace/
Having a test for this bug irrespective of whether or not we go with
this as fix seems really worth it.

Oleg seems to have suggested that a potential alternative fix is to wait
in de_thread() until all other threads in the thread-group have passed
exit_notiy(). Right now we only kill them but don't wait. Currently
de_thread() only waits for the thread-group leader to pass exit_notify()
whenever a non-thread-group leader thread execs (because the exec'ing
thread becomes the new thread-group leader with the same pid as the
former thread-group leader).

Christian
Bernd Edlinger March 1, 2020, 5:24 p.m. UTC | #3
Hi Aleksa,

On 3/1/20 4:13 PM, Aleksa Sarai wrote:
> On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>>
>> I observed that when running strace on the gcc test suite, it always
>> blocks after a while, when expect calls execve, because other threads
>> have to be terminated.  They send ptrace events, but the strace is no
>> longer able to respond, since it is blocked in vm_access.
>>
>> The deadlock is always happening when strace needs to access the
>> tracees process mmap, while another thread in the tracee starts to
>> execve a child process, but that cannot continue until the
>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>>
>> strace          D    0 30614  30584 0x00000000
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> schedule_preempt_disabled+0x15/0x20
>> __mutex_lock.isra.13+0x1ec/0x520
>> __mutex_lock_killable_slowpath+0x13/0x20
>> mutex_lock_killable+0x28/0x30
>> mm_access+0x27/0xa0
>> process_vm_rw_core.isra.3+0xff/0x550
>> process_vm_rw+0xdd/0xf0
>> __x64_sys_process_vm_readv+0x31/0x40
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> expect          D    0 31933  30876 0x80004003
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> flush_old_exec+0xc4/0x770
>> load_elf_binary+0x35a/0x16c0
>> search_binary_handler+0x97/0x1d0
>> __do_execve_file.isra.40+0x5d4/0x8a0
>> __x64_sys_execve+0x49/0x60
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> The proposed solution is to have a second mutex that is
>> used in mm_access, so it is allowed to continue while the
>> dying threads are not yet terminated.
>>
>> I also took the opportunity to improve the documentation
>> of prepare_creds, which is obviously out of sync.
>>
>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> 
> I can't comment on the validity of the patch, but I also found and
> reported this issue in 2016[1] and the discussion quickly veered into
> the problem being more complicated (and uglier) than it seems at first
> glance.
> 
> You should probably also Cc stable, given this has been a long-standing
> issue and your patch doesn't look (too) invasive.
> 

I am fully aware that this patch won't fix the case then PTRACE_ACCESS is racing
with de_thread.  But I don't see a problem with allowing vm access based on the
current credentials as they are still the same until de_thread is done with it's
job.  And in a practical way this fixes 99% of the real problem here, as it only
happens since strace is currently tracing something and needs access to the parameters
in the tracee's vm space.
Of course you could fork the strace process to do any PTRACE_ACCESS when necessary,
and, well, maybe that would fix the remaining problem here...

However before I considered changing the kernel for this I tried to fix this
within strace.  First I tried to wait in the signal handler.  See attached
strace-patch-1.diff, but that did not work, BUT I think it is possible that your
patch you proposed previously would actually make it work.

I tried then another approach, using a worker thread to wait for the childs,
but it did only work when I remove PTRACE_O_TRACEEXIT from the ptrace options,
because the ptrace(PTRACE_SYSCALL, pid, 0L, 0L) does not work in the worker thread,
rv = -1, errno = 3 there, and unfortunately the main thread is blocked and unable
to do the ptrace call, that makes the thread continue.
So I consider that second patch really ugly, and wouldn't propose something like
that seriously.


@@ -69,7 +71,7 @@
 cflag_t cflag = CFLAG_NONE;
 unsigned int followfork;
 unsigned int ptrace_setoptions = PTRACE_O_TRACESYSGOOD | PTRACE_O_TRACEEXEC
-                                | PTRACE_O_TRACEEXIT;
+                                ;//| PTRACE_O_TRACEEXIT;
 unsigned int xflag;
 bool debug_flag;
 bool Tflag;

so it only works because of this line, without that it is not able to make the
thread continue after the PTRACE_EVENT_EXIT. 


Thanks
Bernd.

> [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
>
Bernd Edlinger March 1, 2020, 5:46 p.m. UTC | #4
On 3/1/20 4:58 PM, Christian Brauner wrote:
> On Mon, Mar 02, 2020 at 02:13:33AM +1100, Aleksa Sarai wrote:
>> On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>> This fixes a deadlock in the tracer when tracing a multi-threaded
>>> application that calls execve while more than one thread are running.
>>>
>>> I observed that when running strace on the gcc test suite, it always
>>> blocks after a while, when expect calls execve, because other threads
>>> have to be terminated.  They send ptrace events, but the strace is no
>>> longer able to respond, since it is blocked in vm_access.
>>>
>>> The deadlock is always happening when strace needs to access the
>>> tracees process mmap, while another thread in the tracee starts to
>>> execve a child process, but that cannot continue until the
>>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>>>
>>> strace          D    0 30614  30584 0x00000000
>>> Call Trace:
>>> __schedule+0x3ce/0x6e0
>>> schedule+0x5c/0xd0
>>> schedule_preempt_disabled+0x15/0x20
>>> __mutex_lock.isra.13+0x1ec/0x520
>>> __mutex_lock_killable_slowpath+0x13/0x20
>>> mutex_lock_killable+0x28/0x30
>>> mm_access+0x27/0xa0
>>> process_vm_rw_core.isra.3+0xff/0x550
>>> process_vm_rw+0xdd/0xf0
>>> __x64_sys_process_vm_readv+0x31/0x40
>>> do_syscall_64+0x64/0x220
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> expect          D    0 31933  30876 0x80004003
>>> Call Trace:
>>> __schedule+0x3ce/0x6e0
>>> schedule+0x5c/0xd0
>>> flush_old_exec+0xc4/0x770
>>> load_elf_binary+0x35a/0x16c0
>>> search_binary_handler+0x97/0x1d0
>>> __do_execve_file.isra.40+0x5d4/0x8a0
>>> __x64_sys_execve+0x49/0x60
>>> do_syscall_64+0x64/0x220
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> The proposed solution is to have a second mutex that is
>>> used in mm_access, so it is allowed to continue while the
>>> dying threads are not yet terminated.
>>>
>>> I also took the opportunity to improve the documentation
>>> of prepare_creds, which is obviously out of sync.
>>>
>>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>>
>> I can't comment on the validity of the patch, but I also found and
>> reported this issue in 2016[1] and the discussion quickly veered into
>> the problem being more complicated (and uglier) than it seems at first
>> glance.
>>
>> You should probably also Cc stable, given this has been a long-standing
>> issue and your patch doesn't look (too) invasive.
>>
>> [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
> 
> Yeah, I remember you mentioning this a while back.
> 
> Bernd, we really want a reproducer for this sent alongside with this
> patch added to:
> tools/testing/selftests/ptrace/
> Having a test for this bug irrespective of whether or not we go with
> this as fix seems really worth it.
> 

I ran into this issue, because I wanted to fix an issue in the gcc testsuite,
namely why it forgets to remove some temp files,
so I did the following:

strace -ftt -o trace.txt make check-gcc-c -k -j4

I reproduced with v4.20 and v5.5 kernel, and I don't know why but it is
not happening on all systems I tested, maybe it is something that the expect program
does, because, always when I try to reproduce this, the deadlock was always in "expect".

I use expect version 5.45 on the computer where the above test freezes after
a couple of minutes.

I think the issue with strace is that it is using vm_access to get the parameters
of a syscall that is going on in one thread, and that races with another thread
that calls execve, and blocks the cred_guard_mutex.

While Olg's test case here, will certainly not be fixed:

https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/

he mentions the access to "anything else which needs ->cred_guard_mutex,
say open(/proc/$pid/mem)", I don't know for sure how that can be done, but if
that is possible, it would probably work as a test case.

What do you think?


Bernd.


> Oleg seems to have suggested that a potential alternative fix is to wait
> in de_thread() until all other threads in the thread-group have passed
> exit_notiy(). Right now we only kill them but don't wait. Currently
> de_thread() only waits for the thread-group leader to pass exit_notify()
> whenever a non-thread-group leader thread execs (because the exec'ing
> thread becomes the new thread-group leader with the same pid as the
> former thread-group leader).
> 
> Christian
>
Christian Brauner March 1, 2020, 6:20 p.m. UTC | #5
On Sun, Mar 01, 2020 at 05:46:08PM +0000, Bernd Edlinger wrote:
> On 3/1/20 4:58 PM, Christian Brauner wrote:
> > On Mon, Mar 02, 2020 at 02:13:33AM +1100, Aleksa Sarai wrote:
> >> On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> >>> This fixes a deadlock in the tracer when tracing a multi-threaded
> >>> application that calls execve while more than one thread are running.
> >>>
> >>> I observed that when running strace on the gcc test suite, it always
> >>> blocks after a while, when expect calls execve, because other threads
> >>> have to be terminated.  They send ptrace events, but the strace is no
> >>> longer able to respond, since it is blocked in vm_access.
> >>>
> >>> The deadlock is always happening when strace needs to access the
> >>> tracees process mmap, while another thread in the tracee starts to
> >>> execve a child process, but that cannot continue until the
> >>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> >>>
> >>> strace          D    0 30614  30584 0x00000000
> >>> Call Trace:
> >>> __schedule+0x3ce/0x6e0
> >>> schedule+0x5c/0xd0
> >>> schedule_preempt_disabled+0x15/0x20
> >>> __mutex_lock.isra.13+0x1ec/0x520
> >>> __mutex_lock_killable_slowpath+0x13/0x20
> >>> mutex_lock_killable+0x28/0x30
> >>> mm_access+0x27/0xa0
> >>> process_vm_rw_core.isra.3+0xff/0x550
> >>> process_vm_rw+0xdd/0xf0
> >>> __x64_sys_process_vm_readv+0x31/0x40
> >>> do_syscall_64+0x64/0x220
> >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>
> >>> expect          D    0 31933  30876 0x80004003
> >>> Call Trace:
> >>> __schedule+0x3ce/0x6e0
> >>> schedule+0x5c/0xd0
> >>> flush_old_exec+0xc4/0x770
> >>> load_elf_binary+0x35a/0x16c0
> >>> search_binary_handler+0x97/0x1d0
> >>> __do_execve_file.isra.40+0x5d4/0x8a0
> >>> __x64_sys_execve+0x49/0x60
> >>> do_syscall_64+0x64/0x220
> >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>
> >>> The proposed solution is to have a second mutex that is
> >>> used in mm_access, so it is allowed to continue while the
> >>> dying threads are not yet terminated.
> >>>
> >>> I also took the opportunity to improve the documentation
> >>> of prepare_creds, which is obviously out of sync.
> >>>
> >>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> >>
> >> I can't comment on the validity of the patch, but I also found and
> >> reported this issue in 2016[1] and the discussion quickly veered into
> >> the problem being more complicated (and uglier) than it seems at first
> >> glance.
> >>
> >> You should probably also Cc stable, given this has been a long-standing
> >> issue and your patch doesn't look (too) invasive.
> >>
> >> [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
> > 
> > Yeah, I remember you mentioning this a while back.
> > 
> > Bernd, we really want a reproducer for this sent alongside with this
> > patch added to:
> > tools/testing/selftests/ptrace/
> > Having a test for this bug irrespective of whether or not we go with
> > this as fix seems really worth it.
> > 
> 
> I ran into this issue, because I wanted to fix an issue in the gcc testsuite,
> namely why it forgets to remove some temp files,
> so I did the following:
> 
> strace -ftt -o trace.txt make check-gcc-c -k -j4
> 
> I reproduced with v4.20 and v5.5 kernel, and I don't know why but it is
> not happening on all systems I tested, maybe it is something that the expect program
> does, because, always when I try to reproduce this, the deadlock was always in "expect".
> 
> I use expect version 5.45 on the computer where the above test freezes after
> a couple of minutes.
> 
> I think the issue with strace is that it is using vm_access to get the parameters
> of a syscall that is going on in one thread, and that races with another thread
> that calls execve, and blocks the cred_guard_mutex.
> 
> While Olg's test case here, will certainly not be fixed:
> 
> https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
> 
> he mentions the access to "anything else which needs ->cred_guard_mutex,
> say open(/proc/$pid/mem)", I don't know for sure how that can be done, but if
> that is possible, it would probably work as a test case.
> 
> What do you think?

Yeah, anything that calls ptrace_may_access() is fine and
open(/proc/$pid/mem) will work so long as $pid is not in the same
thread-group as the caller. A polished version of the reproducer you
linked in would probably be good.
Jann Horn March 1, 2020, 6:21 p.m. UTC | #6
On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
> The proposed solution is to have a second mutex that is
> used in mm_access, so it is allowed to continue while the
> dying threads are not yet terminated.

Just for context: When I proposed something similar back in 2016,
https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
was the resulting discussion thread. At least back then, I looked
through the various existing users of cred_guard_mutex, and the only
places that couldn't be converted to the new second mutex were
PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.


The ideal solution would IMO be something like this: Decide what the
new task's credentials should be *before* reaching de_thread(),
install them into a second cred* on the task (together with the new
dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
check against both. After that, some further restructuring might even
allow the cred_guard_mutex to not be held across all of the VFS
operations that happen early on in execve, which may block
indefinitely. But that would be pretty complicated, so I think your
proposed solution makes sense for now, given that nobody has managed
to implement anything better in the last few years.
Christian Brauner March 1, 2020, 6:52 p.m. UTC | #7
On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
> On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
> > The proposed solution is to have a second mutex that is
> > used in mm_access, so it is allowed to continue while the
> > dying threads are not yet terminated.
> 
> Just for context: When I proposed something similar back in 2016,
> https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
> was the resulting discussion thread. At least back then, I looked
> through the various existing users of cred_guard_mutex, and the only
> places that couldn't be converted to the new second mutex were
> PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
> 
> 
> The ideal solution would IMO be something like this: Decide what the
> new task's credentials should be *before* reaching de_thread(),
> install them into a second cred* on the task (together with the new
> dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
> check against both. After that, some further restructuring might even

Hm, so essentially a private ptrace_access_cred member in task_struct?
That would presumably also involve altering various LSM hooks to look at
ptrace_access_cred.

(Minor side-note, de_thread() takes a struct task_struct argument but
 only ever is passed current.)

> allow the cred_guard_mutex to not be held across all of the VFS
> operations that happen early on in execve, which may block
> indefinitely. But that would be pretty complicated, so I think your
> proposed solution makes sense for now, given that nobody has managed
> to implement anything better in the last few years.

Reading through the old threads and how often this issue came up, I tend
to agree.
Bernd Edlinger March 1, 2020, 7 p.m. UTC | #8
On 3/1/20 7:52 PM, Christian Brauner wrote:
> On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
>> On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
>> <bernd.edlinger@hotmail.de> wrote:
>>> The proposed solution is to have a second mutex that is
>>> used in mm_access, so it is allowed to continue while the
>>> dying threads are not yet terminated.
>>
>> Just for context: When I proposed something similar back in 2016,
>> https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
>> was the resulting discussion thread. At least back then, I looked
>> through the various existing users of cred_guard_mutex, and the only
>> places that couldn't be converted to the new second mutex were
>> PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
>>
>>
>> The ideal solution would IMO be something like this: Decide what the
>> new task's credentials should be *before* reaching de_thread(),
>> install them into a second cred* on the task (together with the new
>> dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
>> check against both. After that, some further restructuring might even
> 
> Hm, so essentially a private ptrace_access_cred member in task_struct?
> That would presumably also involve altering various LSM hooks to look at
> ptrace_access_cred.
> 
> (Minor side-note, de_thread() takes a struct task_struct argument but
>  only ever is passed current.)
> 
>> allow the cred_guard_mutex to not be held across all of the VFS
>> operations that happen early on in execve, which may block
>> indefinitely. But that would be pretty complicated, so I think your
>> proposed solution makes sense for now, given that nobody has managed
>> to implement anything better in the last few years.
> 
> Reading through the old threads and how often this issue came up, I tend
> to agree.
> 

Okay, fine.

I managed to change Oleg's test case, into one that shows what exactly
is changed with this patch:


$ cat t.c
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/signal.h>
#include <sys/ptrace.h>

void *thread(void *arg)
{
	ptrace(PTRACE_TRACEME, 0,0,0);
	return NULL;
}

int main(void)
{
	int f, pid = fork();
	char mm[64];

	if (!pid) {
		pthread_t pt;
		pthread_create(&pt, NULL, thread, NULL);
		pthread_join(pt, NULL);
		execlp("echo", "echo", "passed", NULL);
	}

	sleep(1);
	sprintf(mm, "/proc/%d/mem", pid);
        printf("open(%s)\n", mm);
	f = open(mm, O_RDONLY);
        printf("f = %d\n", f);
	// this is not fixed! ptrace(PTRACE_ATTACH, pid, 0,0);
	kill(pid, SIGCONT);
	if (f >= 0)
		close(f);
	return 0;
}
$ gcc -pthread -Wall t.c
$ ./a.out 
open(/proc/2802/mem)
f = 3
$ passed

previously this did block, how can I make a test case for this?
I am not so experienced in this matter.


Thanks
Bernd.
Jann Horn March 1, 2020, 8 p.m. UTC | #9
On Sun, Mar 1, 2020 at 7:52 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
> > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
> > <bernd.edlinger@hotmail.de> wrote:
> > > The proposed solution is to have a second mutex that is
> > > used in mm_access, so it is allowed to continue while the
> > > dying threads are not yet terminated.
> >
> > Just for context: When I proposed something similar back in 2016,
> > https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
> > was the resulting discussion thread. At least back then, I looked
> > through the various existing users of cred_guard_mutex, and the only
> > places that couldn't be converted to the new second mutex were
> > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
> >
> >
> > The ideal solution would IMO be something like this: Decide what the
> > new task's credentials should be *before* reaching de_thread(),
> > install them into a second cred* on the task (together with the new
> > dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
> > check against both. After that, some further restructuring might even
>
> Hm, so essentially a private ptrace_access_cred member in task_struct?

And a second dumpability field, because that changes together with the
creds during execve. (Btw, currently the dumpability is in the
mm_struct, but that's kinda wrong. The mm_struct is removed from a
task on exit while access checks can still be performed against it, and
currently ptrace_may_access() just lets the access go through in that
case, which weakens the protection offered by PR_SET_DUMPABLE when
used for security purposes. I think it ought to be moved over into the
task_struct.)

> That would presumably also involve altering various LSM hooks to look at
> ptrace_access_cred.

When I tried to implement this in the past, I changed the LSM hook to
take the target task's cred* as an argument, and then called the LSM
hook twice from ptrace_may_access(). IIRC having the target task's
creds as an argument works for almost all the LSMs, with the exception
of Yama, which doesn't really care about the target task's creds, so
you have to pass in both the task_struct* and the cred*.
Christian Brauner March 2, 2020, 7:47 a.m. UTC | #10
On Sun, Mar 01, 2020 at 09:00:22PM +0100, Jann Horn wrote:
> On Sun, Mar 1, 2020 at 7:52 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> > On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
> > > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
> > > <bernd.edlinger@hotmail.de> wrote:
> > > > The proposed solution is to have a second mutex that is
> > > > used in mm_access, so it is allowed to continue while the
> > > > dying threads are not yet terminated.
> > >
> > > Just for context: When I proposed something similar back in 2016,
> > > https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
> > > was the resulting discussion thread. At least back then, I looked
> > > through the various existing users of cred_guard_mutex, and the only
> > > places that couldn't be converted to the new second mutex were
> > > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
> > >
> > >
> > > The ideal solution would IMO be something like this: Decide what the
> > > new task's credentials should be *before* reaching de_thread(),
> > > install them into a second cred* on the task (together with the new
> > > dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
> > > check against both. After that, some further restructuring might even
> >
> > Hm, so essentially a private ptrace_access_cred member in task_struct?
> 
> And a second dumpability field, because that changes together with the
> creds during execve. (Btw, currently the dumpability is in the
> mm_struct, but that's kinda wrong. The mm_struct is removed from a
> task on exit while access checks can still be performed against it, and
> currently ptrace_may_access() just lets the access go through in that
> case, which weakens the protection offered by PR_SET_DUMPABLE when
> used for security purposes. I think it ought to be moved over into the
> task_struct.)
> 
> > That would presumably also involve altering various LSM hooks to look at
> > ptrace_access_cred.
> 
> When I tried to implement this in the past, I changed the LSM hook to
> take the target task's cred* as an argument, and then called the LSM
> hook twice from ptrace_may_access(). IIRC having the target task's
> creds as an argument works for almost all the LSMs, with the exception
> of Yama, which doesn't really care about the target task's creds, so
> you have to pass in both the task_struct* and the cred*.

It seems we should try PoCing this.

Christian
Christian Brauner March 2, 2020, 7:48 a.m. UTC | #11
On Mon, Mar 02, 2020 at 08:47:53AM +0100, Christian Brauner wrote:
> On Sun, Mar 01, 2020 at 09:00:22PM +0100, Jann Horn wrote:
> > On Sun, Mar 1, 2020 at 7:52 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > > On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
> > > > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
> > > > <bernd.edlinger@hotmail.de> wrote:
> > > > > The proposed solution is to have a second mutex that is
> > > > > used in mm_access, so it is allowed to continue while the
> > > > > dying threads are not yet terminated.
> > > >
> > > > Just for context: When I proposed something similar back in 2016,
> > > > https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
> > > > was the resulting discussion thread. At least back then, I looked
> > > > through the various existing users of cred_guard_mutex, and the only
> > > > places that couldn't be converted to the new second mutex were
> > > > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
> > > >
> > > >
> > > > The ideal solution would IMO be something like this: Decide what the
> > > > new task's credentials should be *before* reaching de_thread(),
> > > > install them into a second cred* on the task (together with the new
> > > > dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
> > > > check against both. After that, some further restructuring might even
> > >
> > > Hm, so essentially a private ptrace_access_cred member in task_struct?
> > 
> > And a second dumpability field, because that changes together with the
> > creds during execve. (Btw, currently the dumpability is in the
> > mm_struct, but that's kinda wrong. The mm_struct is removed from a
> > task on exit while access checks can still be performed against it, and
> > currently ptrace_may_access() just lets the access go through in that
> > case, which weakens the protection offered by PR_SET_DUMPABLE when
> > used for security purposes. I think it ought to be moved over into the
> > task_struct.)
> > 
> > > That would presumably also involve altering various LSM hooks to look at
> > > ptrace_access_cred.
> > 
> > When I tried to implement this in the past, I changed the LSM hook to
> > take the target task's cred* as an argument, and then called the LSM
> > hook twice from ptrace_may_access(). IIRC having the target task's
> > creds as an argument works for almost all the LSMs, with the exception
> > of Yama, which doesn't really care about the target task's creds, so
> > you have to pass in both the task_struct* and the cred*.
> 
> It seems we should try PoCing this.

Independent of the fix for Bernd's issue that is.
diff mbox series

Patch

diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..c98e0a8 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,9 +437,13 @@  new set of credentials by calling::
 
 	struct cred *prepare_creds(void);
 
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful.  It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
+is acquired before this function gets called, and the mutex
+current->signal->cred_change_mutex is acquired later, while the credentials
+and the process mmap are actually changed.
 
 The mutex prevents ``ptrace()`` from altering the ptrace state of a process
 while security checks on credentials construction and changing is taking place
@@ -466,9 +470,8 @@  by calling::
 
 This will alter various aspects of the credentials and the process, giving the
 LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
 
 This function is guaranteed to return 0, so that it can be tail-called at the
 end of such functions as ``sys_setresuid()``.
@@ -486,8 +489,7 @@  invoked::
 
 	void abort_creds(struct cred *new);
 
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
 
 
 A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..a6884e4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,6 +1266,12 @@  int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	retval = mutex_lock_killable(&current->signal->cred_change_mutex);
+	if (retval)
+		goto out;
+
+	bprm->called_flush_old_exec = 1;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1420,6 +1426,8 @@  static void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		if (bprm->called_flush_old_exec)
+			mutex_unlock(&current->signal->cred_change_mutex);
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1469,6 +1477,7 @@  void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	mutex_unlock(&current->signal->cred_change_mutex);
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..2e1318b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,11 @@  struct linux_binprm {
 		 * exec has happened. Used to sanitize execution environment
 		 * and to set AT_SECURE auxv for glibc.
 		 */
-		secureexec:1;
+		secureexec:1,
+		/*
+		 * Set by flush_old_exec, when the cred_change_mutex is taken.
+		 */
+		called_flush_old_exec:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..37eeabe 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -225,6 +225,7 @@  struct signal_struct {
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace) */
+	struct mutex cred_change_mutex; /* guard against credentials change */
 } __randomize_layout;
 
 /*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..6cd9a0f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@ 
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+	.cred_change_mutex = __MUTEX_INITIALIZER(init_signals.cred_change_mutex),
 #ifdef CONFIG_POSIX_TIMERS
 	.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
 	.cputimer	= {
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..e4c78de 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -676,7 +676,7 @@  void __init cred_init(void)
  *
  * Returns the new credentials or NULL if out of memory.
  *
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
  */
 struct cred *prepare_kernel_cred(struct task_struct *daemon)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 0808095..0395154 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1224,7 +1224,7 @@  struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 	struct mm_struct *mm;
 	int err;
 
-	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
+	err =  mutex_lock_killable(&task->signal->cred_change_mutex);
 	if (err)
 		return ERR_PTR(err);
 
@@ -1234,7 +1234,7 @@  struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 		mmput(mm);
 		mm = ERR_PTR(-EACCES);
 	}
-	mutex_unlock(&task->signal->cred_guard_mutex);
+	mutex_unlock(&task->signal->cred_change_mutex);
 
 	return mm;
 }
@@ -1594,6 +1594,7 @@  static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
 	mutex_init(&sig->cred_guard_mutex);
+	mutex_init(&sig->cred_change_mutex);
 
 	return 0;
 }
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@  static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
 	if (!mm || IS_ERR(mm)) {
 		rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
 		/*
-		 * Explicitly map EACCES to EPERM as EPERM is a more a
+		 * Explicitly map EACCES to EPERM as EPERM is a more
 		 * appropriate error code for process_vw_readv/writev
 		 */
 		if (rc == -EACCES)