Message ID | AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | exec: Fix a deadlock in ptrace | expand |
On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > This fixes a deadlock in the tracer when tracing a multi-threaded > application that calls execve while more than one thread are running. > > I observed that when running strace on the gcc test suite, it always > blocks after a while, when expect calls execve, because other threads > have to be terminated. They send ptrace events, but the strace is no > longer able to respond, since it is blocked in vm_access. > > The deadlock is always happening when strace needs to access the > tracees process mmap, while another thread in the tracee starts to > execve a child process, but that cannot continue until the > PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: > > strace D 0 30614 30584 0x00000000 > Call Trace: > __schedule+0x3ce/0x6e0 > schedule+0x5c/0xd0 > schedule_preempt_disabled+0x15/0x20 > __mutex_lock.isra.13+0x1ec/0x520 > __mutex_lock_killable_slowpath+0x13/0x20 > mutex_lock_killable+0x28/0x30 > mm_access+0x27/0xa0 > process_vm_rw_core.isra.3+0xff/0x550 > process_vm_rw+0xdd/0xf0 > __x64_sys_process_vm_readv+0x31/0x40 > do_syscall_64+0x64/0x220 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > expect D 0 31933 30876 0x80004003 > Call Trace: > __schedule+0x3ce/0x6e0 > schedule+0x5c/0xd0 > flush_old_exec+0xc4/0x770 > load_elf_binary+0x35a/0x16c0 > search_binary_handler+0x97/0x1d0 > __do_execve_file.isra.40+0x5d4/0x8a0 > __x64_sys_execve+0x49/0x60 > do_syscall_64+0x64/0x220 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > The proposed solution is to have a second mutex that is > used in mm_access, so it is allowed to continue while the > dying threads are not yet terminated. > > I also took the opportunity to improve the documentation > of prepare_creds, which is obviously out of sync. > > Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> I can't comment on the validity of the patch, but I also found and reported this issue in 2016[1] and the discussion quickly veered into the problem being more complicated (and uglier) than it seems at first glance. You should probably also Cc stable, given this has been a long-standing issue and your patch doesn't look (too) invasive. [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
On Mon, Mar 02, 2020 at 02:13:33AM +1100, Aleksa Sarai wrote: > On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > > This fixes a deadlock in the tracer when tracing a multi-threaded > > application that calls execve while more than one thread are running. > > > > I observed that when running strace on the gcc test suite, it always > > blocks after a while, when expect calls execve, because other threads > > have to be terminated. They send ptrace events, but the strace is no > > longer able to respond, since it is blocked in vm_access. > > > > The deadlock is always happening when strace needs to access the > > tracees process mmap, while another thread in the tracee starts to > > execve a child process, but that cannot continue until the > > PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: > > > > strace D 0 30614 30584 0x00000000 > > Call Trace: > > __schedule+0x3ce/0x6e0 > > schedule+0x5c/0xd0 > > schedule_preempt_disabled+0x15/0x20 > > __mutex_lock.isra.13+0x1ec/0x520 > > __mutex_lock_killable_slowpath+0x13/0x20 > > mutex_lock_killable+0x28/0x30 > > mm_access+0x27/0xa0 > > process_vm_rw_core.isra.3+0xff/0x550 > > process_vm_rw+0xdd/0xf0 > > __x64_sys_process_vm_readv+0x31/0x40 > > do_syscall_64+0x64/0x220 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > expect D 0 31933 30876 0x80004003 > > Call Trace: > > __schedule+0x3ce/0x6e0 > > schedule+0x5c/0xd0 > > flush_old_exec+0xc4/0x770 > > load_elf_binary+0x35a/0x16c0 > > search_binary_handler+0x97/0x1d0 > > __do_execve_file.isra.40+0x5d4/0x8a0 > > __x64_sys_execve+0x49/0x60 > > do_syscall_64+0x64/0x220 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > The proposed solution is to have a second mutex that is > > used in mm_access, so it is allowed to continue while the > > dying threads are not yet terminated. > > > > I also took the opportunity to improve the documentation > > of prepare_creds, which is obviously out of sync. > > > > Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> > > I can't comment on the validity of the patch, but I also found and > reported this issue in 2016[1] and the discussion quickly veered into > the problem being more complicated (and uglier) than it seems at first > glance. > > You should probably also Cc stable, given this has been a long-standing > issue and your patch doesn't look (too) invasive. > > [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Yeah, I remember you mentioning this a while back. Bernd, we really want a reproducer for this sent alongside with this patch added to: tools/testing/selftests/ptrace/ Having a test for this bug irrespective of whether or not we go with this as fix seems really worth it. Oleg seems to have suggested that a potential alternative fix is to wait in de_thread() until all other threads in the thread-group have passed exit_notiy(). Right now we only kill them but don't wait. Currently de_thread() only waits for the thread-group leader to pass exit_notify() whenever a non-thread-group leader thread execs (because the exec'ing thread becomes the new thread-group leader with the same pid as the former thread-group leader). Christian
Hi Aleksa, On 3/1/20 4:13 PM, Aleksa Sarai wrote: > On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: >> This fixes a deadlock in the tracer when tracing a multi-threaded >> application that calls execve while more than one thread are running. >> >> I observed that when running strace on the gcc test suite, it always >> blocks after a while, when expect calls execve, because other threads >> have to be terminated. They send ptrace events, but the strace is no >> longer able to respond, since it is blocked in vm_access. >> >> The deadlock is always happening when strace needs to access the >> tracees process mmap, while another thread in the tracee starts to >> execve a child process, but that cannot continue until the >> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: >> >> strace D 0 30614 30584 0x00000000 >> Call Trace: >> __schedule+0x3ce/0x6e0 >> schedule+0x5c/0xd0 >> schedule_preempt_disabled+0x15/0x20 >> __mutex_lock.isra.13+0x1ec/0x520 >> __mutex_lock_killable_slowpath+0x13/0x20 >> mutex_lock_killable+0x28/0x30 >> mm_access+0x27/0xa0 >> process_vm_rw_core.isra.3+0xff/0x550 >> process_vm_rw+0xdd/0xf0 >> __x64_sys_process_vm_readv+0x31/0x40 >> do_syscall_64+0x64/0x220 >> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> >> expect D 0 31933 30876 0x80004003 >> Call Trace: >> __schedule+0x3ce/0x6e0 >> schedule+0x5c/0xd0 >> flush_old_exec+0xc4/0x770 >> load_elf_binary+0x35a/0x16c0 >> search_binary_handler+0x97/0x1d0 >> __do_execve_file.isra.40+0x5d4/0x8a0 >> __x64_sys_execve+0x49/0x60 >> do_syscall_64+0x64/0x220 >> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> >> The proposed solution is to have a second mutex that is >> used in mm_access, so it is allowed to continue while the >> dying threads are not yet terminated. >> >> I also took the opportunity to improve the documentation >> of prepare_creds, which is obviously out of sync. >> >> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> > > I can't comment on the validity of the patch, but I also found and > reported this issue in 2016[1] and the discussion quickly veered into > the problem being more complicated (and uglier) than it seems at first > glance. > > You should probably also Cc stable, given this has been a long-standing > issue and your patch doesn't look (too) invasive. > I am fully aware that this patch won't fix the case then PTRACE_ACCESS is racing with de_thread. But I don't see a problem with allowing vm access based on the current credentials as they are still the same until de_thread is done with it's job. And in a practical way this fixes 99% of the real problem here, as it only happens since strace is currently tracing something and needs access to the parameters in the tracee's vm space. Of course you could fork the strace process to do any PTRACE_ACCESS when necessary, and, well, maybe that would fix the remaining problem here... However before I considered changing the kernel for this I tried to fix this within strace. First I tried to wait in the signal handler. See attached strace-patch-1.diff, but that did not work, BUT I think it is possible that your patch you proposed previously would actually make it work. I tried then another approach, using a worker thread to wait for the childs, but it did only work when I remove PTRACE_O_TRACEEXIT from the ptrace options, because the ptrace(PTRACE_SYSCALL, pid, 0L, 0L) does not work in the worker thread, rv = -1, errno = 3 there, and unfortunately the main thread is blocked and unable to do the ptrace call, that makes the thread continue. So I consider that second patch really ugly, and wouldn't propose something like that seriously. @@ -69,7 +71,7 @@ cflag_t cflag = CFLAG_NONE; unsigned int followfork; unsigned int ptrace_setoptions = PTRACE_O_TRACESYSGOOD | PTRACE_O_TRACEEXEC - | PTRACE_O_TRACEEXIT; + ;//| PTRACE_O_TRACEEXIT; unsigned int xflag; bool debug_flag; bool Tflag; so it only works because of this line, without that it is not able to make the thread continue after the PTRACE_EVENT_EXIT. Thanks Bernd. > [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ >
On 3/1/20 4:58 PM, Christian Brauner wrote: > On Mon, Mar 02, 2020 at 02:13:33AM +1100, Aleksa Sarai wrote: >> On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: >>> This fixes a deadlock in the tracer when tracing a multi-threaded >>> application that calls execve while more than one thread are running. >>> >>> I observed that when running strace on the gcc test suite, it always >>> blocks after a while, when expect calls execve, because other threads >>> have to be terminated. They send ptrace events, but the strace is no >>> longer able to respond, since it is blocked in vm_access. >>> >>> The deadlock is always happening when strace needs to access the >>> tracees process mmap, while another thread in the tracee starts to >>> execve a child process, but that cannot continue until the >>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: >>> >>> strace D 0 30614 30584 0x00000000 >>> Call Trace: >>> __schedule+0x3ce/0x6e0 >>> schedule+0x5c/0xd0 >>> schedule_preempt_disabled+0x15/0x20 >>> __mutex_lock.isra.13+0x1ec/0x520 >>> __mutex_lock_killable_slowpath+0x13/0x20 >>> mutex_lock_killable+0x28/0x30 >>> mm_access+0x27/0xa0 >>> process_vm_rw_core.isra.3+0xff/0x550 >>> process_vm_rw+0xdd/0xf0 >>> __x64_sys_process_vm_readv+0x31/0x40 >>> do_syscall_64+0x64/0x220 >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> >>> expect D 0 31933 30876 0x80004003 >>> Call Trace: >>> __schedule+0x3ce/0x6e0 >>> schedule+0x5c/0xd0 >>> flush_old_exec+0xc4/0x770 >>> load_elf_binary+0x35a/0x16c0 >>> search_binary_handler+0x97/0x1d0 >>> __do_execve_file.isra.40+0x5d4/0x8a0 >>> __x64_sys_execve+0x49/0x60 >>> do_syscall_64+0x64/0x220 >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> >>> The proposed solution is to have a second mutex that is >>> used in mm_access, so it is allowed to continue while the >>> dying threads are not yet terminated. >>> >>> I also took the opportunity to improve the documentation >>> of prepare_creds, which is obviously out of sync. >>> >>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> >> >> I can't comment on the validity of the patch, but I also found and >> reported this issue in 2016[1] and the discussion quickly veered into >> the problem being more complicated (and uglier) than it seems at first >> glance. >> >> You should probably also Cc stable, given this has been a long-standing >> issue and your patch doesn't look (too) invasive. >> >> [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ > > Yeah, I remember you mentioning this a while back. > > Bernd, we really want a reproducer for this sent alongside with this > patch added to: > tools/testing/selftests/ptrace/ > Having a test for this bug irrespective of whether or not we go with > this as fix seems really worth it. > I ran into this issue, because I wanted to fix an issue in the gcc testsuite, namely why it forgets to remove some temp files, so I did the following: strace -ftt -o trace.txt make check-gcc-c -k -j4 I reproduced with v4.20 and v5.5 kernel, and I don't know why but it is not happening on all systems I tested, maybe it is something that the expect program does, because, always when I try to reproduce this, the deadlock was always in "expect". I use expect version 5.45 on the computer where the above test freezes after a couple of minutes. I think the issue with strace is that it is using vm_access to get the parameters of a syscall that is going on in one thread, and that races with another thread that calls execve, and blocks the cred_guard_mutex. While Olg's test case here, will certainly not be fixed: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ he mentions the access to "anything else which needs ->cred_guard_mutex, say open(/proc/$pid/mem)", I don't know for sure how that can be done, but if that is possible, it would probably work as a test case. What do you think? Bernd. > Oleg seems to have suggested that a potential alternative fix is to wait > in de_thread() until all other threads in the thread-group have passed > exit_notiy(). Right now we only kill them but don't wait. Currently > de_thread() only waits for the thread-group leader to pass exit_notify() > whenever a non-thread-group leader thread execs (because the exec'ing > thread becomes the new thread-group leader with the same pid as the > former thread-group leader). > > Christian >
On Sun, Mar 01, 2020 at 05:46:08PM +0000, Bernd Edlinger wrote: > On 3/1/20 4:58 PM, Christian Brauner wrote: > > On Mon, Mar 02, 2020 at 02:13:33AM +1100, Aleksa Sarai wrote: > >> On 2020-03-01, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > >>> This fixes a deadlock in the tracer when tracing a multi-threaded > >>> application that calls execve while more than one thread are running. > >>> > >>> I observed that when running strace on the gcc test suite, it always > >>> blocks after a while, when expect calls execve, because other threads > >>> have to be terminated. They send ptrace events, but the strace is no > >>> longer able to respond, since it is blocked in vm_access. > >>> > >>> The deadlock is always happening when strace needs to access the > >>> tracees process mmap, while another thread in the tracee starts to > >>> execve a child process, but that cannot continue until the > >>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: > >>> > >>> strace D 0 30614 30584 0x00000000 > >>> Call Trace: > >>> __schedule+0x3ce/0x6e0 > >>> schedule+0x5c/0xd0 > >>> schedule_preempt_disabled+0x15/0x20 > >>> __mutex_lock.isra.13+0x1ec/0x520 > >>> __mutex_lock_killable_slowpath+0x13/0x20 > >>> mutex_lock_killable+0x28/0x30 > >>> mm_access+0x27/0xa0 > >>> process_vm_rw_core.isra.3+0xff/0x550 > >>> process_vm_rw+0xdd/0xf0 > >>> __x64_sys_process_vm_readv+0x31/0x40 > >>> do_syscall_64+0x64/0x220 > >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9 > >>> > >>> expect D 0 31933 30876 0x80004003 > >>> Call Trace: > >>> __schedule+0x3ce/0x6e0 > >>> schedule+0x5c/0xd0 > >>> flush_old_exec+0xc4/0x770 > >>> load_elf_binary+0x35a/0x16c0 > >>> search_binary_handler+0x97/0x1d0 > >>> __do_execve_file.isra.40+0x5d4/0x8a0 > >>> __x64_sys_execve+0x49/0x60 > >>> do_syscall_64+0x64/0x220 > >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9 > >>> > >>> The proposed solution is to have a second mutex that is > >>> used in mm_access, so it is allowed to continue while the > >>> dying threads are not yet terminated. > >>> > >>> I also took the opportunity to improve the documentation > >>> of prepare_creds, which is obviously out of sync. > >>> > >>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> > >> > >> I can't comment on the validity of the patch, but I also found and > >> reported this issue in 2016[1] and the discussion quickly veered into > >> the problem being more complicated (and uglier) than it seems at first > >> glance. > >> > >> You should probably also Cc stable, given this has been a long-standing > >> issue and your patch doesn't look (too) invasive. > >> > >> [1]: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ > > > > Yeah, I remember you mentioning this a while back. > > > > Bernd, we really want a reproducer for this sent alongside with this > > patch added to: > > tools/testing/selftests/ptrace/ > > Having a test for this bug irrespective of whether or not we go with > > this as fix seems really worth it. > > > > I ran into this issue, because I wanted to fix an issue in the gcc testsuite, > namely why it forgets to remove some temp files, > so I did the following: > > strace -ftt -o trace.txt make check-gcc-c -k -j4 > > I reproduced with v4.20 and v5.5 kernel, and I don't know why but it is > not happening on all systems I tested, maybe it is something that the expect program > does, because, always when I try to reproduce this, the deadlock was always in "expect". > > I use expect version 5.45 on the computer where the above test freezes after > a couple of minutes. > > I think the issue with strace is that it is using vm_access to get the parameters > of a syscall that is going on in one thread, and that races with another thread > that calls execve, and blocks the cred_guard_mutex. > > While Olg's test case here, will certainly not be fixed: > > https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ > > he mentions the access to "anything else which needs ->cred_guard_mutex, > say open(/proc/$pid/mem)", I don't know for sure how that can be done, but if > that is possible, it would probably work as a test case. > > What do you think? Yeah, anything that calls ptrace_may_access() is fine and open(/proc/$pid/mem) will work so long as $pid is not in the same thread-group as the caller. A polished version of the reproducer you linked in would probably be good.
On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > The proposed solution is to have a second mutex that is > used in mm_access, so it is allowed to continue while the > dying threads are not yet terminated. Just for context: When I proposed something similar back in 2016, https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ was the resulting discussion thread. At least back then, I looked through the various existing users of cred_guard_mutex, and the only places that couldn't be converted to the new second mutex were PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC. The ideal solution would IMO be something like this: Decide what the new task's credentials should be *before* reaching de_thread(), install them into a second cred* on the task (together with the new dumpability), drop the cred_guard_mutex, and let ptrace_may_access() check against both. After that, some further restructuring might even allow the cred_guard_mutex to not be held across all of the VFS operations that happen early on in execve, which may block indefinitely. But that would be pretty complicated, so I think your proposed solution makes sense for now, given that nobody has managed to implement anything better in the last few years.
On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote: > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger > <bernd.edlinger@hotmail.de> wrote: > > The proposed solution is to have a second mutex that is > > used in mm_access, so it is allowed to continue while the > > dying threads are not yet terminated. > > Just for context: When I proposed something similar back in 2016, > https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ > was the resulting discussion thread. At least back then, I looked > through the various existing users of cred_guard_mutex, and the only > places that couldn't be converted to the new second mutex were > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC. > > > The ideal solution would IMO be something like this: Decide what the > new task's credentials should be *before* reaching de_thread(), > install them into a second cred* on the task (together with the new > dumpability), drop the cred_guard_mutex, and let ptrace_may_access() > check against both. After that, some further restructuring might even Hm, so essentially a private ptrace_access_cred member in task_struct? That would presumably also involve altering various LSM hooks to look at ptrace_access_cred. (Minor side-note, de_thread() takes a struct task_struct argument but only ever is passed current.) > allow the cred_guard_mutex to not be held across all of the VFS > operations that happen early on in execve, which may block > indefinitely. But that would be pretty complicated, so I think your > proposed solution makes sense for now, given that nobody has managed > to implement anything better in the last few years. Reading through the old threads and how often this issue came up, I tend to agree.
On 3/1/20 7:52 PM, Christian Brauner wrote: > On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote: >> On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger >> <bernd.edlinger@hotmail.de> wrote: >>> The proposed solution is to have a second mutex that is >>> used in mm_access, so it is allowed to continue while the >>> dying threads are not yet terminated. >> >> Just for context: When I proposed something similar back in 2016, >> https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ >> was the resulting discussion thread. At least back then, I looked >> through the various existing users of cred_guard_mutex, and the only >> places that couldn't be converted to the new second mutex were >> PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC. >> >> >> The ideal solution would IMO be something like this: Decide what the >> new task's credentials should be *before* reaching de_thread(), >> install them into a second cred* on the task (together with the new >> dumpability), drop the cred_guard_mutex, and let ptrace_may_access() >> check against both. After that, some further restructuring might even > > Hm, so essentially a private ptrace_access_cred member in task_struct? > That would presumably also involve altering various LSM hooks to look at > ptrace_access_cred. > > (Minor side-note, de_thread() takes a struct task_struct argument but > only ever is passed current.) > >> allow the cred_guard_mutex to not be held across all of the VFS >> operations that happen early on in execve, which may block >> indefinitely. But that would be pretty complicated, so I think your >> proposed solution makes sense for now, given that nobody has managed >> to implement anything better in the last few years. > > Reading through the old threads and how often this issue came up, I tend > to agree. > Okay, fine. I managed to change Oleg's test case, into one that shows what exactly is changed with this patch: $ cat t.c #include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <pthread.h> #include <sys/signal.h> #include <sys/ptrace.h> void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; } int main(void) { int f, pid = fork(); char mm[64]; if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); } sleep(1); sprintf(mm, "/proc/%d/mem", pid); printf("open(%s)\n", mm); f = open(mm, O_RDONLY); printf("f = %d\n", f); // this is not fixed! ptrace(PTRACE_ATTACH, pid, 0,0); kill(pid, SIGCONT); if (f >= 0) close(f); return 0; } $ gcc -pthread -Wall t.c $ ./a.out open(/proc/2802/mem) f = 3 $ passed previously this did block, how can I make a test case for this? I am not so experienced in this matter. Thanks Bernd.
On Sun, Mar 1, 2020 at 7:52 PM Christian Brauner <christian.brauner@ubuntu.com> wrote: > On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote: > > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger > > <bernd.edlinger@hotmail.de> wrote: > > > The proposed solution is to have a second mutex that is > > > used in mm_access, so it is allowed to continue while the > > > dying threads are not yet terminated. > > > > Just for context: When I proposed something similar back in 2016, > > https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ > > was the resulting discussion thread. At least back then, I looked > > through the various existing users of cred_guard_mutex, and the only > > places that couldn't be converted to the new second mutex were > > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC. > > > > > > The ideal solution would IMO be something like this: Decide what the > > new task's credentials should be *before* reaching de_thread(), > > install them into a second cred* on the task (together with the new > > dumpability), drop the cred_guard_mutex, and let ptrace_may_access() > > check against both. After that, some further restructuring might even > > Hm, so essentially a private ptrace_access_cred member in task_struct? And a second dumpability field, because that changes together with the creds during execve. (Btw, currently the dumpability is in the mm_struct, but that's kinda wrong. The mm_struct is removed from a task on exit while access checks can still be performed against it, and currently ptrace_may_access() just lets the access go through in that case, which weakens the protection offered by PR_SET_DUMPABLE when used for security purposes. I think it ought to be moved over into the task_struct.) > That would presumably also involve altering various LSM hooks to look at > ptrace_access_cred. When I tried to implement this in the past, I changed the LSM hook to take the target task's cred* as an argument, and then called the LSM hook twice from ptrace_may_access(). IIRC having the target task's creds as an argument works for almost all the LSMs, with the exception of Yama, which doesn't really care about the target task's creds, so you have to pass in both the task_struct* and the cred*.
On Sun, Mar 01, 2020 at 09:00:22PM +0100, Jann Horn wrote: > On Sun, Mar 1, 2020 at 7:52 PM Christian Brauner > <christian.brauner@ubuntu.com> wrote: > > On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote: > > > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger > > > <bernd.edlinger@hotmail.de> wrote: > > > > The proposed solution is to have a second mutex that is > > > > used in mm_access, so it is allowed to continue while the > > > > dying threads are not yet terminated. > > > > > > Just for context: When I proposed something similar back in 2016, > > > https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ > > > was the resulting discussion thread. At least back then, I looked > > > through the various existing users of cred_guard_mutex, and the only > > > places that couldn't be converted to the new second mutex were > > > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC. > > > > > > > > > The ideal solution would IMO be something like this: Decide what the > > > new task's credentials should be *before* reaching de_thread(), > > > install them into a second cred* on the task (together with the new > > > dumpability), drop the cred_guard_mutex, and let ptrace_may_access() > > > check against both. After that, some further restructuring might even > > > > Hm, so essentially a private ptrace_access_cred member in task_struct? > > And a second dumpability field, because that changes together with the > creds during execve. (Btw, currently the dumpability is in the > mm_struct, but that's kinda wrong. The mm_struct is removed from a > task on exit while access checks can still be performed against it, and > currently ptrace_may_access() just lets the access go through in that > case, which weakens the protection offered by PR_SET_DUMPABLE when > used for security purposes. I think it ought to be moved over into the > task_struct.) > > > That would presumably also involve altering various LSM hooks to look at > > ptrace_access_cred. > > When I tried to implement this in the past, I changed the LSM hook to > take the target task's cred* as an argument, and then called the LSM > hook twice from ptrace_may_access(). IIRC having the target task's > creds as an argument works for almost all the LSMs, with the exception > of Yama, which doesn't really care about the target task's creds, so > you have to pass in both the task_struct* and the cred*. It seems we should try PoCing this. Christian
On Mon, Mar 02, 2020 at 08:47:53AM +0100, Christian Brauner wrote: > On Sun, Mar 01, 2020 at 09:00:22PM +0100, Jann Horn wrote: > > On Sun, Mar 1, 2020 at 7:52 PM Christian Brauner > > <christian.brauner@ubuntu.com> wrote: > > > On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote: > > > > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger > > > > <bernd.edlinger@hotmail.de> wrote: > > > > > The proposed solution is to have a second mutex that is > > > > > used in mm_access, so it is allowed to continue while the > > > > > dying threads are not yet terminated. > > > > > > > > Just for context: When I proposed something similar back in 2016, > > > > https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ > > > > was the resulting discussion thread. At least back then, I looked > > > > through the various existing users of cred_guard_mutex, and the only > > > > places that couldn't be converted to the new second mutex were > > > > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC. > > > > > > > > > > > > The ideal solution would IMO be something like this: Decide what the > > > > new task's credentials should be *before* reaching de_thread(), > > > > install them into a second cred* on the task (together with the new > > > > dumpability), drop the cred_guard_mutex, and let ptrace_may_access() > > > > check against both. After that, some further restructuring might even > > > > > > Hm, so essentially a private ptrace_access_cred member in task_struct? > > > > And a second dumpability field, because that changes together with the > > creds during execve. (Btw, currently the dumpability is in the > > mm_struct, but that's kinda wrong. The mm_struct is removed from a > > task on exit while access checks can still be performed against it, and > > currently ptrace_may_access() just lets the access go through in that > > case, which weakens the protection offered by PR_SET_DUMPABLE when > > used for security purposes. I think it ought to be moved over into the > > task_struct.) > > > > > That would presumably also involve altering various LSM hooks to look at > > > ptrace_access_cred. > > > > When I tried to implement this in the past, I changed the LSM hook to > > take the target task's cred* as an argument, and then called the LSM > > hook twice from ptrace_may_access(). IIRC having the target task's > > creds as an argument works for almost all the LSMs, with the exception > > of Yama, which doesn't really care about the target task's creds, so > > you have to pass in both the task_struct* and the cred*. > > It seems we should try PoCing this. Independent of the fix for Bernd's issue that is.
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst index 282e79f..c98e0a8 100644 --- a/Documentation/security/credentials.rst +++ b/Documentation/security/credentials.rst @@ -437,9 +437,13 @@ new set of credentials by calling:: struct cred *prepare_creds(void); -this locks current->cred_replace_mutex and then allocates and constructs a -duplicate of the current process's credentials, returning with the mutex still -held if successful. It returns NULL if not successful (out of memory). +this allocates and constructs a duplicate of the current process's credentials. +It returns NULL if not successful (out of memory). + +If called from __do_execve_file, the mutex current->signal->cred_guard_mutex +is acquired before this function gets called, and the mutex +current->signal->cred_change_mutex is acquired later, while the credentials +and the process mmap are actually changed. The mutex prevents ``ptrace()`` from altering the ptrace state of a process while security checks on credentials construction and changing is taking place @@ -466,9 +470,8 @@ by calling:: This will alter various aspects of the credentials and the process, giving the LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to -actually commit the new credentials to ``current->cred``, it will release -``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it -will notify the scheduler and others of the changes. +actually commit the new credentials to ``current->cred``, and it will notify +the scheduler and others of the changes. This function is guaranteed to return 0, so that it can be tail-called at the end of such functions as ``sys_setresuid()``. @@ -486,8 +489,7 @@ invoked:: void abort_creds(struct cred *new); -This releases the lock on ``current->cred_replace_mutex`` that -``prepare_creds()`` got and then releases the new credentials. +This releases the new credentials. A typical credentials alteration function would look something like this:: diff --git a/fs/exec.c b/fs/exec.c index 74d88da..a6884e4 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out; + retval = mutex_lock_killable(¤t->signal->cred_change_mutex); + if (retval) + goto out; + + bprm->called_flush_old_exec = 1; + /* * Must be called _before_ exec_mmap() as bprm->mm is * not visibile until then. This also enables the update @@ -1420,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { + if (bprm->called_flush_old_exec) + mutex_unlock(¤t->signal->cred_change_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1469,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); + mutex_unlock(¤t->signal->cred_change_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); } EXPORT_SYMBOL(install_exec_creds); diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..2e1318b 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,11 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */ - secureexec:1; + secureexec:1, + /* + * Set by flush_old_exec, when the cred_change_mutex is taken. + */ + called_flush_old_exec:1; #ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..37eeabe 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -225,6 +225,7 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations * (notably. ptrace) */ + struct mutex cred_change_mutex; /* guard against credentials change */ } __randomize_layout; /* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..6cd9a0f 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), + .cred_change_mutex = __MUTEX_INITIALIZER(init_signals.cred_change_mutex), #ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/cred.c b/kernel/cred.c index 809a985..e4c78de 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -676,7 +676,7 @@ void __init cred_init(void) * * Returns the new credentials or NULL if out of memory. * - * Does not take, and does not return holding current->cred_replace_mutex. + * Does not take, and does not return holding ->cred_guard_mutex. */ struct cred *prepare_kernel_cred(struct task_struct *daemon) { diff --git a/kernel/fork.c b/kernel/fork.c index 0808095..0395154 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) struct mm_struct *mm; int err; - err = mutex_lock_killable(&task->signal->cred_guard_mutex); + err = mutex_lock_killable(&task->signal->cred_change_mutex); if (err) return ERR_PTR(err); @@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) mmput(mm); mm = ERR_PTR(-EACCES); } - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->cred_change_mutex); return mm; } @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min; mutex_init(&sig->cred_guard_mutex); + mutex_init(&sig->cred_change_mutex); return 0; } diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 357aa7b..b3e6eb5 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter, if (!mm || IS_ERR(mm)) { rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; /* - * Explicitly map EACCES to EPERM as EPERM is a more a + * Explicitly map EACCES to EPERM as EPERM is a more * appropriate error code for process_vw_readv/writev */ if (rc == -EACCES)
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running. I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access. The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The proposed solution is to have a second mutex that is used in mm_access, so it is allowed to continue while the dying threads are not yet terminated. I also took the opportunity to improve the documentation of prepare_creds, which is obviously out of sync. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> --- Documentation/security/credentials.rst | 18 ++++++++++-------- fs/exec.c | 9 +++++++++ include/linux/binfmts.h | 6 +++++- include/linux/sched/signal.h | 1 + init/init_task.c | 1 + kernel/cred.c | 2 +- kernel/fork.c | 5 +++-- mm/process_vm_access.c | 2 +- 8 files changed, 31 insertions(+), 13 deletions(-)