Message ID | 1477863998-3298-2-git-send-email-jann@thejh.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 10/30, Jann Horn wrote: > > This is a new per-threadgroup lock that can often be taken instead of > cred_guard_mutex and has less deadlock potential. I'm doing this because > Oleg Nesterov mentioned the potential for deadlocks, in particular if a > debugged task is stuck in execve, trying to get rid of a ptrace-stopped > thread, and the debugger attempts to inspect procfs files of the debugged > task. Yes, but let me repeat that we need to fix this anyway. So I don't really understand why should we add yet another mutex. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote: > On 10/30, Jann Horn wrote: > > > > This is a new per-threadgroup lock that can often be taken instead of > > cred_guard_mutex and has less deadlock potential. I'm doing this because > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped > > thread, and the debugger attempts to inspect procfs files of the debugged > > task. > > Yes, but let me repeat that we need to fix this anyway. So I don't really > understand why should we add yet another mutex. execve() only takes the new mutex immediately after de_thread(), so this problem shouldn't occur there. Basically, I think that I'm not making the problem worse with my patches this way. I believe that it should be possible to convert most existing users of the cred_guard_mutex to the new cred_guard_light - exceptions to that that I see are: - PTRACE_ATTACH - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task) Beyond that, conceptually, the new cred_guard_light could also be turned into a read-write mutex to prevent deadlocks between its users (where execve would take it for writing and everyone else would take it for reading), but afaik the kernel doesn't have an implementation of read-write mutexes yet? cred_guard_light would mean that you could theoretically still create deadlocks, but afaics only if you do things like trying to read /proc/$pid/mem in the FUSE read handler for the file that is currently being executed - and in that case, I think it's okay to have a killable deadlock. Do you think that, if (apart from execve) only PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC remain as users of cred_guard_mutex and everything else used my new cred_guard_light, that would be sufficient to fix the races you are concerned about? It seems to me like SECCOMP_FILTER_FLAG_TSYNC doesn't really have deadlocking issues. PTRACE_ATTACH isn't that clear to me; if a debugger tries to attach to a newly spawned thread while another ptraced thread is dying because of de_thread() in a third thread, that might still cause the debugger to deadlock, right? The problem with PTRACE_ATTACH is basically that bprm->unsafe is used in the bprm_set_creds LSM hook, so it needs to have been calculated when that hook is executed. (Also in the bprm_secureexec hook, but that one happens after install_exec_creds(), so that's unproblematic.) security_bprm_set_creds() is called in prepare_binprm(), which is executed very early in do_execveat_common(), at a point where failures should still be graceful (return an error code instead of killing the whole process), and therefore other threads can still run and debuggers can still attach. The LSM hooks that execute at that point e.g. inspect and modify bprm->cred, and they can still cleanly prohibit execution. E.g. SELinux does this - it can cancel execution with errors like -EPERM and -EACCES. AFAICS the hard case is: - Multithreaded process with tasks A and B is running. - Task C attaches to B via ptrace. - Task A calls execve(), takes the mutex, reaches de_thread(), kills task B. - Task C tries to attach to A, tries to take the mutex again, deadlock. I'm not sure whether it'd be possible to get rid of the deadlock for PTRACE_ATTACH without ABI changes, and I would be surprised if it was doable without nontrivial additional logic.
On Wed, 2016-11-02 at 21:50 +0100, Jann Horn wrote: > On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote: > > On 10/30, Jann Horn wrote: > > > > > > This is a new per-threadgroup lock that can often be taken instead of > > > cred_guard_mutex and has less deadlock potential. I'm doing this because > > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a > > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped > > > thread, and the debugger attempts to inspect procfs files of the debugged > > > task. > > > > > > Yes, but let me repeat that we need to fix this anyway. So I don't really > > understand why should we add yet another mutex. > > > execve() only takes the new mutex immediately after de_thread(), so this > problem shouldn't occur there. Basically, I think that I'm not making the > problem worse with my patches this way. > > I believe that it should be possible to convert most existing users of the > cred_guard_mutex to the new cred_guard_light - exceptions to that that I > see are: > > - PTRACE_ATTACH > - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task) > > Beyond that, conceptually, the new cred_guard_light could also be turned > into a read-write mutex to prevent deadlocks between its users (where > execve would take it for writing and everyone else would take it for > reading), but afaik the kernel doesn't have an implementation of > read-write mutexes yet? [...] It does (rw_semaphore), but with PREEMPT_RT enabled they become simple mutexes. So I don't think we should rely on that to avoid deadlock. Ben.
On Wed, Nov 02, 2016 at 03:38:53PM -0600, Ben Hutchings wrote: > On Wed, 2016-11-02 at 21:50 +0100, Jann Horn wrote: > > On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote: > > > On 10/30, Jann Horn wrote: > > > > > > > > This is a new per-threadgroup lock that can often be taken instead of > > > > cred_guard_mutex and has less deadlock potential. I'm doing this because > > > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a > > > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped > > > > thread, and the debugger attempts to inspect procfs files of the debugged > > > > task. > > > > > > > > > Yes, but let me repeat that we need to fix this anyway. So I don't really > > > understand why should we add yet another mutex. > > > > > > execve() only takes the new mutex immediately after de_thread(), so this > > problem shouldn't occur there. Basically, I think that I'm not making the > > problem worse with my patches this way. > > > > I believe that it should be possible to convert most existing users of the > > cred_guard_mutex to the new cred_guard_light - exceptions to that that I > > see are: > > > > - PTRACE_ATTACH > > - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task) > > > > Beyond that, conceptually, the new cred_guard_light could also be turned > > into a read-write mutex to prevent deadlocks between its users (where > > execve would take it for writing and everyone else would take it for > > reading), but afaik the kernel doesn't have an implementation of > > read-write mutexes yet? > [...] > > It does (rw_semaphore) Aren't those spinlocks? I don't think those would be appropriate here, considering that the cred_guard_light can be held during filesystem access during execve(). > but with PREEMPT_RT enabled they become simple > mutexes. So I don't think we should rely on that to avoid deadlock. Well, I don't think it's really necessary - as far as I can tell, the only locking operations that would be executed with the cred_guard_light held would be taking the sighand lock, filesystem stuff and LSM stuff.
On 11/02, Jann Horn wrote: > > On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote: > > On 10/30, Jann Horn wrote: > > > > > > This is a new per-threadgroup lock that can often be taken instead of > > > cred_guard_mutex and has less deadlock potential. I'm doing this because > > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a > > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped > > > thread, and the debugger attempts to inspect procfs files of the debugged > > > task. > > > > Yes, but let me repeat that we need to fix this anyway. So I don't really > > understand why should we add yet another mutex. > > execve() only takes the new mutex immediately after de_thread(), so this > problem shouldn't occur there. Yes, I see. > Basically, I think that I'm not making the > problem worse with my patches this way. In a sense that it doesn't add the new deadlocks, I agree. But it adds yet another per-process mutex while we already have the similar one, > I believe that it should be possible to convert most existing users of the > cred_guard_mutex to the new cred_guard_light - exceptions to that that I > see are: > > - PTRACE_ATTACH This is the main problem afaics. So "strace -f" can hang if it races with mt-exec. And we need to fix this. I constantly forget about this problem, but I tried many times to find a reasonable solution, still can't. IMO, it would be nice to rework the lsm hooks, so that we could take cred_guard_mutex after de_thread() (like your cred_guard_light) or at least drop it earlier, but unlikely this is possible... So the only plan I currently have is change de_thread() to wait until other threads pass exit_notify() or even exit_signals(), but I don't like this. > - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task) I forgot about this one... Need to re-check but at first glance this is not a real problem. > Beyond that, conceptually, the new cred_guard_light could also be turned > into a read-write mutex Not sure I understand how this can help... doesn't matter. My point is, imo you should not add the new mutex. Just use the old one in (say) 4/8 (which I do not personally like as you know ;), this won't add the new problem. > It seems to me like SECCOMP_FILTER_FLAG_TSYNC doesn't really have > deadlocking issues. Yes, agreed. > PTRACE_ATTACH isn't that clear to me; if a debugger > tries to attach to a newly spawned thread while another ptraced thread is > dying because of de_thread() in a third thread, that might still cause > the debugger to deadlock, right? This is the trivial test-case I wrote when the problem was initially reported. And damn, I always knew that cred_guard_mutex needs fixes, but somehow I completely forgot that it is used by PTRACE_ATTACH when I was going to try to remove from fs/proc a long ago. void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; } int main(void) { int pid = fork(); if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); } sleep(1); // or anything else which needs ->cred_guard_mutex, // say open(/proc/$pid/mem) ptrace(PTRACE_ATTACH, pid, 0,0); kill(pid, SIGCONT); return 0; } The problem is trivial. The execing thread waits until its sub-thread goes away, it should be reaped by the tracer, the tracer waits for cred_guard_mutex. > security_bprm_set_creds() is called in prepare_binprm(), which is > executed very early in do_execveat_common(), at a point where failures > should still be graceful (return an error code instead of killing the > whole process), Yes, yes. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 03, 2016 at 07:12:25PM +0100, Oleg Nesterov wrote: > On 11/02, Jann Horn wrote: > > > > On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote: > > > On 10/30, Jann Horn wrote: [...] > > I believe that it should be possible to convert most existing users of the > > cred_guard_mutex to the new cred_guard_light - exceptions to that that I > > see are: > > > > - PTRACE_ATTACH > > This is the main problem afaics. So "strace -f" can hang if it races > with mt-exec. And we need to fix this. I constantly forget about this > problem, but I tried many times to find a reasonable solution, still > can't. Ah, okay, it wasn't clear to me that you consider the race with PTRACE_ATTACH to be a similarly big problem as the other ones. > IMO, it would be nice to rework the lsm hooks, so that we could take > cred_guard_mutex after de_thread() (like your cred_guard_light) or > at least drop it earlier, but unlikely this is possible... An idea: Maybe we can change the LSM hook so that, immediately before de_thread(), the LSMs can handle the execve() based on the current state and report the circumstances under which they would deny the execution or treat it differently. Then, during de_thread(), we can immediately reject any access that would have changed the execution and immediately permit any access that wouldn't have. This could theoretically cause userland to see spurious permission denials, but only if an LSM has an inconsistent security policy where some access degrades execution although it would have been permitted after a normal execution. Does that make sense? > So the only plan I currently have is change de_thread() to wait until > other threads pass exit_notify() or even exit_signals(), but I don't > like this. [...] > My point is, imo you should not add the new mutex. Just use the old > one in (say) 4/8 (which I do not personally like as you know ;), this > won't add the new problem. Okay, I'll do that.
diff --git a/fs/exec.c b/fs/exec.c index 4e497b9ee71e..67b76cb319d8 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1243,6 +1243,10 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out; + retval = mutex_lock_killable(¤t->signal->cred_guard_light); + if (retval) + goto out; + /* * Must be called _before_ exec_mmap() as bprm->mm is * not visibile until then. This also enables the update @@ -1256,7 +1260,7 @@ int flush_old_exec(struct linux_binprm * bprm) acct_arg_size(bprm, 0); retval = exec_mmap(bprm->mm); if (retval) - goto out; + goto out_unlock; bprm->mm = NULL; /* We're using it now */ @@ -1268,6 +1272,8 @@ int flush_old_exec(struct linux_binprm * bprm) return 0; +out_unlock: + mutex_unlock(¤t->signal->cred_guard_light); out: return retval; } @@ -1391,6 +1397,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); + mutex_unlock(¤t->signal->cred_guard_light); mutex_unlock(¤t->signal->cred_guard_mutex); } EXPORT_SYMBOL(install_exec_creds); @@ -1758,6 +1765,12 @@ static int do_execveat_common(int fd, struct filename *filename, return retval; out: + if (!bprm->mm && bprm->cred) { + /* failure after flush_old_exec(), but before + * install_exec_creds() + */ + mutex_unlock(¤t->signal->cred_guard_light); + } if (bprm->mm) { acct_arg_size(bprm, 0); mmput(bprm->mm); diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 325f649d77ff..c6819468e79a 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -60,6 +60,7 @@ extern struct fs_struct init_fs; INIT_PREV_CPUTIME(sig) \ .cred_guard_mutex = \ __MUTEX_INITIALIZER(sig.cred_guard_mutex), \ + .cred_guard_light = __MUTEX_INITIALIZER(sig.cred_guard_light) \ } extern struct nsproxy init_nsproxy; diff --git a/include/linux/sched.h b/include/linux/sched.h index 348f51b0ec92..0ccb379895b3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -812,6 +812,16 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations * (notably. ptrace) */ + /* + * Lightweight version of cred_guard_mutex; used to prevent race + * conditions where a user can gain information about the post-execve + * state of a task to which access should only be granted pre-execve. + * Hold this mutex while performing remote task inspection associated + * with a security check. + * This mutex MUST NOT be used in cases where anything changes about + * the security properties of a running execve(). + */ + struct mutex cred_guard_light; }; /* diff --git a/kernel/fork.c b/kernel/fork.c index 623259fc794d..d0e1d6fa4d00 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1361,6 +1361,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) current->signal->is_child_subreaper; mutex_init(&sig->cred_guard_mutex); + mutex_init(&sig->cred_guard_light); return 0; } diff --git a/kernel/ptrace.c b/kernel/ptrace.c index e6474f7272ec..c3312e9e0078 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -285,6 +285,16 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode) return security_ptrace_access_check(task, mode); } +/* + * NOTE: When you call this function, you need to ensure that the target task + * can't acquire (via setuid execve) credentials between the ptrace access + * check and the privileged access. The recommended way to do this is to hold + * one of task->signal->{cred_guard_mutex,cred_guard_light} while calling this + * function and performing the requested access. + * + * This function may only be used if access is requested in the name of + * current_cred(). + */ bool ptrace_may_access(struct task_struct *task, unsigned int mode) { int err;
This is a new per-threadgroup lock that can often be taken instead of cred_guard_mutex and has less deadlock potential. I'm doing this because Oleg Nesterov mentioned the potential for deadlocks, in particular if a debugged task is stuck in execve, trying to get rid of a ptrace-stopped thread, and the debugger attempts to inspect procfs files of the debugged task. The binfmt handlers (in particular for elf_fdpic and flat) might still call VFS read and mmap operations on the binary with the lock held, but not open operations (as is the case with cred_guard_mutex). An rwlock would be more appropriate here, but apparently those don't have _killable variants of the locking functions? This is a preparation patch for using proper locking in more places. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jann Horn <jann@thejh.net> --- fs/exec.c | 15 ++++++++++++++- include/linux/init_task.h | 1 + include/linux/sched.h | 10 ++++++++++ kernel/fork.c | 1 + kernel/ptrace.c | 10 ++++++++++ 5 files changed, 36 insertions(+), 1 deletion(-)