diff mbox

[v3,1/8] exec: introduce cred_guard_light

Message ID 1477863998-3298-2-git-send-email-jann@thejh.net (mailing list archive)
State New, archived
Headers show

Commit Message

Jann Horn Oct. 30, 2016, 9:46 p.m. UTC
This is a new per-threadgroup lock that can often be taken instead of
cred_guard_mutex and has less deadlock potential. I'm doing this because
Oleg Nesterov mentioned the potential for deadlocks, in particular if a
debugged task is stuck in execve, trying to get rid of a ptrace-stopped
thread, and the debugger attempts to inspect procfs files of the debugged
task.

The binfmt handlers (in particular for elf_fdpic and flat) might still
call VFS read and mmap operations on the binary with the lock held, but
not open operations (as is the case with cred_guard_mutex).

An rwlock would be more appropriate here, but apparently those don't
have _killable variants of the locking functions?

This is a preparation patch for using proper locking in more places.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jann Horn <jann@thejh.net>
---
 fs/exec.c                 | 15 ++++++++++++++-
 include/linux/init_task.h |  1 +
 include/linux/sched.h     | 10 ++++++++++
 kernel/fork.c             |  1 +
 kernel/ptrace.c           | 10 ++++++++++
 5 files changed, 36 insertions(+), 1 deletion(-)

Comments

Oleg Nesterov Nov. 2, 2016, 6:18 p.m. UTC | #1
On 10/30, Jann Horn wrote:
>
> This is a new per-threadgroup lock that can often be taken instead of
> cred_guard_mutex and has less deadlock potential. I'm doing this because
> Oleg Nesterov mentioned the potential for deadlocks, in particular if a
> debugged task is stuck in execve, trying to get rid of a ptrace-stopped
> thread, and the debugger attempts to inspect procfs files of the debugged
> task.

Yes, but let me repeat that we need to fix this anyway. So I don't really
understand why should we add yet another mutex.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jann Horn Nov. 2, 2016, 8:50 p.m. UTC | #2
On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
> On 10/30, Jann Horn wrote:
> >
> > This is a new per-threadgroup lock that can often be taken instead of
> > cred_guard_mutex and has less deadlock potential. I'm doing this because
> > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
> > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
> > thread, and the debugger attempts to inspect procfs files of the debugged
> > task.
> 
> Yes, but let me repeat that we need to fix this anyway. So I don't really
> understand why should we add yet another mutex.

execve() only takes the new mutex immediately after de_thread(), so this
problem shouldn't occur there. Basically, I think that I'm not making the
problem worse with my patches this way.

I believe that it should be possible to convert most existing users of the
cred_guard_mutex to the new cred_guard_light - exceptions to that that I
see are:

 - PTRACE_ATTACH
 - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task)

Beyond that, conceptually, the new cred_guard_light could also be turned
into a read-write mutex to prevent deadlocks between its users (where
execve would take it for writing and everyone else would take it for
reading), but afaik the kernel doesn't have an implementation of
read-write mutexes yet?

cred_guard_light would mean that you could theoretically still create
deadlocks, but afaics only if you do things like trying to read
/proc/$pid/mem in the FUSE read handler for the file that is currently
being executed - and in that case, I think it's okay to have a killable
deadlock.

Do you think that, if (apart from execve) only PTRACE_ATTACH and
SECCOMP_FILTER_FLAG_TSYNC remain as users of cred_guard_mutex and
everything else used my new cred_guard_light, that would be sufficient to
fix the races you are concerned about?

It seems to me like SECCOMP_FILTER_FLAG_TSYNC doesn't really have
deadlocking issues. PTRACE_ATTACH isn't that clear to me; if a debugger
tries to attach to a newly spawned thread while another ptraced thread is
dying because of de_thread() in a third thread, that might still cause
the debugger to deadlock, right?

The problem with PTRACE_ATTACH is basically that bprm->unsafe is used
in the bprm_set_creds LSM hook, so it needs to have been calculated when
that hook is executed. (Also in the bprm_secureexec hook, but that
one happens after install_exec_creds(), so that's unproblematic.)
security_bprm_set_creds() is called in prepare_binprm(), which is
executed very early in do_execveat_common(), at a point where failures
should still be graceful (return an error code instead of killing the
whole process), and therefore other threads can still run and debuggers
can still attach.

The LSM hooks that execute at that point e.g. inspect and modify
bprm->cred, and they can still cleanly prohibit execution. E.g. SELinux
does this - it can cancel execution with errors like -EPERM and -EACCES.

AFAICS the hard case is:

 - Multithreaded process with tasks A and B is running.
 - Task C attaches to B via ptrace.
 - Task A calls execve(), takes the mutex, reaches de_thread(), kills
   task B.
 - Task C tries to attach to A, tries to take the mutex again,
   deadlock.

I'm not sure whether it'd be possible to get rid of the deadlock
for PTRACE_ATTACH without ABI changes, and I would be surprised if it
was doable without nontrivial additional logic.
Ben Hutchings Nov. 2, 2016, 9:38 p.m. UTC | #3
On Wed, 2016-11-02 at 21:50 +0100, Jann Horn wrote:
> On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
> > On 10/30, Jann Horn wrote:
> > > 
> > > This is a new per-threadgroup lock that can often be taken instead of
> > > cred_guard_mutex and has less deadlock potential. I'm doing this because
> > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
> > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
> > > thread, and the debugger attempts to inspect procfs files of the debugged
> > > task.
> > 
> > 
> > Yes, but let me repeat that we need to fix this anyway. So I don't really
> > understand why should we add yet another mutex.
> 
> 
> execve() only takes the new mutex immediately after de_thread(), so this
> problem shouldn't occur there. Basically, I think that I'm not making the
> problem worse with my patches this way.
> 
> I believe that it should be possible to convert most existing users of the
> cred_guard_mutex to the new cred_guard_light - exceptions to that that I
> see are:
> 
>  - PTRACE_ATTACH
>  - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task)
> 
> Beyond that, conceptually, the new cred_guard_light could also be turned
> into a read-write mutex to prevent deadlocks between its users (where
> execve would take it for writing and everyone else would take it for
> reading), but afaik the kernel doesn't have an implementation of
> read-write mutexes yet?
[...]

It does (rw_semaphore), but with PREEMPT_RT enabled they become simple
mutexes.  So I don't think we should rely on that to avoid deadlock.

Ben.
Jann Horn Nov. 2, 2016, 9:54 p.m. UTC | #4
On Wed, Nov 02, 2016 at 03:38:53PM -0600, Ben Hutchings wrote:
> On Wed, 2016-11-02 at 21:50 +0100, Jann Horn wrote:
> > On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
> > > On 10/30, Jann Horn wrote:
> > > > 
> > > > This is a new per-threadgroup lock that can often be taken instead of
> > > > cred_guard_mutex and has less deadlock potential. I'm doing this because
> > > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
> > > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
> > > > thread, and the debugger attempts to inspect procfs files of the debugged
> > > > task.
> > > 
> > > 
> > > Yes, but let me repeat that we need to fix this anyway. So I don't really
> > > understand why should we add yet another mutex.
> > 
> > 
> > execve() only takes the new mutex immediately after de_thread(), so this
> > problem shouldn't occur there. Basically, I think that I'm not making the
> > problem worse with my patches this way.
> > 
> > I believe that it should be possible to convert most existing users of the
> > cred_guard_mutex to the new cred_guard_light - exceptions to that that I
> > see are:
> > 
> >  - PTRACE_ATTACH
> >  - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task)
> > 
> > Beyond that, conceptually, the new cred_guard_light could also be turned
> > into a read-write mutex to prevent deadlocks between its users (where
> > execve would take it for writing and everyone else would take it for
> > reading), but afaik the kernel doesn't have an implementation of
> > read-write mutexes yet?
> [...]
> 
> It does (rw_semaphore)

Aren't those spinlocks? I don't think those would be appropriate here,
considering that the cred_guard_light can be held during filesystem
access during execve().

> but with PREEMPT_RT enabled they become simple
> mutexes.  So I don't think we should rely on that to avoid deadlock.

Well, I don't think it's really necessary - as far as I can tell, the
only locking operations that would be executed with the
cred_guard_light held would be taking the sighand lock, filesystem
stuff and LSM stuff.
Oleg Nesterov Nov. 3, 2016, 6:12 p.m. UTC | #5
On 11/02, Jann Horn wrote:
>
> On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
> > On 10/30, Jann Horn wrote:
> > >
> > > This is a new per-threadgroup lock that can often be taken instead of
> > > cred_guard_mutex and has less deadlock potential. I'm doing this because
> > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
> > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
> > > thread, and the debugger attempts to inspect procfs files of the debugged
> > > task.
> >
> > Yes, but let me repeat that we need to fix this anyway. So I don't really
> > understand why should we add yet another mutex.
>
> execve() only takes the new mutex immediately after de_thread(), so this
> problem shouldn't occur there.

Yes, I see.

> Basically, I think that I'm not making the
> problem worse with my patches this way.

In a sense that it doesn't add the new deadlocks, I agree. But it adds
yet another per-process mutex while we already have the similar one,

> I believe that it should be possible to convert most existing users of the
> cred_guard_mutex to the new cred_guard_light - exceptions to that that I
> see are:
>
>  - PTRACE_ATTACH

This is the main problem afaics. So "strace -f" can hang if it races
with mt-exec. And we need to fix this. I constantly forget about this
problem, but I tried many times to find a reasonable solution, still
can't.

IMO, it would be nice to rework the lsm hooks, so that we could take
cred_guard_mutex after de_thread() (like your cred_guard_light) or
at least drop it earlier, but unlikely this is possible...

So the only plan I currently have is change de_thread() to wait until
other threads pass exit_notify() or even exit_signals(), but I don't
like this.

>  - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task)

I forgot about this one... Need to re-check but at first glance this
is not a real problem.

> Beyond that, conceptually, the new cred_guard_light could also be turned
> into a read-write mutex

Not sure I understand how this can help... doesn't matter.

My point is, imo you should not add the new mutex. Just use the old
one in (say) 4/8 (which I do not personally like as you know ;), this
won't add the new problem.


> It seems to me like SECCOMP_FILTER_FLAG_TSYNC doesn't really have
> deadlocking issues.

Yes, agreed.

> PTRACE_ATTACH isn't that clear to me; if a debugger
> tries to attach to a newly spawned thread while another ptraced thread is
> dying because of de_thread() in a third thread, that might still cause
> the debugger to deadlock, right?

This is the trivial test-case I wrote when the problem was initially
reported. And damn, I always knew that cred_guard_mutex needs fixes,
but somehow I completely forgot that it is used by PTRACE_ATTACH when
I was going to try to remove from fs/proc a long ago.

	void *thread(void *arg)
	{
		ptrace(PTRACE_TRACEME, 0,0,0);
		return NULL;
	}

	int main(void)
	{
		int pid = fork();

		if (!pid) {
			pthread_t pt;
			pthread_create(&pt, NULL, thread, NULL);
			pthread_join(pt, NULL);
			execlp("echo", "echo", "passed", NULL);
		}

		sleep(1);
		// or anything else which needs ->cred_guard_mutex,
		// say open(/proc/$pid/mem)
		ptrace(PTRACE_ATTACH, pid, 0,0);
		kill(pid, SIGCONT);

		return 0;
	}

The problem is trivial. The execing thread waits until its sub-thread
goes away, it should be reaped by the tracer, the tracer waits for
cred_guard_mutex.

> security_bprm_set_creds() is called in prepare_binprm(), which is
> executed very early in do_execveat_common(), at a point where failures
> should still be graceful (return an error code instead of killing the
> whole process),

Yes, yes.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jann Horn Nov. 3, 2016, 9:17 p.m. UTC | #6
On Thu, Nov 03, 2016 at 07:12:25PM +0100, Oleg Nesterov wrote:
> On 11/02, Jann Horn wrote:
> >
> > On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
> > > On 10/30, Jann Horn wrote:
[...]
> > I believe that it should be possible to convert most existing users of the
> > cred_guard_mutex to the new cred_guard_light - exceptions to that that I
> > see are:
> >
> >  - PTRACE_ATTACH
> 
> This is the main problem afaics. So "strace -f" can hang if it races
> with mt-exec. And we need to fix this. I constantly forget about this
> problem, but I tried many times to find a reasonable solution, still
> can't.

Ah, okay, it wasn't clear to me that you consider the race with
PTRACE_ATTACH to be a similarly big problem as the other ones.

> IMO, it would be nice to rework the lsm hooks, so that we could take
> cred_guard_mutex after de_thread() (like your cred_guard_light) or
> at least drop it earlier, but unlikely this is possible...

An idea: Maybe we can change the LSM hook so that, immediately before
de_thread(), the LSMs can handle the execve() based on the current
state and report the circumstances under which they would deny the
execution or treat it differently. Then, during de_thread(), we can
immediately reject any access that would have changed the execution
and immediately permit any access that wouldn't have.

This could theoretically cause userland to see spurious permission
denials, but only if an LSM has an inconsistent security policy where
some access degrades execution although it would have been permitted
after a normal execution.

Does that make sense?

> So the only plan I currently have is change de_thread() to wait until
> other threads pass exit_notify() or even exit_signals(), but I don't
> like this.
[...]
> My point is, imo you should not add the new mutex. Just use the old
> one in (say) 4/8 (which I do not personally like as you know ;), this
> won't add the new problem.

Okay, I'll do that.
diff mbox

Patch

diff --git a/fs/exec.c b/fs/exec.c
index 4e497b9ee71e..67b76cb319d8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1243,6 +1243,10 @@  int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	retval = mutex_lock_killable(&current->signal->cred_guard_light);
+	if (retval)
+		goto out;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1256,7 +1260,7 @@  int flush_old_exec(struct linux_binprm * bprm)
 	acct_arg_size(bprm, 0);
 	retval = exec_mmap(bprm->mm);
 	if (retval)
-		goto out;
+		goto out_unlock;
 
 	bprm->mm = NULL;		/* We're using it now */
 
@@ -1268,6 +1272,8 @@  int flush_old_exec(struct linux_binprm * bprm)
 
 	return 0;
 
+out_unlock:
+	mutex_unlock(&current->signal->cred_guard_light);
 out:
 	return retval;
 }
@@ -1391,6 +1397,7 @@  void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+	mutex_unlock(&current->signal->cred_guard_light);
 	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
@@ -1758,6 +1765,12 @@  static int do_execveat_common(int fd, struct filename *filename,
 	return retval;
 
 out:
+	if (!bprm->mm && bprm->cred) {
+		/* failure after flush_old_exec(), but before
+		 * install_exec_creds()
+		 */
+		mutex_unlock(&current->signal->cred_guard_light);
+	}
 	if (bprm->mm) {
 		acct_arg_size(bprm, 0);
 		mmput(bprm->mm);
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 325f649d77ff..c6819468e79a 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -60,6 +60,7 @@  extern struct fs_struct init_fs;
 	INIT_PREV_CPUTIME(sig)						\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	.cred_guard_light = __MUTEX_INITIALIZER(sig.cred_guard_light)	\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b0ec92..0ccb379895b3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -812,6 +812,16 @@  struct signal_struct {
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace) */
+	/*
+	 * Lightweight version of cred_guard_mutex; used to prevent race
+	 * conditions where a user can gain information about the post-execve
+	 * state of a task to which access should only be granted pre-execve.
+	 * Hold this mutex while performing remote task inspection associated
+	 * with a security check.
+	 * This mutex MUST NOT be used in cases where anything changes about
+	 * the security properties of a running execve().
+	 */
+	struct mutex cred_guard_light;
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 623259fc794d..d0e1d6fa4d00 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1361,6 +1361,7 @@  static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 				   current->signal->is_child_subreaper;
 
 	mutex_init(&sig->cred_guard_mutex);
+	mutex_init(&sig->cred_guard_light);
 
 	return 0;
 }
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index e6474f7272ec..c3312e9e0078 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -285,6 +285,16 @@  static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 	return security_ptrace_access_check(task, mode);
 }
 
+/*
+ * NOTE: When you call this function, you need to ensure that the target task
+ * can't acquire (via setuid execve) credentials between the ptrace access
+ * check and the privileged access. The recommended way to do this is to hold
+ * one of task->signal->{cred_guard_mutex,cred_guard_light} while calling this
+ * function and performing the requested access.
+ *
+ * This function may only be used if access is requested in the name of
+ * current_cred().
+ */
 bool ptrace_may_access(struct task_struct *task, unsigned int mode)
 {
 	int err;