diff mbox

[v8,9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

Message ID 1403642893-23107-10-git-send-email-keescook@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Kees Cook June 24, 2014, 8:48 p.m. UTC
Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.

This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.

When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.

The race conditions are against another thread requesting TSYNC, another
thread performing a clone, and another thread changing its filter. The
sighand lock is sufficient for these cases. The clone case is assisted
by the fact that new threads will have their seccomp state duplicated
from their parent before appearing on the tasklist.

Based on patches by Will Drewry.

Suggested-by: Julien Tinnes <jln@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/linux/seccomp.h      |    2 +
 include/uapi/linux/seccomp.h |    3 ++
 kernel/seccomp.c             |  115 +++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 119 insertions(+), 1 deletion(-)

Comments

Oleg Nesterov June 25, 2014, 2:21 p.m. UTC | #1
On 06/24, Kees Cook wrote:
>
> +static void seccomp_sync_threads(void)
> +{
> +	struct task_struct *thread, *caller;
> +
> +	BUG_ON(!spin_is_locked(&current->sighand->siglock));
> +
> +	/* Synchronize all threads. */
> +	caller = current;
> +	for_each_thread(caller, thread) {
> +		/* Get a task reference for the new leaf node. */
> +		get_seccomp_filter(caller);
> +		/*
> +		 * Drop the task reference to the shared ancestor since
> +		 * current's path will hold a reference.  (This also
> +		 * allows a put before the assignment.)
> +		 */
> +		put_seccomp_filter(thread);
> +		thread->seccomp.filter = caller->seccomp.filter;
> +		/* Opt the other thread into seccomp if needed.
> +		 * As threads are considered to be trust-realm
> +		 * equivalent (see ptrace_may_access), it is safe to
> +		 * allow one thread to transition the other.
> +		 */
> +		if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) {
> +			/*
> +			 * Don't let an unprivileged task work around
> +			 * the no_new_privs restriction by creating
> +			 * a thread that sets it up, enters seccomp,
> +			 * then dies.
> +			 */
> +			if (task_no_new_privs(caller))
> +				task_set_no_new_privs(thread);
> +
> +			seccomp_assign_mode(thread, SECCOMP_MODE_FILTER);
> +		}
> +	}
> +}

OK, personally I think this all make sense. I even think that perhaps
SECCOMP_FILTER_FLAG_TSYNC should allow filter == NULL, a thread might
want to "sync" without adding the new filter, but this is minor/offtopic.

But. Doesn't this change add the new security hole?

Obviously, we should not allow to install a filter and then (say) exec
a suid binary, that is why we have no_new_privs/LSM_UNSAFE_NO_NEW_PRIVS.

But what if "thread->seccomp.filter = caller->seccomp.filter" races with
any user of task_no_new_privs() ? Say, suppose this thread has already
passed check_unsafe_exec/etc and it is going to exec the suid binary?

Oleg.
Kees Cook June 25, 2014, 3:08 p.m. UTC | #2
On Wed, Jun 25, 2014 at 7:21 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 06/24, Kees Cook wrote:
>>
>> +static void seccomp_sync_threads(void)
>> +{
>> +     struct task_struct *thread, *caller;
>> +
>> +     BUG_ON(!spin_is_locked(&current->sighand->siglock));
>> +
>> +     /* Synchronize all threads. */
>> +     caller = current;
>> +     for_each_thread(caller, thread) {
>> +             /* Get a task reference for the new leaf node. */
>> +             get_seccomp_filter(caller);
>> +             /*
>> +              * Drop the task reference to the shared ancestor since
>> +              * current's path will hold a reference.  (This also
>> +              * allows a put before the assignment.)
>> +              */
>> +             put_seccomp_filter(thread);
>> +             thread->seccomp.filter = caller->seccomp.filter;
>> +             /* Opt the other thread into seccomp if needed.
>> +              * As threads are considered to be trust-realm
>> +              * equivalent (see ptrace_may_access), it is safe to
>> +              * allow one thread to transition the other.
>> +              */
>> +             if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) {
>> +                     /*
>> +                      * Don't let an unprivileged task work around
>> +                      * the no_new_privs restriction by creating
>> +                      * a thread that sets it up, enters seccomp,
>> +                      * then dies.
>> +                      */
>> +                     if (task_no_new_privs(caller))
>> +                             task_set_no_new_privs(thread);
>> +
>> +                     seccomp_assign_mode(thread, SECCOMP_MODE_FILTER);
>> +             }
>> +     }
>> +}
>
> OK, personally I think this all make sense. I even think that perhaps
> SECCOMP_FILTER_FLAG_TSYNC should allow filter == NULL, a thread might
> want to "sync" without adding the new filter, but this is minor/offtopic.
>
> But. Doesn't this change add a new security hole?
>
> Obviously, we should not allow to install a filter and then (say) exec
> a suid binary, that is why we have no_new_privs/LSM_UNSAFE_NO_NEW_PRIVS.
>
> But what if "thread->seccomp.filter = caller->seccomp.filter" races with
> any user of task_no_new_privs() ? Say, suppose this thread has already
> passed check_unsafe_exec/etc and it is going to exec the suid binary?

Oh, ew. Yeah. It looks like there's a cred lock to be held to combat this?

I wonder if changes to nnp need to "flushed" during syscall entry
instead of getting updated externally/asynchronously? That way it
won't be out of sync with the seccomp mode/filters.

Perhaps secure computing needs to check some (maybe seccomp-only)
atomic flags and flip on the "real" nnp if found?

-Kees
Oleg Nesterov June 25, 2014, 4:52 p.m. UTC | #3
On 06/25, Kees Cook wrote:
>
> On Wed, Jun 25, 2014 at 7:21 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > But. Doesn't this change add a new security hole?
> >
> > Obviously, we should not allow to install a filter and then (say) exec
> > a suid binary, that is why we have no_new_privs/LSM_UNSAFE_NO_NEW_PRIVS.
> >
> > But what if "thread->seccomp.filter = caller->seccomp.filter" races with
> > any user of task_no_new_privs() ? Say, suppose this thread has already
> > passed check_unsafe_exec/etc and it is going to exec the suid binary?
>
> Oh, ew. Yeah. It looks like there's a cred lock to be held to combat this?

Yes, cred_guard_mutex looks like an obvious choice... Hmm, but somehow
initially I thought that the fix won't be simple. Not sure why.

Yes, at least this should close the race with suid-exec. And there are no
other users. Except apparmor, and I hope you will check it because I simply
do not know what it does ;)

> I wonder if changes to nnp need to "flushed" during syscall entry
> instead of getting updated externally/asynchronously? That way it
> won't be out of sync with the seccomp mode/filters.
>
> Perhaps secure computing needs to check some (maybe seccomp-only)
> atomic flags and flip on the "real" nnp if found?

Not sure I understand you, could you clarify?

But I was also worried that task_no_new_privs(current) is no longer stable
inside the syscall paths, perhaps this is what you meant? However I do not
see something bad here... And this has nothing to do with the race above.



Also. Even ignoring no_new_privs, SECCOMP_FILTER_FLAG_TSYNC is not atomic
and we can do nothing with this fact (unless it try to freeze the thread
group somehow), perhaps it makes sense to document this somehow.

I mean, suppose you want to ensure write-to-file is not possible, so you
do seccomp(SECCOMP_FILTER_FLAG_TSYNC, nack_write_to_file_filter). You can't
assume that this has effect right after seccomp() returns, this can obviously
race with a sub-thread which has already entered sys_write().

Once again, I am not arguing, just I think it makes sense to at least mention
the limitations during the discussion.

Oleg.
Kees Cook June 25, 2014, 5:09 p.m. UTC | #4
On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 06/25, Kees Cook wrote:
>>
>> On Wed, Jun 25, 2014 at 7:21 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > But. Doesn't this change add a new security hole?
>> >
>> > Obviously, we should not allow to install a filter and then (say) exec
>> > a suid binary, that is why we have no_new_privs/LSM_UNSAFE_NO_NEW_PRIVS.
>> >
>> > But what if "thread->seccomp.filter = caller->seccomp.filter" races with
>> > any user of task_no_new_privs() ? Say, suppose this thread has already
>> > passed check_unsafe_exec/etc and it is going to exec the suid binary?
>>
>> Oh, ew. Yeah. It looks like there's a cred lock to be held to combat this?
>
> Yes, cred_guard_mutex looks like an obvious choice... Hmm, but somehow
> initially I thought that the fix won't be simple. Not sure why.
>
> Yes, at least this should close the race with suid-exec. And there are no
> other users. Except apparmor, and I hope you will check it because I simply
> do not know what it does ;)
>
>> I wonder if changes to nnp need to "flushed" during syscall entry
>> instead of getting updated externally/asynchronously? That way it
>> won't be out of sync with the seccomp mode/filters.
>>
>> Perhaps secure computing needs to check some (maybe seccomp-only)
>> atomic flags and flip on the "real" nnp if found?
>
> Not sure I understand you, could you clarify?

Instead of having TSYNC change the nnp bit, it can set a new flag, say:

    task->seccomp.flags |= SECCOMP_NEEDS_NNP;

This would be set along with seccomp.mode, seccomp.filter, and
TIF_SECCOMP. Then, during the next secure_computing() call that thread
makes, it would check the flag:

    if (task->seccomp.flags & SECCOMP_NEEDS_NNP)
        task->nnp = 1;

This means that nnp couldn't change in the middle of a running syscall.

Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal
above would actually make things worse, since now we'd have a thread
with seccomp set up, and no nnp. If it was in the middle of exec,
we're still causing a problem.

I think we'd also need a way to either delay the seccomp changes, or
to notice this condition during exec. Bleh.

What actually happens with a multi-threaded process calls exec? I
assume all the other threads are destroyed?

> But I was also worried that task_no_new_privs(current) is no longer stable
> inside the syscall paths, perhaps this is what you meant? However I do not
> see something bad here... And this has nothing to do with the race above.
>
> Also. Even ignoring no_new_privs, SECCOMP_FILTER_FLAG_TSYNC is not atomic
> and we can do nothing with this fact (unless it try to freeze the thread
> group somehow), perhaps it makes sense to document this somehow.
>
> I mean, suppose you want to ensure write-to-file is not possible, so you
> do seccomp(SECCOMP_FILTER_FLAG_TSYNC, nack_write_to_file_filter). You can't
> assume that this has effect right after seccomp() returns, this can obviously
> race with a sub-thread which has already entered sys_write().
>
> Once again, I am not arguing, just I think it makes sense to at least mention
> the limitations during the discussion.

Right -- this is an accepted limitation. I will call it out
specifically in the man-page; that's a good idea.

-Kees
Oleg Nesterov June 25, 2014, 5:24 p.m. UTC | #5
On 06/25, Kees Cook wrote:
>
> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Yes, at least this should close the race with suid-exec. And there are no
> > other users. Except apparmor, and I hope you will check it because I simply
> > do not know what it does ;)
> >
> >> I wonder if changes to nnp need to "flushed" during syscall entry
> >> instead of getting updated externally/asynchronously? That way it
> >> won't be out of sync with the seccomp mode/filters.
> >>
> >> Perhaps secure computing needs to check some (maybe seccomp-only)
> >> atomic flags and flip on the "real" nnp if found?
> >
> > Not sure I understand you, could you clarify?
>
> Instead of having TSYNC change the nnp bit, it can set a new flag, say:
>
>     task->seccomp.flags |= SECCOMP_NEEDS_NNP;
>
> This would be set along with seccomp.mode, seccomp.filter, and
> TIF_SECCOMP. Then, during the next secure_computing() call that thread
> makes, it would check the flag:
>
>     if (task->seccomp.flags & SECCOMP_NEEDS_NNP)
>         task->nnp = 1;
>
> This means that nnp couldn't change in the middle of a running syscall.

Aha, so you were worried about the same thing. Not sure we need this,
but at least I understand you and...

> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal
> above would actually make things worse, since now we'd have a thread
> with seccomp set up, and no nnp. If it was in the middle of exec,
> we're still causing a problem.

Yes ;)

> I think we'd also need a way to either delay the seccomp changes, or
> to notice this condition during exec. Bleh.

Hmm. confused again,

> What actually happens with a multi-threaded process calls exec? I
> assume all the other threads are destroyed?

Yes. But this is the point-of-no-return, de_thread() is called after the execing
thared has already passed (say) check_unsafe_exec().

However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
and drops it in install_exec_creds(), so it should solve the problem?

Oleg.
Andy Lutomirski June 25, 2014, 5:40 p.m. UTC | #6
On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 06/25, Kees Cook wrote:
>>
>> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > Yes, at least this should close the race with suid-exec. And there are no
>> > other users. Except apparmor, and I hope you will check it because I simply
>> > do not know what it does ;)
>> >
>> >> I wonder if changes to nnp need to "flushed" during syscall entry
>> >> instead of getting updated externally/asynchronously? That way it
>> >> won't be out of sync with the seccomp mode/filters.
>> >>
>> >> Perhaps secure computing needs to check some (maybe seccomp-only)
>> >> atomic flags and flip on the "real" nnp if found?
>> >
>> > Not sure I understand you, could you clarify?
>>
>> Instead of having TSYNC change the nnp bit, it can set a new flag, say:
>>
>>     task->seccomp.flags |= SECCOMP_NEEDS_NNP;
>>
>> This would be set along with seccomp.mode, seccomp.filter, and
>> TIF_SECCOMP. Then, during the next secure_computing() call that thread
>> makes, it would check the flag:
>>
>>     if (task->seccomp.flags & SECCOMP_NEEDS_NNP)
>>         task->nnp = 1;
>>
>> This means that nnp couldn't change in the middle of a running syscall.
>
> Aha, so you were worried about the same thing. Not sure we need this,
> but at least I understand you and...
>
>> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal
>> above would actually make things worse, since now we'd have a thread
>> with seccomp set up, and no nnp. If it was in the middle of exec,
>> we're still causing a problem.
>
> Yes ;)
>
>> I think we'd also need a way to either delay the seccomp changes, or
>> to notice this condition during exec. Bleh.
>
> Hmm. confused again,
>
>> What actually happens with a multi-threaded process calls exec? I
>> assume all the other threads are destroyed?
>
> Yes. But this is the point-of-no-return, de_thread() is called after the execing
> thared has already passed (say) check_unsafe_exec().
>
> However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
> and drops it in install_exec_creds(), so it should solve the problem?

If you rely on this, then please fix this comment in fs/exec.c:

/*
 * determine how safe it is to execute the proposed program
 * - the caller must hold ->cred_guard_mutex to protect against
 *   PTRACE_ATTACH
 */
static void check_unsafe_exec(struct linux_binprm *bprm)

It sounds like cred_guard_mutex is there for exactly this reason  :)

--Andy
Kees Cook June 25, 2014, 5:57 p.m. UTC | #7
On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 06/25, Kees Cook wrote:
>>
>> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > Yes, at least this should close the race with suid-exec. And there are no
>> > other users. Except apparmor, and I hope you will check it because I simply
>> > do not know what it does ;)
>> >
>> >> I wonder if changes to nnp need to "flushed" during syscall entry
>> >> instead of getting updated externally/asynchronously? That way it
>> >> won't be out of sync with the seccomp mode/filters.
>> >>
>> >> Perhaps secure computing needs to check some (maybe seccomp-only)
>> >> atomic flags and flip on the "real" nnp if found?
>> >
>> > Not sure I understand you, could you clarify?
>>
>> Instead of having TSYNC change the nnp bit, it can set a new flag, say:
>>
>>     task->seccomp.flags |= SECCOMP_NEEDS_NNP;
>>
>> This would be set along with seccomp.mode, seccomp.filter, and
>> TIF_SECCOMP. Then, during the next secure_computing() call that thread
>> makes, it would check the flag:
>>
>>     if (task->seccomp.flags & SECCOMP_NEEDS_NNP)
>>         task->nnp = 1;
>>
>> This means that nnp couldn't change in the middle of a running syscall.
>
> Aha, so you were worried about the same thing. Not sure we need this,
> but at least I understand you and...
>
>> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal
>> above would actually make things worse, since now we'd have a thread
>> with seccomp set up, and no nnp. If it was in the middle of exec,
>> we're still causing a problem.
>
> Yes ;)
>
>> I think we'd also need a way to either delay the seccomp changes, or
>> to notice this condition during exec. Bleh.
>
> Hmm. confused again,

I mean to suggest that the tsync changes would be stored in each
thread, but somewhere other than the true seccomp struct, but with
TIF_SECCOMP set. When entering secure_computing(), current would check
for the "changes to sync", and apply them, then start the syscall. In
this way, we can never race a syscall (like exec).

>> What actually happens with a multi-threaded process calls exec? I
>> assume all the other threads are destroyed?
>
> Yes. But this is the point-of-no-return, de_thread() is called after the execing
> thared has already passed (say) check_unsafe_exec().
>
> However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
> and drops it in install_exec_creds(), so it should solve the problem?

I can't tell yet. I'm still trying to understand the order of
operations here. It looks like de_thread() takes the sighand lock.
do_execve_common does:

prepare_bprm_creds (takes cred_guard_mutex)
check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS)
prepare_binprm (handles suid escalation, checks nnp separately)
    security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS)
exec_binprm
    load_elf_binary
        flush_old_exec
            de_thread (takes and releases sighand->lock)
        install_exec_creds (releases cred_guard_mutex)

I don't see a way to use cred_guard_mutex during tsync (which holds
sighand->lock) without dead-locking. What were you considering here?

-Kees
Andy Lutomirski June 25, 2014, 6:09 p.m. UTC | #8
On Wed, Jun 25, 2014 at 10:57 AM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> On 06/25, Kees Cook wrote:
>>>
>>> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> >
>>> > Yes, at least this should close the race with suid-exec. And there are no
>>> > other users. Except apparmor, and I hope you will check it because I simply
>>> > do not know what it does ;)
>>> >
>>> >> I wonder if changes to nnp need to "flushed" during syscall entry
>>> >> instead of getting updated externally/asynchronously? That way it
>>> >> won't be out of sync with the seccomp mode/filters.
>>> >>
>>> >> Perhaps secure computing needs to check some (maybe seccomp-only)
>>> >> atomic flags and flip on the "real" nnp if found?
>>> >
>>> > Not sure I understand you, could you clarify?
>>>
>>> Instead of having TSYNC change the nnp bit, it can set a new flag, say:
>>>
>>>     task->seccomp.flags |= SECCOMP_NEEDS_NNP;
>>>
>>> This would be set along with seccomp.mode, seccomp.filter, and
>>> TIF_SECCOMP. Then, during the next secure_computing() call that thread
>>> makes, it would check the flag:
>>>
>>>     if (task->seccomp.flags & SECCOMP_NEEDS_NNP)
>>>         task->nnp = 1;
>>>
>>> This means that nnp couldn't change in the middle of a running syscall.
>>
>> Aha, so you were worried about the same thing. Not sure we need this,
>> but at least I understand you and...
>>
>>> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal
>>> above would actually make things worse, since now we'd have a thread
>>> with seccomp set up, and no nnp. If it was in the middle of exec,
>>> we're still causing a problem.
>>
>> Yes ;)
>>
>>> I think we'd also need a way to either delay the seccomp changes, or
>>> to notice this condition during exec. Bleh.
>>
>> Hmm. confused again,
>
> I mean to suggest that the tsync changes would be stored in each
> thread, but somewhere other than the true seccomp struct, but with
> TIF_SECCOMP set. When entering secure_computing(), current would check
> for the "changes to sync", and apply them, then start the syscall. In
> this way, we can never race a syscall (like exec).

I'm not sure that helps.  If you set a pending filter part-way through
exec, and exec copies that pending filter but doesn't notice NNP, then
there's an exploitable race.

>
>>> What actually happens with a multi-threaded process calls exec? I
>>> assume all the other threads are destroyed?
>>
>> Yes. But this is the point-of-no-return, de_thread() is called after the execing
>> thared has already passed (say) check_unsafe_exec().
>>
>> However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
>> and drops it in install_exec_creds(), so it should solve the problem?
>
> I can't tell yet. I'm still trying to understand the order of
> operations here. It looks like de_thread() takes the sighand lock.
> do_execve_common does:
>
> prepare_bprm_creds (takes cred_guard_mutex)
> check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS)
> prepare_binprm (handles suid escalation, checks nnp separately)
>     security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS)
> exec_binprm
>     load_elf_binary
>         flush_old_exec
>             de_thread (takes and releases sighand->lock)
>         install_exec_creds (releases cred_guard_mutex)
>
> I don't see a way to use cred_guard_mutex during tsync (which holds
> sighand->lock) without dead-locking. What were you considering here?
>

Grab cred_guard_mutex and then sighand->lock, perhaps?

> -Kees
>
> --
> Kees Cook
> Chrome OS Security
Oleg Nesterov June 25, 2014, 6:20 p.m. UTC | #9
On 06/25, Kees Cook wrote:
>
> On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
> > and drops it in install_exec_creds(), so it should solve the problem?
>
> I can't tell yet. I'm still trying to understand the order of
> operations here. It looks like de_thread() takes the sighand lock.
> do_execve_common does:
>
> prepare_bprm_creds (takes cred_guard_mutex)
> check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS)
> prepare_binprm (handles suid escalation, checks nnp separately)
>     security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS)
> exec_binprm
>     load_elf_binary
>         flush_old_exec
>             de_thread (takes and releases sighand->lock)
>         install_exec_creds (releases cred_guard_mutex)

Yes, and note that when cred_guard_mutex is dropped all other threads
are already killed,

> I don't see a way to use cred_guard_mutex during tsync (which holds
> sighand->lock) without dead-locking. What were you considering here?

Just take/drop current->signal->cred_guard_mutex along with ->siglock
in seccomp_set_mode_filter() ? Unconditionally on depending on
SECCOMP_FILTER_FLAG_TSYNC.

Oleg.
Kees Cook June 25, 2014, 6:25 p.m. UTC | #10
On Wed, Jun 25, 2014 at 11:09 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Jun 25, 2014 at 10:57 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> On 06/25, Kees Cook wrote:
>>>>
>>>> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> >
>>>> > Yes, at least this should close the race with suid-exec. And there are no
>>>> > other users. Except apparmor, and I hope you will check it because I simply
>>>> > do not know what it does ;)
>>>> >
>>>> >> I wonder if changes to nnp need to "flushed" during syscall entry
>>>> >> instead of getting updated externally/asynchronously? That way it
>>>> >> won't be out of sync with the seccomp mode/filters.
>>>> >>
>>>> >> Perhaps secure computing needs to check some (maybe seccomp-only)
>>>> >> atomic flags and flip on the "real" nnp if found?
>>>> >
>>>> > Not sure I understand you, could you clarify?
>>>>
>>>> Instead of having TSYNC change the nnp bit, it can set a new flag, say:
>>>>
>>>>     task->seccomp.flags |= SECCOMP_NEEDS_NNP;
>>>>
>>>> This would be set along with seccomp.mode, seccomp.filter, and
>>>> TIF_SECCOMP. Then, during the next secure_computing() call that thread
>>>> makes, it would check the flag:
>>>>
>>>>     if (task->seccomp.flags & SECCOMP_NEEDS_NNP)
>>>>         task->nnp = 1;
>>>>
>>>> This means that nnp couldn't change in the middle of a running syscall.
>>>
>>> Aha, so you were worried about the same thing. Not sure we need this,
>>> but at least I understand you and...
>>>
>>>> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal
>>>> above would actually make things worse, since now we'd have a thread
>>>> with seccomp set up, and no nnp. If it was in the middle of exec,
>>>> we're still causing a problem.
>>>
>>> Yes ;)
>>>
>>>> I think we'd also need a way to either delay the seccomp changes, or
>>>> to notice this condition during exec. Bleh.
>>>
>>> Hmm. confused again,
>>
>> I mean to suggest that the tsync changes would be stored in each
>> thread, but somewhere other than the true seccomp struct, but with
>> TIF_SECCOMP set. When entering secure_computing(), current would check
>> for the "changes to sync", and apply them, then start the syscall. In
>> this way, we can never race a syscall (like exec).
>
> I'm not sure that helps.  If you set a pending filter part-way through
> exec, and exec copies that pending filter but doesn't notice NNP, then
> there's an exploitable race.
>
>>
>>>> What actually happens with a multi-threaded process calls exec? I
>>>> assume all the other threads are destroyed?
>>>
>>> Yes. But this is the point-of-no-return, de_thread() is called after the execing
>>> thared has already passed (say) check_unsafe_exec().
>>>
>>> However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
>>> and drops it in install_exec_creds(), so it should solve the problem?
>>
>> I can't tell yet. I'm still trying to understand the order of
>> operations here. It looks like de_thread() takes the sighand lock.
>> do_execve_common does:
>>
>> prepare_bprm_creds (takes cred_guard_mutex)
>> check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS)
>> prepare_binprm (handles suid escalation, checks nnp separately)
>>     security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS)
>> exec_binprm
>>     load_elf_binary
>>         flush_old_exec
>>             de_thread (takes and releases sighand->lock)
>>         install_exec_creds (releases cred_guard_mutex)
>>
>> I don't see a way to use cred_guard_mutex during tsync (which holds
>> sighand->lock) without dead-locking. What were you considering here?
>
> Grab cred_guard_mutex and then sighand->lock, perhaps?

Ah, yes, task->signal is like sighand: shared across all threads.
Kees Cook June 25, 2014, 6:31 p.m. UTC | #11
On Wed, Jun 25, 2014 at 11:20 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 06/25, Kees Cook wrote:
>>
>> On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
>> > and drops it in install_exec_creds(), so it should solve the problem?
>>
>> I can't tell yet. I'm still trying to understand the order of
>> operations here. It looks like de_thread() takes the sighand lock.
>> do_execve_common does:
>>
>> prepare_bprm_creds (takes cred_guard_mutex)
>> check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS)
>> prepare_binprm (handles suid escalation, checks nnp separately)
>>     security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS)
>> exec_binprm
>>     load_elf_binary
>>         flush_old_exec
>>             de_thread (takes and releases sighand->lock)
>>         install_exec_creds (releases cred_guard_mutex)
>
> Yes, and note that when cred_guard_mutex is dropped all other threads
> are already killed,
>
>> I don't see a way to use cred_guard_mutex during tsync (which holds
>> sighand->lock) without dead-locking. What were you considering here?
>
> Just take/drop current->signal->cred_guard_mutex along with ->siglock
> in seccomp_set_mode_filter() ? Unconditionally on depending on
> SECCOMP_FILTER_FLAG_TSYNC.

Yeah, this looks good. *whew* Testing it now, so far so good.

Thanks!

-Kees
diff mbox

Patch

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 9ff98b4bfe2e..15de2a711518 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -3,6 +3,8 @@ 
 
 #include <uapi/linux/seccomp.h>
 
+#define SECCOMP_FILTER_FLAG_MASK	~(SECCOMP_FILTER_FLAG_TSYNC)
+
 #ifdef CONFIG_SECCOMP
 
 #include <linux/thread_info.h>
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index b258878ba754..0f238a43ff1e 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -14,6 +14,9 @@ 
 #define SECCOMP_SET_MODE_STRICT	0
 #define SECCOMP_SET_MODE_FILTER	1
 
+/* Valid flags for SECCOMP_SET_MODE_FILTER */
+#define SECCOMP_FILTER_FLAG_TSYNC	1
+
 /*
  * All BPF programs must return a 32-bit value.
  * The bottom 16-bits are for optional return data.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 051c3da4fb4c..c58b49375125 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -26,6 +26,7 @@ 
 #ifdef CONFIG_SECCOMP_FILTER
 #include <asm/syscall.h>
 #include <linux/filter.h>
+#include <linux/pid.h>
 #include <linux/ptrace.h>
 #include <linux/security.h>
 #include <linux/tracehook.h>
@@ -218,6 +219,105 @@  static inline void seccomp_assign_mode(struct task_struct *task,
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+/* Returns 1 if the candidate is an ancestor. */
+static int is_ancestor(struct seccomp_filter *candidate,
+		       struct seccomp_filter *child)
+{
+	/* NULL is the root ancestor. */
+	if (candidate == NULL)
+		return 1;
+	for (; child; child = child->prev)
+		if (child == candidate)
+			return 1;
+	return 0;
+}
+
+/**
+ * seccomp_can_sync_threads: checks if all threads can be synchronized
+ *
+ * Expects current->sighand->siglock to be held.
+ *
+ * Returns 0 on success, -ve on error, or the pid of a thread which was
+ * either not in the correct seccomp mode or it did not have an ancestral
+ * seccomp filter.
+ */
+static pid_t seccomp_can_sync_threads(void)
+{
+	struct task_struct *thread, *caller;
+
+	BUG_ON(!spin_is_locked(&current->sighand->siglock));
+
+	if (current->seccomp.mode != SECCOMP_MODE_FILTER)
+		return -EACCES;
+
+	/* Validate all threads being eligible for synchronization. */
+	caller = current;
+	for_each_thread(caller, thread) {
+		pid_t failed;
+
+		if (thread->seccomp.mode == SECCOMP_MODE_DISABLED ||
+		    (thread->seccomp.mode == SECCOMP_MODE_FILTER &&
+		     is_ancestor(thread->seccomp.filter,
+				 caller->seccomp.filter)))
+			continue;
+
+		/* Return the first thread that cannot be synchronized. */
+		failed = task_pid_vnr(thread);
+		/* If the pid cannot be resolved, then return -ESRCH */
+		if (unlikely(WARN_ON(failed == 0)))
+			failed = -ESRCH;
+		return failed;
+	}
+
+	return 0;
+}
+
+/**
+ * seccomp_sync_threads: sets all threads to use current's filter
+ *
+ * Expects current->sighand->siglock to be held,
+ * and for seccomp_can_sync_threads() to have returned success already
+ * without dropping the locks.
+ *
+ */
+static void seccomp_sync_threads(void)
+{
+	struct task_struct *thread, *caller;
+
+	BUG_ON(!spin_is_locked(&current->sighand->siglock));
+
+	/* Synchronize all threads. */
+	caller = current;
+	for_each_thread(caller, thread) {
+		/* Get a task reference for the new leaf node. */
+		get_seccomp_filter(caller);
+		/*
+		 * Drop the task reference to the shared ancestor since
+		 * current's path will hold a reference.  (This also
+		 * allows a put before the assignment.)
+		 */
+		put_seccomp_filter(thread);
+		thread->seccomp.filter = caller->seccomp.filter;
+		/* Opt the other thread into seccomp if needed.
+		 * As threads are considered to be trust-realm
+		 * equivalent (see ptrace_may_access), it is safe to
+		 * allow one thread to transition the other.
+		 */
+		if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) {
+			/*
+			 * Don't let an unprivileged task work around
+			 * the no_new_privs restriction by creating
+			 * a thread that sets it up, enters seccomp,
+			 * then dies.
+			 */
+			if (task_no_new_privs(caller))
+				task_set_no_new_privs(thread);
+
+			seccomp_assign_mode(thread, SECCOMP_MODE_FILTER);
+		}
+	}
+}
+
 /**
  * seccomp_prepare_filter: Prepares a seccomp filter for use.
  * @fprog: BPF program to install
@@ -349,7 +449,7 @@  static long seccomp_attach_filter(unsigned int flags,
 	BUG_ON(!spin_is_locked(&current->sighand->siglock));
 
 	/* Validate flags. */
-	if (flags != 0)
+	if (flags & SECCOMP_FILTER_FLAG_MASK)
 		return -EINVAL;
 
 	/* Validate resulting filter length. */
@@ -359,6 +459,15 @@  static long seccomp_attach_filter(unsigned int flags,
 	if (total_insns > MAX_INSNS_PER_PATH)
 		return -ENOMEM;
 
+	/* If thread sync has been requested, check that it is possible. */
+	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+		int ret;
+
+		ret = seccomp_can_sync_threads();
+		if (ret)
+			return ret;
+	}
+
 	/*
 	 * If there is an existing filter, make it the prev and don't drop its
 	 * task reference.
@@ -366,6 +475,10 @@  static long seccomp_attach_filter(unsigned int flags,
 	filter->prev = current->seccomp.filter;
 	smp_store_release(&current->seccomp.filter, filter);
 
+	/* Now that the new filter is in place, synchronize to all threads. */
+	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+		seccomp_sync_threads();
+
 	return 0;
 }