Message ID | 1403642893-23107-10-git-send-email-keescook@chromium.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 06/24, Kees Cook wrote: > > +static void seccomp_sync_threads(void) > +{ > + struct task_struct *thread, *caller; > + > + BUG_ON(!spin_is_locked(¤t->sighand->siglock)); > + > + /* Synchronize all threads. */ > + caller = current; > + for_each_thread(caller, thread) { > + /* Get a task reference for the new leaf node. */ > + get_seccomp_filter(caller); > + /* > + * Drop the task reference to the shared ancestor since > + * current's path will hold a reference. (This also > + * allows a put before the assignment.) > + */ > + put_seccomp_filter(thread); > + thread->seccomp.filter = caller->seccomp.filter; > + /* Opt the other thread into seccomp if needed. > + * As threads are considered to be trust-realm > + * equivalent (see ptrace_may_access), it is safe to > + * allow one thread to transition the other. > + */ > + if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) { > + /* > + * Don't let an unprivileged task work around > + * the no_new_privs restriction by creating > + * a thread that sets it up, enters seccomp, > + * then dies. > + */ > + if (task_no_new_privs(caller)) > + task_set_no_new_privs(thread); > + > + seccomp_assign_mode(thread, SECCOMP_MODE_FILTER); > + } > + } > +} OK, personally I think this all make sense. I even think that perhaps SECCOMP_FILTER_FLAG_TSYNC should allow filter == NULL, a thread might want to "sync" without adding the new filter, but this is minor/offtopic. But. Doesn't this change add the new security hole? Obviously, we should not allow to install a filter and then (say) exec a suid binary, that is why we have no_new_privs/LSM_UNSAFE_NO_NEW_PRIVS. But what if "thread->seccomp.filter = caller->seccomp.filter" races with any user of task_no_new_privs() ? Say, suppose this thread has already passed check_unsafe_exec/etc and it is going to exec the suid binary? Oleg.
On Wed, Jun 25, 2014 at 7:21 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 06/24, Kees Cook wrote: >> >> +static void seccomp_sync_threads(void) >> +{ >> + struct task_struct *thread, *caller; >> + >> + BUG_ON(!spin_is_locked(¤t->sighand->siglock)); >> + >> + /* Synchronize all threads. */ >> + caller = current; >> + for_each_thread(caller, thread) { >> + /* Get a task reference for the new leaf node. */ >> + get_seccomp_filter(caller); >> + /* >> + * Drop the task reference to the shared ancestor since >> + * current's path will hold a reference. (This also >> + * allows a put before the assignment.) >> + */ >> + put_seccomp_filter(thread); >> + thread->seccomp.filter = caller->seccomp.filter; >> + /* Opt the other thread into seccomp if needed. >> + * As threads are considered to be trust-realm >> + * equivalent (see ptrace_may_access), it is safe to >> + * allow one thread to transition the other. >> + */ >> + if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) { >> + /* >> + * Don't let an unprivileged task work around >> + * the no_new_privs restriction by creating >> + * a thread that sets it up, enters seccomp, >> + * then dies. >> + */ >> + if (task_no_new_privs(caller)) >> + task_set_no_new_privs(thread); >> + >> + seccomp_assign_mode(thread, SECCOMP_MODE_FILTER); >> + } >> + } >> +} > > OK, personally I think this all make sense. I even think that perhaps > SECCOMP_FILTER_FLAG_TSYNC should allow filter == NULL, a thread might > want to "sync" without adding the new filter, but this is minor/offtopic. > > But. Doesn't this change add a new security hole? > > Obviously, we should not allow to install a filter and then (say) exec > a suid binary, that is why we have no_new_privs/LSM_UNSAFE_NO_NEW_PRIVS. > > But what if "thread->seccomp.filter = caller->seccomp.filter" races with > any user of task_no_new_privs() ? Say, suppose this thread has already > passed check_unsafe_exec/etc and it is going to exec the suid binary? Oh, ew. Yeah. It looks like there's a cred lock to be held to combat this? I wonder if changes to nnp need to "flushed" during syscall entry instead of getting updated externally/asynchronously? That way it won't be out of sync with the seccomp mode/filters. Perhaps secure computing needs to check some (maybe seccomp-only) atomic flags and flip on the "real" nnp if found? -Kees
On 06/25, Kees Cook wrote: > > On Wed, Jun 25, 2014 at 7:21 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > But. Doesn't this change add a new security hole? > > > > Obviously, we should not allow to install a filter and then (say) exec > > a suid binary, that is why we have no_new_privs/LSM_UNSAFE_NO_NEW_PRIVS. > > > > But what if "thread->seccomp.filter = caller->seccomp.filter" races with > > any user of task_no_new_privs() ? Say, suppose this thread has already > > passed check_unsafe_exec/etc and it is going to exec the suid binary? > > Oh, ew. Yeah. It looks like there's a cred lock to be held to combat this? Yes, cred_guard_mutex looks like an obvious choice... Hmm, but somehow initially I thought that the fix won't be simple. Not sure why. Yes, at least this should close the race with suid-exec. And there are no other users. Except apparmor, and I hope you will check it because I simply do not know what it does ;) > I wonder if changes to nnp need to "flushed" during syscall entry > instead of getting updated externally/asynchronously? That way it > won't be out of sync with the seccomp mode/filters. > > Perhaps secure computing needs to check some (maybe seccomp-only) > atomic flags and flip on the "real" nnp if found? Not sure I understand you, could you clarify? But I was also worried that task_no_new_privs(current) is no longer stable inside the syscall paths, perhaps this is what you meant? However I do not see something bad here... And this has nothing to do with the race above. Also. Even ignoring no_new_privs, SECCOMP_FILTER_FLAG_TSYNC is not atomic and we can do nothing with this fact (unless it try to freeze the thread group somehow), perhaps it makes sense to document this somehow. I mean, suppose you want to ensure write-to-file is not possible, so you do seccomp(SECCOMP_FILTER_FLAG_TSYNC, nack_write_to_file_filter). You can't assume that this has effect right after seccomp() returns, this can obviously race with a sub-thread which has already entered sys_write(). Once again, I am not arguing, just I think it makes sense to at least mention the limitations during the discussion. Oleg.
On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 06/25, Kees Cook wrote: >> >> On Wed, Jun 25, 2014 at 7:21 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> > >> > But. Doesn't this change add a new security hole? >> > >> > Obviously, we should not allow to install a filter and then (say) exec >> > a suid binary, that is why we have no_new_privs/LSM_UNSAFE_NO_NEW_PRIVS. >> > >> > But what if "thread->seccomp.filter = caller->seccomp.filter" races with >> > any user of task_no_new_privs() ? Say, suppose this thread has already >> > passed check_unsafe_exec/etc and it is going to exec the suid binary? >> >> Oh, ew. Yeah. It looks like there's a cred lock to be held to combat this? > > Yes, cred_guard_mutex looks like an obvious choice... Hmm, but somehow > initially I thought that the fix won't be simple. Not sure why. > > Yes, at least this should close the race with suid-exec. And there are no > other users. Except apparmor, and I hope you will check it because I simply > do not know what it does ;) > >> I wonder if changes to nnp need to "flushed" during syscall entry >> instead of getting updated externally/asynchronously? That way it >> won't be out of sync with the seccomp mode/filters. >> >> Perhaps secure computing needs to check some (maybe seccomp-only) >> atomic flags and flip on the "real" nnp if found? > > Not sure I understand you, could you clarify? Instead of having TSYNC change the nnp bit, it can set a new flag, say: task->seccomp.flags |= SECCOMP_NEEDS_NNP; This would be set along with seccomp.mode, seccomp.filter, and TIF_SECCOMP. Then, during the next secure_computing() call that thread makes, it would check the flag: if (task->seccomp.flags & SECCOMP_NEEDS_NNP) task->nnp = 1; This means that nnp couldn't change in the middle of a running syscall. Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal above would actually make things worse, since now we'd have a thread with seccomp set up, and no nnp. If it was in the middle of exec, we're still causing a problem. I think we'd also need a way to either delay the seccomp changes, or to notice this condition during exec. Bleh. What actually happens with a multi-threaded process calls exec? I assume all the other threads are destroyed? > But I was also worried that task_no_new_privs(current) is no longer stable > inside the syscall paths, perhaps this is what you meant? However I do not > see something bad here... And this has nothing to do with the race above. > > Also. Even ignoring no_new_privs, SECCOMP_FILTER_FLAG_TSYNC is not atomic > and we can do nothing with this fact (unless it try to freeze the thread > group somehow), perhaps it makes sense to document this somehow. > > I mean, suppose you want to ensure write-to-file is not possible, so you > do seccomp(SECCOMP_FILTER_FLAG_TSYNC, nack_write_to_file_filter). You can't > assume that this has effect right after seccomp() returns, this can obviously > race with a sub-thread which has already entered sys_write(). > > Once again, I am not arguing, just I think it makes sense to at least mention > the limitations during the discussion. Right -- this is an accepted limitation. I will call it out specifically in the man-page; that's a good idea. -Kees
On 06/25, Kees Cook wrote: > > On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > Yes, at least this should close the race with suid-exec. And there are no > > other users. Except apparmor, and I hope you will check it because I simply > > do not know what it does ;) > > > >> I wonder if changes to nnp need to "flushed" during syscall entry > >> instead of getting updated externally/asynchronously? That way it > >> won't be out of sync with the seccomp mode/filters. > >> > >> Perhaps secure computing needs to check some (maybe seccomp-only) > >> atomic flags and flip on the "real" nnp if found? > > > > Not sure I understand you, could you clarify? > > Instead of having TSYNC change the nnp bit, it can set a new flag, say: > > task->seccomp.flags |= SECCOMP_NEEDS_NNP; > > This would be set along with seccomp.mode, seccomp.filter, and > TIF_SECCOMP. Then, during the next secure_computing() call that thread > makes, it would check the flag: > > if (task->seccomp.flags & SECCOMP_NEEDS_NNP) > task->nnp = 1; > > This means that nnp couldn't change in the middle of a running syscall. Aha, so you were worried about the same thing. Not sure we need this, but at least I understand you and... > Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal > above would actually make things worse, since now we'd have a thread > with seccomp set up, and no nnp. If it was in the middle of exec, > we're still causing a problem. Yes ;) > I think we'd also need a way to either delay the seccomp changes, or > to notice this condition during exec. Bleh. Hmm. confused again, > What actually happens with a multi-threaded process calls exec? I > assume all the other threads are destroyed? Yes. But this is the point-of-no-return, de_thread() is called after the execing thared has already passed (say) check_unsafe_exec(). However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds() and drops it in install_exec_creds(), so it should solve the problem? Oleg.
On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 06/25, Kees Cook wrote: >> >> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> > >> > Yes, at least this should close the race with suid-exec. And there are no >> > other users. Except apparmor, and I hope you will check it because I simply >> > do not know what it does ;) >> > >> >> I wonder if changes to nnp need to "flushed" during syscall entry >> >> instead of getting updated externally/asynchronously? That way it >> >> won't be out of sync with the seccomp mode/filters. >> >> >> >> Perhaps secure computing needs to check some (maybe seccomp-only) >> >> atomic flags and flip on the "real" nnp if found? >> > >> > Not sure I understand you, could you clarify? >> >> Instead of having TSYNC change the nnp bit, it can set a new flag, say: >> >> task->seccomp.flags |= SECCOMP_NEEDS_NNP; >> >> This would be set along with seccomp.mode, seccomp.filter, and >> TIF_SECCOMP. Then, during the next secure_computing() call that thread >> makes, it would check the flag: >> >> if (task->seccomp.flags & SECCOMP_NEEDS_NNP) >> task->nnp = 1; >> >> This means that nnp couldn't change in the middle of a running syscall. > > Aha, so you were worried about the same thing. Not sure we need this, > but at least I understand you and... > >> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal >> above would actually make things worse, since now we'd have a thread >> with seccomp set up, and no nnp. If it was in the middle of exec, >> we're still causing a problem. > > Yes ;) > >> I think we'd also need a way to either delay the seccomp changes, or >> to notice this condition during exec. Bleh. > > Hmm. confused again, > >> What actually happens with a multi-threaded process calls exec? I >> assume all the other threads are destroyed? > > Yes. But this is the point-of-no-return, de_thread() is called after the execing > thared has already passed (say) check_unsafe_exec(). > > However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds() > and drops it in install_exec_creds(), so it should solve the problem? If you rely on this, then please fix this comment in fs/exec.c: /* * determine how safe it is to execute the proposed program * - the caller must hold ->cred_guard_mutex to protect against * PTRACE_ATTACH */ static void check_unsafe_exec(struct linux_binprm *bprm) It sounds like cred_guard_mutex is there for exactly this reason :) --Andy
On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 06/25, Kees Cook wrote: >> >> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> > >> > Yes, at least this should close the race with suid-exec. And there are no >> > other users. Except apparmor, and I hope you will check it because I simply >> > do not know what it does ;) >> > >> >> I wonder if changes to nnp need to "flushed" during syscall entry >> >> instead of getting updated externally/asynchronously? That way it >> >> won't be out of sync with the seccomp mode/filters. >> >> >> >> Perhaps secure computing needs to check some (maybe seccomp-only) >> >> atomic flags and flip on the "real" nnp if found? >> > >> > Not sure I understand you, could you clarify? >> >> Instead of having TSYNC change the nnp bit, it can set a new flag, say: >> >> task->seccomp.flags |= SECCOMP_NEEDS_NNP; >> >> This would be set along with seccomp.mode, seccomp.filter, and >> TIF_SECCOMP. Then, during the next secure_computing() call that thread >> makes, it would check the flag: >> >> if (task->seccomp.flags & SECCOMP_NEEDS_NNP) >> task->nnp = 1; >> >> This means that nnp couldn't change in the middle of a running syscall. > > Aha, so you were worried about the same thing. Not sure we need this, > but at least I understand you and... > >> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal >> above would actually make things worse, since now we'd have a thread >> with seccomp set up, and no nnp. If it was in the middle of exec, >> we're still causing a problem. > > Yes ;) > >> I think we'd also need a way to either delay the seccomp changes, or >> to notice this condition during exec. Bleh. > > Hmm. confused again, I mean to suggest that the tsync changes would be stored in each thread, but somewhere other than the true seccomp struct, but with TIF_SECCOMP set. When entering secure_computing(), current would check for the "changes to sync", and apply them, then start the syscall. In this way, we can never race a syscall (like exec). >> What actually happens with a multi-threaded process calls exec? I >> assume all the other threads are destroyed? > > Yes. But this is the point-of-no-return, de_thread() is called after the execing > thared has already passed (say) check_unsafe_exec(). > > However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds() > and drops it in install_exec_creds(), so it should solve the problem? I can't tell yet. I'm still trying to understand the order of operations here. It looks like de_thread() takes the sighand lock. do_execve_common does: prepare_bprm_creds (takes cred_guard_mutex) check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS) prepare_binprm (handles suid escalation, checks nnp separately) security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS) exec_binprm load_elf_binary flush_old_exec de_thread (takes and releases sighand->lock) install_exec_creds (releases cred_guard_mutex) I don't see a way to use cred_guard_mutex during tsync (which holds sighand->lock) without dead-locking. What were you considering here? -Kees
On Wed, Jun 25, 2014 at 10:57 AM, Kees Cook <keescook@chromium.org> wrote: > On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> On 06/25, Kees Cook wrote: >>> >>> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <oleg@redhat.com> wrote: >>> > >>> > Yes, at least this should close the race with suid-exec. And there are no >>> > other users. Except apparmor, and I hope you will check it because I simply >>> > do not know what it does ;) >>> > >>> >> I wonder if changes to nnp need to "flushed" during syscall entry >>> >> instead of getting updated externally/asynchronously? That way it >>> >> won't be out of sync with the seccomp mode/filters. >>> >> >>> >> Perhaps secure computing needs to check some (maybe seccomp-only) >>> >> atomic flags and flip on the "real" nnp if found? >>> > >>> > Not sure I understand you, could you clarify? >>> >>> Instead of having TSYNC change the nnp bit, it can set a new flag, say: >>> >>> task->seccomp.flags |= SECCOMP_NEEDS_NNP; >>> >>> This would be set along with seccomp.mode, seccomp.filter, and >>> TIF_SECCOMP. Then, during the next secure_computing() call that thread >>> makes, it would check the flag: >>> >>> if (task->seccomp.flags & SECCOMP_NEEDS_NNP) >>> task->nnp = 1; >>> >>> This means that nnp couldn't change in the middle of a running syscall. >> >> Aha, so you were worried about the same thing. Not sure we need this, >> but at least I understand you and... >> >>> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal >>> above would actually make things worse, since now we'd have a thread >>> with seccomp set up, and no nnp. If it was in the middle of exec, >>> we're still causing a problem. >> >> Yes ;) >> >>> I think we'd also need a way to either delay the seccomp changes, or >>> to notice this condition during exec. Bleh. >> >> Hmm. confused again, > > I mean to suggest that the tsync changes would be stored in each > thread, but somewhere other than the true seccomp struct, but with > TIF_SECCOMP set. When entering secure_computing(), current would check > for the "changes to sync", and apply them, then start the syscall. In > this way, we can never race a syscall (like exec). I'm not sure that helps. If you set a pending filter part-way through exec, and exec copies that pending filter but doesn't notice NNP, then there's an exploitable race. > >>> What actually happens with a multi-threaded process calls exec? I >>> assume all the other threads are destroyed? >> >> Yes. But this is the point-of-no-return, de_thread() is called after the execing >> thared has already passed (say) check_unsafe_exec(). >> >> However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds() >> and drops it in install_exec_creds(), so it should solve the problem? > > I can't tell yet. I'm still trying to understand the order of > operations here. It looks like de_thread() takes the sighand lock. > do_execve_common does: > > prepare_bprm_creds (takes cred_guard_mutex) > check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS) > prepare_binprm (handles suid escalation, checks nnp separately) > security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS) > exec_binprm > load_elf_binary > flush_old_exec > de_thread (takes and releases sighand->lock) > install_exec_creds (releases cred_guard_mutex) > > I don't see a way to use cred_guard_mutex during tsync (which holds > sighand->lock) without dead-locking. What were you considering here? > Grab cred_guard_mutex and then sighand->lock, perhaps? > -Kees > > -- > Kees Cook > Chrome OS Security
On 06/25, Kees Cook wrote: > > On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds() > > and drops it in install_exec_creds(), so it should solve the problem? > > I can't tell yet. I'm still trying to understand the order of > operations here. It looks like de_thread() takes the sighand lock. > do_execve_common does: > > prepare_bprm_creds (takes cred_guard_mutex) > check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS) > prepare_binprm (handles suid escalation, checks nnp separately) > security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS) > exec_binprm > load_elf_binary > flush_old_exec > de_thread (takes and releases sighand->lock) > install_exec_creds (releases cred_guard_mutex) Yes, and note that when cred_guard_mutex is dropped all other threads are already killed, > I don't see a way to use cred_guard_mutex during tsync (which holds > sighand->lock) without dead-locking. What were you considering here? Just take/drop current->signal->cred_guard_mutex along with ->siglock in seccomp_set_mode_filter() ? Unconditionally on depending on SECCOMP_FILTER_FLAG_TSYNC. Oleg.
On Wed, Jun 25, 2014 at 11:09 AM, Andy Lutomirski <luto@amacapital.net> wrote: > On Wed, Jun 25, 2014 at 10:57 AM, Kees Cook <keescook@chromium.org> wrote: >> On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <oleg@redhat.com> wrote: >>> On 06/25, Kees Cook wrote: >>>> >>>> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <oleg@redhat.com> wrote: >>>> > >>>> > Yes, at least this should close the race with suid-exec. And there are no >>>> > other users. Except apparmor, and I hope you will check it because I simply >>>> > do not know what it does ;) >>>> > >>>> >> I wonder if changes to nnp need to "flushed" during syscall entry >>>> >> instead of getting updated externally/asynchronously? That way it >>>> >> won't be out of sync with the seccomp mode/filters. >>>> >> >>>> >> Perhaps secure computing needs to check some (maybe seccomp-only) >>>> >> atomic flags and flip on the "real" nnp if found? >>>> > >>>> > Not sure I understand you, could you clarify? >>>> >>>> Instead of having TSYNC change the nnp bit, it can set a new flag, say: >>>> >>>> task->seccomp.flags |= SECCOMP_NEEDS_NNP; >>>> >>>> This would be set along with seccomp.mode, seccomp.filter, and >>>> TIF_SECCOMP. Then, during the next secure_computing() call that thread >>>> makes, it would check the flag: >>>> >>>> if (task->seccomp.flags & SECCOMP_NEEDS_NNP) >>>> task->nnp = 1; >>>> >>>> This means that nnp couldn't change in the middle of a running syscall. >>> >>> Aha, so you were worried about the same thing. Not sure we need this, >>> but at least I understand you and... >>> >>>> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal >>>> above would actually make things worse, since now we'd have a thread >>>> with seccomp set up, and no nnp. If it was in the middle of exec, >>>> we're still causing a problem. >>> >>> Yes ;) >>> >>>> I think we'd also need a way to either delay the seccomp changes, or >>>> to notice this condition during exec. Bleh. >>> >>> Hmm. confused again, >> >> I mean to suggest that the tsync changes would be stored in each >> thread, but somewhere other than the true seccomp struct, but with >> TIF_SECCOMP set. When entering secure_computing(), current would check >> for the "changes to sync", and apply them, then start the syscall. In >> this way, we can never race a syscall (like exec). > > I'm not sure that helps. If you set a pending filter part-way through > exec, and exec copies that pending filter but doesn't notice NNP, then > there's an exploitable race. > >> >>>> What actually happens with a multi-threaded process calls exec? I >>>> assume all the other threads are destroyed? >>> >>> Yes. But this is the point-of-no-return, de_thread() is called after the execing >>> thared has already passed (say) check_unsafe_exec(). >>> >>> However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds() >>> and drops it in install_exec_creds(), so it should solve the problem? >> >> I can't tell yet. I'm still trying to understand the order of >> operations here. It looks like de_thread() takes the sighand lock. >> do_execve_common does: >> >> prepare_bprm_creds (takes cred_guard_mutex) >> check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS) >> prepare_binprm (handles suid escalation, checks nnp separately) >> security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS) >> exec_binprm >> load_elf_binary >> flush_old_exec >> de_thread (takes and releases sighand->lock) >> install_exec_creds (releases cred_guard_mutex) >> >> I don't see a way to use cred_guard_mutex during tsync (which holds >> sighand->lock) without dead-locking. What were you considering here? > > Grab cred_guard_mutex and then sighand->lock, perhaps? Ah, yes, task->signal is like sighand: shared across all threads.
On Wed, Jun 25, 2014 at 11:20 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 06/25, Kees Cook wrote: >> >> On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> > >> > However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds() >> > and drops it in install_exec_creds(), so it should solve the problem? >> >> I can't tell yet. I'm still trying to understand the order of >> operations here. It looks like de_thread() takes the sighand lock. >> do_execve_common does: >> >> prepare_bprm_creds (takes cred_guard_mutex) >> check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS) >> prepare_binprm (handles suid escalation, checks nnp separately) >> security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS) >> exec_binprm >> load_elf_binary >> flush_old_exec >> de_thread (takes and releases sighand->lock) >> install_exec_creds (releases cred_guard_mutex) > > Yes, and note that when cred_guard_mutex is dropped all other threads > are already killed, > >> I don't see a way to use cred_guard_mutex during tsync (which holds >> sighand->lock) without dead-locking. What were you considering here? > > Just take/drop current->signal->cred_guard_mutex along with ->siglock > in seccomp_set_mode_filter() ? Unconditionally on depending on > SECCOMP_FILTER_FLAG_TSYNC. Yeah, this looks good. *whew* Testing it now, so far so good. Thanks! -Kees
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 9ff98b4bfe2e..15de2a711518 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -3,6 +3,8 @@ #include <uapi/linux/seccomp.h> +#define SECCOMP_FILTER_FLAG_MASK ~(SECCOMP_FILTER_FLAG_TSYNC) + #ifdef CONFIG_SECCOMP #include <linux/thread_info.h> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index b258878ba754..0f238a43ff1e 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -14,6 +14,9 @@ #define SECCOMP_SET_MODE_STRICT 0 #define SECCOMP_SET_MODE_FILTER 1 +/* Valid flags for SECCOMP_SET_MODE_FILTER */ +#define SECCOMP_FILTER_FLAG_TSYNC 1 + /* * All BPF programs must return a 32-bit value. * The bottom 16-bits are for optional return data. diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 051c3da4fb4c..c58b49375125 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -26,6 +26,7 @@ #ifdef CONFIG_SECCOMP_FILTER #include <asm/syscall.h> #include <linux/filter.h> +#include <linux/pid.h> #include <linux/ptrace.h> #include <linux/security.h> #include <linux/tracehook.h> @@ -218,6 +219,105 @@ static inline void seccomp_assign_mode(struct task_struct *task, } #ifdef CONFIG_SECCOMP_FILTER +/* Returns 1 if the candidate is an ancestor. */ +static int is_ancestor(struct seccomp_filter *candidate, + struct seccomp_filter *child) +{ + /* NULL is the root ancestor. */ + if (candidate == NULL) + return 1; + for (; child; child = child->prev) + if (child == candidate) + return 1; + return 0; +} + +/** + * seccomp_can_sync_threads: checks if all threads can be synchronized + * + * Expects current->sighand->siglock to be held. + * + * Returns 0 on success, -ve on error, or the pid of a thread which was + * either not in the correct seccomp mode or it did not have an ancestral + * seccomp filter. + */ +static pid_t seccomp_can_sync_threads(void) +{ + struct task_struct *thread, *caller; + + BUG_ON(!spin_is_locked(¤t->sighand->siglock)); + + if (current->seccomp.mode != SECCOMP_MODE_FILTER) + return -EACCES; + + /* Validate all threads being eligible for synchronization. */ + caller = current; + for_each_thread(caller, thread) { + pid_t failed; + + if (thread->seccomp.mode == SECCOMP_MODE_DISABLED || + (thread->seccomp.mode == SECCOMP_MODE_FILTER && + is_ancestor(thread->seccomp.filter, + caller->seccomp.filter))) + continue; + + /* Return the first thread that cannot be synchronized. */ + failed = task_pid_vnr(thread); + /* If the pid cannot be resolved, then return -ESRCH */ + if (unlikely(WARN_ON(failed == 0))) + failed = -ESRCH; + return failed; + } + + return 0; +} + +/** + * seccomp_sync_threads: sets all threads to use current's filter + * + * Expects current->sighand->siglock to be held, + * and for seccomp_can_sync_threads() to have returned success already + * without dropping the locks. + * + */ +static void seccomp_sync_threads(void) +{ + struct task_struct *thread, *caller; + + BUG_ON(!spin_is_locked(¤t->sighand->siglock)); + + /* Synchronize all threads. */ + caller = current; + for_each_thread(caller, thread) { + /* Get a task reference for the new leaf node. */ + get_seccomp_filter(caller); + /* + * Drop the task reference to the shared ancestor since + * current's path will hold a reference. (This also + * allows a put before the assignment.) + */ + put_seccomp_filter(thread); + thread->seccomp.filter = caller->seccomp.filter; + /* Opt the other thread into seccomp if needed. + * As threads are considered to be trust-realm + * equivalent (see ptrace_may_access), it is safe to + * allow one thread to transition the other. + */ + if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) { + /* + * Don't let an unprivileged task work around + * the no_new_privs restriction by creating + * a thread that sets it up, enters seccomp, + * then dies. + */ + if (task_no_new_privs(caller)) + task_set_no_new_privs(thread); + + seccomp_assign_mode(thread, SECCOMP_MODE_FILTER); + } + } +} + /** * seccomp_prepare_filter: Prepares a seccomp filter for use. * @fprog: BPF program to install @@ -349,7 +449,7 @@ static long seccomp_attach_filter(unsigned int flags, BUG_ON(!spin_is_locked(¤t->sighand->siglock)); /* Validate flags. */ - if (flags != 0) + if (flags & SECCOMP_FILTER_FLAG_MASK) return -EINVAL; /* Validate resulting filter length. */ @@ -359,6 +459,15 @@ static long seccomp_attach_filter(unsigned int flags, if (total_insns > MAX_INSNS_PER_PATH) return -ENOMEM; + /* If thread sync has been requested, check that it is possible. */ + if (flags & SECCOMP_FILTER_FLAG_TSYNC) { + int ret; + + ret = seccomp_can_sync_threads(); + if (ret) + return ret; + } + /* * If there is an existing filter, make it the prev and don't drop its * task reference. @@ -366,6 +475,10 @@ static long seccomp_attach_filter(unsigned int flags, filter->prev = current->seccomp.filter; smp_store_release(¤t->seccomp.filter, filter); + /* Now that the new filter is in place, synchronize to all threads. */ + if (flags & SECCOMP_FILTER_FLAG_TSYNC) + seccomp_sync_threads(); + return 0; }
Applying restrictive seccomp filter programs to large or diverse codebases often requires handling threads which may be started early in the process lifetime (e.g., by code that is linked in). While it is possible to apply permissive programs prior to process start up, it is difficult to further restrict the kernel ABI to those threads after that point. This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for synchronizing thread group seccomp filters at filter installation time. When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC, filter) an attempt will be made to synchronize all threads in current's threadgroup to its new seccomp filter program. This is possible iff all threads are using a filter that is an ancestor to the filter current is attempting to synchronize to. NULL filters (where the task is running as SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the calling thread, no_new_privs will be set for all synchronized threads too. On success, 0 is returned. On failure, the pid of one of the failing threads will be returned and no filters will have been applied. The race conditions are against another thread requesting TSYNC, another thread performing a clone, and another thread changing its filter. The sighand lock is sufficient for these cases. The clone case is assisted by the fact that new threads will have their seccomp state duplicated from their parent before appearing on the tasklist. Based on patches by Will Drewry. Suggested-by: Julien Tinnes <jln@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> --- include/linux/seccomp.h | 2 + include/uapi/linux/seccomp.h | 3 ++ kernel/seccomp.c | 115 +++++++++++++++++++++++++++++++++++++++++- 3 files changed, 119 insertions(+), 1 deletion(-)