Message ID | 1474211117-16674-3-git-send-email-jann@thejh.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, 2016-09-18 at 17:05 +0200, Jann Horn wrote: > This ensures that self_privunit_id ("privilege unit ID") is only shared by > processes that share the mm_struct and the signal_struct; not just > spatially, but also temporally. In other words, if you do execve() or > clone() without CLONE_THREAD, you get a new privunit_id that has never been > used before. [...] > +void increment_privunit_counter(void) > +{ > + BUILD_BUG_ON(NR_CPUS > (1 << 16)); > + current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS); > +} [...] This will wrap incorrectly if NR_CPUS is not a power of 2 (which is unusual but allowed). Ben.
On Sun, Sep 18, 2016 at 07:13:27PM +0100, Ben Hutchings wrote: > On Sun, 2016-09-18 at 17:05 +0200, Jann Horn wrote: > > This ensures that self_privunit_id ("privilege unit ID") is only shared by > > processes that share the mm_struct and the signal_struct; not just > > spatially, but also temporally. In other words, if you do execve() or > > clone() without CLONE_THREAD, you get a new privunit_id that has never been > > used before. > [...] > > +void increment_privunit_counter(void) > > +{ > > + BUILD_BUG_ON(NR_CPUS > (1 << 16)); > > + current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS); > > +} > [...] > > This will wrap incorrectly if NR_CPUS is not a power of 2 (which is > unusual but allowed). If this wraps, hell breaks loose permission-wise - processes that have no relationship whatsoever with each other will suddenly be able to ptrace each other. The idea is that it never wraps. It wraps after (2^64)/NR_CPUS execs or forks on one CPU core. NR_CPUS is bounded to <=2^16, so in the worst case, it wraps after 2^48 execs or forks. On my system with 3.7GHz per core, 2^16 minimal sequential non-thread clone() calls need 1 second system time (and 2 seconds wall clock time, but let's disregard that), so 2^48 non-thread clone() calls should need over 100 years. But I guess both the kernel and machines get faster - if you think the margin might not be future-proof enough (or if you think I measured wrong and it's actually much faster), I guess I could bump this to a 128bit number.
On Sun, Sep 18, 2016 at 08:31:37PM +0200, Jann Horn wrote: > On Sun, Sep 18, 2016 at 07:13:27PM +0100, Ben Hutchings wrote: > > On Sun, 2016-09-18 at 17:05 +0200, Jann Horn wrote: > > > This ensures that self_privunit_id ("privilege unit ID") is only shared by > > > processes that share the mm_struct and the signal_struct; not just > > > spatially, but also temporally. In other words, if you do execve() or > > > clone() without CLONE_THREAD, you get a new privunit_id that has never been > > > used before. > > [...] > > > +void increment_privunit_counter(void) > > > +{ > > > + BUILD_BUG_ON(NR_CPUS > (1 << 16)); > > > + current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS); > > > +} > > [...] > > > > This will wrap incorrectly if NR_CPUS is not a power of 2 (which is > > unusual but allowed). > > If this wraps, hell breaks loose permission-wise - processes that have > no relationship whatsoever with each other will suddenly be able to ptrace > each other. > > The idea is that it never wraps. That's what I suspected, but wasn't sure. In that case you can initialise each counter to U64_MAX/NR_CPUS*cpu and increment by 1 each time, which might be more efficient on some architectures. > It wraps after (2^64)/NR_CPUS execs or > forks on one CPU core. NR_CPUS is bounded to <=2^16, so in the worst case, > it wraps after 2^48 execs or forks. > > On my system with 3.7GHz per core, 2^16 minimal sequential non-thread clone() > calls need 1 second system time (and 2 seconds wall clock time, but let's > disregard that), so 2^48 non-thread clone() calls should need over 100 years. > > But I guess both the kernel and machines get faster - if you think the margin > might not be future-proof enough (or if you think I measured wrong and it's > actually much faster), I guess I could bump this to a 128bit number. Sequential execution speed isn't likely to get significantly faster so with those current numbers this seems to be quite safe. Ben.
On Sun, Sep 18, 2016 at 07:45:07PM +0100, Ben Hutchings wrote: > On Sun, Sep 18, 2016 at 08:31:37PM +0200, Jann Horn wrote: > > On Sun, Sep 18, 2016 at 07:13:27PM +0100, Ben Hutchings wrote: > > > On Sun, 2016-09-18 at 17:05 +0200, Jann Horn wrote: > > > > This ensures that self_privunit_id ("privilege unit ID") is only shared by > > > > processes that share the mm_struct and the signal_struct; not just > > > > spatially, but also temporally. In other words, if you do execve() or > > > > clone() without CLONE_THREAD, you get a new privunit_id that has never been > > > > used before. > > > [...] > > > > +void increment_privunit_counter(void) > > > > +{ > > > > + BUILD_BUG_ON(NR_CPUS > (1 << 16)); > > > > + current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS); > > > > +} > > > [...] > > > > > > This will wrap incorrectly if NR_CPUS is not a power of 2 (which is > > > unusual but allowed). > > > > If this wraps, hell breaks loose permission-wise - processes that have > > no relationship whatsoever with each other will suddenly be able to ptrace > > each other. > > > > The idea is that it never wraps. > > That's what I suspected, but wasn't sure. In that case you can > initialise each counter to U64_MAX/NR_CPUS*cpu and increment by > 1 each time, which might be more efficient on some architectures. Makes sense. Will do that!
On Sep 18, 2016 8:45 AM, "Ben Hutchings" <ben@decadent.org.uk> wrote: > > On Sun, Sep 18, 2016 at 08:31:37PM +0200, Jann Horn wrote: > > On Sun, Sep 18, 2016 at 07:13:27PM +0100, Ben Hutchings wrote: > > > On Sun, 2016-09-18 at 17:05 +0200, Jann Horn wrote: > > > > This ensures that self_privunit_id ("privilege unit ID") is only shared by > > > > processes that share the mm_struct and the signal_struct; not just > > > > spatially, but also temporally. In other words, if you do execve() or > > > > clone() without CLONE_THREAD, you get a new privunit_id that has never been > > > > used before. > > > [...] > > > > +void increment_privunit_counter(void) > > > > +{ > > > > + BUILD_BUG_ON(NR_CPUS > (1 << 16)); > > > > + current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS); > > > > +} > > > [...] > > > > > > This will wrap incorrectly if NR_CPUS is not a power of 2 (which is > > > unusual but allowed). > > > > If this wraps, hell breaks loose permission-wise - processes that have > > no relationship whatsoever with each other will suddenly be able to ptrace > > each other. > > > > The idea is that it never wraps. > > That's what I suspected, but wasn't sure. In that case you can > initialise each counter to U64_MAX/NR_CPUS*cpu and increment by > 1 each time, which might be more efficient on some architectures. > > > It wraps after (2^64)/NR_CPUS execs or > > forks on one CPU core. NR_CPUS is bounded to <=2^16, so in the worst case, > > it wraps after 2^48 execs or forks. > > > > On my system with 3.7GHz per core, 2^16 minimal sequential non-thread clone() > > calls need 1 second system time (and 2 seconds wall clock time, but let's > > disregard that), so 2^48 non-thread clone() calls should need over 100 years. > > > > But I guess both the kernel and machines get faster - if you think the margin > > might not be future-proof enough (or if you think I measured wrong and it's > > actually much faster), I guess I could bump this to a 128bit number. > > Sequential execution speed isn't likely to get significantly faster so > with those current numbers this seems to be quite safe. > But how big can NR_CPUs get before this gets uncomfortable? We could do: struct luid { u64 count: unsigned cpu; }; (LUID = locally unique ID). IIRC my draft PCID code does something similar to uniquely identify mms. If I accidentally reused a PCID without a flush, everything would explode. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Sep 18, 2016 at 12:57:46PM -0700, Andy Lutomirski wrote: > On Sep 18, 2016 8:45 AM, "Ben Hutchings" <ben@decadent.org.uk> wrote: > > > > On Sun, Sep 18, 2016 at 08:31:37PM +0200, Jann Horn wrote: > > > On Sun, Sep 18, 2016 at 07:13:27PM +0100, Ben Hutchings wrote: > > > > On Sun, 2016-09-18 at 17:05 +0200, Jann Horn wrote: > > > > > This ensures that self_privunit_id ("privilege unit ID") is only shared by > > > > > processes that share the mm_struct and the signal_struct; not just > > > > > spatially, but also temporally. In other words, if you do execve() or > > > > > clone() without CLONE_THREAD, you get a new privunit_id that has never been > > > > > used before. > > > > [...] > > > > > +void increment_privunit_counter(void) > > > > > +{ > > > > > + BUILD_BUG_ON(NR_CPUS > (1 << 16)); > > > > > + current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS); > > > > > +} > > > > [...] > > > > > > > > This will wrap incorrectly if NR_CPUS is not a power of 2 (which is > > > > unusual but allowed). > > > > > > If this wraps, hell breaks loose permission-wise - processes that have > > > no relationship whatsoever with each other will suddenly be able to ptrace > > > each other. > > > > > > The idea is that it never wraps. > > > > That's what I suspected, but wasn't sure. In that case you can > > initialise each counter to U64_MAX/NR_CPUS*cpu and increment by > > 1 each time, which might be more efficient on some architectures. > > > > > It wraps after (2^64)/NR_CPUS execs or > > > forks on one CPU core. NR_CPUS is bounded to <=2^16, so in the worst case, > > > it wraps after 2^48 execs or forks. > > > > > > On my system with 3.7GHz per core, 2^16 minimal sequential non-thread clone() > > > calls need 1 second system time (and 2 seconds wall clock time, but let's > > > disregard that), so 2^48 non-thread clone() calls should need over 100 years. > > > > > > But I guess both the kernel and machines get faster - if you think the margin > > > might not be future-proof enough (or if you think I measured wrong and it's > > > actually much faster), I guess I could bump this to a 128bit number. > > > > Sequential execution speed isn't likely to get significantly faster so > > with those current numbers this seems to be quite safe. > > > > But how big can NR_CPUs get before this gets uncomfortable? > > We could do: > > struct luid { > u64 count: > unsigned cpu; > }; > > (LUID = locally unique ID). > > IIRC my draft PCID code does something similar to uniquely identify > mms. If I accidentally reused a PCID without a flush, everything > would explode. So I guess for generating a new LUID, I'd have to do something like this? struct luid new_luid; preempt_disable(); raw_cpu_add(luid_counters, 1); new_luid.count = raw_cpu_read(luid_counters, 1); new_luid.cpu = smp_processor_id(); preempt_enable(); Disabling preemption should be sufficient as long as nobody generates LUIDs from IRQ context, right?
diff --git a/fs/exec.c b/fs/exec.c index 84430ee..1a15cb0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1281,6 +1281,25 @@ void would_dump(struct linux_binprm *bprm, struct file *file) } EXPORT_SYMBOL(would_dump); +static DEFINE_PER_CPU(u64, exec_counter); +static int __init init_exec_counters(void) +{ + unsigned int cpu; + + for_each_possible_cpu(cpu) { + per_cpu(exec_counter, cpu) = (u64)cpu; + } + + return 0; +} +early_initcall(init_exec_counters); + +void increment_privunit_counter(void) +{ + BUILD_BUG_ON(NR_CPUS > (1 << 16)); + current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS); +} + void setup_new_exec(struct linux_binprm * bprm) { arch_pick_mmap_layout(current->mm); @@ -1314,7 +1333,7 @@ void setup_new_exec(struct linux_binprm * bprm) /* An exec changes our domain. We are no longer part of the thread group */ - current->self_exec_id++; + increment_privunit_counter(); flush_signal_handlers(current, 0); do_close_on_exec(current->files); } diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index 1303b57..9570bd0 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -100,6 +100,7 @@ extern int prepare_binprm(struct linux_binprm *); extern int __must_check remove_arg_zero(struct linux_binprm *); extern int search_binary_handler(struct linux_binprm *); extern int flush_old_exec(struct linux_binprm * bprm); +extern void increment_privunit_counter(void); extern void setup_new_exec(struct linux_binprm * bprm); extern void would_dump(struct linux_binprm *, struct file *); diff --git a/include/linux/sched.h b/include/linux/sched.h index 2a1df2f..e4bf894 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1688,8 +1688,8 @@ struct task_struct { struct seccomp seccomp; /* Thread group tracking */ - u32 parent_exec_id; - u32 self_exec_id; + u64 parent_privunit_id; + u64 self_privunit_id; /* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, * mempolicy */ spinlock_t alloc_lock; diff --git a/kernel/fork.c b/kernel/fork.c index 2d46f3a..537c117 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1567,6 +1567,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, p->exit_signal = (clone_flags & CSIGNAL); p->group_leader = p; p->tgid = p->pid; + increment_privunit_counter(); } p->nr_dirtied = 0; @@ -1597,10 +1598,10 @@ static struct task_struct *copy_process(unsigned long clone_flags, /* CLONE_PARENT re-uses the old parent */ if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) { p->real_parent = current->real_parent; - p->parent_exec_id = current->parent_exec_id; + p->parent_privunit_id = current->parent_privunit_id; } else { p->real_parent = current; - p->parent_exec_id = current->self_exec_id; + p->parent_privunit_id = current->self_privunit_id; } spin_lock(¤t->sighand->siglock); diff --git a/kernel/signal.c b/kernel/signal.c index af21afc..e4e3e1b 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1590,7 +1590,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig) * This is only possible if parent == real_parent. * Check if it has changed security domain. */ - if (tsk->parent_exec_id != tsk->parent->self_exec_id) + if (tsk->parent_privunit_id != tsk->parent->self_privunit_id) sig = SIGCHLD; }
This ensures that self_privunit_id ("privilege unit ID") is only shared by processes that share the mm_struct and the signal_struct; not just spatially, but also temporally. In other words, if you do execve() or clone() without CLONE_THREAD, you get a new privunit_id that has never been used before. One reason for doing this is that it prevents an attacker from sending an arbitrary signal to a parent process after performing 2^32-1 execve() calls. The second reason for this is that it permits using the self_exec_id in a later patch to check during a ptrace access whether subject and object are temporally and spatially equal for privilege checking purposes. This patch was grabbed from grsecurity and modified. Credit for the original patch goes to Brad Spengler <spender@grsecurity.net>. Signed-off-by: Jann Horn <jann@thejh.net> --- fs/exec.c | 21 ++++++++++++++++++++- include/linux/binfmts.h | 1 + include/linux/sched.h | 4 ++-- kernel/fork.c | 5 +++-- kernel/signal.c | 2 +- 5 files changed, 27 insertions(+), 6 deletions(-)