[2/9] exec: turn self_exec_id into self_privunit_id
diff mbox

Message ID 1474211117-16674-3-git-send-email-jann@thejh.net
State New
Headers show

Commit Message

Jann Horn Sept. 18, 2016, 3:05 p.m. UTC
This ensures that self_privunit_id ("privilege unit ID") is only shared by
processes that share the mm_struct and the signal_struct; not just
spatially, but also temporally. In other words, if you do execve() or
clone() without CLONE_THREAD, you get a new privunit_id that has never been
used before.

One reason for doing this is that it prevents an attacker from sending an
arbitrary signal to a parent process after performing 2^32-1 execve()
calls.

The second reason for this is that it permits using the self_exec_id in
a later patch to check during a ptrace access whether subject and object
are temporally and spatially equal for privilege checking purposes.

This patch was grabbed from grsecurity and modified. Credit for the
original patch goes to Brad Spengler <spender@grsecurity.net>.

Signed-off-by: Jann Horn <jann@thejh.net>
---
 fs/exec.c               | 21 ++++++++++++++++++++-
 include/linux/binfmts.h |  1 +
 include/linux/sched.h   |  4 ++--
 kernel/fork.c           |  5 +++--
 kernel/signal.c         |  2 +-
 5 files changed, 27 insertions(+), 6 deletions(-)

Comments

Ben Hutchings Sept. 18, 2016, 6:13 p.m. UTC | #1
On Sun, 2016-09-18 at 17:05 +0200, Jann Horn wrote:
> This ensures that self_privunit_id ("privilege unit ID") is only shared by
> processes that share the mm_struct and the signal_struct; not just
> spatially, but also temporally. In other words, if you do execve() or
> clone() without CLONE_THREAD, you get a new privunit_id that has never been
> used before.
[...]
> +void increment_privunit_counter(void)
> +{
> +	BUILD_BUG_ON(NR_CPUS > (1 << 16));
> +	current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS);
> +}
[...]

This will wrap incorrectly if NR_CPUS is not a power of 2 (which is
unusual but allowed).

Ben.
Jann Horn Sept. 18, 2016, 6:31 p.m. UTC | #2
On Sun, Sep 18, 2016 at 07:13:27PM +0100, Ben Hutchings wrote:
> On Sun, 2016-09-18 at 17:05 +0200, Jann Horn wrote:
> > This ensures that self_privunit_id ("privilege unit ID") is only shared by
> > processes that share the mm_struct and the signal_struct; not just
> > spatially, but also temporally. In other words, if you do execve() or
> > clone() without CLONE_THREAD, you get a new privunit_id that has never been
> > used before.
> [...]
> > +void increment_privunit_counter(void)
> > +{
> > +	BUILD_BUG_ON(NR_CPUS > (1 << 16));
> > +	current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS);
> > +}
> [...]
> 
> This will wrap incorrectly if NR_CPUS is not a power of 2 (which is
> unusual but allowed).

If this wraps, hell breaks loose permission-wise - processes that have
no relationship whatsoever with each other will suddenly be able to ptrace
each other.

The idea is that it never wraps. It wraps after (2^64)/NR_CPUS execs or
forks on one CPU core. NR_CPUS is bounded to <=2^16, so in the worst case,
it wraps after 2^48 execs or forks.

On my system with 3.7GHz per core, 2^16 minimal sequential non-thread clone()
calls need 1 second system time (and 2 seconds wall clock time, but let's
disregard that), so 2^48 non-thread clone() calls should need over 100 years.

But I guess both the kernel and machines get faster - if you think the margin
might not be future-proof enough (or if you think I measured wrong and it's
actually much faster), I guess I could bump this to a 128bit number.
Ben Hutchings Sept. 18, 2016, 6:45 p.m. UTC | #3
On Sun, Sep 18, 2016 at 08:31:37PM +0200, Jann Horn wrote:
> On Sun, Sep 18, 2016 at 07:13:27PM +0100, Ben Hutchings wrote:
> > On Sun, 2016-09-18 at 17:05 +0200, Jann Horn wrote:
> > > This ensures that self_privunit_id ("privilege unit ID") is only shared by
> > > processes that share the mm_struct and the signal_struct; not just
> > > spatially, but also temporally. In other words, if you do execve() or
> > > clone() without CLONE_THREAD, you get a new privunit_id that has never been
> > > used before.
> > [...]
> > > +void increment_privunit_counter(void)
> > > +{
> > > +	BUILD_BUG_ON(NR_CPUS > (1 << 16));
> > > +	current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS);
> > > +}
> > [...]
> > 
> > This will wrap incorrectly if NR_CPUS is not a power of 2 (which is
> > unusual but allowed).
> 
> If this wraps, hell breaks loose permission-wise - processes that have
> no relationship whatsoever with each other will suddenly be able to ptrace
> each other.
> 
> The idea is that it never wraps.

That's what I suspected, but wasn't sure.  In that case you can
initialise each counter to U64_MAX/NR_CPUS*cpu and increment by
1 each time, which might be more efficient on some architectures.

> It wraps after (2^64)/NR_CPUS execs or
> forks on one CPU core. NR_CPUS is bounded to <=2^16, so in the worst case,
> it wraps after 2^48 execs or forks.
> 
> On my system with 3.7GHz per core, 2^16 minimal sequential non-thread clone()
> calls need 1 second system time (and 2 seconds wall clock time, but let's
> disregard that), so 2^48 non-thread clone() calls should need over 100 years.
> 
> But I guess both the kernel and machines get faster - if you think the margin
> might not be future-proof enough (or if you think I measured wrong and it's
> actually much faster), I guess I could bump this to a 128bit number.

Sequential execution speed isn't likely to get significantly faster so
with those current numbers this seems to be quite safe.

Ben.
Jann Horn Sept. 18, 2016, 7:08 p.m. UTC | #4
On Sun, Sep 18, 2016 at 07:45:07PM +0100, Ben Hutchings wrote:
> On Sun, Sep 18, 2016 at 08:31:37PM +0200, Jann Horn wrote:
> > On Sun, Sep 18, 2016 at 07:13:27PM +0100, Ben Hutchings wrote:
> > > On Sun, 2016-09-18 at 17:05 +0200, Jann Horn wrote:
> > > > This ensures that self_privunit_id ("privilege unit ID") is only shared by
> > > > processes that share the mm_struct and the signal_struct; not just
> > > > spatially, but also temporally. In other words, if you do execve() or
> > > > clone() without CLONE_THREAD, you get a new privunit_id that has never been
> > > > used before.
> > > [...]
> > > > +void increment_privunit_counter(void)
> > > > +{
> > > > +	BUILD_BUG_ON(NR_CPUS > (1 << 16));
> > > > +	current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS);
> > > > +}
> > > [...]
> > > 
> > > This will wrap incorrectly if NR_CPUS is not a power of 2 (which is
> > > unusual but allowed).
> > 
> > If this wraps, hell breaks loose permission-wise - processes that have
> > no relationship whatsoever with each other will suddenly be able to ptrace
> > each other.
> > 
> > The idea is that it never wraps.
> 
> That's what I suspected, but wasn't sure.  In that case you can
> initialise each counter to U64_MAX/NR_CPUS*cpu and increment by
> 1 each time, which might be more efficient on some architectures.

Makes sense. Will do that!
Andy Lutomirski Sept. 18, 2016, 7:57 p.m. UTC | #5
On Sep 18, 2016 8:45 AM, "Ben Hutchings" <ben@decadent.org.uk> wrote:
>
> On Sun, Sep 18, 2016 at 08:31:37PM +0200, Jann Horn wrote:
> > On Sun, Sep 18, 2016 at 07:13:27PM +0100, Ben Hutchings wrote:
> > > On Sun, 2016-09-18 at 17:05 +0200, Jann Horn wrote:
> > > > This ensures that self_privunit_id ("privilege unit ID") is only shared by
> > > > processes that share the mm_struct and the signal_struct; not just
> > > > spatially, but also temporally. In other words, if you do execve() or
> > > > clone() without CLONE_THREAD, you get a new privunit_id that has never been
> > > > used before.
> > > [...]
> > > > +void increment_privunit_counter(void)
> > > > +{
> > > > + BUILD_BUG_ON(NR_CPUS > (1 << 16));
> > > > + current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS);
> > > > +}
> > > [...]
> > >
> > > This will wrap incorrectly if NR_CPUS is not a power of 2 (which is
> > > unusual but allowed).
> >
> > If this wraps, hell breaks loose permission-wise - processes that have
> > no relationship whatsoever with each other will suddenly be able to ptrace
> > each other.
> >
> > The idea is that it never wraps.
>
> That's what I suspected, but wasn't sure.  In that case you can
> initialise each counter to U64_MAX/NR_CPUS*cpu and increment by
> 1 each time, which might be more efficient on some architectures.
>
> > It wraps after (2^64)/NR_CPUS execs or
> > forks on one CPU core. NR_CPUS is bounded to <=2^16, so in the worst case,
> > it wraps after 2^48 execs or forks.
> >
> > On my system with 3.7GHz per core, 2^16 minimal sequential non-thread clone()
> > calls need 1 second system time (and 2 seconds wall clock time, but let's
> > disregard that), so 2^48 non-thread clone() calls should need over 100 years.
> >
> > But I guess both the kernel and machines get faster - if you think the margin
> > might not be future-proof enough (or if you think I measured wrong and it's
> > actually much faster), I guess I could bump this to a 128bit number.
>
> Sequential execution speed isn't likely to get significantly faster so
> with those current numbers this seems to be quite safe.
>

But how big can NR_CPUs get before this gets uncomfortable?

We could do:

struct luid {
  u64 count:
  unsigned cpu;
};

(LUID = locally unique ID).

IIRC my draft PCID code does something similar to uniquely identify
mms.  If I accidentally reused a PCID without a flush, everything
would explode.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jann Horn Sept. 19, 2016, 3:31 p.m. UTC | #6
On Sun, Sep 18, 2016 at 12:57:46PM -0700, Andy Lutomirski wrote:
> On Sep 18, 2016 8:45 AM, "Ben Hutchings" <ben@decadent.org.uk> wrote:
> >
> > On Sun, Sep 18, 2016 at 08:31:37PM +0200, Jann Horn wrote:
> > > On Sun, Sep 18, 2016 at 07:13:27PM +0100, Ben Hutchings wrote:
> > > > On Sun, 2016-09-18 at 17:05 +0200, Jann Horn wrote:
> > > > > This ensures that self_privunit_id ("privilege unit ID") is only shared by
> > > > > processes that share the mm_struct and the signal_struct; not just
> > > > > spatially, but also temporally. In other words, if you do execve() or
> > > > > clone() without CLONE_THREAD, you get a new privunit_id that has never been
> > > > > used before.
> > > > [...]
> > > > > +void increment_privunit_counter(void)
> > > > > +{
> > > > > + BUILD_BUG_ON(NR_CPUS > (1 << 16));
> > > > > + current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS);
> > > > > +}
> > > > [...]
> > > >
> > > > This will wrap incorrectly if NR_CPUS is not a power of 2 (which is
> > > > unusual but allowed).
> > >
> > > If this wraps, hell breaks loose permission-wise - processes that have
> > > no relationship whatsoever with each other will suddenly be able to ptrace
> > > each other.
> > >
> > > The idea is that it never wraps.
> >
> > That's what I suspected, but wasn't sure.  In that case you can
> > initialise each counter to U64_MAX/NR_CPUS*cpu and increment by
> > 1 each time, which might be more efficient on some architectures.
> >
> > > It wraps after (2^64)/NR_CPUS execs or
> > > forks on one CPU core. NR_CPUS is bounded to <=2^16, so in the worst case,
> > > it wraps after 2^48 execs or forks.
> > >
> > > On my system with 3.7GHz per core, 2^16 minimal sequential non-thread clone()
> > > calls need 1 second system time (and 2 seconds wall clock time, but let's
> > > disregard that), so 2^48 non-thread clone() calls should need over 100 years.
> > >
> > > But I guess both the kernel and machines get faster - if you think the margin
> > > might not be future-proof enough (or if you think I measured wrong and it's
> > > actually much faster), I guess I could bump this to a 128bit number.
> >
> > Sequential execution speed isn't likely to get significantly faster so
> > with those current numbers this seems to be quite safe.
> >
> 
> But how big can NR_CPUs get before this gets uncomfortable?
> 
> We could do:
> 
> struct luid {
>   u64 count:
>   unsigned cpu;
> };
> 
> (LUID = locally unique ID).
> 
> IIRC my draft PCID code does something similar to uniquely identify
> mms.  If I accidentally reused a PCID without a flush, everything
> would explode.

So I guess for generating a new LUID, I'd have to do something like this?

  struct luid new_luid;
  preempt_disable();
  raw_cpu_add(luid_counters, 1);
  new_luid.count = raw_cpu_read(luid_counters, 1);
  new_luid.cpu = smp_processor_id();
  preempt_enable();

Disabling preemption should be sufficient as long as nobody generates LUIDs
from IRQ context, right?

Patch
diff mbox

diff --git a/fs/exec.c b/fs/exec.c
index 84430ee..1a15cb0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1281,6 +1281,25 @@  void would_dump(struct linux_binprm *bprm, struct file *file)
 }
 EXPORT_SYMBOL(would_dump);
 
+static DEFINE_PER_CPU(u64, exec_counter);
+static int __init init_exec_counters(void)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		per_cpu(exec_counter, cpu) = (u64)cpu;
+	}
+
+	return 0;
+}
+early_initcall(init_exec_counters);
+
+void increment_privunit_counter(void)
+{
+	BUILD_BUG_ON(NR_CPUS > (1 << 16));
+	current->self_privunit_id = this_cpu_add_return(exec_counter, NR_CPUS);
+}
+
 void setup_new_exec(struct linux_binprm * bprm)
 {
 	arch_pick_mmap_layout(current->mm);
@@ -1314,7 +1333,7 @@  void setup_new_exec(struct linux_binprm * bprm)
 
 	/* An exec changes our domain. We are no longer part of the thread
 	   group */
-	current->self_exec_id++;
+	increment_privunit_counter();
 	flush_signal_handlers(current, 0);
 	do_close_on_exec(current->files);
 }
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 1303b57..9570bd0 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -100,6 +100,7 @@  extern int prepare_binprm(struct linux_binprm *);
 extern int __must_check remove_arg_zero(struct linux_binprm *);
 extern int search_binary_handler(struct linux_binprm *);
 extern int flush_old_exec(struct linux_binprm * bprm);
+extern void increment_privunit_counter(void);
 extern void setup_new_exec(struct linux_binprm * bprm);
 extern void would_dump(struct linux_binprm *, struct file *);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2a1df2f..e4bf894 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1688,8 +1688,8 @@  struct task_struct {
 	struct seccomp seccomp;
 
 /* Thread group tracking */
-   	u32 parent_exec_id;
-   	u32 self_exec_id;
+	u64 parent_privunit_id;
+	u64 self_privunit_id;
 /* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed,
  * mempolicy */
 	spinlock_t alloc_lock;
diff --git a/kernel/fork.c b/kernel/fork.c
index 2d46f3a..537c117 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1567,6 +1567,7 @@  static struct task_struct *copy_process(unsigned long clone_flags,
 			p->exit_signal = (clone_flags & CSIGNAL);
 		p->group_leader = p;
 		p->tgid = p->pid;
+		increment_privunit_counter();
 	}
 
 	p->nr_dirtied = 0;
@@ -1597,10 +1598,10 @@  static struct task_struct *copy_process(unsigned long clone_flags,
 	/* CLONE_PARENT re-uses the old parent */
 	if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
 		p->real_parent = current->real_parent;
-		p->parent_exec_id = current->parent_exec_id;
+		p->parent_privunit_id = current->parent_privunit_id;
 	} else {
 		p->real_parent = current;
-		p->parent_exec_id = current->self_exec_id;
+		p->parent_privunit_id = current->self_privunit_id;
 	}
 
 	spin_lock(&current->sighand->siglock);
diff --git a/kernel/signal.c b/kernel/signal.c
index af21afc..e4e3e1b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1590,7 +1590,7 @@  bool do_notify_parent(struct task_struct *tsk, int sig)
 		 * This is only possible if parent == real_parent.
 		 * Check if it has changed security domain.
 		 */
-		if (tsk->parent_exec_id != tsk->parent->self_exec_id)
+		if (tsk->parent_privunit_id != tsk->parent->self_privunit_id)
 			sig = SIGCHLD;
 	}