diff mbox series

[ghak90,V9,11/13] audit: contid check descendancy and nesting

Message ID 01229b93733d9baf6ac9bb0cc243eeb08ad579cd.1593198710.git.rgb@redhat.com (mailing list archive)
State New, archived
Headers show
Series audit: implement container identifier | expand

Commit Message

Richard Guy Briggs June 27, 2020, 1:20 p.m. UTC
Require the target task to be a descendant of the container
orchestrator/engine.

You would only change the audit container ID from one set or inherited
value to another if you were nesting containers.

If changing the contid, the container orchestrator/engine must be a
descendant and not same orchestrator as the one that set it so it is not
possible to change the contid of another orchestrator's container.

Since the task_is_descendant() function is used in YAMA and in audit,
remove the duplication and pull the function into kernel/core/sched.c

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
---
 include/linux/sched.h    |  3 +++
 kernel/audit.c           | 23 +++++++++++++++++++++--
 kernel/sched/core.c      | 33 +++++++++++++++++++++++++++++++++
 security/yama/yama_lsm.c | 33 ---------------------------------
 4 files changed, 57 insertions(+), 35 deletions(-)

Comments

Paul Moore July 5, 2020, 3:11 p.m. UTC | #1
On Sat, Jun 27, 2020 at 9:23 AM Richard Guy Briggs <rgb@redhat.com> wrote:
>
> Require the target task to be a descendant of the container
> orchestrator/engine.
>
> You would only change the audit container ID from one set or inherited
> value to another if you were nesting containers.
>
> If changing the contid, the container orchestrator/engine must be a
> descendant and not same orchestrator as the one that set it so it is not
> possible to change the contid of another orchestrator's container.
>
> Since the task_is_descendant() function is used in YAMA and in audit,
> remove the duplication and pull the function into kernel/core/sched.c
>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> ---
>  include/linux/sched.h    |  3 +++
>  kernel/audit.c           | 23 +++++++++++++++++++++--
>  kernel/sched/core.c      | 33 +++++++++++++++++++++++++++++++++
>  security/yama/yama_lsm.c | 33 ---------------------------------
>  4 files changed, 57 insertions(+), 35 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2213ac670386..06938d0b9e0c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2047,4 +2047,7 @@ static inline void rseq_syscall(struct pt_regs *regs)
>
>  const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
>
> +extern int task_is_descendant(struct task_struct *parent,
> +                             struct task_struct *child);
> +
>  #endif
> diff --git a/kernel/audit.c b/kernel/audit.c
> index a862721dfd9b..efa65ec01239 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -2713,6 +2713,20 @@ int audit_signal_info(int sig, struct task_struct *t)
>         return audit_signal_info_syscall(t);
>  }
>
> +static bool audit_contid_isnesting(struct task_struct *tsk)
> +{
> +       bool isowner = false;
> +       bool ownerisparent = false;
> +
> +       rcu_read_lock();
> +       if (tsk->audit && tsk->audit->cont) {
> +               isowner = current == tsk->audit->cont->owner;
> +               ownerisparent = task_is_descendant(tsk->audit->cont->owner, current);

I want to make sure I'm understanding this correctly and I keep
mentally tripping over something: it seems like for a given audit
container ID a task is either the owner or a descendent, there is no
third state, is that correct?

Assuming that is true, can the descendent check simply be a negative
owner check given they both have the same audit container ID?

> +       }
> +       rcu_read_unlock();
> +       return !isowner && ownerisparent;
> +}
> +
>  /*
>   * audit_set_contid - set current task's audit contid
>   * @task: target task
> @@ -2755,8 +2769,13 @@ int audit_set_contid(struct task_struct *task, u64 contid)
>                 rc = -EBUSY;
>                 goto unlock;
>         }
> -       /* if contid is already set, deny */
> -       if (audit_contid_set(task))
> +       /* if task is not descendant, block */
> +       if (task == current || !task_is_descendant(current, task)) {

I'm also still fuzzy on why we can't let a task set it's own audit
container ID, assuming it meets all the criteria established in patch
2/13.  It somewhat made sense when you were tracking inherited vs
explicitly set audit container IDs, but that doesn't appear to be the
case so far in this patchset, yes?

> +               rc = -EXDEV;

I'm fairly confident we had a discussion about not using all these
different error codes, but that may be a moot point given my next
comment.

> +               goto unlock;
> +       }
> +       /* only allow contid setting again if nesting */
> +       if (audit_contid_set(task) && !audit_contid_isnesting(task))
>                 rc = -EEXIST;

It seems like what we need in audit_set_contid() is a check to ensure
that the task being modified is only modified by the owner of the
audit container ID, yes?  If so, I would think we could do this quite
easily with the following, or similar logic, (NOTE: assumes both
current and tsk are properly setup):

  if ((current->audit->cont != tsk->audit->cont) ||
(current->audit->cont->owner != current))
    return -EACCESS;

This is somewhat independent of the above issue, but we may also want
to add to the capability check.  Patch 2 adds a
"capable(CAP_AUDIT_CONTROL)" which is good, but perhaps we also need a
"ns_capable(CAP_AUDIT_CONTROL)" to allow a given audit container ID
orchestrator/owner the ability to control which of it's descendants
can change their audit container ID, for example:

  if (!capable(CAP_AUDIT_CONTROL) ||
      !ns_capable(current->nsproxy->user_ns, CAP_AUDIT_CONTROL))
    return -EPERM;

--
paul moore
www.paul-moore.com
Richard Guy Briggs Aug. 7, 2020, 5:10 p.m. UTC | #2
On 2020-07-05 11:11, Paul Moore wrote:
> On Sat, Jun 27, 2020 at 9:23 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > Require the target task to be a descendant of the container
> > orchestrator/engine.
> >
> > You would only change the audit container ID from one set or inherited
> > value to another if you were nesting containers.
> >
> > If changing the contid, the container orchestrator/engine must be a
> > descendant and not same orchestrator as the one that set it so it is not
> > possible to change the contid of another orchestrator's container.

Are we able to agree on the premises above?  Is anything asserted that
should not be and is there anything missing?

I've been sitting on my response below for more than a week trying to
understand the issues raised and to give it the proper attention to a
reply.  Please excuse my tardiness at replying on this issue since I'm
still having trouble thinking through all the scenarios for nesting.

> > Since the task_is_descendant() function is used in YAMA and in audit,
> > remove the duplication and pull the function into kernel/core/sched.c
> >
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > ---
> >  include/linux/sched.h    |  3 +++
> >  kernel/audit.c           | 23 +++++++++++++++++++++--
> >  kernel/sched/core.c      | 33 +++++++++++++++++++++++++++++++++
> >  security/yama/yama_lsm.c | 33 ---------------------------------
> >  4 files changed, 57 insertions(+), 35 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 2213ac670386..06938d0b9e0c 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2047,4 +2047,7 @@ static inline void rseq_syscall(struct pt_regs *regs)
> >
> >  const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
> >
> > +extern int task_is_descendant(struct task_struct *parent,
> > +                             struct task_struct *child);
> > +
> >  #endif
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index a862721dfd9b..efa65ec01239 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -2713,6 +2713,20 @@ int audit_signal_info(int sig, struct task_struct *t)
> >         return audit_signal_info_syscall(t);
> >  }
> >
> > +static bool audit_contid_isnesting(struct task_struct *tsk)
> > +{
> > +       bool isowner = false;
> > +       bool ownerisparent = false;
> > +
> > +       rcu_read_lock();
> > +       if (tsk->audit && tsk->audit->cont) {
> > +               isowner = current == tsk->audit->cont->owner;
> > +               ownerisparent = task_is_descendant(tsk->audit->cont->owner, current);
> 
> I want to make sure I'm understanding this correctly and I keep
> mentally tripping over something: it seems like for a given audit
> container ID a task is either the owner or a descendent, there is no
> third state, is that correct?

Sure there is.  It could be another owner (which is addressed when we
search for an existing contobj match), or in the next patch, the
owner's parent if nested or a peer.

> Assuming that is true, can the descendent check simply be a negative
> owner check given they both have the same audit container ID?

There isn't actually a check in my code for the orchestrator contid and
task contid being the same.  Maybe I was making this check more
complicated than necessary, and still incomplete, but see below for more...

> > +       }
> > +       rcu_read_unlock();
> > +       return !isowner && ownerisparent;
> > +}
> > +
> >  /*
> >   * audit_set_contid - set current task's audit contid
> >   * @task: target task
> > @@ -2755,8 +2769,13 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> >                 rc = -EBUSY;
> >                 goto unlock;
> >         }
> > -       /* if contid is already set, deny */
> > -       if (audit_contid_set(task))
> > +       /* if task is not descendant, block */
> > +       if (task == current || !task_is_descendant(current, task)) {
> 
> I'm also still fuzzy on why we can't let a task set it's own audit
> container ID, assuming it meets all the criteria established in patch
> 2/13.  It somewhat made sense when you were tracking inherited vs
> explicitly set audit container IDs, but that doesn't appear to be the
> case so far in this patchset, yes?

I'm still having a strong reluctance to permit this but can't come up
with a solid technical reason right now, but it feels like a layer
violation.  If we forbid it and discover it necessary and harmless, then
permitting it won't break the API.  If we permit it and later discover a
reason it causes a problem, then blocking it will break the API.  I have
heard that there are cases where there is no orchestrator/engine, so in
those cases I conclude that a process would need to set its own contid
but I'm having trouble recalling what those circumstances are.

I also was seriously considering blocking any contid set on the initial
user or PID namespaces to avoid polluting them, and even had a tested
patch to implement it, but this starts making assumptions about the
definition of a container with respect to namespaces which we have been
deliberately avoiding.

> > +               rc = -EXDEV;
> 
> I'm fairly confident we had a discussion about not using all these
> different error codes, but that may be a moot point given my next
> comment.

Yes, we did.  I reduced both circumstances down to what you requested,
shedding two along the way.  Given the number of different ways
orchestrators, contids and tasks can be related, I'd rather have more,
not fewer diagnostics to understand what it thinks is happenning.  This
is a realtively minor detail in the context of the rest of the
discussion in this thread.

> > +               goto unlock;
> > +       }
> > +       /* only allow contid setting again if nesting */
> > +       if (audit_contid_set(task) && !audit_contid_isnesting(task))
> >                 rc = -EEXIST;
> 
> It seems like what we need in audit_set_contid() is a check to ensure
> that the task being modified is only modified by the owner of the
> audit container ID, yes?  If so, I would think we could do this quite
> easily with the following, or similar logic, (NOTE: assumes both
> current and tsk are properly setup):
> 
>   if ((current->audit->cont != tsk->audit->cont) || (current->audit->cont->owner != current))
>     return -EACCESS;

Not necessarily.

If we start from the premise that once set, a contid on a task cannot be
unset, and then that it cannot be set to another value, then the oldest
ancestor in any container must not be able to change contid.  That
leaves any descendant (that hasn't threaded or parented) free to nest.

If we allow a task to modify its own contid (from the potential change
above), then if it inherited its contid, it could set its own.  This
still looks like a layer violation to me.  Going back to some
discussions with Eric Biederman from a number of years ago, it seems
wrong to me that a task should be able to see its own contid, let alone
be able to set it.  This came out of a CRIU concern about serial nsIDs
based on proc inode numbers not being portable.  Is it still a
consideration?

Another scenario comes to mind.  Should an orchestrator be able to set
the contid of a descendant of one of the former's child orchestrators?
This doesn't sound like a good idea leaping generations and I can't come
up with a valid use case.

> This is somewhat independent of the above issue, but we may also want
> to add to the capability check.  Patch 2 adds a
> "capable(CAP_AUDIT_CONTROL)" which is good, but perhaps we also need a
> "ns_capable(CAP_AUDIT_CONTROL)" to allow a given audit container ID
> orchestrator/owner the ability to control which of it's descendants
> can change their audit container ID, for example:
> 
>   if (!capable(CAP_AUDIT_CONTROL) ||
>       !ns_capable(current->nsproxy->user_ns, CAP_AUDIT_CONTROL))
>     return -EPERM;

Why does ns_capable keep being raised?  The last patch, capcontid, was
developed to solve this previously raised issue.  The issue was an
unprivileged user creating a user namespace with full capabilities,
circumventing capable() and being able to change the main audit
configuration.  It was already discussed in v8 and before that and my
last posting in the thread was left dangling with an unanswered
question:
https://lkml.org/lkml/2020/2/6/333

I only see this being potentially useful with audit namespaces in
conjunction with unprivileged user namespaces in the future with the
implementation of multiple audit daemons for the ability of an
unprivileged user to run their own distro container without influencing
the master audit configuration.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore Aug. 21, 2020, 8:13 p.m. UTC | #3
On Fri, Aug 7, 2020 at 1:10 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-07-05 11:11, Paul Moore wrote:
> > On Sat, Jun 27, 2020 at 9:23 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > Require the target task to be a descendant of the container
> > > orchestrator/engine.

If you want to get formal about this, you need to define "target" in
the sentence above.  Target of what?

FWIW, I read the above to basically mean that a task can only set the
audit container ID of processes which are beneath it in the "process
tree" where the "process tree" is defined as the relationship between
a parent and children processes such that the children processes are
branches below the parent process.

I have no problem with that, with the understanding that nesting
complicates it somewhat.  For example, this isn't true when one of the
children is a nested orchestrator, is it?

> > > You would only change the audit container ID from one set or inherited
> > > value to another if you were nesting containers.

I thought we decided we were going to allow an orchestrator to move a
process between audit container IDs, yes?  no?

> > > If changing the contid, the container orchestrator/engine must be a
> > > descendant and not same orchestrator as the one that set it so it is not
> > > possible to change the contid of another orchestrator's container.

Try rephrasing the above please, it isn't clear to me what you are
trying to say.

> Are we able to agree on the premises above?  Is anything asserted that
> should not be and is there anything missing?

See above.

If you want to go back to the definitions/assumptions stage, it
probably isn't worth worrying about the other comments until we get
the above sorted.
Richard Guy Briggs Oct. 6, 2020, 8:03 p.m. UTC | #4
On 2020-08-21 16:13, Paul Moore wrote:
> On Fri, Aug 7, 2020 at 1:10 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-07-05 11:11, Paul Moore wrote:
> > > On Sat, Jun 27, 2020 at 9:23 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > Require the target task to be a descendant of the container
> > > > orchestrator/engine.
> 
> If you want to get formal about this, you need to define "target" in
> the sentence above.  Target of what?

The target is the task having its audit container identifier modified by
the orchestrator current task.

> FWIW, I read the above to basically mean that a task can only set the
> audit container ID of processes which are beneath it in the "process
> tree" where the "process tree" is defined as the relationship between
> a parent and children processes such that the children processes are
> branches below the parent process.

Yes.

> I have no problem with that, with the understanding that nesting
> complicates it somewhat.  For example, this isn't true when one of the
> children is a nested orchestrator, is it?

It should still be true if that child is a nested orchestrator that has
not yet spawned any children or threads (or they have all died off).

It does get more complicated when we consider the scenario outlined
below about perceived layer violations...

> > > > You would only change the audit container ID from one set or inherited
> > > > value to another if you were nesting containers.
> 
> I thought we decided we were going to allow an orchestrator to move a
> process between audit container IDs, yes?  no?

We did?  I don't remember anything about that.  Has this been requested?
This seems to violate the rule that we can't change the audit container
identifier once it has been set (other than nesting).  Can you suggest a
use case?

> > > > If changing the contid, the container orchestrator/engine must be a
> > > > descendant and not same orchestrator as the one that set it so it is not
> > > > possible to change the contid of another orchestrator's container.
> 
> Try rephrasing the above please, it isn't clear to me what you are
> trying to say.

This is harder than I expected to rephrase...  It also makes it clear
that there are some scenarios that have not been considered that may
need to be restricted.

Orchestrator A spawned task B which is itself an orchestrator without
chidren yet.  Orchestrator A sets the audit container identifier of B.
Neither A, nor B, nor any other child of A (or any of their
descendants), nor any orchestrator outside the tree of A (uncles, aunts
and cousins are outside), can change the audit container identifier of
B.

Orchestrator B spawns task C.  Here's where it gets tricky.  It seems
like a layer violation for B to spawn a child C and have A reach over B
to set the audit container identifier of C, especially if B is also an
orchestrator.  This all will be especially hard to police if we don't
limit the ability of an orchestrator task to set an audit container
identifier to that orchestrator's immediate children, only once.

> > Are we able to agree on the premises above?  Is anything asserted that
> > should not be and is there anything missing?
> 
> See above.
> 
> If you want to go back to the definitions/assumptions stage, it
> probably isn't worth worrying about the other comments until we get
> the above sorted.

I don't want to.  I'm trying to confirm that we are on the same page.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2213ac670386..06938d0b9e0c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2047,4 +2047,7 @@  static inline void rseq_syscall(struct pt_regs *regs)
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+extern int task_is_descendant(struct task_struct *parent,
+			      struct task_struct *child);
+
 #endif
diff --git a/kernel/audit.c b/kernel/audit.c
index a862721dfd9b..efa65ec01239 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -2713,6 +2713,20 @@  int audit_signal_info(int sig, struct task_struct *t)
 	return audit_signal_info_syscall(t);
 }
 
+static bool audit_contid_isnesting(struct task_struct *tsk)
+{
+	bool isowner = false;
+	bool ownerisparent = false;
+
+	rcu_read_lock();
+	if (tsk->audit && tsk->audit->cont) {
+		isowner = current == tsk->audit->cont->owner;
+		ownerisparent = task_is_descendant(tsk->audit->cont->owner, current);
+	}
+	rcu_read_unlock();
+	return !isowner && ownerisparent;
+}
+
 /*
  * audit_set_contid - set current task's audit contid
  * @task: target task
@@ -2755,8 +2769,13 @@  int audit_set_contid(struct task_struct *task, u64 contid)
 		rc = -EBUSY;
 		goto unlock;
 	}
-	/* if contid is already set, deny */
-	if (audit_contid_set(task))
+	/* if task is not descendant, block */
+	if (task == current || !task_is_descendant(current, task)) {
+		rc = -EXDEV;
+		goto unlock;
+	}
+	/* only allow contid setting again if nesting */
+	if (audit_contid_set(task) && !audit_contid_isnesting(task))
 		rc = -EEXIST;
 unlock:
 	read_unlock(&tasklist_lock);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f360326861e..e6b24c52b3c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8012,6 +8012,39 @@  void dump_cpu_task(int cpu)
 }
 
 /*
+ * task_is_descendant - walk up a process family tree looking for a match
+ * @parent: the process to compare against while walking up from child
+ * @child: the process to start from while looking upwards for parent
+ *
+ * Returns 1 if child is a descendant of parent, 0 if not.
+ */
+int task_is_descendant(struct task_struct *parent,
+			      struct task_struct *child)
+{
+	int rc = 0;
+	struct task_struct *walker = child;
+
+	if (!parent || !child)
+		return 0;
+
+	rcu_read_lock();
+	if (!thread_group_leader(parent))
+		parent = rcu_dereference(parent->group_leader);
+	while (walker->pid > 0) {
+		if (!thread_group_leader(walker))
+			walker = rcu_dereference(walker->group_leader);
+		if (walker == parent) {
+			rc = 1;
+			break;
+		}
+		walker = rcu_dereference(walker->real_parent);
+	}
+	rcu_read_unlock();
+
+	return rc;
+}
+
+/*
  * Nice levels are multiplicative, with a gentle 10% change for every
  * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
  * nice 1, it will get ~10% less CPU time than another CPU-bound task
diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
index 536c99646f6a..24939f765df5 100644
--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -263,39 +263,6 @@  static int yama_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 }
 
 /**
- * task_is_descendant - walk up a process family tree looking for a match
- * @parent: the process to compare against while walking up from child
- * @child: the process to start from while looking upwards for parent
- *
- * Returns 1 if child is a descendant of parent, 0 if not.
- */
-static int task_is_descendant(struct task_struct *parent,
-			      struct task_struct *child)
-{
-	int rc = 0;
-	struct task_struct *walker = child;
-
-	if (!parent || !child)
-		return 0;
-
-	rcu_read_lock();
-	if (!thread_group_leader(parent))
-		parent = rcu_dereference(parent->group_leader);
-	while (walker->pid > 0) {
-		if (!thread_group_leader(walker))
-			walker = rcu_dereference(walker->group_leader);
-		if (walker == parent) {
-			rc = 1;
-			break;
-		}
-		walker = rcu_dereference(walker->real_parent);
-	}
-	rcu_read_unlock();
-
-	return rc;
-}
-
-/**
  * ptracer_exception_found - tracer registered as exception for this tracee
  * @tracer: the task_struct of the process attempting ptrace
  * @tracee: the task_struct of the process to be ptraced