[RFC] seccomp: don't kill process for resource control syscalls
diff mbox series

Message ID 20190313094903.10006-1-berrange@redhat.com
State New
Headers show
Series
  • [RFC] seccomp: don't kill process for resource control syscalls
Related show

Commit Message

Daniel P. Berrangé March 13, 2019, 9:49 a.m. UTC
The Mesa library tries to set process affinity on some of its threads in
order to optimize its performance. Currently this results in QEMU being
immediately terminated when seccomp is enabled.

Mesa doesn't consider failure of the process affinity settings to be
fatal to its operation, but our seccomp policy gives it no choice in
gracefully handling this denial.

It is reasonable to consider that malicious code using the resource
control syscalls to be a less serious attack than if they were trying
to spawn processes or change UIDs and other such things. Generally
speaking changing the resource control setting will "merely" affect
quality of service of processes on the host. With this in mind, rather
than kill the process, we can relax the policy for these syscalls to
return the EPERM errno value. This allows callers to detect that QEMU
does not want them to change resource allocations, and apply some
reasonable fallback logic.

The main downside to this is for code which uses these syscalls but does
not check the return value, blindly assuming they will always
succeeed. Returning an errno could result in sub-optimal behaviour.
Arguably though such code is already broken & needs fixing regardless.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 qemu-seccomp.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

Comments

Marc-André Lureau March 13, 2019, 12:20 p.m. UTC | #1
Hi

On Wed, Mar 13, 2019 at 10:49 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> The Mesa library tries to set process affinity on some of its threads in
> order to optimize its performance. Currently this results in QEMU being
> immediately terminated when seccomp is enabled.
>
> Mesa doesn't consider failure of the process affinity settings to be
> fatal to its operation, but our seccomp policy gives it no choice in
> gracefully handling this denial.
>
> It is reasonable to consider that malicious code using the resource
> control syscalls to be a less serious attack than if they were trying
> to spawn processes or change UIDs and other such things. Generally
> speaking changing the resource control setting will "merely" affect
> quality of service of processes on the host. With this in mind, rather
> than kill the process, we can relax the policy for these syscalls to
> return the EPERM errno value. This allows callers to detect that QEMU
> does not want them to change resource allocations, and apply some
> reasonable fallback logic.
>
> The main downside to this is for code which uses these syscalls but does
> not check the return value, blindly assuming they will always
> succeeed. Returning an errno could result in sub-optimal behaviour.
> Arguably though such code is already broken & needs fixing regardless.

Given that the situation is stuck with mesa, I also think we need to
"downgrade" the seccomp policy.

I was preparing a similar patch. Backporting it to f29 would be nice...

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>

thanks

>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>  qemu-seccomp.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
>
> diff --git a/qemu-seccomp.c b/qemu-seccomp.c
> index 36d5829831..9776c9ef40 100644
> --- a/qemu-seccomp.c
> +++ b/qemu-seccomp.c
> @@ -121,20 +121,37 @@ qemu_seccomp(unsigned int operation, unsigned int flags, void *args)
>  #endif
>  }
>
> -static uint32_t qemu_seccomp_get_kill_action(void)
> +static uint32_t qemu_seccomp_get_kill_action(int set)
>  {
> +    switch (set) {
> +    case QEMU_SECCOMP_SET_DEFAULT:
> +    case QEMU_SECCOMP_SET_OBSOLETE:
> +    case QEMU_SECCOMP_SET_PRIVILEGED:
> +    case QEMU_SECCOMP_SET_SPAWN: {
>  #if defined(SECCOMP_GET_ACTION_AVAIL) && defined(SCMP_ACT_KILL_PROCESS) && \
>      defined(SECCOMP_RET_KILL_PROCESS)
> -    {
> -        uint32_t action = SECCOMP_RET_KILL_PROCESS;
> +        static int kill_process = -1;
> +        if (kill_process == -1) {
> +            uint32_t action = SECCOMP_RET_KILL_PROCESS;
>
> -        if (qemu_seccomp(SECCOMP_GET_ACTION_AVAIL, 0, &action) == 0) {
> +            if (qemu_seccomp(SECCOMP_GET_ACTION_AVAIL, 0, &action) == 0) {
> +                kill_process = 1;
> +            }
> +            kill_process = 0;
> +        }
> +        if (kill_process == 1) {
>              return SCMP_ACT_KILL_PROCESS;
>          }
> -    }
>  #endif
> +        return SCMP_ACT_TRAP;
> +    }
> +
> +    case QEMU_SECCOMP_SET_RESOURCECTL:
> +        return SCMP_ACT_ERRNO(EPERM);
>
> -    return SCMP_ACT_TRAP;
> +    default:
> +        g_assert_not_reached();
> +    }
>  }
>
>
> @@ -143,7 +160,6 @@ static int seccomp_start(uint32_t seccomp_opts)
>      int rc = 0;
>      unsigned int i = 0;
>      scmp_filter_ctx ctx;
> -    uint32_t action = qemu_seccomp_get_kill_action();
>
>      ctx = seccomp_init(SCMP_ACT_ALLOW);
>      if (ctx == NULL) {
> @@ -157,10 +173,12 @@ static int seccomp_start(uint32_t seccomp_opts)
>      }
>
>      for (i = 0; i < ARRAY_SIZE(blacklist); i++) {
> +        uint32_t action;
>          if (!(seccomp_opts & blacklist[i].set)) {
>              continue;
>          }
>
> +        action = qemu_seccomp_get_kill_action(blacklist[i].set);
>          rc = seccomp_rule_add_array(ctx, action, blacklist[i].num,
>                                      blacklist[i].narg, blacklist[i].arg_cmp);
>          if (rc < 0) {
> --
> 2.20.1
>
Marc-André Lureau March 13, 2019, 12:22 p.m. UTC | #2
Hi

On Wed, Mar 13, 2019 at 10:49 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> The Mesa library tries to set process affinity on some of its threads in
> order to optimize its performance. Currently this results in QEMU being
> immediately terminated when seccomp is enabled.
>
> Mesa doesn't consider failure of the process affinity settings to be
> fatal to its operation, but our seccomp policy gives it no choice in
> gracefully handling this denial.
>
> It is reasonable to consider that malicious code using the resource
> control syscalls to be a less serious attack than if they were trying
> to spawn processes or change UIDs and other such things. Generally
> speaking changing the resource control setting will "merely" affect
> quality of service of processes on the host. With this in mind, rather
> than kill the process, we can relax the policy for these syscalls to
> return the EPERM errno value. This allows callers to detect that QEMU
> does not want them to change resource allocations, and apply some
> reasonable fallback logic.
>
> The main downside to this is for code which uses these syscalls but does
> not check the return value, blindly assuming they will always
> succeeed. Returning an errno could result in sub-optimal behaviour.
> Arguably though such code is already broken & needs fixing regardless.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>  qemu-seccomp.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
>
> diff --git a/qemu-seccomp.c b/qemu-seccomp.c
> index 36d5829831..9776c9ef40 100644
> --- a/qemu-seccomp.c
> +++ b/qemu-seccomp.c
> @@ -121,20 +121,37 @@ qemu_seccomp(unsigned int operation, unsigned int flags, void *args)
>  #endif
>  }
>
> -static uint32_t qemu_seccomp_get_kill_action(void)
> +static uint32_t qemu_seccomp_get_kill_action(int set)

Minor nit, let's rename qemu_seccomp_get_kill_action() ->
qemu_seccomp_get_action()

>  {
> +    switch (set) {
> +    case QEMU_SECCOMP_SET_DEFAULT:
> +    case QEMU_SECCOMP_SET_OBSOLETE:
> +    case QEMU_SECCOMP_SET_PRIVILEGED:
> +    case QEMU_SECCOMP_SET_SPAWN: {
>  #if defined(SECCOMP_GET_ACTION_AVAIL) && defined(SCMP_ACT_KILL_PROCESS) && \
>      defined(SECCOMP_RET_KILL_PROCESS)
> -    {
> -        uint32_t action = SECCOMP_RET_KILL_PROCESS;
> +        static int kill_process = -1;
> +        if (kill_process == -1) {
> +            uint32_t action = SECCOMP_RET_KILL_PROCESS;
>
> -        if (qemu_seccomp(SECCOMP_GET_ACTION_AVAIL, 0, &action) == 0) {
> +            if (qemu_seccomp(SECCOMP_GET_ACTION_AVAIL, 0, &action) == 0) {
> +                kill_process = 1;
> +            }
> +            kill_process = 0;
> +        }
> +        if (kill_process == 1) {
>              return SCMP_ACT_KILL_PROCESS;
>          }
> -    }
>  #endif
> +        return SCMP_ACT_TRAP;
> +    }
> +
> +    case QEMU_SECCOMP_SET_RESOURCECTL:
> +        return SCMP_ACT_ERRNO(EPERM);
>
> -    return SCMP_ACT_TRAP;
> +    default:
> +        g_assert_not_reached();
> +    }
>  }
>
>
> @@ -143,7 +160,6 @@ static int seccomp_start(uint32_t seccomp_opts)
>      int rc = 0;
>      unsigned int i = 0;
>      scmp_filter_ctx ctx;
> -    uint32_t action = qemu_seccomp_get_kill_action();
>
>      ctx = seccomp_init(SCMP_ACT_ALLOW);
>      if (ctx == NULL) {
> @@ -157,10 +173,12 @@ static int seccomp_start(uint32_t seccomp_opts)
>      }
>
>      for (i = 0; i < ARRAY_SIZE(blacklist); i++) {
> +        uint32_t action;
>          if (!(seccomp_opts & blacklist[i].set)) {
>              continue;
>          }
>
> +        action = qemu_seccomp_get_kill_action(blacklist[i].set);
>          rc = seccomp_rule_add_array(ctx, action, blacklist[i].num,
>                                      blacklist[i].narg, blacklist[i].arg_cmp);
>          if (rc < 0) {
> --
> 2.20.1
>
Mathias Fröhlich March 14, 2019, 6:39 a.m. UTC | #3
Hi,

Thanks for not just killing processes anymore!
See the mesa thread
https://www.mail-archive.com/mesa-dev@lists.freedesktop.org/msg214474.html
for some background.

Thanks a lot and best

Mathias

On Wednesday, 13 March 2019 10:49:03 CET Daniel P. Berrangé wrote:
> The Mesa library tries to set process affinity on some of its threads in
> order to optimize its performance. Currently this results in QEMU being
> immediately terminated when seccomp is enabled.
> 
> Mesa doesn't consider failure of the process affinity settings to be
> fatal to its operation, but our seccomp policy gives it no choice in
> gracefully handling this denial.
> 
> It is reasonable to consider that malicious code using the resource
> control syscalls to be a less serious attack than if they were trying
> to spawn processes or change UIDs and other such things. Generally
> speaking changing the resource control setting will "merely" affect
> quality of service of processes on the host. With this in mind, rather
> than kill the process, we can relax the policy for these syscalls to
> return the EPERM errno value. This allows callers to detect that QEMU
> does not want them to change resource allocations, and apply some
> reasonable fallback logic.
> 
> The main downside to this is for code which uses these syscalls but does
> not check the return value, blindly assuming they will always
> succeeed. Returning an errno could result in sub-optimal behaviour.
> Arguably though such code is already broken & needs fixing regardless.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>  qemu-seccomp.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/qemu-seccomp.c b/qemu-seccomp.c
> index 36d5829831..9776c9ef40 100644
> --- a/qemu-seccomp.c
> +++ b/qemu-seccomp.c
> @@ -121,20 +121,37 @@ qemu_seccomp(unsigned int operation, unsigned int flags, void *args)
>  #endif
>  }
>  
> -static uint32_t qemu_seccomp_get_kill_action(void)
> +static uint32_t qemu_seccomp_get_kill_action(int set)
>  {
> +    switch (set) {
> +    case QEMU_SECCOMP_SET_DEFAULT:
> +    case QEMU_SECCOMP_SET_OBSOLETE:
> +    case QEMU_SECCOMP_SET_PRIVILEGED:
> +    case QEMU_SECCOMP_SET_SPAWN: {
>  #if defined(SECCOMP_GET_ACTION_AVAIL) && defined(SCMP_ACT_KILL_PROCESS) && \
>      defined(SECCOMP_RET_KILL_PROCESS)
> -    {
> -        uint32_t action = SECCOMP_RET_KILL_PROCESS;
> +        static int kill_process = -1;
> +        if (kill_process == -1) {
> +            uint32_t action = SECCOMP_RET_KILL_PROCESS;
>  
> -        if (qemu_seccomp(SECCOMP_GET_ACTION_AVAIL, 0, &action) == 0) {
> +            if (qemu_seccomp(SECCOMP_GET_ACTION_AVAIL, 0, &action) == 0) {
> +                kill_process = 1;
> +            }
> +            kill_process = 0;
> +        }
> +        if (kill_process == 1) {
>              return SCMP_ACT_KILL_PROCESS;
>          }
> -    }
>  #endif
> +        return SCMP_ACT_TRAP;
> +    }
> +
> +    case QEMU_SECCOMP_SET_RESOURCECTL:
> +        return SCMP_ACT_ERRNO(EPERM);
>  
> -    return SCMP_ACT_TRAP;
> +    default:
> +        g_assert_not_reached();
> +    }
>  }
>  
>  
> @@ -143,7 +160,6 @@ static int seccomp_start(uint32_t seccomp_opts)
>      int rc = 0;
>      unsigned int i = 0;
>      scmp_filter_ctx ctx;
> -    uint32_t action = qemu_seccomp_get_kill_action();
>  
>      ctx = seccomp_init(SCMP_ACT_ALLOW);
>      if (ctx == NULL) {
> @@ -157,10 +173,12 @@ static int seccomp_start(uint32_t seccomp_opts)
>      }
>  
>      for (i = 0; i < ARRAY_SIZE(blacklist); i++) {
> +        uint32_t action;
>          if (!(seccomp_opts & blacklist[i].set)) {
>              continue;
>          }
>  
> +        action = qemu_seccomp_get_kill_action(blacklist[i].set);
>          rc = seccomp_rule_add_array(ctx, action, blacklist[i].num,
>                                      blacklist[i].narg, blacklist[i].arg_cmp);
>          if (rc < 0) {
>
Eduardo Otubo March 15, 2019, 1:51 p.m. UTC | #4
On 13/03/2019 - 13:22:13, Marc-André Lureau wrote:
> Hi
> 
> On Wed, Mar 13, 2019 at 10:49 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > The Mesa library tries to set process affinity on some of its threads in
> > order to optimize its performance. Currently this results in QEMU being
> > immediately terminated when seccomp is enabled.
> >
> > Mesa doesn't consider failure of the process affinity settings to be
> > fatal to its operation, but our seccomp policy gives it no choice in
> > gracefully handling this denial.
> >
> > It is reasonable to consider that malicious code using the resource
> > control syscalls to be a less serious attack than if they were trying
> > to spawn processes or change UIDs and other such things. Generally
> > speaking changing the resource control setting will "merely" affect
> > quality of service of processes on the host. With this in mind, rather
> > than kill the process, we can relax the policy for these syscalls to
> > return the EPERM errno value. This allows callers to detect that QEMU
> > does not want them to change resource allocations, and apply some
> > reasonable fallback logic.
> >
> > The main downside to this is for code which uses these syscalls but does
> > not check the return value, blindly assuming they will always
> > succeeed. Returning an errno could result in sub-optimal behaviour.
> > Arguably though such code is already broken & needs fixing regardless.
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > ---
> >  qemu-seccomp.c | 32 +++++++++++++++++++++++++-------
> >  1 file changed, 25 insertions(+), 7 deletions(-)
> >
> > diff --git a/qemu-seccomp.c b/qemu-seccomp.c
> > index 36d5829831..9776c9ef40 100644
> > --- a/qemu-seccomp.c
> > +++ b/qemu-seccomp.c
> > @@ -121,20 +121,37 @@ qemu_seccomp(unsigned int operation, unsigned int flags, void *args)
> >  #endif
> >  }
> >
> > -static uint32_t qemu_seccomp_get_kill_action(void)
> > +static uint32_t qemu_seccomp_get_kill_action(int set)
> 
> Minor nit, let's rename qemu_seccomp_get_kill_action() ->
> qemu_seccomp_get_action()

I think that would be better too.

And thanks for the contribution, Daniel.
I can fix this when I send the pull request.


Acked-by: Eduardo Otubo <otubo@redhat.com>

> 
> >  {
> > +    switch (set) {
> > +    case QEMU_SECCOMP_SET_DEFAULT:
> > +    case QEMU_SECCOMP_SET_OBSOLETE:
> > +    case QEMU_SECCOMP_SET_PRIVILEGED:
> > +    case QEMU_SECCOMP_SET_SPAWN: {
> >  #if defined(SECCOMP_GET_ACTION_AVAIL) && defined(SCMP_ACT_KILL_PROCESS) && \
> >      defined(SECCOMP_RET_KILL_PROCESS)
> > -    {
> > -        uint32_t action = SECCOMP_RET_KILL_PROCESS;
> > +        static int kill_process = -1;
> > +        if (kill_process == -1) {
> > +            uint32_t action = SECCOMP_RET_KILL_PROCESS;
> >
> > -        if (qemu_seccomp(SECCOMP_GET_ACTION_AVAIL, 0, &action) == 0) {
> > +            if (qemu_seccomp(SECCOMP_GET_ACTION_AVAIL, 0, &action) == 0) {
> > +                kill_process = 1;
> > +            }
> > +            kill_process = 0;
> > +        }
> > +        if (kill_process == 1) {
> >              return SCMP_ACT_KILL_PROCESS;
> >          }
> > -    }
> >  #endif
> > +        return SCMP_ACT_TRAP;
> > +    }
> > +
> > +    case QEMU_SECCOMP_SET_RESOURCECTL:
> > +        return SCMP_ACT_ERRNO(EPERM);
> >
> > -    return SCMP_ACT_TRAP;
> > +    default:
> > +        g_assert_not_reached();
> > +    }
> >  }
> >
> >
> > @@ -143,7 +160,6 @@ static int seccomp_start(uint32_t seccomp_opts)
> >      int rc = 0;
> >      unsigned int i = 0;
> >      scmp_filter_ctx ctx;
> > -    uint32_t action = qemu_seccomp_get_kill_action();
> >
> >      ctx = seccomp_init(SCMP_ACT_ALLOW);
> >      if (ctx == NULL) {
> > @@ -157,10 +173,12 @@ static int seccomp_start(uint32_t seccomp_opts)
> >      }
> >
> >      for (i = 0; i < ARRAY_SIZE(blacklist); i++) {
> > +        uint32_t action;
> >          if (!(seccomp_opts & blacklist[i].set)) {
> >              continue;
> >          }
> >
> > +        action = qemu_seccomp_get_kill_action(blacklist[i].set);
> >          rc = seccomp_rule_add_array(ctx, action, blacklist[i].num,
> >                                      blacklist[i].narg, blacklist[i].arg_cmp);
> >          if (rc < 0) {
> > --
> > 2.20.1
> >
> 
> 
> -- 
> Marc-André Lureau

Patch
diff mbox series

diff --git a/qemu-seccomp.c b/qemu-seccomp.c
index 36d5829831..9776c9ef40 100644
--- a/qemu-seccomp.c
+++ b/qemu-seccomp.c
@@ -121,20 +121,37 @@  qemu_seccomp(unsigned int operation, unsigned int flags, void *args)
 #endif
 }
 
-static uint32_t qemu_seccomp_get_kill_action(void)
+static uint32_t qemu_seccomp_get_kill_action(int set)
 {
+    switch (set) {
+    case QEMU_SECCOMP_SET_DEFAULT:
+    case QEMU_SECCOMP_SET_OBSOLETE:
+    case QEMU_SECCOMP_SET_PRIVILEGED:
+    case QEMU_SECCOMP_SET_SPAWN: {
 #if defined(SECCOMP_GET_ACTION_AVAIL) && defined(SCMP_ACT_KILL_PROCESS) && \
     defined(SECCOMP_RET_KILL_PROCESS)
-    {
-        uint32_t action = SECCOMP_RET_KILL_PROCESS;
+        static int kill_process = -1;
+        if (kill_process == -1) {
+            uint32_t action = SECCOMP_RET_KILL_PROCESS;
 
-        if (qemu_seccomp(SECCOMP_GET_ACTION_AVAIL, 0, &action) == 0) {
+            if (qemu_seccomp(SECCOMP_GET_ACTION_AVAIL, 0, &action) == 0) {
+                kill_process = 1;
+            }
+            kill_process = 0;
+        }
+        if (kill_process == 1) {
             return SCMP_ACT_KILL_PROCESS;
         }
-    }
 #endif
+        return SCMP_ACT_TRAP;
+    }
+
+    case QEMU_SECCOMP_SET_RESOURCECTL:
+        return SCMP_ACT_ERRNO(EPERM);
 
-    return SCMP_ACT_TRAP;
+    default:
+        g_assert_not_reached();
+    }
 }
 
 
@@ -143,7 +160,6 @@  static int seccomp_start(uint32_t seccomp_opts)
     int rc = 0;
     unsigned int i = 0;
     scmp_filter_ctx ctx;
-    uint32_t action = qemu_seccomp_get_kill_action();
 
     ctx = seccomp_init(SCMP_ACT_ALLOW);
     if (ctx == NULL) {
@@ -157,10 +173,12 @@  static int seccomp_start(uint32_t seccomp_opts)
     }
 
     for (i = 0; i < ARRAY_SIZE(blacklist); i++) {
+        uint32_t action;
         if (!(seccomp_opts & blacklist[i].set)) {
             continue;
         }
 
+        action = qemu_seccomp_get_kill_action(blacklist[i].set);
         rc = seccomp_rule_add_array(ctx, action, blacklist[i].num,
                                     blacklist[i].narg, blacklist[i].arg_cmp);
         if (rc < 0) {