[v1] os-posix: Add -unshare option

Message ID	20171019160419.11611-1-ross.lagerwall@citrix.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Ross Lagerwall <ross.lagerwall@citrix.com> To: <qemu-devel@nongnu.org> Date: Thu, 19 Oct 2017 17:04:19 +0100 Message-ID: <20171019160419.11611-1-ross.lagerwall@citrix.com> MIME-Version: 1.0 Content-Type: text/plain Subject: [Qemu-devel] [PATCH v1] os-posix: Add -unshare option Precedence: list Cc: Ross Lagerwall <ross.lagerwall@citrix.com>, Markus Armbruster <armbru@redhat.com> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Ross Lagerwall Oct. 19, 2017, 4:04 p.m. UTC

Add an option to allow calling unshare() just before starting guest
execution. The option allows unsharing one or more of the mount
namespace, the network namespace, and the IPC namespace. This is useful
to restrict the ability of QEMU to cause damage to the system should it
be compromised.

An example of using this would be to have QEMU open a QMP socket at
startup and unshare the network namespace. The instance of QEMU could
still be controlled by the QMP socket since that belongs in the original
namespace, but if QEMU were compromised it wouldn't be able to open any
new connections, even to other processes on the same machine.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 os-posix.c      | 34 ++++++++++++++++++++++++++++++++++
 qemu-options.hx | 14 ++++++++++++++
 2 files changed, 48 insertions(+)

Daniel P. Berrangé Oct. 19, 2017, 4:24 p.m. UTC | #1

On Thu, Oct 19, 2017 at 05:04:19PM +0100, Ross Lagerwall wrote:
> Add an option to allow calling unshare() just before starting guest
> execution. The option allows unsharing one or more of the mount
> namespace, the network namespace, and the IPC namespace. This is useful
> to restrict the ability of QEMU to cause damage to the system should it
> be compromised.
> 
> An example of using this would be to have QEMU open a QMP socket at
> startup and unshare the network namespace. The instance of QEMU could
> still be controlled by the QMP socket since that belongs in the original
> namespace, but if QEMU were compromised it wouldn't be able to open any
> new connections, even to other processes on the same machine.

Unless I'm misunderstanding you, what's described here is already possible
by just using the 'unshare' command to spawn QEMU:

  # unshare --ipc --mount --net qemu-system-x86_64 -qmp unix:/tmp/foo,server -vnc :1
  qemu-system-x86_64: -qmp unix:/tmp/foo,server: QEMU waiting for connection on: disconnected:unix:/tmp/foo,server

And in another shell I can still access the QMP socket from the original host
namespace

  # ./scripts/qmp/qmp-shell /tmp/foo
  Welcome to the QMP low-level shell!
  Connected to QEMU 2.9.1

  (QEMU) query-kvm
  {"return": {"enabled": false, "present": true}}

FWIW, even if that were not possible, you could still do it by wrapping the
qmp-shell in an 'nsenter' call. eg

  nsenter --target $QEMUPID --net ./scripts/qmp/qmp-shell /tmp/foo

> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> ---
>  os-posix.c      | 34 ++++++++++++++++++++++++++++++++++
>  qemu-options.hx | 14 ++++++++++++++
>  2 files changed, 48 insertions(+)
> 
> diff --git a/os-posix.c b/os-posix.c
> index b9c2343..cfc5c38 100644
> --- a/os-posix.c
> +++ b/os-posix.c
> @@ -45,6 +45,7 @@ static struct passwd *user_pwd;
>  static const char *chroot_dir;
>  static int daemonize;
>  static int daemon_pipe;
> +static int unshare_flags;
>  
>  void os_setup_early_signal_handling(void)
>  {
> @@ -160,6 +161,28 @@ void os_parse_cmd_args(int index, const char *optarg)
>          fips_set_state(true);
>          break;
>  #endif
> +#ifdef CONFIG_SETNS
> +    case QEMU_OPTION_unshare:
> +        {
> +            char *flag;
> +            char *opts = g_strdup(optarg);
> +
> +            while ((flag = qemu_strsep(&opts, ",")) != NULL) {
> +                if (!strcmp(flag, "mount")) {
> +                    unshare_flags |= CLONE_NEWNS;
> +                } else if (!strcmp(flag, "net")) {
> +                    unshare_flags |= CLONE_NEWNET;
> +                } else if (!strcmp(flag, "ipc")) {
> +                    unshare_flags |= CLONE_NEWIPC;
> +                } else {
> +                    fprintf(stderr, "Unknown unshare option: %s\n", flag);
> +                    exit(1);
> +                }
> +            }
> +            g_free(opts);
> +        }
> +        break;
> +#endif
>      }
>  }
>  
> @@ -201,6 +224,16 @@ static void change_root(void)
>  
>  }
>  
> +static void unshare_namespaces(void)
> +{
> +    if (unshare_flags) {
> +        if (unshare(unshare_flags) < 0) {
> +            perror("could not unshare");
> +            exit(1);
> +        }
> +    }
> +}
> +
>  void os_daemonize(void)
>  {
>      if (daemonize) {
> @@ -266,6 +299,7 @@ void os_setup_post(void)
>      }
>  
>      change_root();
> +    unshare_namespaces();
>      change_process_uid();

This has some really bad implications.  All the command line options that are
given are processed *beforfe* os_setup_post() is called. IOW, -chardev, -vnc,
-migrate, -net, etc will all be configured in the context of the host namespace.

If you then use the QMP monitor to run  chardev_add,  device_add, migrate,
hostnet_add, etc this will all take place in the new namespace.

So the exact same args give as ARGV now have completely different semantics
when given via QMP.

I think this is really very undesirable.

If you wrap QEMU execution in 'unshare' as I illustrate above, then the
semantics of ARGV & QMP remain consistent.

FWIW, as a further point that might be of interest, libvirt will now spawn
a new private mount namespace for QEMU by default. We do this so that we can
give QEMU a private /dev filesystem with only the devices its permitted to
use present as device nodes.  The ability to do such setup tasks inbetween
namespace creation and QEMU launching is broadly useful. For example, if
using a private network namespace, you might want to create a veth pair and
put one end in the namespace, so that QEMU's network services have some
level of outside network connectivity - eg to enable QEMU to connect to a remote
QEMU for live migration.

So overall, I absolutely encourage the use of namespaces to confine QEMU,
but I tend to think namespace creation/setup is better done outside QEMU
before launching it.

Regards,
Daniel

Ross Lagerwall Oct. 23, 2017, 2:30 p.m. UTC | #2

On 10/19/2017 05:24 PM, Daniel P. Berrange wrote:
> On Thu, Oct 19, 2017 at 05:04:19PM +0100, Ross Lagerwall wrote:
>> Add an option to allow calling unshare() just before starting guest
>> execution. The option allows unsharing one or more of the mount
>> namespace, the network namespace, and the IPC namespace. This is useful
>> to restrict the ability of QEMU to cause damage to the system should it
>> be compromised.
>>
>> An example of using this would be to have QEMU open a QMP socket at
>> startup and unshare the network namespace. The instance of QEMU could
>> still be controlled by the QMP socket since that belongs in the original
>> namespace, but if QEMU were compromised it wouldn't be able to open any
>> new connections, even to other processes on the same machine.
> 
> Unless I'm misunderstanding you, what's described here is already possible
> by just using the 'unshare' command to spawn QEMU:
> 
>    # unshare --ipc --mount --net qemu-system-x86_64 -qmp unix:/tmp/foo,server -vnc :1
>    qemu-system-x86_64: -qmp unix:/tmp/foo,server: QEMU waiting for connection on: disconnected:unix:/tmp/foo,server
> 
> And in another shell I can still access the QMP socket from the original host
> namespace

So that works because UNIX domains sockets are not restricted by network 
namespaces. But if you try to connect to the VNC server listening on TCP 
port 5901, it won't work.

> 
>    # ./scripts/qmp/qmp-shell /tmp/foo
>    Welcome to the QMP low-level shell!
>    Connected to QEMU 2.9.1
> 
>    (QEMU) query-kvm
>    {"return": {"enabled": false, "present": true}}
> 
> 
> FWIW, even if that were not possible, you could still do it by wrapping the
> qmp-shell in an 'nsenter' call. eg
> 
>    nsenter --target $QEMUPID --net ./scripts/qmp/qmp-shell /tmp/foo

I have a single process which connects to all the QEMUs' listening VNC 
sockets so I'm not sure that this would work.

> 
>> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
>> ---
>>   os-posix.c      | 34 ++++++++++++++++++++++++++++++++++
>>   qemu-options.hx | 14 ++++++++++++++
>>   2 files changed, 48 insertions(+)
>>
>> diff --git a/os-posix.c b/os-posix.c
>> index b9c2343..cfc5c38 100644
>> --- a/os-posix.c
>> +++ b/os-posix.c
>> @@ -45,6 +45,7 @@ static struct passwd *user_pwd;
>>   static const char *chroot_dir;
>>   static int daemonize;
>>   static int daemon_pipe;
>> +static int unshare_flags;
>>   
>>   void os_setup_early_signal_handling(void)
>>   {
>> @@ -160,6 +161,28 @@ void os_parse_cmd_args(int index, const char *optarg)
>>           fips_set_state(true);
>>           break;
>>   #endif
>> +#ifdef CONFIG_SETNS
>> +    case QEMU_OPTION_unshare:
>> +        {
>> +            char *flag;
>> +            char *opts = g_strdup(optarg);
>> +
>> +            while ((flag = qemu_strsep(&opts, ",")) != NULL) {
>> +                if (!strcmp(flag, "mount")) {
>> +                    unshare_flags |= CLONE_NEWNS;
>> +                } else if (!strcmp(flag, "net")) {
>> +                    unshare_flags |= CLONE_NEWNET;
>> +                } else if (!strcmp(flag, "ipc")) {
>> +                    unshare_flags |= CLONE_NEWIPC;
>> +                } else {
>> +                    fprintf(stderr, "Unknown unshare option: %s\n", flag);
>> +                    exit(1);
>> +                }
>> +            }
>> +            g_free(opts);
>> +        }
>> +        break;
>> +#endif
>>       }
>>   }
>>   
>> @@ -201,6 +224,16 @@ static void change_root(void)
>>   
>>   }
>>   
>> +static void unshare_namespaces(void)
>> +{
>> +    if (unshare_flags) {
>> +        if (unshare(unshare_flags) < 0) {
>> +            perror("could not unshare");
>> +            exit(1);
>> +        }
>> +    }
>> +}
>> +
>>   void os_daemonize(void)
>>   {
>>       if (daemonize) {
>> @@ -266,6 +299,7 @@ void os_setup_post(void)
>>       }
>>   
>>       change_root();
>> +    unshare_namespaces();
>>       change_process_uid();
> 
> This has some really bad implications.  All the command line options that are
> given are processed *beforfe* os_setup_post() is called. IOW, -chardev, -vnc,
> -migrate, -net, etc will all be configured in the context of the host namespace.
> 
> If you then use the QMP monitor to run  chardev_add,  device_add, migrate,
> hostnet_add, etc this will all take place in the new namespace.
> 
> So the exact same args give as ARGV now have completely different semantics
> when given via QMP.
> 
> I think this is really very undesirable.

I consider this to be broadly similar to using -chroot -- adding devices 
and so on after chrooting would have a different effect compared with 
adding them after chrooting. I do agree though that both -chroot and 
-unshare could have confusing semantics.

> 
> If you wrap QEMU execution in 'unshare' as I illustrate above, then the
> semantics of ARGV & QMP remain consistent.
> 
> FWIW, as a further point that might be of interest, libvirt will now spawn
> a new private mount namespace for QEMU by default. We do this so that we can
> give QEMU a private /dev filesystem with only the devices its permitted to
> use present as device nodes.  The ability to do such setup tasks inbetween
> namespace creation and QEMU launching is broadly useful. For example, if
> using a private network namespace, you might want to create a veth pair and
> put one end in the namespace, so that QEMU's network services have some
> level of outside network connectivity - eg to enable QEMU to connect to a remote
> QEMU for live migration.

Hmm, I think having a veth pair per VM might be a little too much 
overhead and management just to expose the VNC port.

> 
> So overall, I absolutely encourage the use of namespaces to confine QEMU,
> but I tend to think namespace creation/setup is better done outside QEMU
> before launching it.
> 

Thanks for the extensive comments. While I do think that there is some 
value in being able to unshare namespaces after doing the initial setup 
(much like chrooting and dropping privileges), I think for now I can 
work around this by unsharing before starting QEMU and then ensuring 
that QEMU only listens on UNIX domain sockets rather than TCP sockets.

Regards,

Daniel P. Berrangé Oct. 23, 2017, 2:50 p.m. UTC | #3

On Mon, Oct 23, 2017 at 03:30:05PM +0100, Ross Lagerwall wrote:
> On 10/19/2017 05:24 PM, Daniel P. Berrange wrote:
> > On Thu, Oct 19, 2017 at 05:04:19PM +0100, Ross Lagerwall wrote:
> > > Add an option to allow calling unshare() just before starting guest
> > > execution. The option allows unsharing one or more of the mount
> > > namespace, the network namespace, and the IPC namespace. This is useful
> > > to restrict the ability of QEMU to cause damage to the system should it
> > > be compromised.
> > > 
> > > An example of using this would be to have QEMU open a QMP socket at
> > > startup and unshare the network namespace. The instance of QEMU could
> > > still be controlled by the QMP socket since that belongs in the original
> > > namespace, but if QEMU were compromised it wouldn't be able to open any
> > > new connections, even to other processes on the same machine.
> > 
> > Unless I'm misunderstanding you, what's described here is already possible
> > by just using the 'unshare' command to spawn QEMU:
> > 
> >    # unshare --ipc --mount --net qemu-system-x86_64 -qmp unix:/tmp/foo,server -vnc :1
> >    qemu-system-x86_64: -qmp unix:/tmp/foo,server: QEMU waiting for connection on: disconnected:unix:/tmp/foo,server
> > 
> > And in another shell I can still access the QMP socket from the original host
> > namespace
> 
> So that works because UNIX domains sockets are not restricted by network
> namespaces. But if you try to connect to the VNC server listening on TCP
> port 5901, it won't work.
> 
> > 
> >    # ./scripts/qmp/qmp-shell /tmp/foo
> >    Welcome to the QMP low-level shell!
> >    Connected to QEMU 2.9.1
> > 
> >    (QEMU) query-kvm
> >    {"return": {"enabled": false, "present": true}}
> > 
> > 
> > FWIW, even if that were not possible, you could still do it by wrapping the
> > qmp-shell in an 'nsenter' call. eg
> > 
> >    nsenter --target $QEMUPID --net ./scripts/qmp/qmp-shell /tmp/foo
> 
> I have a single process which connects to all the QEMUs' listening VNC
> sockets so I'm not sure that this would work.

Yes, it can still work - you simply need to use set() to temporarily
switch into QEMU's namespace, and then switch back again afterwards

  oldns = open("/proc/self/ns/net")
  newns = open("/proc/$PID-OF-QEMU/ns/net")
  setns(newns, CLONE_NEWNET)

  ...open connection to VNC...
  
  setns(oldns, CLONE_NEWNET)

  ...use connection to VNC...

The setns() call is thread-local, so you can safely use different
namespaces in each thread if you need to have concurrent comms with
many QEMUs.

Regards,
Daniel

Ross Lagerwall Oct. 23, 2017, 3:01 p.m. UTC | #4

On 10/23/2017 03:50 PM, Daniel P. Berrange wrote:
> On Mon, Oct 23, 2017 at 03:30:05PM +0100, Ross Lagerwall wrote:
>> On 10/19/2017 05:24 PM, Daniel P. Berrange wrote:
>>> On Thu, Oct 19, 2017 at 05:04:19PM +0100, Ross Lagerwall wrote:
>>>> Add an option to allow calling unshare() just before starting guest
>>>> execution. The option allows unsharing one or more of the mount
>>>> namespace, the network namespace, and the IPC namespace. This is useful
>>>> to restrict the ability of QEMU to cause damage to the system should it
>>>> be compromised.
>>>>
>>>> An example of using this would be to have QEMU open a QMP socket at
>>>> startup and unshare the network namespace. The instance of QEMU could
>>>> still be controlled by the QMP socket since that belongs in the original
>>>> namespace, but if QEMU were compromised it wouldn't be able to open any
>>>> new connections, even to other processes on the same machine.
>>>
>>> Unless I'm misunderstanding you, what's described here is already possible
>>> by just using the 'unshare' command to spawn QEMU:
>>>
>>>     # unshare --ipc --mount --net qemu-system-x86_64 -qmp unix:/tmp/foo,server -vnc :1
>>>     qemu-system-x86_64: -qmp unix:/tmp/foo,server: QEMU waiting for connection on: disconnected:unix:/tmp/foo,server
>>>
>>> And in another shell I can still access the QMP socket from the original host
>>> namespace
>>
>> So that works because UNIX domains sockets are not restricted by network
>> namespaces. But if you try to connect to the VNC server listening on TCP
>> port 5901, it won't work.
>>
>>>
>>>     # ./scripts/qmp/qmp-shell /tmp/foo
>>>     Welcome to the QMP low-level shell!
>>>     Connected to QEMU 2.9.1
>>>
>>>     (QEMU) query-kvm
>>>     {"return": {"enabled": false, "present": true}}
>>>
>>>
>>> FWIW, even if that were not possible, you could still do it by wrapping the
>>> qmp-shell in an 'nsenter' call. eg
>>>
>>>     nsenter --target $QEMUPID --net ./scripts/qmp/qmp-shell /tmp/foo
>>
>> I have a single process which connects to all the QEMUs' listening VNC
>> sockets so I'm not sure that this would work.
> 
> Yes, it can still work - you simply need to use set() to temporarily
> switch into QEMU's namespace, and then switch back again afterwards
> 
>    oldns = open("/proc/self/ns/net")
>    newns = open("/proc/$PID-OF-QEMU/ns/net")
>    setns(newns, CLONE_NEWNET)
> 
>    ...open connection to VNC...
>    
>    setns(oldns, CLONE_NEWNET)
> 
>    ...use connection to VNC...
> 
> The setns() call is thread-local, so you can safely use different
> namespaces in each thread if you need to have concurrent comms with
> many QEMUs.
> 

Ah, I didn't realize that the namespace could be set locally for a 
thread. That's helpful, thanks!

Daniel P. Berrangé Oct. 23, 2017, 3:05 p.m. UTC | #5

On Mon, Oct 23, 2017 at 04:01:12PM +0100, Ross Lagerwall wrote:
> On 10/23/2017 03:50 PM, Daniel P. Berrange wrote:
> > On Mon, Oct 23, 2017 at 03:30:05PM +0100, Ross Lagerwall wrote:
> > > On 10/19/2017 05:24 PM, Daniel P. Berrange wrote:
> > > > On Thu, Oct 19, 2017 at 05:04:19PM +0100, Ross Lagerwall wrote:
> > > > > Add an option to allow calling unshare() just before starting guest
> > > > > execution. The option allows unsharing one or more of the mount
> > > > > namespace, the network namespace, and the IPC namespace. This is useful
> > > > > to restrict the ability of QEMU to cause damage to the system should it
> > > > > be compromised.
> > > > > 
> > > > > An example of using this would be to have QEMU open a QMP socket at
> > > > > startup and unshare the network namespace. The instance of QEMU could
> > > > > still be controlled by the QMP socket since that belongs in the original
> > > > > namespace, but if QEMU were compromised it wouldn't be able to open any
> > > > > new connections, even to other processes on the same machine.
> > > > 
> > > > Unless I'm misunderstanding you, what's described here is already possible
> > > > by just using the 'unshare' command to spawn QEMU:
> > > > 
> > > >     # unshare --ipc --mount --net qemu-system-x86_64 -qmp unix:/tmp/foo,server -vnc :1
> > > >     qemu-system-x86_64: -qmp unix:/tmp/foo,server: QEMU waiting for connection on: disconnected:unix:/tmp/foo,server
> > > > 
> > > > And in another shell I can still access the QMP socket from the original host
> > > > namespace
> > > 
> > > So that works because UNIX domains sockets are not restricted by network
> > > namespaces. But if you try to connect to the VNC server listening on TCP
> > > port 5901, it won't work.
> > > 
> > > > 
> > > >     # ./scripts/qmp/qmp-shell /tmp/foo
> > > >     Welcome to the QMP low-level shell!
> > > >     Connected to QEMU 2.9.1
> > > > 
> > > >     (QEMU) query-kvm
> > > >     {"return": {"enabled": false, "present": true}}
> > > > 
> > > > 
> > > > FWIW, even if that were not possible, you could still do it by wrapping the
> > > > qmp-shell in an 'nsenter' call. eg
> > > > 
> > > >     nsenter --target $QEMUPID --net ./scripts/qmp/qmp-shell /tmp/foo
> > > 
> > > I have a single process which connects to all the QEMUs' listening VNC
> > > sockets so I'm not sure that this would work.
> > 
> > Yes, it can still work - you simply need to use set() to temporarily
> > switch into QEMU's namespace, and then switch back again afterwards
> > 
> >    oldns = open("/proc/self/ns/net")
> >    newns = open("/proc/$PID-OF-QEMU/ns/net")
> >    setns(newns, CLONE_NEWNET)
> > 
> >    ...open connection to VNC...
> >    setns(oldns, CLONE_NEWNET)
> > 
> >    ...use connection to VNC...
> > 
> > The setns() call is thread-local, so you can safely use different
> > namespaces in each thread if you need to have concurrent comms with
> > many QEMUs.
> > 
> 
> Ah, I didn't realize that the namespace could be set locally for a thread.
> That's helpful, thanks!

Oh I should have mentioned that there's some caveats with joining some of
the other namespaces (mount + pid) - so be sure to check the setns() manpage
which describes the edge cases. eg changing pid namespace won't take effect
until you fork() a new child.

Regards,
Daniel

Stefan Hajnoczi Oct. 24, 2017, 12:35 p.m. UTC | #6

On Mon, Oct 23, 2017 at 03:30:05PM +0100, Ross Lagerwall wrote:
> On 10/19/2017 05:24 PM, Daniel P. Berrange wrote:
> > On Thu, Oct 19, 2017 at 05:04:19PM +0100, Ross Lagerwall wrote:
> > > Add an option to allow calling unshare() just before starting guest
> > > execution. The option allows unsharing one or more of the mount
> > > namespace, the network namespace, and the IPC namespace. This is useful
> > > to restrict the ability of QEMU to cause damage to the system should it
> > > be compromised.
> > > 
> > > An example of using this would be to have QEMU open a QMP socket at
> > > startup and unshare the network namespace. The instance of QEMU could
> > > still be controlled by the QMP socket since that belongs in the original
> > > namespace, but if QEMU were compromised it wouldn't be able to open any
> > > new connections, even to other processes on the same machine.
> > 
> > Unless I'm misunderstanding you, what's described here is already possible
> > by just using the 'unshare' command to spawn QEMU:
> > 
> >    # unshare --ipc --mount --net qemu-system-x86_64 -qmp unix:/tmp/foo,server -vnc :1
> >    qemu-system-x86_64: -qmp unix:/tmp/foo,server: QEMU waiting for connection on: disconnected:unix:/tmp/foo,server
> > 
> > And in another shell I can still access the QMP socket from the original host
> > namespace
> 
> So that works because UNIX domains sockets are not restricted by network
> namespaces.

Slightly pedantic but hopefully interesting:

It's not correct to say that UNIX domain sockets are not restricted by
network namespaces, it's more complicated than that.  UNIX domain
sockets fall into several groups:

1. pathname (i.e. they have an inode on disk) sockets are namespaced by
   the mount namespace, not the network namespace.  These sockets can
   only be accessed if the thread's mount (filesystem) namespace can
   reach the inode.

2. unnamed (e.g. socketpair(2)) sockets can never be looked up in any
   namespace anyway, only fork() or file descriptor passing transfers
   them between processes.  Namespaces are irrelevant here.

3. abstract (sun_path[0] == '\0') sockets are affected by the network
   namespace.

Stefan

[v1] os-posix: Add -unshare option

Commit Message

Comments

Patch