diff mbox series

[v2,3/3] virtiofsd: probe unshare(CLONE_FS) and print an error

Message ID 20200727190223.422280-4-stefanha@redhat.com (mailing list archive)
State New, archived
Headers show
Series virtiofsd: allow virtiofsd to run in a container | expand

Commit Message

Stefan Hajnoczi July 27, 2020, 7:02 p.m. UTC
An assertion failure is raised during request processing if
unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
be detected right away.

Unfortunately Docker/Moby does not include unshare in the seccomp.json
list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
include unshare (e.g. podman is unaffected):
https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json

Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
default seccomp.json is missing unshare.

Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

Tomohiro Misono (Fujitsu) July 28, 2020, 1:05 a.m. UTC | #1
> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
> 
> An assertion failure is raised during request processing if
> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> be detected right away.
> 
> Unfortunately Docker/Moby does not include unshare in the seccomp.json
> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> include unshare (e.g. podman is unaffected):
> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> 
> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> default seccomp.json is missing unshare.

Hi, sorry for a bit late.

unshare() was added to fix xattr problem: 
  https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
In theory we don't need to call unshare if xattr is disabled, but it is hard to get to know
if xattr is enabled or disabled in fv_queue_worker(), right?

So, it looks good to me.
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>

Regards,
Misono

> 
> Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> index 3b6d16a041..9e5537506c 100644
> --- a/tools/virtiofsd/fuse_virtio.c
> +++ b/tools/virtiofsd/fuse_virtio.c
> @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
>  {
>      int ret;
> 
> +    /*
> +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> +     * an unprivileged system call but some Docker/Moby versions are known to
> +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> +     *
> +     * Note that the program is single-threaded here so this syscall has no
> +     * visible effect and is safe to make.
> +     */
> +    ret = unshare(CLONE_FS);
> +    if (ret == -1 && errno == EPERM) {
> +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> +                "running in a container please check that the container "
> +                "runtime seccomp policy allows unshare.\n");
> +        return -1;
> +    }
> +
>      ret = fv_create_listen_socket(se);
>      if (ret < 0) {
>          return ret;
> --
> 2.26.2
Roman Mohr July 28, 2020, 10 a.m. UTC | #2
On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
misono.tomohiro@fujitsu.com> wrote:

> > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> error
> >
> > An assertion failure is raised during request processing if
> > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > be detected right away.
> >
> > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > include unshare (e.g. podman is unaffected):
> >
> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> >
> > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > default seccomp.json is missing unshare.
>
> Hi, sorry for a bit late.
>
> unshare() was added to fix xattr problem:
>
> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> In theory we don't need to call unshare if xattr is disabled, but it is
> hard to get to know
> if xattr is enabled or disabled in fv_queue_worker(), right?
>
>
In kubevirt we want to run virtiofsd in containers. We would already not
have xattr support for e.g. overlayfs in the VM after this patch series (an
acceptable con at least for us right now).
If we can get rid of the unshare (and potentially of needing root) that
would be great. We always assume that everything which we run in containers
should work for cri-o and docker.

"Just" pointing docker to a different seccomp.json file is something which
k8s users/admin in many cases can't do.

Best Regards,
Roman


> So, it looks good to me.
> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
>
> Regards,
> Misono
>
> >
> > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/tools/virtiofsd/fuse_virtio.c
> b/tools/virtiofsd/fuse_virtio.c
> > index 3b6d16a041..9e5537506c 100644
> > --- a/tools/virtiofsd/fuse_virtio.c
> > +++ b/tools/virtiofsd/fuse_virtio.c
> > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
> >  {
> >      int ret;
> >
> > +    /*
> > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need
> it. It's
> > +     * an unprivileged system call but some Docker/Moby versions are
> known to
> > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > +     *
> > +     * Note that the program is single-threaded here so this syscall
> has no
> > +     * visible effect and is safe to make.
> > +     */
> > +    ret = unshare(CLONE_FS);
> > +    if (ret == -1 && errno == EPERM) {
> > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If
> "
> > +                "running in a container please check that the container
> "
> > +                "runtime seccomp policy allows unshare.\n");
> > +        return -1;
> > +    }
> > +
> >      ret = fv_create_listen_socket(se);
> >      if (ret < 0) {
> >          return ret;
> > --
> > 2.26.2
>
>
Vivek Goyal July 28, 2020, 1:12 p.m. UTC | #3
On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> misono.tomohiro@fujitsu.com> wrote:
> 
> > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> > error
> > >
> > > An assertion failure is raised during request processing if
> > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > be detected right away.
> > >
> > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > include unshare (e.g. podman is unaffected):
> > >
> > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > >
> > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > default seccomp.json is missing unshare.
> >
> > Hi, sorry for a bit late.
> >
> > unshare() was added to fix xattr problem:
> >
> > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> > In theory we don't need to call unshare if xattr is disabled, but it is
> > hard to get to know
> > if xattr is enabled or disabled in fv_queue_worker(), right?
> >
> >
> In kubevirt we want to run virtiofsd in containers. We would already not
> have xattr support for e.g. overlayfs in the VM after this patch series (an
> acceptable con at least for us right now).
> If we can get rid of the unshare (and potentially of needing root) that
> would be great. We always assume that everything which we run in containers
> should work for cri-o and docker.

But cri-o and docker containers run as root, isn't it? (or atleast have
the capability to run as root). Havind said that, it will be nice to be able
to run virtiofsd without root. 

There are few hurdles though.

- For file creation, we switch uid/gid (seteuid/setegid) and that seems
  to require root. If we were to run unpriviliged, probably all files
  on host will have to be owned by unpriviliged user and guest visible
  uid/gid will have to be stored in xattrs. I think virtfs supports
  something similar.

I am sure there are other restrictions but this probably is the biggest
one to overcome.

 > 
> "Just" pointing docker to a different seccomp.json file is something which
> k8s users/admin in many cases can't do.

Or may be issue is that standard seccomp.json does not allow unshare()
and hence you are forced to use a non-standar seccomp.json.

Vivek

> 
> Best Regards,
> Roman
> 
> 
> > So, it looks good to me.
> > Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> >
> > Regards,
> > Misono
> >
> > >
> > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > ---
> > >  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
> > >  1 file changed, 16 insertions(+)
> > >
> > > diff --git a/tools/virtiofsd/fuse_virtio.c
> > b/tools/virtiofsd/fuse_virtio.c
> > > index 3b6d16a041..9e5537506c 100644
> > > --- a/tools/virtiofsd/fuse_virtio.c
> > > +++ b/tools/virtiofsd/fuse_virtio.c
> > > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
> > >  {
> > >      int ret;
> > >
> > > +    /*
> > > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need
> > it. It's
> > > +     * an unprivileged system call but some Docker/Moby versions are
> > known to
> > > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > > +     *
> > > +     * Note that the program is single-threaded here so this syscall
> > has no
> > > +     * visible effect and is safe to make.
> > > +     */
> > > +    ret = unshare(CLONE_FS);
> > > +    if (ret == -1 && errno == EPERM) {
> > > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If
> > "
> > > +                "running in a container please check that the container
> > "
> > > +                "runtime seccomp policy allows unshare.\n");
> > > +        return -1;
> > > +    }
> > > +
> > >      ret = fv_create_listen_socket(se);
> > >      if (ret < 0) {
> > >          return ret;
> > > --
> > > 2.26.2
> >
> >
Stefan Hajnoczi July 28, 2020, 3:32 p.m. UTC | #4
On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> misono.tomohiro@fujitsu.com> wrote:
> 
> > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> > error
> > >
> > > An assertion failure is raised during request processing if
> > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > be detected right away.
> > >
> > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > include unshare (e.g. podman is unaffected):
> > >
> > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > >
> > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > default seccomp.json is missing unshare.
> >
> > Hi, sorry for a bit late.
> >
> > unshare() was added to fix xattr problem:
> >
> > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> > In theory we don't need to call unshare if xattr is disabled, but it is
> > hard to get to know
> > if xattr is enabled or disabled in fv_queue_worker(), right?
> >
> >
> In kubevirt we want to run virtiofsd in containers. We would already not
> have xattr support for e.g. overlayfs in the VM after this patch series (an
> acceptable con at least for us right now).
> If we can get rid of the unshare (and potentially of needing root) that
> would be great. We always assume that everything which we run in containers
> should work for cri-o and docker.

Root is required to access files with any uid/gid.

Dave Gilbert is working on xattr support without CAP_SYS_ADMIN. He may
be able to find a way to drop unshare (at least in containers).

> "Just" pointing docker to a different seccomp.json file is something which
> k8s users/admin in many cases can't do.

There is a Moby PR to change the default seccomp.json file here but it's
unclear if it will be merged:
https://github.com/moby/moby/pull/41244

Stefan
Daniel P. Berrangé July 28, 2020, 3:52 p.m. UTC | #5
On Tue, Jul 28, 2020 at 09:12:50AM -0400, Vivek Goyal wrote:
> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> > misono.tomohiro@fujitsu.com> wrote:
> > 
> > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> > > error
> > > >
> > > > An assertion failure is raised during request processing if
> > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > > be detected right away.
> > > >
> > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > > include unshare (e.g. podman is unaffected):
> > > >
> > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > > >
> > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > > default seccomp.json is missing unshare.
> > >
> > > Hi, sorry for a bit late.
> > >
> > > unshare() was added to fix xattr problem:
> > >
> > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> > > In theory we don't need to call unshare if xattr is disabled, but it is
> > > hard to get to know
> > > if xattr is enabled or disabled in fv_queue_worker(), right?
> > >
> > >
> > In kubevirt we want to run virtiofsd in containers. We would already not
> > have xattr support for e.g. overlayfs in the VM after this patch series (an
> > acceptable con at least for us right now).
> > If we can get rid of the unshare (and potentially of needing root) that
> > would be great. We always assume that everything which we run in containers
> > should work for cri-o and docker.
> 
> But cri-o and docker containers run as root, isn't it? (or atleast have
> the capability to run as root). Havind said that, it will be nice to be able
> to run virtiofsd without root. 
> 
> There are few hurdles though.
> 
> - For file creation, we switch uid/gid (seteuid/setegid) and that seems
>   to require root. If we were to run unpriviliged, probably all files
>   on host will have to be owned by unpriviliged user and guest visible
>   uid/gid will have to be stored in xattrs. I think virtfs supports
>   something similar.

I think I've mentioned before, 9p virtfs supports different modes,
passthrough, squashed or remapped.

passthrough should be reasonably straightforward to support in virtiofs.
The guest sees all the host UID/GIDs ownership as normal, and can read
any files the host user can read, but are obviously restricted to write
to only the files that host user can write too. No DAC-OVERRIDE facility
in essence. You'll just get EPERM, which is fine. This simple passthrough
scenario would be just what's desired for a typical desktop virt use
cases, where you want to share part/all of your home dir with a guest for
easy file access. Personally this is the mode I'd be most interested in
seeing provided for unprivileged virtiofsd usage.

squash is similar to passthrough, except the guest sees everything
as owned by the same user. This can be surprising as the guest might
see a file owned by them, but not be able to write to it, as on the
host its actually owned by some other user. Fairly niche use case
I think.

remapping would be needed for a more general purpose use cases
allowing the guest to do arbitrary UID/GID changes, but on the host
everything is still stored as one user and remapped somehow.

The main challenge for all the unprivileged scenarios is safety of
the sandbox, to avoid risk of guests escaping to access files outside
of the exported dir via symlink attacks or similar.



Regards,
Daniel
Daniel Walsh July 28, 2020, 7:12 p.m. UTC | #6
On 7/28/20 09:12, Vivek Goyal wrote:
> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
>> misono.tomohiro@fujitsu.com> wrote:
>>
>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
>>> error
>>>> An assertion failure is raised during request processing if
>>>> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
>>>> be detected right away.
>>>>
>>>> Unfortunately Docker/Moby does not include unshare in the seccomp.json
>>>> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
>>>> include unshare (e.g. podman is unaffected):
>>>>
>>> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
>>>> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
>>>> default seccomp.json is missing unshare.
>>> Hi, sorry for a bit late.
>>>
>>> unshare() was added to fix xattr problem:
>>>
>>> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
>>> In theory we don't need to call unshare if xattr is disabled, but it is
>>> hard to get to know
>>> if xattr is enabled or disabled in fv_queue_worker(), right?
>>>
>>>
>> In kubevirt we want to run virtiofsd in containers. We would already not
>> have xattr support for e.g. overlayfs in the VM after this patch series (an
>> acceptable con at least for us right now).
>> If we can get rid of the unshare (and potentially of needing root) that
>> would be great. We always assume that everything which we run in containers
>> should work for cri-o and docker.
> But cri-o and docker containers run as root, isn't it? (or atleast have
> the capability to run as root). Havind said that, it will be nice to be able
> to run virtiofsd without root. 
>
> There are few hurdles though.
>
> - For file creation, we switch uid/gid (seteuid/setegid) and that seems
>   to require root. If we were to run unpriviliged, probably all files
>   on host will have to be owned by unpriviliged user and guest visible
>   uid/gid will have to be stored in xattrs. I think virtfs supports
>   something similar.
>
> I am sure there are other restrictions but this probably is the biggest
> one to overcome.
>
>  > 
You should be able to run it within a user namespace with Namespaces
capabilities.
>> "Just" pointing docker to a different seccomp.json file is something which
>> k8s users/admin in many cases can't do.
> Or may be issue is that standard seccomp.json does not allow unshare()
> and hence you are forced to use a non-standar seccomp.json.
>
> Vivek
>
>> Best Regards,
>> Roman
>>
>>
>>> So, it looks good to me.
>>> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
>>>
>>> Regards,
>>> Misono
>>>
>>>> Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
>>>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>> ---
>>>>  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
>>>>  1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/tools/virtiofsd/fuse_virtio.c
>>> b/tools/virtiofsd/fuse_virtio.c
>>>> index 3b6d16a041..9e5537506c 100644
>>>> --- a/tools/virtiofsd/fuse_virtio.c
>>>> +++ b/tools/virtiofsd/fuse_virtio.c
>>>> @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
>>>>  {
>>>>      int ret;
>>>>
>>>> +    /*
>>>> +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need
>>> it. It's
>>>> +     * an unprivileged system call but some Docker/Moby versions are
>>> known to
>>>> +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
>>>> +     *
>>>> +     * Note that the program is single-threaded here so this syscall
>>> has no
>>>> +     * visible effect and is safe to make.
>>>> +     */
>>>> +    ret = unshare(CLONE_FS);
>>>> +    if (ret == -1 && errno == EPERM) {
>>>> +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If
>>> "
>>>> +                "running in a container please check that the container
>>> "
>>>> +                "runtime seccomp policy allows unshare.\n");
>>>> +        return -1;
>>>> +    }
>>>> +
>>>>      ret = fv_create_listen_socket(se);
>>>>      if (ret < 0) {
>>>>          return ret;
>>>> --
>>>> 2.26.2
>>>
Daniel Walsh July 28, 2020, 7:15 p.m. UTC | #7
On 7/28/20 11:32, Stefan Hajnoczi wrote:
> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
>> misono.tomohiro@fujitsu.com> wrote:
>>
>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
>>> error
>>>> An assertion failure is raised during request processing if
>>>> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
>>>> be detected right away.
>>>>
>>>> Unfortunately Docker/Moby does not include unshare in the seccomp.json
>>>> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
>>>> include unshare (e.g. podman is unaffected):
>>>>
>>> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
>>>> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
>>>> default seccomp.json is missing unshare.
>>> Hi, sorry for a bit late.
>>>
>>> unshare() was added to fix xattr problem:
>>>
>>> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
>>> In theory we don't need to call unshare if xattr is disabled, but it is
>>> hard to get to know
>>> if xattr is enabled or disabled in fv_queue_worker(), right?
>>>
>>>
>> In kubevirt we want to run virtiofsd in containers. We would already not
>> have xattr support for e.g. overlayfs in the VM after this patch series (an
>> acceptable con at least for us right now).
>> If we can get rid of the unshare (and potentially of needing root) that
>> would be great. We always assume that everything which we run in containers
>> should work for cri-o and docker.
> Root is required to access files with any uid/gid.
>
> Dave Gilbert is working on xattr support without CAP_SYS_ADMIN. He may
> be able to find a way to drop unshare (at least in containers).
>
>> "Just" pointing docker to a different seccomp.json file is something which
>> k8s users/admin in many cases can't do.
> There is a Moby PR to change the default seccomp.json file here but it's
> unclear if it will be merged:
> https://github.com/moby/moby/pull/41244
>
> Stefan

Why not try Podman?
Vivek Goyal July 28, 2020, 8:54 p.m. UTC | #8
On Tue, Jul 28, 2020 at 04:52:33PM +0100, Daniel P. Berrangé wrote:
> On Tue, Jul 28, 2020 at 09:12:50AM -0400, Vivek Goyal wrote:
> > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> > > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> > > misono.tomohiro@fujitsu.com> wrote:
> > > 
> > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> > > > error
> > > > >
> > > > > An assertion failure is raised during request processing if
> > > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > > > > be detected right away.
> > > > >
> > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > > > include unshare (e.g. podman is unaffected):
> > > > >
> > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > > > >
> > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > > > > default seccomp.json is missing unshare.
> > > >
> > > > Hi, sorry for a bit late.
> > > >
> > > > unshare() was added to fix xattr problem:
> > > >
> > > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> > > > In theory we don't need to call unshare if xattr is disabled, but it is
> > > > hard to get to know
> > > > if xattr is enabled or disabled in fv_queue_worker(), right?
> > > >
> > > >
> > > In kubevirt we want to run virtiofsd in containers. We would already not
> > > have xattr support for e.g. overlayfs in the VM after this patch series (an
> > > acceptable con at least for us right now).
> > > If we can get rid of the unshare (and potentially of needing root) that
> > > would be great. We always assume that everything which we run in containers
> > > should work for cri-o and docker.
> > 
> > But cri-o and docker containers run as root, isn't it? (or atleast have
> > the capability to run as root). Havind said that, it will be nice to be able
> > to run virtiofsd without root. 
> > 
> > There are few hurdles though.
> > 
> > - For file creation, we switch uid/gid (seteuid/setegid) and that seems
> >   to require root. If we were to run unpriviliged, probably all files
> >   on host will have to be owned by unpriviliged user and guest visible
> >   uid/gid will have to be stored in xattrs. I think virtfs supports
> >   something similar.
> 
> I think I've mentioned before, 9p virtfs supports different modes,
> passthrough, squashed or remapped.
> 
> passthrough should be reasonably straightforward to support in virtiofs.
> The guest sees all the host UID/GIDs ownership as normal, and can read
> any files the host user can read, but are obviously restricted to write
> to only the files that host user can write too. No DAC-OVERRIDE facility
> in essence. You'll just get EPERM, which is fine. This simple passthrough
> scenario would be just what's desired for a typical desktop virt use
> cases, where you want to share part/all of your home dir with a guest for
> easy file access. Personally this is the mode I'd be most interested in
> seeing provided for unprivileged virtiofsd usage.

Interesting. So passthrough will have two sub modes. priviliged and
unpriviliged. As of now we support priviliged passthrough. 

I guess it does make sense to look into unpriviliged passthrough
and see what other operations will not be allowed.

Thanks
Vivek

> 
> squash is similar to passthrough, except the guest sees everything
> as owned by the same user. This can be surprising as the guest might
> see a file owned by them, but not be able to write to it, as on the
> host its actually owned by some other user. Fairly niche use case
> I think.
> 
> remapping would be needed for a more general purpose use cases
> allowing the guest to do arbitrary UID/GID changes, but on the host
> everything is still stored as one user and remapped somehow.
> 
> The main challenge for all the unprivileged scenarios is safety of
> the sandbox, to avoid risk of guests escaping to access files outside
> of the exported dir via symlink attacks or similar.
> 
> 
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Vivek Goyal July 28, 2020, 9:01 p.m. UTC | #9
On Tue, Jul 28, 2020 at 03:12:54PM -0400, Daniel Walsh wrote:
> On 7/28/20 09:12, Vivek Goyal wrote:
> > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> >> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> >> misono.tomohiro@fujitsu.com> wrote:
> >>
> >>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> >>> error
> >>>> An assertion failure is raised during request processing if
> >>>> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> >>>> be detected right away.
> >>>>
> >>>> Unfortunately Docker/Moby does not include unshare in the seccomp.json
> >>>> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> >>>> include unshare (e.g. podman is unaffected):
> >>>>
> >>> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> >>>> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> >>>> default seccomp.json is missing unshare.
> >>> Hi, sorry for a bit late.
> >>>
> >>> unshare() was added to fix xattr problem:
> >>>
> >>> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> >>> In theory we don't need to call unshare if xattr is disabled, but it is
> >>> hard to get to know
> >>> if xattr is enabled or disabled in fv_queue_worker(), right?
> >>>
> >>>
> >> In kubevirt we want to run virtiofsd in containers. We would already not
> >> have xattr support for e.g. overlayfs in the VM after this patch series (an
> >> acceptable con at least for us right now).
> >> If we can get rid of the unshare (and potentially of needing root) that
> >> would be great. We always assume that everything which we run in containers
> >> should work for cri-o and docker.
> > But cri-o and docker containers run as root, isn't it? (or atleast have
> > the capability to run as root). Havind said that, it will be nice to be able
> > to run virtiofsd without root. 
> >
> > There are few hurdles though.
> >
> > - For file creation, we switch uid/gid (seteuid/setegid) and that seems
> >   to require root. If we were to run unpriviliged, probably all files
> >   on host will have to be owned by unpriviliged user and guest visible
> >   uid/gid will have to be stored in xattrs. I think virtfs supports
> >   something similar.
> >
> > I am sure there are other restrictions but this probably is the biggest
> > one to overcome.
> >
> >  > 
> You should be able to run it within a user namespace with Namespaces
> capabilities.

User namespaces has always been a one TODO item. Few challenges are
how to manage uid/gid mapping for existing directories which will be
shared. We will have to pick a mapping range and then chown the
files accordingly. And chowning itself will require priviliges.

Now Stefan is adding support to run virtiofsd inside container. So
podman should be able virtiofsd inside a user namespace (And even
give CAP_SYS_ADMIN). That way we don't have to worry about setting
a user namespace and select a mapping etc. But persistent data
volumes will continue to be an issue with user namespace, right?

How are you dealing with it in podman. Containers running in 
user namespace and with volume mounts.

Vivek

> >> "Just" pointing docker to a different seccomp.json file is something which
> >> k8s users/admin in many cases can't do.
> > Or may be issue is that standard seccomp.json does not allow unshare()
> > and hence you are forced to use a non-standar seccomp.json.
> >
> > Vivek
> >
> >> Best Regards,
> >> Roman
> >>
> >>
> >>> So, it looks good to me.
> >>> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> >>>
> >>> Regards,
> >>> Misono
> >>>
> >>>> Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> >>>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>> ---
> >>>>  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
> >>>>  1 file changed, 16 insertions(+)
> >>>>
> >>>> diff --git a/tools/virtiofsd/fuse_virtio.c
> >>> b/tools/virtiofsd/fuse_virtio.c
> >>>> index 3b6d16a041..9e5537506c 100644
> >>>> --- a/tools/virtiofsd/fuse_virtio.c
> >>>> +++ b/tools/virtiofsd/fuse_virtio.c
> >>>> @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
> >>>>  {
> >>>>      int ret;
> >>>>
> >>>> +    /*
> >>>> +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need
> >>> it. It's
> >>>> +     * an unprivileged system call but some Docker/Moby versions are
> >>> known to
> >>>> +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> >>>> +     *
> >>>> +     * Note that the program is single-threaded here so this syscall
> >>> has no
> >>>> +     * visible effect and is safe to make.
> >>>> +     */
> >>>> +    ret = unshare(CLONE_FS);
> >>>> +    if (ret == -1 && errno == EPERM) {
> >>>> +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If
> >>> "
> >>>> +                "running in a container please check that the container
> >>> "
> >>>> +                "runtime seccomp policy allows unshare.\n");
> >>>> +        return -1;
> >>>> +    }
> >>>> +
> >>>>      ret = fv_create_listen_socket(se);
> >>>>      if (ret < 0) {
> >>>>          return ret;
> >>>> --
> >>>> 2.26.2
> >>>
>
Roman Mohr July 29, 2020, 7:59 a.m. UTC | #10
On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:

> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> > misono.tomohiro@fujitsu.com> wrote:
> >
> > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print
> an
> > > error
> > > >
> > > > An assertion failure is raised during request processing if
> > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem
> can
> > > > be detected right away.
> > > >
> > > > Unfortunately Docker/Moby does not include unshare in the
> seccomp.json
> > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > > > include unshare (e.g. podman is unaffected):
> > > >
> > >
> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > > >
> > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if
> the
> > > > default seccomp.json is missing unshare.
> > >
> > > Hi, sorry for a bit late.
> > >
> > > unshare() was added to fix xattr problem:
> > >
> > >
> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> > > In theory we don't need to call unshare if xattr is disabled, but it is
> > > hard to get to know
> > > if xattr is enabled or disabled in fv_queue_worker(), right?
> > >
> > >
> > In kubevirt we want to run virtiofsd in containers. We would already not
> > have xattr support for e.g. overlayfs in the VM after this patch series
> (an
> > acceptable con at least for us right now).
> > If we can get rid of the unshare (and potentially of needing root) that
> > would be great. We always assume that everything which we run in
> containers
> > should work for cri-o and docker.
>
> But cri-o and docker containers run as root, isn't it? (or atleast have
> the capability to run as root). Havind said that, it will be nice to be
> able
> to run virtiofsd without root.
>

Yes they can run as root. I can tell you what we plan to do with the
containerized virtiofsd: We run it as part of the user-owned pod (a set of
containers).
One of our main goals at the moment is to run VMs in a user-owned pod
without additional privileges.
So that in case the user (VM-creator/owner) enters the pod or something
breaks out of the VM they are just in the unprivileged container sandbox.
As part of that we try to get also rid of running containers in the
user-context with the root user.

One possible scenario which I could think of as being desirable from a
kubevirt perspective:
We would run the VM in one container and have an unprivileged
virtiofsd container in parallel.
This container already has its own mount namespace and it is not that
critical if something manages to enter this sandbox.

But we are not as far yet as getting completely rid of root right now in
kubevirt, so if as a temporary step it needs root, the current proposed
changes would still be very useful for us.

Best Regards,
Roman

There are few hurdles though.
>
> - For file creation, we switch uid/gid (seteuid/setegid) and that seems
>   to require root. If we were to run unpriviliged, probably all files
>   on host will have to be owned by unpriviliged user and guest visible
>   uid/gid will have to be stored in xattrs. I think virtfs supports
>   something similar.


> I am sure there are other restrictions but this probably is the biggest
> one to overcome.
>
>  >
> > "Just" pointing docker to a different seccomp.json file is something
> which
> > k8s users/admin in many cases can't do.
>
> Or may be issue is that standard seccomp.json does not allow unshare()
> and hence you are forced to use a non-standar seccomp.json.
>
> Vivek
>
> >
> > Best Regards,
> > Roman
> >
> >
> > > So, it looks good to me.
> > > Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > >
> > > Regards,
> > > Misono
> > >
> > > >
> > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > ---
> > > >  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
> > > >  1 file changed, 16 insertions(+)
> > > >
> > > > diff --git a/tools/virtiofsd/fuse_virtio.c
> > > b/tools/virtiofsd/fuse_virtio.c
> > > > index 3b6d16a041..9e5537506c 100644
> > > > --- a/tools/virtiofsd/fuse_virtio.c
> > > > +++ b/tools/virtiofsd/fuse_virtio.c
> > > > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session
> *se)
> > > >  {
> > > >      int ret;
> > > >
> > > > +    /*
> > > > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will
> need
> > > it. It's
> > > > +     * an unprivileged system call but some Docker/Moby versions are
> > > known to
> > > > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > > > +     *
> > > > +     * Note that the program is single-threaded here so this syscall
> > > has no
> > > > +     * visible effect and is safe to make.
> > > > +     */
> > > > +    ret = unshare(CLONE_FS);
> > > > +    if (ret == -1 && errno == EPERM) {
> > > > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with
> EPERM. If
> > > "
> > > > +                "running in a container please check that the
> container
> > > "
> > > > +                "runtime seccomp policy allows unshare.\n");
> > > > +        return -1;
> > > > +    }
> > > > +
> > > >      ret = fv_create_listen_socket(se);
> > > >      if (ret < 0) {
> > > >          return ret;
> > > > --
> > > > 2.26.2
> > >
> > >
>
>
Stefan Hajnoczi July 29, 2020, 2:29 p.m. UTC | #11
On Tue, Jul 28, 2020 at 03:15:25PM -0400, Daniel Walsh wrote:
> On 7/28/20 11:32, Stefan Hajnoczi wrote:
> > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> >> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> >> misono.tomohiro@fujitsu.com> wrote:
> >>
> >>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an
> >>> error
> >> "Just" pointing docker to a different seccomp.json file is something which
> >> k8s users/admin in many cases can't do.
> > There is a Moby PR to change the default seccomp.json file here but it's
> > unclear if it will be merged:
> > https://github.com/moby/moby/pull/41244
> >
> > Stefan
> 
> Why not try Podman?

Absolutely, Podman allows unshare(2) in its default seccomp policy so it
does not have this problem.

I think Roman's point was mainly about the upstream user experience
where Docker is common.

Stefan
Stefan Hajnoczi July 29, 2020, 2:40 p.m. UTC | #12
On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote:
> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> > > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> > > misono.tomohiro@fujitsu.com> wrote:
> > >
> > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print
> > an
> > > > error
> Yes they can run as root. I can tell you what we plan to do with the
> containerized virtiofsd: We run it as part of the user-owned pod (a set of
> containers).
> One of our main goals at the moment is to run VMs in a user-owned pod
> without additional privileges.
> So that in case the user (VM-creator/owner) enters the pod or something
> breaks out of the VM they are just in the unprivileged container sandbox.
> As part of that we try to get also rid of running containers in the
> user-context with the root user.
> 
> One possible scenario which I could think of as being desirable from a
> kubevirt perspective:
> We would run the VM in one container and have an unprivileged
> virtiofsd container in parallel.
> This container already has its own mount namespace and it is not that
> critical if something manages to enter this sandbox.
> 
> But we are not as far yet as getting completely rid of root right now in
> kubevirt, so if as a temporary step it needs root, the current proposed
> changes would still be very useful for us.

What is the issue with root in user namespaces?

I remember a few years ago it was seen as a major security issue but
don't remember if container runtimes were already using user namespaces
back then.

I guess the goal might be simply to minimize Linux capabilities as much
as possible?

virtiofsd could nominally run with an arbitrary uid/gid but it still
needs the Linux capabilities that allow it to change uid/gid and
override file system permission checks just like the root user. Not sure
if there is any advantage to running with uid 1000 when you still have
these Linux capabilities.

Stefan
Daniel Walsh July 30, 2020, 10:21 p.m. UTC | #13
On 7/29/20 10:40, Stefan Hajnoczi wrote:
> On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote:
>> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>>
>>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
>>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
>>>> misono.tomohiro@fujitsu.com> wrote:
>>>>
>>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print
>>> an
>>>>> error
>> Yes they can run as root. I can tell you what we plan to do with the
>> containerized virtiofsd: We run it as part of the user-owned pod (a set of
>> containers).
>> One of our main goals at the moment is to run VMs in a user-owned pod
>> without additional privileges.
>> So that in case the user (VM-creator/owner) enters the pod or something
>> breaks out of the VM they are just in the unprivileged container sandbox.
>> As part of that we try to get also rid of running containers in the
>> user-context with the root user.
>>
>> One possible scenario which I could think of as being desirable from a
>> kubevirt perspective:
>> We would run the VM in one container and have an unprivileged
>> virtiofsd container in parallel.
>> This container already has its own mount namespace and it is not that
>> critical if something manages to enter this sandbox.
>>
>> But we are not as far yet as getting completely rid of root right now in
>> kubevirt, so if as a temporary step it needs root, the current proposed
>> changes would still be very useful for us.
> What is the issue with root in user namespaces?
>
> I remember a few years ago it was seen as a major security issue but
> don't remember if container runtimes were already using user namespaces
> back then.
>
> I guess the goal might be simply to minimize Linux capabilities as much
> as possible?
>
> virtiofsd could nominally run with an arbitrary uid/gid but it still
> needs the Linux capabilities that allow it to change uid/gid and
> override file system permission checks just like the root user. Not sure
> if there is any advantage to running with uid 1000 when you still have
> these Linux capabilities.
>
> Stefan

When you run in a user namespace, virtiofsd would only have
setuid/setgid over the range of UIDs mapped into the user namespace.  So
if UID=0 on the host is not mapped, then the container can not create
real UID=0 files on disk.

Similarly you can protect the user directories and any content by
running the containers in a really high UID Mapping.
Stefan Hajnoczi July 31, 2020, 8:26 a.m. UTC | #14
On Thu, Jul 30, 2020 at 06:21:34PM -0400, Daniel Walsh wrote:
> On 7/29/20 10:40, Stefan Hajnoczi wrote:
> > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote:
> >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >>
> >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> >>>> misono.tomohiro@fujitsu.com> wrote:
> >>>>
> >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print
> >>> an
> >>>>> error
> >> Yes they can run as root. I can tell you what we plan to do with the
> >> containerized virtiofsd: We run it as part of the user-owned pod (a set of
> >> containers).
> >> One of our main goals at the moment is to run VMs in a user-owned pod
> >> without additional privileges.
> >> So that in case the user (VM-creator/owner) enters the pod or something
> >> breaks out of the VM they are just in the unprivileged container sandbox.
> >> As part of that we try to get also rid of running containers in the
> >> user-context with the root user.
> >>
> >> One possible scenario which I could think of as being desirable from a
> >> kubevirt perspective:
> >> We would run the VM in one container and have an unprivileged
> >> virtiofsd container in parallel.
> >> This container already has its own mount namespace and it is not that
> >> critical if something manages to enter this sandbox.
> >>
> >> But we are not as far yet as getting completely rid of root right now in
> >> kubevirt, so if as a temporary step it needs root, the current proposed
> >> changes would still be very useful for us.
> > What is the issue with root in user namespaces?
> >
> > I remember a few years ago it was seen as a major security issue but
> > don't remember if container runtimes were already using user namespaces
> > back then.
> >
> > I guess the goal might be simply to minimize Linux capabilities as much
> > as possible?
> >
> > virtiofsd could nominally run with an arbitrary uid/gid but it still
> > needs the Linux capabilities that allow it to change uid/gid and
> > override file system permission checks just like the root user. Not sure
> > if there is any advantage to running with uid 1000 when you still have
> > these Linux capabilities.
> >
> > Stefan
> 
> When you run in a user namespace, virtiofsd would only have
> setuid/setgid over the range of UIDs mapped into the user namespace.  So
> if UID=0 on the host is not mapped, then the container can not create
> real UID=0 files on disk.
> 
> Similarly you can protect the user directories and any content by
> running the containers in a really high UID Mapping.

Roman, do user namespaces address your concerns about uid 0 in
containers?

Stefan
Roman Mohr July 31, 2020, 8:39 a.m. UTC | #15
On Fri, Jul 31, 2020 at 10:26 AM Stefan Hajnoczi <stefanha@redhat.com>
wrote:

> On Thu, Jul 30, 2020 at 06:21:34PM -0400, Daniel Walsh wrote:
> > On 7/29/20 10:40, Stefan Hajnoczi wrote:
> > > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote:
> > >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com>
> wrote:
> > >>
> > >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> > >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> > >>>> misono.tomohiro@fujitsu.com> wrote:
> > >>>>
> > >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and
> print
> > >>> an
> > >>>>> error
> > >> Yes they can run as root. I can tell you what we plan to do with the
> > >> containerized virtiofsd: We run it as part of the user-owned pod (a
> set of
> > >> containers).
> > >> One of our main goals at the moment is to run VMs in a user-owned pod
> > >> without additional privileges.
> > >> So that in case the user (VM-creator/owner) enters the pod or
> something
> > >> breaks out of the VM they are just in the unprivileged container
> sandbox.
> > >> As part of that we try to get also rid of running containers in the
> > >> user-context with the root user.
> > >>
> > >> One possible scenario which I could think of as being desirable from a
> > >> kubevirt perspective:
> > >> We would run the VM in one container and have an unprivileged
> > >> virtiofsd container in parallel.
> > >> This container already has its own mount namespace and it is not that
> > >> critical if something manages to enter this sandbox.
> > >>
> > >> But we are not as far yet as getting completely rid of root right now
> in
> > >> kubevirt, so if as a temporary step it needs root, the current
> proposed
> > >> changes would still be very useful for us.
> > > What is the issue with root in user namespaces?
> > >
> > > I remember a few years ago it was seen as a major security issue but
> > > don't remember if container runtimes were already using user namespaces
> > > back then.
> > >
> > > I guess the goal might be simply to minimize Linux capabilities as much
> > > as possible?
> > >
> > > virtiofsd could nominally run with an arbitrary uid/gid but it still
> > > needs the Linux capabilities that allow it to change uid/gid and
> > > override file system permission checks just like the root user. Not
> sure
> > > if there is any advantage to running with uid 1000 when you still have
> > > these Linux capabilities.
> > >
> > > Stefan
> >
> > When you run in a user namespace, virtiofsd would only have
> > setuid/setgid over the range of UIDs mapped into the user namespace.  So
> > if UID=0 on the host is not mapped, then the container can not create
> > real UID=0 files on disk.
> >
> > Similarly you can protect the user directories and any content by
> > running the containers in a really high UID Mapping.
>
> Roman, do user namespaces address your concerns about uid 0 in
> containers?
>

They may eventually solve it. I would not let us hang up on this right now,
since as said at least in kubevirt we can't get rid right now of root
anyway.
Even if it is at some point in the future save and supported on
bleeding-edge managed k8s clusters to allow ordinary users to run with uid
0, from my perspective it is right now common to restrict namespaces with
PodSecurityPolicies or SecurityContexts to not allow running pods as root
for normal users.
It is also common that a significant part of the community users run docker
and/or run on managed k8s clusters where they can not influence if
user-namespaces are enabled, if they can run pods as root, if the runtime
points to a seccomp file they like or if the runtime they prefer is used.

But let me repeat again that we require root right now anyway and that we
don't run the pods right now with the user privileges (but we should and we
aim for that). Right now PSPs and SCCs restrict access to these pods by the
users.
So for our use case, at this exact moment root is acceptable, the unshare
call is a little bit more problematic.

Best Regards,
Roman



>
> Stefan
>
Stefan Hajnoczi July 31, 2020, 2:11 p.m. UTC | #16
On Fri, Jul 31, 2020 at 10:39:37AM +0200, Roman Mohr wrote:
> On Fri, Jul 31, 2020 at 10:26 AM Stefan Hajnoczi <stefanha@redhat.com>
> wrote:
> 
> > On Thu, Jul 30, 2020 at 06:21:34PM -0400, Daniel Walsh wrote:
> > > On 7/29/20 10:40, Stefan Hajnoczi wrote:
> > > > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote:
> > > >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com>
> > wrote:
> > > >>
> > > >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote:
> > > >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com <
> > > >>>> misono.tomohiro@fujitsu.com> wrote:
> > > >>>>
> > > >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and
> > print
> > > >>> an
> > > >>>>> error
> > > >> Yes they can run as root. I can tell you what we plan to do with the
> > > >> containerized virtiofsd: We run it as part of the user-owned pod (a
> > set of
> > > >> containers).
> > > >> One of our main goals at the moment is to run VMs in a user-owned pod
> > > >> without additional privileges.
> > > >> So that in case the user (VM-creator/owner) enters the pod or
> > something
> > > >> breaks out of the VM they are just in the unprivileged container
> > sandbox.
> > > >> As part of that we try to get also rid of running containers in the
> > > >> user-context with the root user.
> > > >>
> > > >> One possible scenario which I could think of as being desirable from a
> > > >> kubevirt perspective:
> > > >> We would run the VM in one container and have an unprivileged
> > > >> virtiofsd container in parallel.
> > > >> This container already has its own mount namespace and it is not that
> > > >> critical if something manages to enter this sandbox.
> > > >>
> > > >> But we are not as far yet as getting completely rid of root right now
> > in
> > > >> kubevirt, so if as a temporary step it needs root, the current
> > proposed
> > > >> changes would still be very useful for us.
> > > > What is the issue with root in user namespaces?
> > > >
> > > > I remember a few years ago it was seen as a major security issue but
> > > > don't remember if container runtimes were already using user namespaces
> > > > back then.
> > > >
> > > > I guess the goal might be simply to minimize Linux capabilities as much
> > > > as possible?
> > > >
> > > > virtiofsd could nominally run with an arbitrary uid/gid but it still
> > > > needs the Linux capabilities that allow it to change uid/gid and
> > > > override file system permission checks just like the root user. Not
> > sure
> > > > if there is any advantage to running with uid 1000 when you still have
> > > > these Linux capabilities.
> > > >
> > > > Stefan
> > >
> > > When you run in a user namespace, virtiofsd would only have
> > > setuid/setgid over the range of UIDs mapped into the user namespace.  So
> > > if UID=0 on the host is not mapped, then the container can not create
> > > real UID=0 files on disk.
> > >
> > > Similarly you can protect the user directories and any content by
> > > running the containers in a really high UID Mapping.
> >
> > Roman, do user namespaces address your concerns about uid 0 in
> > containers?
> >
> 
> They may eventually solve it. I would not let us hang up on this right now,
> since as said at least in kubevirt we can't get rid right now of root
> anyway.
> Even if it is at some point in the future save and supported on
> bleeding-edge managed k8s clusters to allow ordinary users to run with uid
> 0, from my perspective it is right now common to restrict namespaces with
> PodSecurityPolicies or SecurityContexts to not allow running pods as root
> for normal users.
> It is also common that a significant part of the community users run docker
> and/or run on managed k8s clusters where they can not influence if
> user-namespaces are enabled, if they can run pods as root, if the runtime
> points to a seccomp file they like or if the runtime they prefer is used.
> 
> But let me repeat again that we require root right now anyway and that we
> don't run the pods right now with the user privileges (but we should and we
> aim for that). Right now PSPs and SCCs restrict access to these pods by the
> users.
> So for our use case, at this exact moment root is acceptable, the unshare
> call is a little bit more problematic.

Okay, thanks for explaining.

Stefan
Dr. David Alan Gilbert Aug. 7, 2020, 3:29 p.m. UTC | #17
* misono.tomohiro@fujitsu.com (misono.tomohiro@fujitsu.com) wrote:
> > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error
> > 
> > An assertion failure is raised during request processing if
> > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can
> > be detected right away.
> > 
> > Unfortunately Docker/Moby does not include unshare in the seccomp.json
> > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always
> > include unshare (e.g. podman is unaffected):
> > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json
> > 
> > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the
> > default seccomp.json is missing unshare.
> 
> Hi, sorry for a bit late.
> 
> unshare() was added to fix xattr problem: 
>   https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc#
> In theory we don't need to call unshare if xattr is disabled, but it is hard to get to know
> if xattr is enabled or disabled in fv_queue_worker(), right?
> 
> So, it looks good to me.
> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>

OK, I think it might also be OK to just fail the xattr operation on a
non-file/directory in this case.

Dave

> Regards,
> Misono
> 
> > 
> > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> > index 3b6d16a041..9e5537506c 100644
> > --- a/tools/virtiofsd/fuse_virtio.c
> > +++ b/tools/virtiofsd/fuse_virtio.c
> > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se)
> >  {
> >      int ret;
> > 
> > +    /*
> > +     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
> > +     * an unprivileged system call but some Docker/Moby versions are known to
> > +     * reject it via seccomp when CAP_SYS_ADMIN is not given.
> > +     *
> > +     * Note that the program is single-threaded here so this syscall has no
> > +     * visible effect and is safe to make.
> > +     */
> > +    ret = unshare(CLONE_FS);
> > +    if (ret == -1 && errno == EPERM) {
> > +        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
> > +                "running in a container please check that the container "
> > +                "runtime seccomp policy allows unshare.\n");
> > +        return -1;
> > +    }
> > +
> >      ret = fv_create_listen_socket(se);
> >      if (ret < 0) {
> >          return ret;
> > --
> > 2.26.2
>
diff mbox series

Patch

diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 3b6d16a041..9e5537506c 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -949,6 +949,22 @@  int virtio_session_mount(struct fuse_session *se)
 {
     int ret;
 
+    /*
+     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's
+     * an unprivileged system call but some Docker/Moby versions are known to
+     * reject it via seccomp when CAP_SYS_ADMIN is not given.
+     *
+     * Note that the program is single-threaded here so this syscall has no
+     * visible effect and is safe to make.
+     */
+    ret = unshare(CLONE_FS);
+    if (ret == -1 && errno == EPERM) {
+        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
+                "running in a container please check that the container "
+                "runtime seccomp policy allows unshare.\n");
+        return -1;
+    }
+
     ret = fv_create_listen_socket(se);
     if (ret < 0) {
         return ret;