Message ID | 20200727190223.422280-4-stefanha@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | virtiofsd: allow virtiofsd to run in a container | expand |
> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error > > An assertion failure is raised during request processing if > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > be detected right away. > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > include unshare (e.g. podman is unaffected): > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > default seccomp.json is missing unshare. Hi, sorry for a bit late. unshare() was added to fix xattr problem: https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# In theory we don't need to call unshare if xattr is disabled, but it is hard to get to know if xattr is enabled or disabled in fv_queue_worker(), right? So, it looks good to me. Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Regards, Misono > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) > > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c > index 3b6d16a041..9e5537506c 100644 > --- a/tools/virtiofsd/fuse_virtio.c > +++ b/tools/virtiofsd/fuse_virtio.c > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) > { > int ret; > > + /* > + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's > + * an unprivileged system call but some Docker/Moby versions are known to > + * reject it via seccomp when CAP_SYS_ADMIN is not given. > + * > + * Note that the program is single-threaded here so this syscall has no > + * visible effect and is safe to make. > + */ > + ret = unshare(CLONE_FS); > + if (ret == -1 && errno == EPERM) { > + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If " > + "running in a container please check that the container " > + "runtime seccomp policy allows unshare.\n"); > + return -1; > + } > + > ret = fv_create_listen_socket(se); > if (ret < 0) { > return ret; > -- > 2.26.2
On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < misono.tomohiro@fujitsu.com> wrote: > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > error > > > > An assertion failure is raised during request processing if > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > > be detected right away. > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > include unshare (e.g. podman is unaffected): > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > > default seccomp.json is missing unshare. > > Hi, sorry for a bit late. > > unshare() was added to fix xattr problem: > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > In theory we don't need to call unshare if xattr is disabled, but it is > hard to get to know > if xattr is enabled or disabled in fv_queue_worker(), right? > > In kubevirt we want to run virtiofsd in containers. We would already not have xattr support for e.g. overlayfs in the VM after this patch series (an acceptable con at least for us right now). If we can get rid of the unshare (and potentially of needing root) that would be great. We always assume that everything which we run in containers should work for cri-o and docker. "Just" pointing docker to a different seccomp.json file is something which k8s users/admin in many cases can't do. Best Regards, Roman > So, it looks good to me. > Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > Regards, > Misono > > > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > --- > > tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ > > 1 file changed, 16 insertions(+) > > > > diff --git a/tools/virtiofsd/fuse_virtio.c > b/tools/virtiofsd/fuse_virtio.c > > index 3b6d16a041..9e5537506c 100644 > > --- a/tools/virtiofsd/fuse_virtio.c > > +++ b/tools/virtiofsd/fuse_virtio.c > > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) > > { > > int ret; > > > > + /* > > + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need > it. It's > > + * an unprivileged system call but some Docker/Moby versions are > known to > > + * reject it via seccomp when CAP_SYS_ADMIN is not given. > > + * > > + * Note that the program is single-threaded here so this syscall > has no > > + * visible effect and is safe to make. > > + */ > > + ret = unshare(CLONE_FS); > > + if (ret == -1 && errno == EPERM) { > > + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If > " > > + "running in a container please check that the container > " > > + "runtime seccomp policy allows unshare.\n"); > > + return -1; > > + } > > + > > ret = fv_create_listen_socket(se); > > if (ret < 0) { > > return ret; > > -- > > 2.26.2 > >
On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > misono.tomohiro@fujitsu.com> wrote: > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > > error > > > > > > An assertion failure is raised during request processing if > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > > > be detected right away. > > > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > > include unshare (e.g. podman is unaffected): > > > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > > > default seccomp.json is missing unshare. > > > > Hi, sorry for a bit late. > > > > unshare() was added to fix xattr problem: > > > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > > In theory we don't need to call unshare if xattr is disabled, but it is > > hard to get to know > > if xattr is enabled or disabled in fv_queue_worker(), right? > > > > > In kubevirt we want to run virtiofsd in containers. We would already not > have xattr support for e.g. overlayfs in the VM after this patch series (an > acceptable con at least for us right now). > If we can get rid of the unshare (and potentially of needing root) that > would be great. We always assume that everything which we run in containers > should work for cri-o and docker. But cri-o and docker containers run as root, isn't it? (or atleast have the capability to run as root). Havind said that, it will be nice to be able to run virtiofsd without root. There are few hurdles though. - For file creation, we switch uid/gid (seteuid/setegid) and that seems to require root. If we were to run unpriviliged, probably all files on host will have to be owned by unpriviliged user and guest visible uid/gid will have to be stored in xattrs. I think virtfs supports something similar. I am sure there are other restrictions but this probably is the biggest one to overcome. > > "Just" pointing docker to a different seccomp.json file is something which > k8s users/admin in many cases can't do. Or may be issue is that standard seccomp.json does not allow unshare() and hence you are forced to use a non-standar seccomp.json. Vivek > > Best Regards, > Roman > > > > So, it looks good to me. > > Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > > > Regards, > > Misono > > > > > > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > > --- > > > tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ > > > 1 file changed, 16 insertions(+) > > > > > > diff --git a/tools/virtiofsd/fuse_virtio.c > > b/tools/virtiofsd/fuse_virtio.c > > > index 3b6d16a041..9e5537506c 100644 > > > --- a/tools/virtiofsd/fuse_virtio.c > > > +++ b/tools/virtiofsd/fuse_virtio.c > > > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) > > > { > > > int ret; > > > > > > + /* > > > + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need > > it. It's > > > + * an unprivileged system call but some Docker/Moby versions are > > known to > > > + * reject it via seccomp when CAP_SYS_ADMIN is not given. > > > + * > > > + * Note that the program is single-threaded here so this syscall > > has no > > > + * visible effect and is safe to make. > > > + */ > > > + ret = unshare(CLONE_FS); > > > + if (ret == -1 && errno == EPERM) { > > > + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If > > " > > > + "running in a container please check that the container > > " > > > + "runtime seccomp policy allows unshare.\n"); > > > + return -1; > > > + } > > > + > > > ret = fv_create_listen_socket(se); > > > if (ret < 0) { > > > return ret; > > > -- > > > 2.26.2 > > > >
On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > misono.tomohiro@fujitsu.com> wrote: > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > > error > > > > > > An assertion failure is raised during request processing if > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > > > be detected right away. > > > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > > include unshare (e.g. podman is unaffected): > > > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > > > default seccomp.json is missing unshare. > > > > Hi, sorry for a bit late. > > > > unshare() was added to fix xattr problem: > > > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > > In theory we don't need to call unshare if xattr is disabled, but it is > > hard to get to know > > if xattr is enabled or disabled in fv_queue_worker(), right? > > > > > In kubevirt we want to run virtiofsd in containers. We would already not > have xattr support for e.g. overlayfs in the VM after this patch series (an > acceptable con at least for us right now). > If we can get rid of the unshare (and potentially of needing root) that > would be great. We always assume that everything which we run in containers > should work for cri-o and docker. Root is required to access files with any uid/gid. Dave Gilbert is working on xattr support without CAP_SYS_ADMIN. He may be able to find a way to drop unshare (at least in containers). > "Just" pointing docker to a different seccomp.json file is something which > k8s users/admin in many cases can't do. There is a Moby PR to change the default seccomp.json file here but it's unclear if it will be merged: https://github.com/moby/moby/pull/41244 Stefan
On Tue, Jul 28, 2020 at 09:12:50AM -0400, Vivek Goyal wrote: > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > > misono.tomohiro@fujitsu.com> wrote: > > > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > > > error > > > > > > > > An assertion failure is raised during request processing if > > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > > > > be detected right away. > > > > > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > > > include unshare (e.g. podman is unaffected): > > > > > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > > > > default seccomp.json is missing unshare. > > > > > > Hi, sorry for a bit late. > > > > > > unshare() was added to fix xattr problem: > > > > > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > > > In theory we don't need to call unshare if xattr is disabled, but it is > > > hard to get to know > > > if xattr is enabled or disabled in fv_queue_worker(), right? > > > > > > > > In kubevirt we want to run virtiofsd in containers. We would already not > > have xattr support for e.g. overlayfs in the VM after this patch series (an > > acceptable con at least for us right now). > > If we can get rid of the unshare (and potentially of needing root) that > > would be great. We always assume that everything which we run in containers > > should work for cri-o and docker. > > But cri-o and docker containers run as root, isn't it? (or atleast have > the capability to run as root). Havind said that, it will be nice to be able > to run virtiofsd without root. > > There are few hurdles though. > > - For file creation, we switch uid/gid (seteuid/setegid) and that seems > to require root. If we were to run unpriviliged, probably all files > on host will have to be owned by unpriviliged user and guest visible > uid/gid will have to be stored in xattrs. I think virtfs supports > something similar. I think I've mentioned before, 9p virtfs supports different modes, passthrough, squashed or remapped. passthrough should be reasonably straightforward to support in virtiofs. The guest sees all the host UID/GIDs ownership as normal, and can read any files the host user can read, but are obviously restricted to write to only the files that host user can write too. No DAC-OVERRIDE facility in essence. You'll just get EPERM, which is fine. This simple passthrough scenario would be just what's desired for a typical desktop virt use cases, where you want to share part/all of your home dir with a guest for easy file access. Personally this is the mode I'd be most interested in seeing provided for unprivileged virtiofsd usage. squash is similar to passthrough, except the guest sees everything as owned by the same user. This can be surprising as the guest might see a file owned by them, but not be able to write to it, as on the host its actually owned by some other user. Fairly niche use case I think. remapping would be needed for a more general purpose use cases allowing the guest to do arbitrary UID/GID changes, but on the host everything is still stored as one user and remapped somehow. The main challenge for all the unprivileged scenarios is safety of the sandbox, to avoid risk of guests escaping to access files outside of the exported dir via symlink attacks or similar. Regards, Daniel
On 7/28/20 09:12, Vivek Goyal wrote: > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: >> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < >> misono.tomohiro@fujitsu.com> wrote: >> >>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an >>> error >>>> An assertion failure is raised during request processing if >>>> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can >>>> be detected right away. >>>> >>>> Unfortunately Docker/Moby does not include unshare in the seccomp.json >>>> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always >>>> include unshare (e.g. podman is unaffected): >>>> >>> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json >>>> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the >>>> default seccomp.json is missing unshare. >>> Hi, sorry for a bit late. >>> >>> unshare() was added to fix xattr problem: >>> >>> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# >>> In theory we don't need to call unshare if xattr is disabled, but it is >>> hard to get to know >>> if xattr is enabled or disabled in fv_queue_worker(), right? >>> >>> >> In kubevirt we want to run virtiofsd in containers. We would already not >> have xattr support for e.g. overlayfs in the VM after this patch series (an >> acceptable con at least for us right now). >> If we can get rid of the unshare (and potentially of needing root) that >> would be great. We always assume that everything which we run in containers >> should work for cri-o and docker. > But cri-o and docker containers run as root, isn't it? (or atleast have > the capability to run as root). Havind said that, it will be nice to be able > to run virtiofsd without root. > > There are few hurdles though. > > - For file creation, we switch uid/gid (seteuid/setegid) and that seems > to require root. If we were to run unpriviliged, probably all files > on host will have to be owned by unpriviliged user and guest visible > uid/gid will have to be stored in xattrs. I think virtfs supports > something similar. > > I am sure there are other restrictions but this probably is the biggest > one to overcome. > > > You should be able to run it within a user namespace with Namespaces capabilities. >> "Just" pointing docker to a different seccomp.json file is something which >> k8s users/admin in many cases can't do. > Or may be issue is that standard seccomp.json does not allow unshare() > and hence you are forced to use a non-standar seccomp.json. > > Vivek > >> Best Regards, >> Roman >> >> >>> So, it looks good to me. >>> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> >>> >>> Regards, >>> Misono >>> >>>> Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> >>>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> >>>> --- >>>> tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ >>>> 1 file changed, 16 insertions(+) >>>> >>>> diff --git a/tools/virtiofsd/fuse_virtio.c >>> b/tools/virtiofsd/fuse_virtio.c >>>> index 3b6d16a041..9e5537506c 100644 >>>> --- a/tools/virtiofsd/fuse_virtio.c >>>> +++ b/tools/virtiofsd/fuse_virtio.c >>>> @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) >>>> { >>>> int ret; >>>> >>>> + /* >>>> + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need >>> it. It's >>>> + * an unprivileged system call but some Docker/Moby versions are >>> known to >>>> + * reject it via seccomp when CAP_SYS_ADMIN is not given. >>>> + * >>>> + * Note that the program is single-threaded here so this syscall >>> has no >>>> + * visible effect and is safe to make. >>>> + */ >>>> + ret = unshare(CLONE_FS); >>>> + if (ret == -1 && errno == EPERM) { >>>> + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If >>> " >>>> + "running in a container please check that the container >>> " >>>> + "runtime seccomp policy allows unshare.\n"); >>>> + return -1; >>>> + } >>>> + >>>> ret = fv_create_listen_socket(se); >>>> if (ret < 0) { >>>> return ret; >>>> -- >>>> 2.26.2 >>>
On 7/28/20 11:32, Stefan Hajnoczi wrote: > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: >> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < >> misono.tomohiro@fujitsu.com> wrote: >> >>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an >>> error >>>> An assertion failure is raised during request processing if >>>> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can >>>> be detected right away. >>>> >>>> Unfortunately Docker/Moby does not include unshare in the seccomp.json >>>> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always >>>> include unshare (e.g. podman is unaffected): >>>> >>> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json >>>> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the >>>> default seccomp.json is missing unshare. >>> Hi, sorry for a bit late. >>> >>> unshare() was added to fix xattr problem: >>> >>> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# >>> In theory we don't need to call unshare if xattr is disabled, but it is >>> hard to get to know >>> if xattr is enabled or disabled in fv_queue_worker(), right? >>> >>> >> In kubevirt we want to run virtiofsd in containers. We would already not >> have xattr support for e.g. overlayfs in the VM after this patch series (an >> acceptable con at least for us right now). >> If we can get rid of the unshare (and potentially of needing root) that >> would be great. We always assume that everything which we run in containers >> should work for cri-o and docker. > Root is required to access files with any uid/gid. > > Dave Gilbert is working on xattr support without CAP_SYS_ADMIN. He may > be able to find a way to drop unshare (at least in containers). > >> "Just" pointing docker to a different seccomp.json file is something which >> k8s users/admin in many cases can't do. > There is a Moby PR to change the default seccomp.json file here but it's > unclear if it will be merged: > https://github.com/moby/moby/pull/41244 > > Stefan Why not try Podman?
On Tue, Jul 28, 2020 at 04:52:33PM +0100, Daniel P. Berrangé wrote: > On Tue, Jul 28, 2020 at 09:12:50AM -0400, Vivek Goyal wrote: > > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > > > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > > > misono.tomohiro@fujitsu.com> wrote: > > > > > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > > > > error > > > > > > > > > > An assertion failure is raised during request processing if > > > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > > > > > be detected right away. > > > > > > > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > > > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > > > > include unshare (e.g. podman is unaffected): > > > > > > > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > > > > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > > > > > default seccomp.json is missing unshare. > > > > > > > > Hi, sorry for a bit late. > > > > > > > > unshare() was added to fix xattr problem: > > > > > > > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > > > > In theory we don't need to call unshare if xattr is disabled, but it is > > > > hard to get to know > > > > if xattr is enabled or disabled in fv_queue_worker(), right? > > > > > > > > > > > In kubevirt we want to run virtiofsd in containers. We would already not > > > have xattr support for e.g. overlayfs in the VM after this patch series (an > > > acceptable con at least for us right now). > > > If we can get rid of the unshare (and potentially of needing root) that > > > would be great. We always assume that everything which we run in containers > > > should work for cri-o and docker. > > > > But cri-o and docker containers run as root, isn't it? (or atleast have > > the capability to run as root). Havind said that, it will be nice to be able > > to run virtiofsd without root. > > > > There are few hurdles though. > > > > - For file creation, we switch uid/gid (seteuid/setegid) and that seems > > to require root. If we were to run unpriviliged, probably all files > > on host will have to be owned by unpriviliged user and guest visible > > uid/gid will have to be stored in xattrs. I think virtfs supports > > something similar. > > I think I've mentioned before, 9p virtfs supports different modes, > passthrough, squashed or remapped. > > passthrough should be reasonably straightforward to support in virtiofs. > The guest sees all the host UID/GIDs ownership as normal, and can read > any files the host user can read, but are obviously restricted to write > to only the files that host user can write too. No DAC-OVERRIDE facility > in essence. You'll just get EPERM, which is fine. This simple passthrough > scenario would be just what's desired for a typical desktop virt use > cases, where you want to share part/all of your home dir with a guest for > easy file access. Personally this is the mode I'd be most interested in > seeing provided for unprivileged virtiofsd usage. Interesting. So passthrough will have two sub modes. priviliged and unpriviliged. As of now we support priviliged passthrough. I guess it does make sense to look into unpriviliged passthrough and see what other operations will not be allowed. Thanks Vivek > > squash is similar to passthrough, except the guest sees everything > as owned by the same user. This can be surprising as the guest might > see a file owned by them, but not be able to write to it, as on the > host its actually owned by some other user. Fairly niche use case > I think. > > remapping would be needed for a more general purpose use cases > allowing the guest to do arbitrary UID/GID changes, but on the host > everything is still stored as one user and remapped somehow. > > The main challenge for all the unprivileged scenarios is safety of > the sandbox, to avoid risk of guests escaping to access files outside > of the exported dir via symlink attacks or similar. > > > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Tue, Jul 28, 2020 at 03:12:54PM -0400, Daniel Walsh wrote: > On 7/28/20 09:12, Vivek Goyal wrote: > > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > >> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > >> misono.tomohiro@fujitsu.com> wrote: > >> > >>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > >>> error > >>>> An assertion failure is raised during request processing if > >>>> unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > >>>> be detected right away. > >>>> > >>>> Unfortunately Docker/Moby does not include unshare in the seccomp.json > >>>> list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > >>>> include unshare (e.g. podman is unaffected): > >>>> > >>> https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > >>>> Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > >>>> default seccomp.json is missing unshare. > >>> Hi, sorry for a bit late. > >>> > >>> unshare() was added to fix xattr problem: > >>> > >>> https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > >>> In theory we don't need to call unshare if xattr is disabled, but it is > >>> hard to get to know > >>> if xattr is enabled or disabled in fv_queue_worker(), right? > >>> > >>> > >> In kubevirt we want to run virtiofsd in containers. We would already not > >> have xattr support for e.g. overlayfs in the VM after this patch series (an > >> acceptable con at least for us right now). > >> If we can get rid of the unshare (and potentially of needing root) that > >> would be great. We always assume that everything which we run in containers > >> should work for cri-o and docker. > > But cri-o and docker containers run as root, isn't it? (or atleast have > > the capability to run as root). Havind said that, it will be nice to be able > > to run virtiofsd without root. > > > > There are few hurdles though. > > > > - For file creation, we switch uid/gid (seteuid/setegid) and that seems > > to require root. If we were to run unpriviliged, probably all files > > on host will have to be owned by unpriviliged user and guest visible > > uid/gid will have to be stored in xattrs. I think virtfs supports > > something similar. > > > > I am sure there are other restrictions but this probably is the biggest > > one to overcome. > > > > > > You should be able to run it within a user namespace with Namespaces > capabilities. User namespaces has always been a one TODO item. Few challenges are how to manage uid/gid mapping for existing directories which will be shared. We will have to pick a mapping range and then chown the files accordingly. And chowning itself will require priviliges. Now Stefan is adding support to run virtiofsd inside container. So podman should be able virtiofsd inside a user namespace (And even give CAP_SYS_ADMIN). That way we don't have to worry about setting a user namespace and select a mapping etc. But persistent data volumes will continue to be an issue with user namespace, right? How are you dealing with it in podman. Containers running in user namespace and with volume mounts. Vivek > >> "Just" pointing docker to a different seccomp.json file is something which > >> k8s users/admin in many cases can't do. > > Or may be issue is that standard seccomp.json does not allow unshare() > > and hence you are forced to use a non-standar seccomp.json. > > > > Vivek > > > >> Best Regards, > >> Roman > >> > >> > >>> So, it looks good to me. > >>> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > >>> > >>> Regards, > >>> Misono > >>> > >>>> Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > >>>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > >>>> --- > >>>> tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ > >>>> 1 file changed, 16 insertions(+) > >>>> > >>>> diff --git a/tools/virtiofsd/fuse_virtio.c > >>> b/tools/virtiofsd/fuse_virtio.c > >>>> index 3b6d16a041..9e5537506c 100644 > >>>> --- a/tools/virtiofsd/fuse_virtio.c > >>>> +++ b/tools/virtiofsd/fuse_virtio.c > >>>> @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) > >>>> { > >>>> int ret; > >>>> > >>>> + /* > >>>> + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need > >>> it. It's > >>>> + * an unprivileged system call but some Docker/Moby versions are > >>> known to > >>>> + * reject it via seccomp when CAP_SYS_ADMIN is not given. > >>>> + * > >>>> + * Note that the program is single-threaded here so this syscall > >>> has no > >>>> + * visible effect and is safe to make. > >>>> + */ > >>>> + ret = unshare(CLONE_FS); > >>>> + if (ret == -1 && errno == EPERM) { > >>>> + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If > >>> " > >>>> + "running in a container please check that the container > >>> " > >>>> + "runtime seccomp policy allows unshare.\n"); > >>>> + return -1; > >>>> + } > >>>> + > >>>> ret = fv_create_listen_socket(se); > >>>> if (ret < 0) { > >>>> return ret; > >>>> -- > >>>> 2.26.2 > >>> >
On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote: > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > > misono.tomohiro@fujitsu.com> wrote: > > > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print > an > > > error > > > > > > > > An assertion failure is raised during request processing if > > > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem > can > > > > be detected right away. > > > > > > > > Unfortunately Docker/Moby does not include unshare in the > seccomp.json > > > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > > > include unshare (e.g. podman is unaffected): > > > > > > > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if > the > > > > default seccomp.json is missing unshare. > > > > > > Hi, sorry for a bit late. > > > > > > unshare() was added to fix xattr problem: > > > > > > > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > > > In theory we don't need to call unshare if xattr is disabled, but it is > > > hard to get to know > > > if xattr is enabled or disabled in fv_queue_worker(), right? > > > > > > > > In kubevirt we want to run virtiofsd in containers. We would already not > > have xattr support for e.g. overlayfs in the VM after this patch series > (an > > acceptable con at least for us right now). > > If we can get rid of the unshare (and potentially of needing root) that > > would be great. We always assume that everything which we run in > containers > > should work for cri-o and docker. > > But cri-o and docker containers run as root, isn't it? (or atleast have > the capability to run as root). Havind said that, it will be nice to be > able > to run virtiofsd without root. > Yes they can run as root. I can tell you what we plan to do with the containerized virtiofsd: We run it as part of the user-owned pod (a set of containers). One of our main goals at the moment is to run VMs in a user-owned pod without additional privileges. So that in case the user (VM-creator/owner) enters the pod or something breaks out of the VM they are just in the unprivileged container sandbox. As part of that we try to get also rid of running containers in the user-context with the root user. One possible scenario which I could think of as being desirable from a kubevirt perspective: We would run the VM in one container and have an unprivileged virtiofsd container in parallel. This container already has its own mount namespace and it is not that critical if something manages to enter this sandbox. But we are not as far yet as getting completely rid of root right now in kubevirt, so if as a temporary step it needs root, the current proposed changes would still be very useful for us. Best Regards, Roman There are few hurdles though. > > - For file creation, we switch uid/gid (seteuid/setegid) and that seems > to require root. If we were to run unpriviliged, probably all files > on host will have to be owned by unpriviliged user and guest visible > uid/gid will have to be stored in xattrs. I think virtfs supports > something similar. > I am sure there are other restrictions but this probably is the biggest > one to overcome. > > > > > "Just" pointing docker to a different seccomp.json file is something > which > > k8s users/admin in many cases can't do. > > Or may be issue is that standard seccomp.json does not allow unshare() > and hence you are forced to use a non-standar seccomp.json. > > Vivek > > > > > Best Regards, > > Roman > > > > > > > So, it looks good to me. > > > Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > > > > > Regards, > > > Misono > > > > > > > > > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > --- > > > > tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ > > > > 1 file changed, 16 insertions(+) > > > > > > > > diff --git a/tools/virtiofsd/fuse_virtio.c > > > b/tools/virtiofsd/fuse_virtio.c > > > > index 3b6d16a041..9e5537506c 100644 > > > > --- a/tools/virtiofsd/fuse_virtio.c > > > > +++ b/tools/virtiofsd/fuse_virtio.c > > > > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session > *se) > > > > { > > > > int ret; > > > > > > > > + /* > > > > + * Test that unshare(CLONE_FS) works. fv_queue_worker() will > need > > > it. It's > > > > + * an unprivileged system call but some Docker/Moby versions are > > > known to > > > > + * reject it via seccomp when CAP_SYS_ADMIN is not given. > > > > + * > > > > + * Note that the program is single-threaded here so this syscall > > > has no > > > > + * visible effect and is safe to make. > > > > + */ > > > > + ret = unshare(CLONE_FS); > > > > + if (ret == -1 && errno == EPERM) { > > > > + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with > EPERM. If > > > " > > > > + "running in a container please check that the > container > > > " > > > > + "runtime seccomp policy allows unshare.\n"); > > > > + return -1; > > > > + } > > > > + > > > > ret = fv_create_listen_socket(se); > > > > if (ret < 0) { > > > > return ret; > > > > -- > > > > 2.26.2 > > > > > > > >
On Tue, Jul 28, 2020 at 03:15:25PM -0400, Daniel Walsh wrote: > On 7/28/20 11:32, Stefan Hajnoczi wrote: > > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > >> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > >> misono.tomohiro@fujitsu.com> wrote: > >> > >>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an > >>> error > >> "Just" pointing docker to a different seccomp.json file is something which > >> k8s users/admin in many cases can't do. > > There is a Moby PR to change the default seccomp.json file here but it's > > unclear if it will be merged: > > https://github.com/moby/moby/pull/41244 > > > > Stefan > > Why not try Podman? Absolutely, Podman allows unshare(2) in its default seccomp policy so it does not have this problem. I think Roman's point was mainly about the upstream user experience where Docker is common. Stefan
On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote: > On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > > > On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > > > misono.tomohiro@fujitsu.com> wrote: > > > > > > > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print > > an > > > > error > Yes they can run as root. I can tell you what we plan to do with the > containerized virtiofsd: We run it as part of the user-owned pod (a set of > containers). > One of our main goals at the moment is to run VMs in a user-owned pod > without additional privileges. > So that in case the user (VM-creator/owner) enters the pod or something > breaks out of the VM they are just in the unprivileged container sandbox. > As part of that we try to get also rid of running containers in the > user-context with the root user. > > One possible scenario which I could think of as being desirable from a > kubevirt perspective: > We would run the VM in one container and have an unprivileged > virtiofsd container in parallel. > This container already has its own mount namespace and it is not that > critical if something manages to enter this sandbox. > > But we are not as far yet as getting completely rid of root right now in > kubevirt, so if as a temporary step it needs root, the current proposed > changes would still be very useful for us. What is the issue with root in user namespaces? I remember a few years ago it was seen as a major security issue but don't remember if container runtimes were already using user namespaces back then. I guess the goal might be simply to minimize Linux capabilities as much as possible? virtiofsd could nominally run with an arbitrary uid/gid but it still needs the Linux capabilities that allow it to change uid/gid and override file system permission checks just like the root user. Not sure if there is any advantage to running with uid 1000 when you still have these Linux capabilities. Stefan
On 7/29/20 10:40, Stefan Hajnoczi wrote: > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote: >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote: >> >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < >>>> misono.tomohiro@fujitsu.com> wrote: >>>> >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print >>> an >>>>> error >> Yes they can run as root. I can tell you what we plan to do with the >> containerized virtiofsd: We run it as part of the user-owned pod (a set of >> containers). >> One of our main goals at the moment is to run VMs in a user-owned pod >> without additional privileges. >> So that in case the user (VM-creator/owner) enters the pod or something >> breaks out of the VM they are just in the unprivileged container sandbox. >> As part of that we try to get also rid of running containers in the >> user-context with the root user. >> >> One possible scenario which I could think of as being desirable from a >> kubevirt perspective: >> We would run the VM in one container and have an unprivileged >> virtiofsd container in parallel. >> This container already has its own mount namespace and it is not that >> critical if something manages to enter this sandbox. >> >> But we are not as far yet as getting completely rid of root right now in >> kubevirt, so if as a temporary step it needs root, the current proposed >> changes would still be very useful for us. > What is the issue with root in user namespaces? > > I remember a few years ago it was seen as a major security issue but > don't remember if container runtimes were already using user namespaces > back then. > > I guess the goal might be simply to minimize Linux capabilities as much > as possible? > > virtiofsd could nominally run with an arbitrary uid/gid but it still > needs the Linux capabilities that allow it to change uid/gid and > override file system permission checks just like the root user. Not sure > if there is any advantage to running with uid 1000 when you still have > these Linux capabilities. > > Stefan When you run in a user namespace, virtiofsd would only have setuid/setgid over the range of UIDs mapped into the user namespace. So if UID=0 on the host is not mapped, then the container can not create real UID=0 files on disk. Similarly you can protect the user directories and any content by running the containers in a really high UID Mapping.
On Thu, Jul 30, 2020 at 06:21:34PM -0400, Daniel Walsh wrote: > On 7/29/20 10:40, Stefan Hajnoczi wrote: > > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote: > >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> wrote: > >> > >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > >>>> misono.tomohiro@fujitsu.com> wrote: > >>>> > >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print > >>> an > >>>>> error > >> Yes they can run as root. I can tell you what we plan to do with the > >> containerized virtiofsd: We run it as part of the user-owned pod (a set of > >> containers). > >> One of our main goals at the moment is to run VMs in a user-owned pod > >> without additional privileges. > >> So that in case the user (VM-creator/owner) enters the pod or something > >> breaks out of the VM they are just in the unprivileged container sandbox. > >> As part of that we try to get also rid of running containers in the > >> user-context with the root user. > >> > >> One possible scenario which I could think of as being desirable from a > >> kubevirt perspective: > >> We would run the VM in one container and have an unprivileged > >> virtiofsd container in parallel. > >> This container already has its own mount namespace and it is not that > >> critical if something manages to enter this sandbox. > >> > >> But we are not as far yet as getting completely rid of root right now in > >> kubevirt, so if as a temporary step it needs root, the current proposed > >> changes would still be very useful for us. > > What is the issue with root in user namespaces? > > > > I remember a few years ago it was seen as a major security issue but > > don't remember if container runtimes were already using user namespaces > > back then. > > > > I guess the goal might be simply to minimize Linux capabilities as much > > as possible? > > > > virtiofsd could nominally run with an arbitrary uid/gid but it still > > needs the Linux capabilities that allow it to change uid/gid and > > override file system permission checks just like the root user. Not sure > > if there is any advantage to running with uid 1000 when you still have > > these Linux capabilities. > > > > Stefan > > When you run in a user namespace, virtiofsd would only have > setuid/setgid over the range of UIDs mapped into the user namespace. So > if UID=0 on the host is not mapped, then the container can not create > real UID=0 files on disk. > > Similarly you can protect the user directories and any content by > running the containers in a really high UID Mapping. Roman, do user namespaces address your concerns about uid 0 in containers? Stefan
On Fri, Jul 31, 2020 at 10:26 AM Stefan Hajnoczi <stefanha@redhat.com> wrote: > On Thu, Jul 30, 2020 at 06:21:34PM -0400, Daniel Walsh wrote: > > On 7/29/20 10:40, Stefan Hajnoczi wrote: > > > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote: > > >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> > wrote: > > >> > > >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > > >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > > >>>> misono.tomohiro@fujitsu.com> wrote: > > >>>> > > >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and > print > > >>> an > > >>>>> error > > >> Yes they can run as root. I can tell you what we plan to do with the > > >> containerized virtiofsd: We run it as part of the user-owned pod (a > set of > > >> containers). > > >> One of our main goals at the moment is to run VMs in a user-owned pod > > >> without additional privileges. > > >> So that in case the user (VM-creator/owner) enters the pod or > something > > >> breaks out of the VM they are just in the unprivileged container > sandbox. > > >> As part of that we try to get also rid of running containers in the > > >> user-context with the root user. > > >> > > >> One possible scenario which I could think of as being desirable from a > > >> kubevirt perspective: > > >> We would run the VM in one container and have an unprivileged > > >> virtiofsd container in parallel. > > >> This container already has its own mount namespace and it is not that > > >> critical if something manages to enter this sandbox. > > >> > > >> But we are not as far yet as getting completely rid of root right now > in > > >> kubevirt, so if as a temporary step it needs root, the current > proposed > > >> changes would still be very useful for us. > > > What is the issue with root in user namespaces? > > > > > > I remember a few years ago it was seen as a major security issue but > > > don't remember if container runtimes were already using user namespaces > > > back then. > > > > > > I guess the goal might be simply to minimize Linux capabilities as much > > > as possible? > > > > > > virtiofsd could nominally run with an arbitrary uid/gid but it still > > > needs the Linux capabilities that allow it to change uid/gid and > > > override file system permission checks just like the root user. Not > sure > > > if there is any advantage to running with uid 1000 when you still have > > > these Linux capabilities. > > > > > > Stefan > > > > When you run in a user namespace, virtiofsd would only have > > setuid/setgid over the range of UIDs mapped into the user namespace. So > > if UID=0 on the host is not mapped, then the container can not create > > real UID=0 files on disk. > > > > Similarly you can protect the user directories and any content by > > running the containers in a really high UID Mapping. > > Roman, do user namespaces address your concerns about uid 0 in > containers? > They may eventually solve it. I would not let us hang up on this right now, since as said at least in kubevirt we can't get rid right now of root anyway. Even if it is at some point in the future save and supported on bleeding-edge managed k8s clusters to allow ordinary users to run with uid 0, from my perspective it is right now common to restrict namespaces with PodSecurityPolicies or SecurityContexts to not allow running pods as root for normal users. It is also common that a significant part of the community users run docker and/or run on managed k8s clusters where they can not influence if user-namespaces are enabled, if they can run pods as root, if the runtime points to a seccomp file they like or if the runtime they prefer is used. But let me repeat again that we require root right now anyway and that we don't run the pods right now with the user privileges (but we should and we aim for that). Right now PSPs and SCCs restrict access to these pods by the users. So for our use case, at this exact moment root is acceptable, the unshare call is a little bit more problematic. Best Regards, Roman > > Stefan >
On Fri, Jul 31, 2020 at 10:39:37AM +0200, Roman Mohr wrote: > On Fri, Jul 31, 2020 at 10:26 AM Stefan Hajnoczi <stefanha@redhat.com> > wrote: > > > On Thu, Jul 30, 2020 at 06:21:34PM -0400, Daniel Walsh wrote: > > > On 7/29/20 10:40, Stefan Hajnoczi wrote: > > > > On Wed, Jul 29, 2020 at 09:59:01AM +0200, Roman Mohr wrote: > > > >> On Tue, Jul 28, 2020 at 3:13 PM Vivek Goyal <vgoyal@redhat.com> > > wrote: > > > >> > > > >>> On Tue, Jul 28, 2020 at 12:00:20PM +0200, Roman Mohr wrote: > > > >>>> On Tue, Jul 28, 2020 at 3:07 AM misono.tomohiro@fujitsu.com < > > > >>>> misono.tomohiro@fujitsu.com> wrote: > > > >>>> > > > >>>>>> Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and > > print > > > >>> an > > > >>>>> error > > > >> Yes they can run as root. I can tell you what we plan to do with the > > > >> containerized virtiofsd: We run it as part of the user-owned pod (a > > set of > > > >> containers). > > > >> One of our main goals at the moment is to run VMs in a user-owned pod > > > >> without additional privileges. > > > >> So that in case the user (VM-creator/owner) enters the pod or > > something > > > >> breaks out of the VM they are just in the unprivileged container > > sandbox. > > > >> As part of that we try to get also rid of running containers in the > > > >> user-context with the root user. > > > >> > > > >> One possible scenario which I could think of as being desirable from a > > > >> kubevirt perspective: > > > >> We would run the VM in one container and have an unprivileged > > > >> virtiofsd container in parallel. > > > >> This container already has its own mount namespace and it is not that > > > >> critical if something manages to enter this sandbox. > > > >> > > > >> But we are not as far yet as getting completely rid of root right now > > in > > > >> kubevirt, so if as a temporary step it needs root, the current > > proposed > > > >> changes would still be very useful for us. > > > > What is the issue with root in user namespaces? > > > > > > > > I remember a few years ago it was seen as a major security issue but > > > > don't remember if container runtimes were already using user namespaces > > > > back then. > > > > > > > > I guess the goal might be simply to minimize Linux capabilities as much > > > > as possible? > > > > > > > > virtiofsd could nominally run with an arbitrary uid/gid but it still > > > > needs the Linux capabilities that allow it to change uid/gid and > > > > override file system permission checks just like the root user. Not > > sure > > > > if there is any advantage to running with uid 1000 when you still have > > > > these Linux capabilities. > > > > > > > > Stefan > > > > > > When you run in a user namespace, virtiofsd would only have > > > setuid/setgid over the range of UIDs mapped into the user namespace. So > > > if UID=0 on the host is not mapped, then the container can not create > > > real UID=0 files on disk. > > > > > > Similarly you can protect the user directories and any content by > > > running the containers in a really high UID Mapping. > > > > Roman, do user namespaces address your concerns about uid 0 in > > containers? > > > > They may eventually solve it. I would not let us hang up on this right now, > since as said at least in kubevirt we can't get rid right now of root > anyway. > Even if it is at some point in the future save and supported on > bleeding-edge managed k8s clusters to allow ordinary users to run with uid > 0, from my perspective it is right now common to restrict namespaces with > PodSecurityPolicies or SecurityContexts to not allow running pods as root > for normal users. > It is also common that a significant part of the community users run docker > and/or run on managed k8s clusters where they can not influence if > user-namespaces are enabled, if they can run pods as root, if the runtime > points to a seccomp file they like or if the runtime they prefer is used. > > But let me repeat again that we require root right now anyway and that we > don't run the pods right now with the user privileges (but we should and we > aim for that). Right now PSPs and SCCs restrict access to these pods by the > users. > So for our use case, at this exact moment root is acceptable, the unshare > call is a little bit more problematic. Okay, thanks for explaining. Stefan
* misono.tomohiro@fujitsu.com (misono.tomohiro@fujitsu.com) wrote: > > Subject: [PATCH v2 3/3] virtiofsd: probe unshare(CLONE_FS) and print an error > > > > An assertion failure is raised during request processing if > > unshare(CLONE_FS) fails. Implement a probe at startup so the problem can > > be detected right away. > > > > Unfortunately Docker/Moby does not include unshare in the seccomp.json > > list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always > > include unshare (e.g. podman is unaffected): > > https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json > > > > Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the > > default seccomp.json is missing unshare. > > Hi, sorry for a bit late. > > unshare() was added to fix xattr problem: > https://github.com/qemu/qemu/commit/bdfd66788349acc43cd3f1298718ad491663cfcc# > In theory we don't need to call unshare if xattr is disabled, but it is hard to get to know > if xattr is enabled or disabled in fv_queue_worker(), right? > > So, it looks good to me. > Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> OK, I think it might also be OK to just fail the xattr operation on a non-file/directory in this case. Dave > Regards, > Misono > > > > > Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > --- > > tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ > > 1 file changed, 16 insertions(+) > > > > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c > > index 3b6d16a041..9e5537506c 100644 > > --- a/tools/virtiofsd/fuse_virtio.c > > +++ b/tools/virtiofsd/fuse_virtio.c > > @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) > > { > > int ret; > > > > + /* > > + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's > > + * an unprivileged system call but some Docker/Moby versions are known to > > + * reject it via seccomp when CAP_SYS_ADMIN is not given. > > + * > > + * Note that the program is single-threaded here so this syscall has no > > + * visible effect and is safe to make. > > + */ > > + ret = unshare(CLONE_FS); > > + if (ret == -1 && errno == EPERM) { > > + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If " > > + "running in a container please check that the container " > > + "runtime seccomp policy allows unshare.\n"); > > + return -1; > > + } > > + > > ret = fv_create_listen_socket(se); > > if (ret < 0) { > > return ret; > > -- > > 2.26.2 >
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c index 3b6d16a041..9e5537506c 100644 --- a/tools/virtiofsd/fuse_virtio.c +++ b/tools/virtiofsd/fuse_virtio.c @@ -949,6 +949,22 @@ int virtio_session_mount(struct fuse_session *se) { int ret; + /* + * Test that unshare(CLONE_FS) works. fv_queue_worker() will need it. It's + * an unprivileged system call but some Docker/Moby versions are known to + * reject it via seccomp when CAP_SYS_ADMIN is not given. + * + * Note that the program is single-threaded here so this syscall has no + * visible effect and is safe to make. + */ + ret = unshare(CLONE_FS); + if (ret == -1 && errno == EPERM) { + fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If " + "running in a container please check that the container " + "runtime seccomp policy allows unshare.\n"); + return -1; + } + ret = fv_create_listen_socket(se); if (ret < 0) { return ret;
An assertion failure is raised during request processing if unshare(CLONE_FS) fails. Implement a probe at startup so the problem can be detected right away. Unfortunately Docker/Moby does not include unshare in the seccomp.json list unless CAP_SYS_ADMIN is given. Other seccomp.json lists always include unshare (e.g. podman is unaffected): https://raw.githubusercontent.com/seccomp/containers-golang/master/seccomp.json Use "docker run --security-opt seccomp=path/to/seccomp.json ..." if the default seccomp.json is missing unshare. Cc: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> --- tools/virtiofsd/fuse_virtio.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)