Message ID | 20181010161430.11633-1-laurent@vivier.eu (mailing list archive) |
---|---|
Headers | show |
Series | ns: introduce binfmt_misc namespace | expand |
Hi, Any comment on this last version? Any chance to be merged? Thanks, Laurent Le 10/10/2018 à 18:14, Laurent Vivier a écrit : > v6: Return &init_binfmt_ns instead of NULL in binfmt_ns() > This should never happen, but to stay safe return a > value we can use. > change subject from "RFC" to "PATCH" > > v5: Use READ_ONCE()/WRITE_ONCE() > move mount pointer struct init to bm_fill_super() and add smp_wmb() > remove useless NULL value init > add WARN_ON_ONCE() > > v4: first user namespace is initialized with &init_binfmt_ns, > all new user namespaces are initialized with a NULL and use > the one of the first parent that is not NULL. The pointer > is initialized to a valid value the first time the binfmt_misc > fs is mounted in the current user namespace. > This allows to not change the way it was working before: > new ns inherits values from its parent, and if parent value is modified > (or parent creates its own binfmt entry by mounting the fs) child > inherits it (unless it has itself mounted the fs). > > v3: create a structure to store binfmt_misc data, > add a pointer to this structure in the user_namespace structure, > in init_user_ns structure this pointer points to an init_binfmt_ns > structure. And all new user namespaces point to this init structure. > A new binfmt namespace structure is allocated if the binfmt_misc > filesystem is mounted in a user namespace that is not the initial > one but its binfmt namespace pointer points to the initial one. > add override_creds()/revert_creds() around open_exec() in > bm_register_write() > > v2: no new namespace, binfmt_misc data are now part of > the mount namespace > I put this in mount namespace instead of user namespace > because the mount namespace is already needed and > I don't want to force to have the user namespace for that. > As this is a filesystem, it seems logic to have it here. > > This allows to define a new interpreter for each new container. > > But the main goal is to be able to chroot to a directory > using a binfmt_misc interpreter without being root. > > I have a modified version of unshare at: > > git@github.com:vivier/util-linux.git branch unshare-chroot > > with some new options to unshare binfmt_misc namespace and to chroot > to a directory. > > If you have a directory /chroot/powerpc/jessie containing debian for powerpc > binaries and a qemu-ppc interpreter, you can do for instance: > > $ uname -a > Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux > $ ./unshare --map-root-user --fork --pid \ > --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \ > --root=/chroot/powerpc/jessie /bin/bash -l > # uname -a > Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux > # id > uid=0(root) gid=0(root) groups=0(root),65534(nogroup) > # ls -l > total 5940 > drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:58 bin > drwxr-xr-x. 2 nobody nogroup 4096 Jun 17 20:26 boot > drwxr-xr-x. 4 nobody nogroup 4096 Aug 12 00:08 dev > drwxr-xr-x. 42 nobody nogroup 4096 Sep 28 07:25 etc > drwxr-xr-x. 3 nobody nogroup 4096 Sep 28 07:25 home > drwxr-xr-x. 9 nobody nogroup 4096 Aug 12 00:58 lib > drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 media > drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 mnt > drwxr-xr-x. 3 nobody nogroup 4096 Aug 12 13:09 opt > dr-xr-xr-x. 143 nobody nogroup 0 Sep 30 23:02 proc > -rwxr-xr-x. 1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc > drwx------. 3 nobody nogroup 4096 Aug 12 12:54 root > drwxr-xr-x. 3 nobody nogroup 4096 Aug 12 00:08 run > drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:58 sbin > drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 srv > drwxr-xr-x. 2 nobody nogroup 4096 Apr 6 2015 sys > drwxrwxrwt. 2 nobody nogroup 4096 Sep 28 10:31 tmp > drwxr-xr-x. 10 nobody nogroup 4096 Aug 12 00:08 usr > drwxr-xr-x. 11 nobody nogroup 4096 Aug 12 00:08 var > > If you want to use the qemu binary provided by your distro, you can use > > --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF" > > With the 'F' flag, qemu-ppc-static will be then loaded from the main root > filesystem before switching to the chroot. > > Laurent Vivier (1): > ns: add binfmt_misc to the user namespace > > fs/binfmt_misc.c | 111 ++++++++++++++++++++++++--------- > include/linux/user_namespace.h | 15 +++++ > kernel/user.c | 14 +++++ > kernel/user_namespace.c | 3 + > 4 files changed, 115 insertions(+), 28 deletions(-) >
On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: > Hi, > > Any comment on this last version? > > Any chance to be merged? I've got a use case for this: I went to one of the Graphene talks in Edinburgh and it struck me that we seem to keep reinventing the type of sandboxing that qemu-user already does. However if you want to do an x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism because that has you running *every* binary on the system emulated. Doing it per user namespace fixes this problem and allows us to at least cut down on all the pointless duplication. James
On Thu, Nov 1, 2018 at 3:59 AM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: > > Hi, > > > > Any comment on this last version? > > > > Any chance to be merged? > > I've got a use case for this: I went to one of the Graphene talks in > Edinburgh and it struck me that we seem to keep reinventing the type of > sandboxing that qemu-user already does. However if you want to do an > x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism > because that has you running *every* binary on the system emulated. > Doing it per user namespace fixes this problem and allows us to at > least cut down on all the pointless duplication. Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes your code slower and *LESS* secure. As far as I know, qemu-user is only intended for purposes like development and testing.
On 01/11/2018 04:51, Jann Horn wrote: > On Thu, Nov 1, 2018 at 3:59 AM James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: >> >> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: >>> Hi, >>> >>> Any comment on this last version? >>> >>> Any chance to be merged? >> >> I've got a use case for this: I went to one of the Graphene talks in >> Edinburgh and it struck me that we seem to keep reinventing the type of >> sandboxing that qemu-user already does. However if you want to do an >> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism >> because that has you running *every* binary on the system emulated. >> Doing it per user namespace fixes this problem and allows us to at >> least cut down on all the pointless duplication. > > Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes > your code slower and *LESS* secure. As far as I know, qemu-user is > only intended for purposes like development and testing. > I think the idea here is not to run qemu, but to use an interpreter (something like gVisor) into a container to control the binaries execution inside the container without using this interpreter on the host itself (container and host shares the same binfmt_misc magic/mask). Thanks, Laurent
On Thu, 2018-11-01 at 04:51 +0100, Jann Horn wrote: > On Thu, Nov 1, 2018 at 3:59 AM James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > > On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: > > > Hi, > > > > > > Any comment on this last version? > > > > > > Any chance to be merged? > > > > I've got a use case for this: I went to one of the Graphene talks > > in Edinburgh and it struck me that we seem to keep reinventing the > > type of sandboxing that qemu-user already does. However if you > > want to do an x86 on x86 sandbox, you can't currently use the > > binfmt_misc mechanism because that has you running *every* binary > > on the system emulated. Doing it per user namespace fixes this > > problem and allows us to at least cut down on all the pointless > > duplication. > > Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes > your code slower and *LESS* secure. As far as I know, qemu-user is > only intended for purposes like development and testing. Sandboxing is about protecting the cloud service provider (and other tenants) from horizontal attack by reducing calls to the shared kernel. I think it's pretty indisputable that full emulation is an effective sandbox in that regard. We can argue for about bugginess vs completeness, but technologically qemu-user already has most of the system calls, which seems to be a significant problem with other sandboxes. I also can't dispute it's slower, but that's a tradeoff for people to make. James
Laurent Vivier <laurent@vivier.eu> writes: > On 01/11/2018 04:51, Jann Horn wrote: >> On Thu, Nov 1, 2018 at 3:59 AM James Bottomley >> <James.Bottomley@hansenpartnership.com> wrote: >>> >>> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: >>>> Hi, >>>> >>>> Any comment on this last version? >>>> >>>> Any chance to be merged? >>> >>> I've got a use case for this: I went to one of the Graphene talks in >>> Edinburgh and it struck me that we seem to keep reinventing the type of >>> sandboxing that qemu-user already does. However if you want to do an >>> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism >>> because that has you running *every* binary on the system emulated. >>> Doing it per user namespace fixes this problem and allows us to at >>> least cut down on all the pointless duplication. >> >> Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes >> your code slower and *LESS* secure. As far as I know, qemu-user is >> only intended for purposes like development and testing. >> > > I think the idea here is not to run qemu, but to use an interpreter > (something like gVisor) into a container to control the binaries > execution inside the container without using this interpreter on the > host itself (container and host shares the same binfmt_misc > magic/mask). Please remind me of this patchset after the merge window is over, and if there are no issues I will take it via my user namespace branch. Last I looked I had a concern that some of the permission check issues were being papered over by using override cred instead of fixing the deaper code. Sometimes they are necessary but seeing work-arounds instead of fixes for problems tends to be a maintenance issue, possibly with security consequences. Best is if the everyone agrees on how all of the interfaces work so their are no surprises. Eric
On Thu, Nov 1, 2018 at 3:10 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > On Thu, 2018-11-01 at 04:51 +0100, Jann Horn wrote: > > On Thu, Nov 1, 2018 at 3:59 AM James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: > > > > Hi, > > > > > > > > Any comment on this last version? > > > > > > > > Any chance to be merged? > > > > > > I've got a use case for this: I went to one of the Graphene talks > > > in Edinburgh and it struck me that we seem to keep reinventing the > > > type of sandboxing that qemu-user already does. However if you > > > want to do an x86 on x86 sandbox, you can't currently use the > > > binfmt_misc mechanism because that has you running *every* binary > > > on the system emulated. Doing it per user namespace fixes this > > > problem and allows us to at least cut down on all the pointless > > > duplication. > > > > Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes > > your code slower and *LESS* secure. As far as I know, qemu-user is > > only intended for purposes like development and testing. > > Sandboxing is about protecting the cloud service provider (and other > tenants) from horizontal attack by reducing calls to the shared kernel. > I think it's pretty indisputable that full emulation is an effective > sandbox in that regard. > > We can argue for about bugginess vs completeness, but technologically > qemu-user already has most of the system calls, which seems to be a > significant problem with other sandboxes. I also can't dispute it's > slower, but that's a tradeoff for people to make. I'm pretty sure you don't understand how qemu-user works. When the emulated code makes a syscall, QEMU just forwards the syscall to the native kernel. QEMU doesn't even prevent you from accessing the address space used by the emulation logic. qemu-user is not for sandboxing. qemu-user is not for security. qemu-user is for running binaries from architecture A on architecture B, with as much direct access to the kernel's syscall surface as possible. An example: $ cat blah.c #include <fcntl.h> #include <unistd.h> #include <stdio.h> int main(void) { open("/foo/bar/blah", O_RDONLY); char c; printf("ptr is %p\n", &c); read(1337, &c, 1); *(volatile char *)0x13371338; } $ aarch64-linux-gnu-gcc -static -o blah blah.c && strace -f qemu-aarch64 ./blah [...] [pid 14181] openat(AT_FDCWD, "/foo/bar/blah", O_RDONLY) = -1 ENOENT (No such file or directory) [pid 14181] fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 93), ...}) = 0 [pid 14181] write(1, "ptr is 0x40007fff2f\n", 20ptr is 0x40007fff2f ) = 20 [pid 14181] read(1337, 0x40007fff2f, 1) = -1 EBADF (Bad file descriptor) [pid 14181] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x13371338} --- [...]
Le 01/11/2018 à 15:16, Eric W. Biederman a écrit : > Laurent Vivier <laurent@vivier.eu> writes: > >> On 01/11/2018 04:51, Jann Horn wrote: >>> On Thu, Nov 1, 2018 at 3:59 AM James Bottomley >>> <James.Bottomley@hansenpartnership.com> wrote: >>>> >>>> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: >>>>> Hi, >>>>> >>>>> Any comment on this last version? >>>>> >>>>> Any chance to be merged? >>>> >>>> I've got a use case for this: I went to one of the Graphene talks in >>>> Edinburgh and it struck me that we seem to keep reinventing the type of >>>> sandboxing that qemu-user already does. However if you want to do an >>>> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism >>>> because that has you running *every* binary on the system emulated. >>>> Doing it per user namespace fixes this problem and allows us to at >>>> least cut down on all the pointless duplication. >>> >>> Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes >>> your code slower and *LESS* secure. As far as I know, qemu-user is >>> only intended for purposes like development and testing. >>> >> >> I think the idea here is not to run qemu, but to use an interpreter >> (something like gVisor) into a container to control the binaries >> execution inside the container without using this interpreter on the >> host itself (container and host shares the same binfmt_misc >> magic/mask). > > Please remind me of this patchset after the merge window is over, and if > there are no issues I will take it via my user namespace branch. > > Last I looked I had a concern that some of the permission check issues > were being papered over by using override cred instead of fixing the > deaper code. Sometimes they are necessary but seeing work-arounds > instead of fixes for problems tends to be a maintenance issue, possibly > with security consequences. Best is if the everyone agrees on how all > of the interfaces work so their are no surprises. I don't know where we are in the merge window, but is there something I can do to have this merged? Thanks, Laurent
Ping Thanks, Laurent Le 29/11/2018 à 14:05, Laurent Vivier a écrit : > Le 01/11/2018 à 15:16, Eric W. Biederman a écrit : >> Laurent Vivier <laurent@vivier.eu> writes: >> >>> On 01/11/2018 04:51, Jann Horn wrote: >>>> On Thu, Nov 1, 2018 at 3:59 AM James Bottomley >>>> <James.Bottomley@hansenpartnership.com> wrote: >>>>> >>>>> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: >>>>>> Hi, >>>>>> >>>>>> Any comment on this last version? >>>>>> >>>>>> Any chance to be merged? >>>>> >>>>> I've got a use case for this: I went to one of the Graphene talks in >>>>> Edinburgh and it struck me that we seem to keep reinventing the type of >>>>> sandboxing that qemu-user already does. However if you want to do an >>>>> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism >>>>> because that has you running *every* binary on the system emulated. >>>>> Doing it per user namespace fixes this problem and allows us to at >>>>> least cut down on all the pointless duplication. >>>> >>>> Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes >>>> your code slower and *LESS* secure. As far as I know, qemu-user is >>>> only intended for purposes like development and testing. >>>> >>> >>> I think the idea here is not to run qemu, but to use an interpreter >>> (something like gVisor) into a container to control the binaries >>> execution inside the container without using this interpreter on the >>> host itself (container and host shares the same binfmt_misc >>> magic/mask). >> >> Please remind me of this patchset after the merge window is over, and if >> there are no issues I will take it via my user namespace branch. >> >> Last I looked I had a concern that some of the permission check issues >> were being papered over by using override cred instead of fixing the >> deaper code. Sometimes they are necessary but seeing work-arounds >> instead of fixes for problems tends to be a maintenance issue, possibly >> with security consequences. Best is if the everyone agrees on how all >> of the interfaces work so their are no surprises. > > I don't know where we are in the merge window, but is there something I > can do to have this merged? > > Thanks, > Laurent >