mbox series

[v6,0/1] ns: introduce binfmt_misc namespace

Message ID 20181010161430.11633-1-laurent@vivier.eu (mailing list archive)
Headers show
Series ns: introduce binfmt_misc namespace | expand

Message

Laurent Vivier Oct. 10, 2018, 4:14 p.m. UTC
v6: Return &init_binfmt_ns instead of NULL in binfmt_ns()
    This should never happen, but to stay safe return a
    value we can use.
    change subject from "RFC" to "PATCH"

v5: Use READ_ONCE()/WRITE_ONCE()
    move mount pointer struct init to bm_fill_super() and add smp_wmb()
    remove useless NULL value init
    add WARN_ON_ONCE()

v4: first user namespace is initialized with &init_binfmt_ns,
    all new user namespaces are initialized with a NULL and use
    the one of the first parent that is not NULL. The pointer
    is initialized to a valid value the first time the binfmt_misc
    fs is mounted in the current user namespace.
    This allows to not change the way it was working before:
    new ns inherits values from its parent, and if parent value is modified
    (or parent creates its own binfmt entry by mounting the fs) child
    inherits it (unless it has itself mounted the fs).

v3: create a structure to store binfmt_misc data,
    add a pointer to this structure in the user_namespace structure,
    in init_user_ns structure this pointer points to an init_binfmt_ns
    structure. And all new user namespaces point to this init structure.
    A new binfmt namespace structure is allocated if the binfmt_misc
    filesystem is mounted in a user namespace that is not the initial
    one but its binfmt namespace pointer points to the initial one.
    add override_creds()/revert_creds() around open_exec() in
    bm_register_write()

v2: no new namespace, binfmt_misc data are now part of
    the mount namespace
    I put this in mount namespace instead of user namespace
    because the mount namespace is already needed and
    I don't want to force to have the user namespace for that.
    As this is a filesystem, it seems logic to have it here.

This allows to define a new interpreter for each new container.

But the main goal is to be able to chroot to a directory
using a binfmt_misc interpreter without being root.

I have a modified version of unshare at:

  git@github.com:vivier/util-linux.git branch unshare-chroot

with some new options to unshare binfmt_misc namespace and to chroot
to a directory.

If you have a directory /chroot/powerpc/jessie containing debian for powerpc
binaries and a qemu-ppc interpreter, you can do for instance:

 $ uname -a
 Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
 $ ./unshare --map-root-user --fork --pid \
   --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \
   --root=/chroot/powerpc/jessie /bin/bash -l
 # uname -a
 Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux
 # id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
 # ls -l
total 5940
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 bin
drwxr-xr-x.   2 nobody nogroup    4096 Jun 17 20:26 boot
drwxr-xr-x.   4 nobody nogroup    4096 Aug 12 00:08 dev
drwxr-xr-x.  42 nobody nogroup    4096 Sep 28 07:25 etc
drwxr-xr-x.   3 nobody nogroup    4096 Sep 28 07:25 home
drwxr-xr-x.   9 nobody nogroup    4096 Aug 12 00:58 lib
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 media
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 mnt
drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 13:09 opt
dr-xr-xr-x. 143 nobody nogroup       0 Sep 30 23:02 proc
-rwxr-xr-x.   1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc
drwx------.   3 nobody nogroup    4096 Aug 12 12:54 root
drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 00:08 run
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 sbin
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 srv
drwxr-xr-x.   2 nobody nogroup    4096 Apr  6  2015 sys
drwxrwxrwt.   2 nobody nogroup    4096 Sep 28 10:31 tmp
drwxr-xr-x.  10 nobody nogroup    4096 Aug 12 00:08 usr
drwxr-xr-x.  11 nobody nogroup    4096 Aug 12 00:08 var

If you want to use the qemu binary provided by your distro, you can use

    --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF"

With the 'F' flag, qemu-ppc-static will be then loaded from the main root
filesystem before switching to the chroot.

Laurent Vivier (1):
  ns: add binfmt_misc to the user namespace

 fs/binfmt_misc.c               | 111 ++++++++++++++++++++++++---------
 include/linux/user_namespace.h |  15 +++++
 kernel/user.c                  |  14 +++++
 kernel/user_namespace.c        |   3 +
 4 files changed, 115 insertions(+), 28 deletions(-)

Comments

Laurent Vivier Oct. 16, 2018, 9:52 a.m. UTC | #1
Hi,

Any comment on this last version?

Any chance to be merged?

Thanks,
Laurent

Le 10/10/2018 à 18:14, Laurent Vivier a écrit :
> v6: Return &init_binfmt_ns instead of NULL in binfmt_ns()
>     This should never happen, but to stay safe return a
>     value we can use.
>     change subject from "RFC" to "PATCH"
> 
> v5: Use READ_ONCE()/WRITE_ONCE()
>     move mount pointer struct init to bm_fill_super() and add smp_wmb()
>     remove useless NULL value init
>     add WARN_ON_ONCE()
> 
> v4: first user namespace is initialized with &init_binfmt_ns,
>     all new user namespaces are initialized with a NULL and use
>     the one of the first parent that is not NULL. The pointer
>     is initialized to a valid value the first time the binfmt_misc
>     fs is mounted in the current user namespace.
>     This allows to not change the way it was working before:
>     new ns inherits values from its parent, and if parent value is modified
>     (or parent creates its own binfmt entry by mounting the fs) child
>     inherits it (unless it has itself mounted the fs).
> 
> v3: create a structure to store binfmt_misc data,
>     add a pointer to this structure in the user_namespace structure,
>     in init_user_ns structure this pointer points to an init_binfmt_ns
>     structure. And all new user namespaces point to this init structure.
>     A new binfmt namespace structure is allocated if the binfmt_misc
>     filesystem is mounted in a user namespace that is not the initial
>     one but its binfmt namespace pointer points to the initial one.
>     add override_creds()/revert_creds() around open_exec() in
>     bm_register_write()
> 
> v2: no new namespace, binfmt_misc data are now part of
>     the mount namespace
>     I put this in mount namespace instead of user namespace
>     because the mount namespace is already needed and
>     I don't want to force to have the user namespace for that.
>     As this is a filesystem, it seems logic to have it here.
> 
> This allows to define a new interpreter for each new container.
> 
> But the main goal is to be able to chroot to a directory
> using a binfmt_misc interpreter without being root.
> 
> I have a modified version of unshare at:
> 
>   git@github.com:vivier/util-linux.git branch unshare-chroot
> 
> with some new options to unshare binfmt_misc namespace and to chroot
> to a directory.
> 
> If you have a directory /chroot/powerpc/jessie containing debian for powerpc
> binaries and a qemu-ppc interpreter, you can do for instance:
> 
>  $ uname -a
>  Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
>  $ ./unshare --map-root-user --fork --pid \
>    --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \
>    --root=/chroot/powerpc/jessie /bin/bash -l
>  # uname -a
>  Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux
>  # id
> uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
>  # ls -l
> total 5940
> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 bin
> drwxr-xr-x.   2 nobody nogroup    4096 Jun 17 20:26 boot
> drwxr-xr-x.   4 nobody nogroup    4096 Aug 12 00:08 dev
> drwxr-xr-x.  42 nobody nogroup    4096 Sep 28 07:25 etc
> drwxr-xr-x.   3 nobody nogroup    4096 Sep 28 07:25 home
> drwxr-xr-x.   9 nobody nogroup    4096 Aug 12 00:58 lib
> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 media
> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 mnt
> drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 13:09 opt
> dr-xr-xr-x. 143 nobody nogroup       0 Sep 30 23:02 proc
> -rwxr-xr-x.   1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc
> drwx------.   3 nobody nogroup    4096 Aug 12 12:54 root
> drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 00:08 run
> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 sbin
> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 srv
> drwxr-xr-x.   2 nobody nogroup    4096 Apr  6  2015 sys
> drwxrwxrwt.   2 nobody nogroup    4096 Sep 28 10:31 tmp
> drwxr-xr-x.  10 nobody nogroup    4096 Aug 12 00:08 usr
> drwxr-xr-x.  11 nobody nogroup    4096 Aug 12 00:08 var
> 
> If you want to use the qemu binary provided by your distro, you can use
> 
>     --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF"
> 
> With the 'F' flag, qemu-ppc-static will be then loaded from the main root
> filesystem before switching to the chroot.
> 
> Laurent Vivier (1):
>   ns: add binfmt_misc to the user namespace
> 
>  fs/binfmt_misc.c               | 111 ++++++++++++++++++++++++---------
>  include/linux/user_namespace.h |  15 +++++
>  kernel/user.c                  |  14 +++++
>  kernel/user_namespace.c        |   3 +
>  4 files changed, 115 insertions(+), 28 deletions(-)
>
James Bottomley Nov. 1, 2018, 2:59 a.m. UTC | #2
On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote:
> Hi,
> 
> Any comment on this last version?
> 
> Any chance to be merged?

I've got a use case for this:  I went to one of the Graphene talks in
Edinburgh and it struck me that we seem to keep reinventing the type of
sandboxing that qemu-user already does.  However if you want to do an
x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism
because that has you running *every* binary on the system emulated. 
Doing it per user namespace fixes this problem and allows us to at
least cut down on all the pointless duplication.

James
Jann Horn Nov. 1, 2018, 3:51 a.m. UTC | #3
On Thu, Nov 1, 2018 at 3:59 AM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote:
> > Hi,
> >
> > Any comment on this last version?
> >
> > Any chance to be merged?
>
> I've got a use case for this:  I went to one of the Graphene talks in
> Edinburgh and it struck me that we seem to keep reinventing the type of
> sandboxing that qemu-user already does.  However if you want to do an
> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism
> because that has you running *every* binary on the system emulated.
> Doing it per user namespace fixes this problem and allows us to at
> least cut down on all the pointless duplication.

Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes
your code slower and *LESS* secure. As far as I know, qemu-user is
only intended for purposes like development and testing.
Laurent Vivier Nov. 1, 2018, 12:28 p.m. UTC | #4
On 01/11/2018 04:51, Jann Horn wrote:
> On Thu, Nov 1, 2018 at 3:59 AM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
>>
>> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote:
>>> Hi,
>>>
>>> Any comment on this last version?
>>>
>>> Any chance to be merged?
>>
>> I've got a use case for this:  I went to one of the Graphene talks in
>> Edinburgh and it struck me that we seem to keep reinventing the type of
>> sandboxing that qemu-user already does.  However if you want to do an
>> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism
>> because that has you running *every* binary on the system emulated.
>> Doing it per user namespace fixes this problem and allows us to at
>> least cut down on all the pointless duplication.
> 
> Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes
> your code slower and *LESS* secure. As far as I know, qemu-user is
> only intended for purposes like development and testing.
> 

I think the idea here is not to run qemu, but to use an interpreter
(something like gVisor) into a container to control the binaries
execution inside the container without using this interpreter on the
host itself (container and host shares the same binfmt_misc magic/mask).

Thanks,
Laurent
James Bottomley Nov. 1, 2018, 2:10 p.m. UTC | #5
On Thu, 2018-11-01 at 04:51 +0100, Jann Horn wrote:
> On Thu, Nov 1, 2018 at 3:59 AM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > 
> > On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote:
> > > Hi,
> > > 
> > > Any comment on this last version?
> > > 
> > > Any chance to be merged?
> > 
> > I've got a use case for this:  I went to one of the Graphene talks
> > in Edinburgh and it struck me that we seem to keep reinventing the
> > type of sandboxing that qemu-user already does.  However if you
> > want to do an x86 on x86 sandbox, you can't currently use the
> > binfmt_misc mechanism because that has you running *every* binary
> > on the system emulated. Doing it per user namespace fixes this
> > problem and allows us to at least cut down on all the pointless
> > duplication.
> 
> Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes
> your code slower and *LESS* secure. As far as I know, qemu-user is
> only intended for purposes like development and testing.

Sandboxing is about protecting the cloud service provider (and other
tenants) from horizontal attack by reducing calls to the shared kernel.
 I think it's pretty indisputable that full emulation is an effective
sandbox in that regard.

We can argue for about bugginess vs completeness, but technologically
qemu-user already has most of the system calls, which seems to be a
significant problem with other sandboxes.  I also can't dispute it's
slower, but that's a tradeoff for people to make.

James
Eric W. Biederman Nov. 1, 2018, 2:16 p.m. UTC | #6
Laurent Vivier <laurent@vivier.eu> writes:

> On 01/11/2018 04:51, Jann Horn wrote:
>> On Thu, Nov 1, 2018 at 3:59 AM James Bottomley
>> <James.Bottomley@hansenpartnership.com> wrote:
>>>
>>> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote:
>>>> Hi,
>>>>
>>>> Any comment on this last version?
>>>>
>>>> Any chance to be merged?
>>>
>>> I've got a use case for this:  I went to one of the Graphene talks in
>>> Edinburgh and it struck me that we seem to keep reinventing the type of
>>> sandboxing that qemu-user already does.  However if you want to do an
>>> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism
>>> because that has you running *every* binary on the system emulated.
>>> Doing it per user namespace fixes this problem and allows us to at
>>> least cut down on all the pointless duplication.
>> 
>> Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes
>> your code slower and *LESS* secure. As far as I know, qemu-user is
>> only intended for purposes like development and testing.
>> 
>
> I think the idea here is not to run qemu, but to use an interpreter
> (something like gVisor) into a container to control the binaries
> execution inside the container without using this interpreter on the
> host itself (container and host shares the same binfmt_misc
> magic/mask).

Please remind me of this patchset after the merge window is over, and if
there are no issues I will take it via my user namespace branch.

Last I looked I had a concern that some of the permission check issues
were being papered over by using override cred instead of fixing the
deaper code.  Sometimes they are necessary but seeing work-arounds
instead of fixes for problems tends to be a maintenance issue, possibly
with security consequences.  Best is if the everyone agrees on how all
of the interfaces work so their are no surprises.

Eric
Jann Horn Nov. 1, 2018, 2:44 p.m. UTC | #7
On Thu, Nov 1, 2018 at 3:10 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Thu, 2018-11-01 at 04:51 +0100, Jann Horn wrote:
> > On Thu, Nov 1, 2018 at 3:59 AM James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> > >
> > > On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote:
> > > > Hi,
> > > >
> > > > Any comment on this last version?
> > > >
> > > > Any chance to be merged?
> > >
> > > I've got a use case for this:  I went to one of the Graphene talks
> > > in Edinburgh and it struck me that we seem to keep reinventing the
> > > type of sandboxing that qemu-user already does.  However if you
> > > want to do an x86 on x86 sandbox, you can't currently use the
> > > binfmt_misc mechanism because that has you running *every* binary
> > > on the system emulated. Doing it per user namespace fixes this
> > > problem and allows us to at least cut down on all the pointless
> > > duplication.
> >
> > Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes
> > your code slower and *LESS* secure. As far as I know, qemu-user is
> > only intended for purposes like development and testing.
>
> Sandboxing is about protecting the cloud service provider (and other
> tenants) from horizontal attack by reducing calls to the shared kernel.
>  I think it's pretty indisputable that full emulation is an effective
> sandbox in that regard.
>
> We can argue for about bugginess vs completeness, but technologically
> qemu-user already has most of the system calls, which seems to be a
> significant problem with other sandboxes.  I also can't dispute it's
> slower, but that's a tradeoff for people to make.

I'm pretty sure you don't understand how qemu-user works.

When the emulated code makes a syscall, QEMU just forwards the syscall
to the native kernel.

QEMU doesn't even prevent you from accessing the address space used by
the emulation logic.

qemu-user is not for sandboxing. qemu-user is not for security.
qemu-user is for running binaries from architecture A on architecture
B, with as much direct access to the kernel's syscall surface as
possible.


An example:

$ cat blah.c
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
int main(void) {
  open("/foo/bar/blah", O_RDONLY);
  char c;
  printf("ptr is %p\n", &c);
  read(1337, &c, 1);
  *(volatile char *)0x13371338;
}
$ aarch64-linux-gnu-gcc -static -o blah blah.c && strace -f qemu-aarch64 ./blah
[...]
[pid 14181] openat(AT_FDCWD, "/foo/bar/blah", O_RDONLY) = -1 ENOENT
(No such file or directory)
[pid 14181] fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 93), ...}) = 0
[pid 14181] write(1, "ptr is 0x40007fff2f\n", 20ptr is 0x40007fff2f
) = 20
[pid 14181] read(1337, 0x40007fff2f, 1) = -1 EBADF (Bad file descriptor)
[pid 14181] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR,
si_addr=0x13371338} ---
[...]
Laurent Vivier Nov. 29, 2018, 1:05 p.m. UTC | #8
Le 01/11/2018 à 15:16, Eric W. Biederman a écrit :
> Laurent Vivier <laurent@vivier.eu> writes:
> 
>> On 01/11/2018 04:51, Jann Horn wrote:
>>> On Thu, Nov 1, 2018 at 3:59 AM James Bottomley
>>> <James.Bottomley@hansenpartnership.com> wrote:
>>>>
>>>> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote:
>>>>> Hi,
>>>>>
>>>>> Any comment on this last version?
>>>>>
>>>>> Any chance to be merged?
>>>>
>>>> I've got a use case for this:  I went to one of the Graphene talks in
>>>> Edinburgh and it struck me that we seem to keep reinventing the type of
>>>> sandboxing that qemu-user already does.  However if you want to do an
>>>> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism
>>>> because that has you running *every* binary on the system emulated.
>>>> Doing it per user namespace fixes this problem and allows us to at
>>>> least cut down on all the pointless duplication.
>>>
>>> Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes
>>> your code slower and *LESS* secure. As far as I know, qemu-user is
>>> only intended for purposes like development and testing.
>>>
>>
>> I think the idea here is not to run qemu, but to use an interpreter
>> (something like gVisor) into a container to control the binaries
>> execution inside the container without using this interpreter on the
>> host itself (container and host shares the same binfmt_misc
>> magic/mask).
> 
> Please remind me of this patchset after the merge window is over, and if
> there are no issues I will take it via my user namespace branch.
> 
> Last I looked I had a concern that some of the permission check issues
> were being papered over by using override cred instead of fixing the
> deaper code.  Sometimes they are necessary but seeing work-arounds
> instead of fixes for problems tends to be a maintenance issue, possibly
> with security consequences.  Best is if the everyone agrees on how all
> of the interfaces work so their are no surprises.

I don't know where we are in the merge window, but is there something I
can do to have this merged?

Thanks,
Laurent
Laurent Vivier Dec. 29, 2018, 3:41 p.m. UTC | #9
Ping

Thanks,
Laurent

Le 29/11/2018 à 14:05, Laurent Vivier a écrit :
> Le 01/11/2018 à 15:16, Eric W. Biederman a écrit :
>> Laurent Vivier <laurent@vivier.eu> writes:
>>
>>> On 01/11/2018 04:51, Jann Horn wrote:
>>>> On Thu, Nov 1, 2018 at 3:59 AM James Bottomley
>>>> <James.Bottomley@hansenpartnership.com> wrote:
>>>>>
>>>>> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Any comment on this last version?
>>>>>>
>>>>>> Any chance to be merged?
>>>>>
>>>>> I've got a use case for this:  I went to one of the Graphene talks in
>>>>> Edinburgh and it struck me that we seem to keep reinventing the type of
>>>>> sandboxing that qemu-user already does.  However if you want to do an
>>>>> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism
>>>>> because that has you running *every* binary on the system emulated.
>>>>> Doing it per user namespace fixes this problem and allows us to at
>>>>> least cut down on all the pointless duplication.
>>>>
>>>> Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes
>>>> your code slower and *LESS* secure. As far as I know, qemu-user is
>>>> only intended for purposes like development and testing.
>>>>
>>>
>>> I think the idea here is not to run qemu, but to use an interpreter
>>> (something like gVisor) into a container to control the binaries
>>> execution inside the container without using this interpreter on the
>>> host itself (container and host shares the same binfmt_misc
>>> magic/mask).
>>
>> Please remind me of this patchset after the merge window is over, and if
>> there are no issues I will take it via my user namespace branch.
>>
>> Last I looked I had a concern that some of the permission check issues
>> were being papered over by using override cred instead of fixing the
>> deaper code.  Sometimes they are necessary but seeing work-arounds
>> instead of fixes for problems tends to be a maintenance issue, possibly
>> with security consequences.  Best is if the everyone agrees on how all
>> of the interfaces work so their are no surprises.
> 
> I don't know where we are in the merge window, but is there something I
> can do to have this merged?
> 
> Thanks,
> Laurent
>