[RFC,v2,v2,0/1] ns: introduce binfmt_misc namespace

Message ID	20181002102054.13245-1-laurent@vivier.eu (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Laurent Vivier <laurent@vivier.eu> To: linux-kernel@vger.kernel.org Cc: Dmitry Safonov <dima@arista.com>, Andrei Vagin <avagin@openvz.org>, Eric Biederman <ebiederm@xmission.com>, Alexander Viro <viro@zeniv.linux.org.uk>, James Bottomley <James.Bottomley@HansenPartnership.com>, containers@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, Laurent Vivier <laurent@vivier.eu> Subject: [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace Date: Tue, 2 Oct 2018 12:20:53 +0200 Message-Id: <20181002102054.13245-1-laurent@vivier.eu> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	ns: introduce binfmt_misc namespace \| expand [RFC,v2,v2,0/1] ns: introduce binfmt_misc namespace [RFC,v2,v2,1/1] ns: add binfmt_misc to the mount namespace

Message ID

20181002102054.13245-1-laurent@vivier.eu (mailing list archive)

Headers

From: Laurent Vivier <laurent@vivier.eu>
To: linux-kernel@vger.kernel.org
Cc: Dmitry Safonov <dima@arista.com>, Andrei Vagin <avagin@openvz.org>,
        Eric Biederman <ebiederm@xmission.com>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        James Bottomley <James.Bottomley@HansenPartnership.com>,
        containers@lists.linux-foundation.org,
        linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
        Laurent Vivier <laurent@vivier.eu>
Subject: [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace
Date: Tue,  2 Oct 2018 12:20:53 +0200
Message-Id: <20181002102054.13245-1-laurent@vivier.eu>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Series

ns: introduce binfmt_misc namespace | expand

Message

Laurent Vivier Oct. 2, 2018, 10:20 a.m. UTC

v2: no new namespace, binfmt_misc data are now part of
    the mount namespace
    I put this in mount namespace instead of user namespace
    because the mount namespace is already needed and
    I don't want to force to have the user namespace for that.
    As this is a filesystem, it seems logic to have it here.

This allows to define a new interpreter for each new container.

But the main goal is to be able to chroot to a directory
using a binfmt_misc interpreter without being root.

I have a modified version of unshare at:

  git@github.com:vivier/util-linux.git branch unshare-chroot

with some new options to unshare binfmt_misc namespace and to chroot
to a directory.

If you have a directory /chroot/powerpc/jessie containing debian for powerpc
binaries and a qemu-ppc interpreter, you can do for instance:

 $ uname -a
 Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
 $ ./unshare --map-root-user --fork --pid \
   --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \
   --root=/chroot/powerpc/jessie /bin/bash -l
 # uname -a
 Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux
 # id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
 # ls -l
total 5940
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 bin
drwxr-xr-x.   2 nobody nogroup    4096 Jun 17 20:26 boot
drwxr-xr-x.   4 nobody nogroup    4096 Aug 12 00:08 dev
drwxr-xr-x.  42 nobody nogroup    4096 Sep 28 07:25 etc
drwxr-xr-x.   3 nobody nogroup    4096 Sep 28 07:25 home
drwxr-xr-x.   9 nobody nogroup    4096 Aug 12 00:58 lib
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 media
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 mnt
drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 13:09 opt
dr-xr-xr-x. 143 nobody nogroup       0 Sep 30 23:02 proc
-rwxr-xr-x.   1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc
drwx------.   3 nobody nogroup    4096 Aug 12 12:54 root
drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 00:08 run
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 sbin
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 srv
drwxr-xr-x.   2 nobody nogroup    4096 Apr  6  2015 sys
drwxrwxrwt.   2 nobody nogroup    4096 Sep 28 10:31 tmp
drwxr-xr-x.  10 nobody nogroup    4096 Aug 12 00:08 usr
drwxr-xr-x.  11 nobody nogroup    4096 Aug 12 00:08 var

If you want to use the qemu binary provided by your distro, you can use

    --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF"

With the 'F' flag, qemu-ppc-static will be then loaded from the main root
filesystem before switching to the chroot.

Laurent Vivier (1):
  ns: add binfmt_misc to the mount namespace

 fs/binfmt_misc.c | 50 +++++++++++++++++++++++++-----------------------
 fs/mount.h       |  8 ++++++++
 fs/namespace.c   |  6 ++++++
 3 files changed, 40 insertions(+), 24 deletions(-)

Comments

James Bottomley Oct. 2, 2018, 4:13 p.m. UTC | #1

On Tue, 2018-10-02 at 12:20 +0200, Laurent Vivier wrote:
> v2: no new namespace, binfmt_misc data are now part of
>     the mount namespace
>     I put this in mount namespace instead of user namespace
>     because the mount namespace is already needed and
>     I don't want to force to have the user namespace for that.
>     As this is a filesystem, it seems logic to have it here.
> 
> This allows to define a new interpreter for each new container.
> 
> But the main goal is to be able to chroot to a directory
> using a binfmt_misc interpreter without being root.

Reading all this, I don't quite understand why this works for me and
not for you (I think I get from your explanation that it doesn't work
for you, but I might have missed something):

jejb@jarvis:~> uname -m
x86_64
jejb@jarvis:~> unshare -r -m
root@jarvis:~# chroot /home/jejb/containers/aarch64
jarvis:/ # uname -m
aarch64

Of course to get that to work I have an 'F' entry in
/etc/binfmt.d/qemu-aarch64.conf

Which means I'm running the host emulator in the container, which is
what I want to do.  I think another goal of the patches might be to use
different emulators for different aarch64 containers?  Do you have a
use case for this, because right at the moment for arch emulation
containers I think a single host wide entry per static emulator is the
right approach.

James

Laurent Vivier Oct. 2, 2018, 4:47 p.m. UTC | #2

Le 02/10/2018 à 18:13, James Bottomley a écrit :
> On Tue, 2018-10-02 at 12:20 +0200, Laurent Vivier wrote:
>> v2: no new namespace, binfmt_misc data are now part of
>>     the mount namespace
>>     I put this in mount namespace instead of user namespace
>>     because the mount namespace is already needed and
>>     I don't want to force to have the user namespace for that.
>>     As this is a filesystem, it seems logic to have it here.
>>
>> This allows to define a new interpreter for each new container.
>>
>> But the main goal is to be able to chroot to a directory
>> using a binfmt_misc interpreter without being root.
> 
> Reading all this, I don't quite understand why this works for me and
> not for you (I think I get from your explanation that it doesn't work
> for you, but I might have missed something):
> 
> jejb@jarvis:~> uname -m
> x86_64
> jejb@jarvis:~> unshare -r -m
> root@jarvis:~# chroot /home/jejb/containers/aarch64
> jarvis:/ # uname -m
> aarch64
> 
> Of course to get that to work I have an 'F' entry in
> /etc/binfmt.d/qemu-aarch64.conf
> 

I'd like to configure the interpreter without being root.

As a simple user can run a VM and a full system inside, I'd like to be
able to start a container/chroot without having to configure something
at the host level.

For instance, I'd like to provide to "someone" (with no admin rights) a
tar file with inside an OS environment for a given target and the
interpreter, and allow him to run the binaries inside just by running a
simple command (like qemu-system-XXX -hda my.img)

It's also interesting for a test purpose: I can test concurrently
different interpreters for the same target without modifying the target
root filesystem (with the 'F' flag but on a per directory basis) or the
host configuration.

Another case is we can't configure qemu-mips/qemu-mipsel (old kernel
API) and qemu-mipsn32/qemu-mipsne32el (new kernel API) interpreters on
the same system because they share the same ELF signature (to be honest
qemu should have only one binary for the old and the new interface and
dynamically change it according to the ELF binary that is loaded, as it
is done for ARM).

But if no one thinks it's useful, I don't want to push this more than
that...

Thanks,
Laurent

James Bottomley Oct. 3, 2018, 10:13 a.m. UTC | #3

On Tue, 2018-10-02 at 18:47 +0200, Laurent Vivier wrote:
> Le 02/10/2018 à 18:13, James Bottomley a écrit :
> > On Tue, 2018-10-02 at 12:20 +0200, Laurent Vivier wrote:
> > > v2: no new namespace, binfmt_misc data are now part of
> > >     the mount namespace
> > >     I put this in mount namespace instead of user namespace
> > >     because the mount namespace is already needed and
> > >     I don't want to force to have the user namespace for that.
> > >     As this is a filesystem, it seems logic to have it here.
> > > 
> > > This allows to define a new interpreter for each new container.
> > > 
> > > But the main goal is to be able to chroot to a directory
> > > using a binfmt_misc interpreter without being root.
> > 
> > Reading all this, I don't quite understand why this works for me
> > and
> > not for you (I think I get from your explanation that it doesn't
> > work
> > for you, but I might have missed something):
> > 
> > jejb@jarvis:~> uname -m
> > x86_64
> > jejb@jarvis:~> unshare -r -m
> > root@jarvis:~# chroot /home/jejb/containers/aarch64
> > jarvis:/ # uname -m
> > aarch64
> > 
> > Of course to get that to work I have an 'F' entry in
> > /etc/binfmt.d/qemu-aarch64.conf
> > 
> 
> I'd like to configure the interpreter without being root.
> 
> As a simple user can run a VM and a full system inside, I'd like to
> be
> able to start a container/chroot without having to configure
> something
> at the host level.
> 
> For instance, I'd like to provide to "someone" (with no admin rights)
> a tar file with inside an OS environment for a given target and the
> interpreter, and allow him to run the binaries inside just by running
> a simple command (like qemu-system-XXX -hda my.img)

OK, since trying to persuade the distros to add the 'F' flag has been
challenging, I certainly buy this use case.

There is a security risk to allowing an unprivileged user to supply an
arbitrary interpreter (suid and sgid binaries), but as long as
whatever's agreed requires root in the user namespace, I'm happy we
have the security issue confined.

James


> It's also interesting for a test purpose: I can test concurrently
> different interpreters for the same target without modifying the
> target root filesystem (with the 'F' flag but on a per directory
> basis) or the host configuration.
> 
> Another case is we can't configure qemu-mips/qemu-mipsel (old kernel
> API) and qemu-mipsn32/qemu-mipsne32el (new kernel API) interpreters
> on the same system because they share the same ELF signature (to be
> honest qemu should have only one binary for the old and the new
> interface and dynamically change it according to the ELF binary that
> is loaded, as it is done for ARM).
> 
> But if no one thinks it's useful, I don't want to push this more than
> that...
> 
> Thanks,
> Laurent
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers