mbox series

[v2,00/28] user_namespace: introduce fsid mappings

Message ID 20200214183554.1133805-1-christian.brauner@ubuntu.com (mailing list archive)
Headers show
Series user_namespace: introduce fsid mappings | expand

Message

Christian Brauner Feb. 14, 2020, 6:35 p.m. UTC
Hey everyone,

This is v2 with various fixes after discussions with Jann.

From pings and off-list questions and discussions at Google Container
Security Summit there seems to be quite a lot of interest in this
patchset with use-cases ranging from layer sharing for app containers
and k8s, as well as data sharing between containers with different id
mappings. I haven't Cced all people because I don't have all the email
adresses at hand but I've at least added Phil now. :)

This is the implementation of shiftfs which was cooked up during lunch at
Linux Plumbers 2019 the day after the container's microconference. The
idea is a design-stew from Stéphane, Aleksa, Eric, and myself. Back then
we all were quite busy with other work and couldn't really sit down and
implement it. But I took a few days last week to do this work, including
demos and performance testing.
This implementation does not require us to touch the vfs substantially
at all. Instead, we implement shiftfs via fsid mappings.
With this patch, it took me 20 mins to port both LXD and LXC to support
shiftfs via fsid mappings.

For anyone wanting to play with this the branch can be pulled from:
https://github.com/brauner/linux/tree/fsid_mappings
https://gitlab.com/brauner/linux/-/tree/fsid_mappings
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings

The main use case for shiftfs for us is in allowing shared writable
storage to multiple containers using non-overlapping id mappings.
In such a scenario you want the fsids to be valid and identical in both
containers for the shared mount. A demo for this exists in [3].
If you don't want to read on, go straight to the other demos below in
[1] and [2].

People not as familiar with user namespaces might not be aware that fsid
mappings already exist. Right now, fsid mappings are always identical to
id mappings. Specifically, the kernel will lookup fsuids in the uid
mappings and fsgids in the gid mappings of the relevant user namespace.

With this patch series we simply introduce the ability to create fsid
mappings that are different from the id mappings of a user namespace.
The whole feature set is placed under a config option that defaults to
false.

In the usual case of running an unprivileged container we will have
setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
correspond to this id mapping, i.e. all files which we want to appear as
0:0 inside the user namespace will be chowned to 100000:100000 on the
host. This works, because whenever the kernel needs to do a filesystem
access it will lookup the corresponding uid and gid in the idmapping
tables of the container.
Now think about the case where we want to have an id mapping of 0 100000
100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
share a single on-disk mapping with multiple containers that all have
different id mappings.
This will be problematic. Whenever a filesystem access is requested, the
kernel will now try to lookup a mapping for 300000 in the id mapping
tables of the user namespace but since there is none the files will
appear to be owned by the overflow id, i.e. usually 65534:65534 or
nobody:nogroup.

With fsid mappings we can solve this by writing an id mapping of 0
100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
access the kernel will now lookup the mapping for 300000 in the fsid
mapping tables of the user namespace. And since such a mapping exists,
the corresponding files will have correct ownership.

A note on proc (and sys), the proc filesystem is special in sofar as it
only has a single superblock that is (currently but might be about to
change) visible in all user namespaces (same goes for sys). This means
it has special semantics in many ways, including how file ownership and
access works. The fsid mapping implementation does not alter how proc
(and sys) ownership works. proc and sys will both continue to lookup
filesystem access in id mapping tables.

When Writing fsid mappings the same rules apply as when writing id
mappings so I won't reiterate them here. The limit of fs id mappings is
the same as for id mappings, i.e. 340 lines.

# Performance
Back when I extended the range of possible id mappings to 340 I did
performance testing by booting into single user mode, creating 1,000,000
files to fstat()ing them and calculated the mean fstat() time per file.
(Back when Linux was still fast. I won't mention that the stat
 numbers have (thanks microcode!) doubled since then...)
I did the same test for this patchset: one vanilla kernel, one kernel
with my fsid mapping patches but CONFIG_USER_NS_FSID set to n and one
with fsid mappings patches enabled. I then ran the same test on all
three kernels and compared the numbers. The implementation does not
introduce overhead. That's all I can say. Here are the numbers:

             | vanilla v5.5 | fsid mappings       | fsid mappings      | fsid mappings      |
	     |              | disabled in Kconfig | enabled in Kconfig | enabled in Kconfig |
	     |   	    |                     | and unset for all  | and set for all    |
	     |   	    |    		  | test cases         | test cases         |
-------------|--------------|---------------------|--------------------|--------------------|
 0  mappings |       367 ns |              365 ns |             365 ns |             N/A    |
 1  mappings |       362 ns |              367 ns |             363 ns |             363 ns |
 2  mappings |       361 ns |              369 ns |             363 ns |             364 ns |
 3  mappings |       361 ns |              368 ns |             366 ns |             365 ns |
 5  mappings |       365 ns |              368 ns |             363 ns |             365 ns |
 10 mappings |       391 ns |              388 ns |             387 ns |             389 ns |
 50 mappings |       395 ns |              398 ns |             401 ns |             397 ns |
100 mappings |       400 ns |              405 ns |             399 ns |             399 ns |
200 mappings |       404 ns |              407 ns |             430 ns |             404 ns |
300 mappings |       492 ns |              494 ns |             432 ns |             413 ns |
340 mappings |       495 ns |              497 ns |             500 ns |             484 ns |

# Demos
[1]: Create a container with different id and fsid mappings.
     https://asciinema.org/a/300233 
[2]: Create a container with id mappings but without fsid mappings.
     https://asciinema.org/a/300234
[3]: Share storage between multiple containers with non-overlapping id
     mappings.
     https://asciinema.org/a/300235

Thanks!
Christian

Christian Brauner (28):
  user_namespace: introduce fsid mappings infrastructure
  proc: add /proc/<pid>/fsuid_map
  proc: add /proc/<pid>/fsgid_map
  fsuidgid: add fsid mapping helpers
  proc: task_state(): use from_kfs{g,u}id_munged
  cred: add kfs{g,u}id
  sys: __sys_setfsuid(): handle fsid mappings
  sys: __sys_setfsgid(): handle fsid mappings
  sys:__sys_setuid(): handle fsid mappings
  sys:__sys_setgid(): handle fsid mappings
  sys:__sys_setreuid(): handle fsid mappings
  sys:__sys_setregid(): handle fsid mappings
  sys:__sys_setresuid(): handle fsid mappings
  sys:__sys_setresgid(): handle fsid mappings
  fs: add is_userns_visible() helper
  namei: may_{o_}create(): handle fsid mappings
  inode: inode_owner_or_capable(): handle fsid mappings
  capability: privileged_wrt_inode_uidgid(): handle fsid mappings
  stat: handle fsid mappings
  open: handle fsid mappings
  posix_acl: handle fsid mappings
  attr: notify_change(): handle fsid mappings
  commoncap: cap_bprm_set_creds(): handle fsid mappings
  commoncap: cap_task_fix_setuid(): handle fsid mappings
  commoncap: handle fsid mappings with vfs caps
  exec: bprm_fill_uid(): handle fsid mappings
  ptrace: adapt ptrace_may_access() to always uses unmapped fsids
  devpts: handle fsid mappings

 fs/attr.c                      |  23 ++-
 fs/devpts/inode.c              |   7 +-
 fs/exec.c                      |  25 ++-
 fs/inode.c                     |   7 +-
 fs/namei.c                     |  36 +++-
 fs/open.c                      |  16 +-
 fs/posix_acl.c                 |  21 +--
 fs/proc/array.c                |   5 +-
 fs/proc/base.c                 |  34 ++++
 fs/stat.c                      |  48 ++++--
 include/linux/cred.h           |   4 +
 include/linux/fs.h             |   5 +
 include/linux/fsuidgid.h       | 122 +++++++++++++
 include/linux/stat.h           |   1 +
 include/linux/user_namespace.h |  10 ++
 init/Kconfig                   |  11 ++
 kernel/capability.c            |  10 +-
 kernel/ptrace.c                |   4 +-
 kernel/sys.c                   | 106 +++++++++---
 kernel/user.c                  |  22 +++
 kernel/user_namespace.c        | 303 ++++++++++++++++++++++++++++++++-
 security/commoncap.c           |  35 ++--
 22 files changed, 757 insertions(+), 98 deletions(-)
 create mode 100644 include/linux/fsuidgid.h


base-commit: bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9

Comments

Florian Weimer Feb. 16, 2020, 3:55 p.m. UTC | #1
* Christian Brauner:

> With fsid mappings we can solve this by writing an id mapping of 0
> 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> access the kernel will now lookup the mapping for 300000 in the fsid
> mapping tables of the user namespace. And since such a mapping exists,
> the corresponding files will have correct ownership.

I'm worried that this is a bit of a management nightmare because the
data about the mapping does not live within the file system (it's
externally determined, static, but crucial to the interpretation of
file system content).  I expect that many organizations have
centralized allocation of user IDs, but centralized allocation of the
static mapping does not appear feasible.

Have you considered a more complex design, where untranslated nested
user IDs are store in a file attribute (or something like that)?  This
way, any existing user ID infrastructure can be carried over largely
unchanged.
Christian Brauner Feb. 16, 2020, 4:40 p.m. UTC | #2
On Sun, Feb 16, 2020 at 04:55:49PM +0100, Florian Weimer wrote:
> * Christian Brauner:
> 
> > With fsid mappings we can solve this by writing an id mapping of 0
> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > access the kernel will now lookup the mapping for 300000 in the fsid
> > mapping tables of the user namespace. And since such a mapping exists,
> > the corresponding files will have correct ownership.
> 
> I'm worried that this is a bit of a management nightmare because the
> data about the mapping does not live within the file system (it's
> externally determined, static, but crucial to the interpretation of
> file system content).  I expect that many organizations have

Iiuc, that's already the case with user namespaces right now e.g. when
you have an on-disk mapping that doesn't match your user namespace
mapping.

> centralized allocation of user IDs, but centralized allocation of the
> static mapping does not appear feasible.

I thought we're working on this right now with the new nss
infrastructure to register id mappings aka the shadow discussion we've
been having.

> 
> Have you considered a more complex design, where untranslated nested
> user IDs are store in a file attribute (or something like that)?  This

That doesn't sound like it would be feasible especially in the nesting
case wrt. to performance.

Christian
James Bottomley Feb. 17, 2020, 9:06 p.m. UTC | #3
On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote:
[...]
> People not as familiar with user namespaces might not be aware that
> fsid mappings already exist. Right now, fsid mappings are always
> identical to id mappings. Specifically, the kernel will lookup fsuids
> in the uid mappings and fsgids in the gid mappings of the relevant
> user namespace.

This isn't actually entirely true: today we have the superblock user
namespace, which can be used for fsid remapping on filesystems that
support it (currently f2fs and fuse).  Since this is a single shift,
how is it going to play with s_user_ns?  Do you have to understand the
superblock mapping to use this shift, or are we simply using this to
replace s_user_ns?

James
James Bottomley Feb. 17, 2020, 9:11 p.m. UTC | #4
On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote:
[...]
> With this patch series we simply introduce the ability to create fsid
> mappings that are different from the id mappings of a user namespace.
> The whole feature set is placed under a config option that defaults
> to false.
> 
> In the usual case of running an unprivileged container we will have
> setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> correspond to this id mapping, i.e. all files which we want to appear
> as 0:0 inside the user namespace will be chowned to 100000:100000 on
> the host. This works, because whenever the kernel needs to do a
> filesystem access it will lookup the corresponding uid and gid in the
> idmapping tables of the container.
> Now think about the case where we want to have an id mapping of 0
> 100000 100000 but an on-disk mapping of 0 300000 100000 which is
> needed to e.g. share a single on-disk mapping with multiple
> containers that all have different id mappings.
> This will be problematic. Whenever a filesystem access is requested,
> the kernel will now try to lookup a mapping for 300000 in the id
> mapping tables of the user namespace but since there is none the
> files will appear to be owned by the overflow id, i.e. usually
> 65534:65534 or nobody:nogroup.
> 
> With fsid mappings we can solve this by writing an id mapping of 0
> 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> access the kernel will now lookup the mapping for 300000 in the fsid
> mapping tables of the user namespace. And since such a mapping
> exists, the corresponding files will have correct ownership.

How do we parametrise this new fsid shift for the unprivileged use
case?  For newuidmap/newgidmap, it's easy because each user gets a
dedicated range and everything "just works (tm)".  However, for the
fsid mapping, assuming some newfsuid/newfsgid tool to help, that tool
has to know not only your allocated uid/gid chunk, but also the offset
map of the image.  The former is easy, but the latter is going to vary
by the actual image ... well unless we standardise some accepted shift
for images and it simply becomes a known static offset.

James
Christian Brauner Feb. 17, 2020, 9:20 p.m. UTC | #5
On Mon, Feb 17, 2020 at 01:06:08PM -0800, James Bottomley wrote:
> On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote:
> [...]
> > People not as familiar with user namespaces might not be aware that
> > fsid mappings already exist. Right now, fsid mappings are always
> > identical to id mappings. Specifically, the kernel will lookup fsuids
> > in the uid mappings and fsgids in the gid mappings of the relevant
> > user namespace.
> 
> This isn't actually entirely true: today we have the superblock user
> namespace, which can be used for fsid remapping on filesystems that
> support it (currently f2fs and fuse).  Since this is a single shift,

Note that this states "the relevant" user namespace not the caller's
user namespace. And the point is true even for such filesystems. fuse
does call make_kuid(fc->user_ns, attr->uid) and hence looks up the
mapping in the id mappings.. This would be replaced by make_kfsuid().

> how is it going to play with s_user_ns?  Do you have to understand the
> superblock mapping to use this shift, or are we simply using this to
> replace s_user_ns?

I'm not sure what you mean by understand the superblock mapping. The
case is not different from the devpts patch in this series.
Fuse needs to be changed to call make_kfsuid() since it is mountable
inside user namespaces at which point everthing just works.
Stéphane Graber Feb. 17, 2020, 10:02 p.m. UTC | #6
And re-sending, this time hopefully actually in plain text mode.
Sorry about that, my e-mail client isn't behaving today...

Stéphane

On Mon, Feb 17, 2020 at 4:57 PM Stéphane Graber <stgraber@ubuntu.com> wrote:
>
> On Mon, Feb 17, 2020 at 4:12 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote:
>>
>> On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote:
>> [...]
>> > With this patch series we simply introduce the ability to create fsid
>> > mappings that are different from the id mappings of a user namespace.
>> > The whole feature set is placed under a config option that defaults
>> > to false.
>> >
>> > In the usual case of running an unprivileged container we will have
>> > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
>> > correspond to this id mapping, i.e. all files which we want to appear
>> > as 0:0 inside the user namespace will be chowned to 100000:100000 on
>> > the host. This works, because whenever the kernel needs to do a
>> > filesystem access it will lookup the corresponding uid and gid in the
>> > idmapping tables of the container.
>> > Now think about the case where we want to have an id mapping of 0
>> > 100000 100000 but an on-disk mapping of 0 300000 100000 which is
>> > needed to e.g. share a single on-disk mapping with multiple
>> > containers that all have different id mappings.
>> > This will be problematic. Whenever a filesystem access is requested,
>> > the kernel will now try to lookup a mapping for 300000 in the id
>> > mapping tables of the user namespace but since there is none the
>> > files will appear to be owned by the overflow id, i.e. usually
>> > 65534:65534 or nobody:nogroup.
>> >
>> > With fsid mappings we can solve this by writing an id mapping of 0
>> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
>> > access the kernel will now lookup the mapping for 300000 in the fsid
>> > mapping tables of the user namespace. And since such a mapping
>> > exists, the corresponding files will have correct ownership.
>>
>> How do we parametrise this new fsid shift for the unprivileged use
>> case?  For newuidmap/newgidmap, it's easy because each user gets a
>> dedicated range and everything "just works (tm)".  However, for the
>> fsid mapping, assuming some newfsuid/newfsgid tool to help, that tool
>> has to know not only your allocated uid/gid chunk, but also the offset
>> map of the image.  The former is easy, but the latter is going to vary
>> by the actual image ... well unless we standardise some accepted shift
>> for images and it simply becomes a known static offset.
>
>
> For unprivileged runtimes, I would expect images to be unshifted and be
> unpacked from within a userns. So your unprivileged user would be allowed
> a uid/gid range through /etc/subuid and /etc/subgid and allowed to use
> them through newuidmap/newgidmap.In that namespace, you can then pull
> and unpack any images/layers you may want and the resulting fs tree will
> look correct from within that namespace.
>
> All that is possible today and is how for example unprivileged LXC works
> right now.
>
> What this patchset then allows is for containers to have differing
> uid/gid maps while still being based off the same image or layers.
> In this scenario, you would carve a subset of your main uid/gid map for
> each container you run and run them in a child user namespace while
> setting up a fsuid/fsgid map such that their filesystem access do not
> follow their uid/gid map. This then results in proper isolation for
> processes, networks, ... as everything runs as different kuid/kgid but
> the VFS view will be the same in all containers.
>
> Shared storage between those otherwise isolated containers would also
> work just fine by simply bind-mounting the same path into two or more
> containers.
>
>
> Now one additional thing that would be safe for a setuid wrapper to
> allow would be for arbitrary mapping of any of the uid/gid that the user
> owns to be used within the fsuid/fsgid map. One potential use for this
> would be to create any number of user namespaces, each with their own
> mapping for uid 0 while still having all VFS access be mapped to the
> user that spawned them (say uid=1000, gid=1000).
>
>
> Note that in our case, the intended use for this is from a privileged runtime
> where our images would be unshifted as would be the container storage
> and any shared storage for containers. The security model effectively relying
> on properly configured filesystem permissions and mount namespaces such
> that the content of those paths can never be seen by anyone but root outside
> of those containers (and therefore avoids all the issues around setuid/setgid/fscaps).
>
> We will then be able to allocate distinct, random, ranges of 65536 uids/gids (or more)
> for each container without ever having to do any uid/gid shifting at the filesystem layer
> or run into issues when having to setup shared storage between containers or attaching
> external storage volumes to those containers.
>
>> James
>
>
> Stéphane
James Bottomley Feb. 17, 2020, 10:35 p.m. UTC | #7
On Mon, 2020-02-17 at 22:20 +0100, Christian Brauner wrote:
> On Mon, Feb 17, 2020 at 01:06:08PM -0800, James Bottomley wrote:
> > On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote:
> > [...]
> > > People not as familiar with user namespaces might not be aware
> > > that fsid mappings already exist. Right now, fsid mappings are
> > > always identical to id mappings. Specifically, the kernel will
> > > lookup fsuids in the uid mappings and fsgids in the gid mappings
> > > of the relevant user namespace.
> > 
> > This isn't actually entirely true: today we have the superblock
> > user namespace, which can be used for fsid remapping on filesystems
> > that support it (currently f2fs and fuse).  Since this is a single
> > shift,
> 
> Note that this states "the relevant" user namespace not the caller's
> user namespace. And the point is true even for such filesystems. fuse
> does call make_kuid(fc->user_ns, attr->uid) and hence looks up the
> mapping in the id mappings.. This would be replaced by make_kfsuid().
> 
> > how is it going to play with s_user_ns?  Do you have to understand
> > the superblock mapping to use this shift, or are we simply using
> > this to replace s_user_ns?
> 
> I'm not sure what you mean by understand the superblock mapping. The
> case is not different from the devpts patch in this series.

So since devpts wasn't originally a s_user_ns consumer, I assume you're
thinking that this patch series just replaces the whole of s_user_ns
for fuse and f2fs and we can remove it?

> Fuse needs to be changed to call make_kfsuid() since it is mountable
> inside user namespaces at which point everthing just works.

The fuse case is slightly more complicated because there are sound
reasons to run the daemon in a separate user namespace regardless of
where the end fuse mount is.

James
James Bottomley Feb. 17, 2020, 11:03 p.m. UTC | #8
On Mon, 2020-02-17 at 16:57 -0500, Stéphane Graber wrote:
> On Mon, Feb 17, 2020 at 4:12 PM James Bottomley <
> James.Bottomley@hansenpartnership.com> wrote:
> 
> > On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote:
> > [...]
> > > With this patch series we simply introduce the ability to create
> > > fsid mappings that are different from the id mappings of a user
> > > namespace. The whole feature set is placed under a config option
> > > that defaults to false.
> > > 
> > > In the usual case of running an unprivileged container we will
> > > have setup an id mapping, e.g. 0 100000 100000. The on-disk
> > > mapping will correspond to this id mapping, i.e. all files which
> > > we want to appear as 0:0 inside the user namespace will be
> > > chowned to 100000:100000 on the host. This works, because
> > > whenever the kernel needs to do a filesystem access it will
> > > lookup the corresponding uid and gid in the idmapping tables of
> > > the container.
> > > Now think about the case where we want to have an id mapping of 0
> > > 100000 100000 but an on-disk mapping of 0 300000 100000 which is
> > > needed to e.g. share a single on-disk mapping with multiple
> > > containers that all have different id mappings.
> > > This will be problematic. Whenever a filesystem access is
> > > requested, the kernel will now try to lookup a mapping for 300000
> > > in the id mapping tables of the user namespace but since there is
> > > none the files will appear to be owned by the overflow id, i.e.
> > > usually 65534:65534 or nobody:nogroup.
> > > 
> > > With fsid mappings we can solve this by writing an id mapping of
> > > 0 100000 100000 and an fsid mapping of 0 300000 100000. On
> > > filesystem access the kernel will now lookup the mapping for
> > > 300000 in the fsid mapping tables of the user namespace. And
> > > since such a mapping exists, the corresponding files will have
> > > correct ownership.
> > 
> > How do we parametrise this new fsid shift for the unprivileged use
> > case?  For newuidmap/newgidmap, it's easy because each user gets a
> > dedicated range and everything "just works (tm)".  However, for the
> > fsid mapping, assuming some newfsuid/newfsgid tool to help, that
> > tool has to know not only your allocated uid/gid chunk, but also
> > the offset map of the image.  The former is easy, but the latter is
> > going to vary by the actual image ... well unless we standardise
> > some accepted shift for images and it simply becomes a known static
> > offset.
> > 
> 
> For unprivileged runtimes, I would expect images to be unshifted and
> be unpacked from within a userns.

For images whose resting format is an archive like tar, I concur.

>  So your unprivileged user would be allowed a uid/gid range through
> /etc/subuid and /etc/subgid and allowed to use them through
> newuidmap/newgidmap.In that namespace, you can then pull
> and unpack any images/layers you may want and the resulting fs tree
> will look correct from within that namespace.
> 
> All that is possible today and is how for example unprivileged LXC
> works right now.

I do have a counter example, but it might be more esoteric: I do use
unprivileged architecture emulation containers to maintain actual
physical system boot environments.  These are stored as mountable disk
images, not as archives, so I do need a simple remapping ... however, I
think this use case is simple: it's a back shift along my owned uid/gid
range, so tools for allowing unprivileged use can easily cope with this
use case, so the use is either fsid identity or fsid back along
existing user_ns mapping.

> What this patchset then allows is for containers to have differing
> uid/gid maps while still being based off the same image or layers.
> In this scenario, you would carve a subset of your main uid/gid map
> for each container you run and run them in a child user namespace
> while setting up a fsuid/fsgid map such that their filesystem access
> do not follow their uid/gid map. This then results in proper
> isolation for processes, networks, ... as everything runs as
> different kuid/kgid but the VFS view will be the same in all
> containers.

Who owns the shifted range of the image ... all tenants or none?

> Shared storage between those otherwise isolated containers would also
> work just fine by simply bind-mounting the same path into two or more
> containers.
> 
> 
> Now one additional thing that would be safe for a setuid wrapper to
> allow would be for arbitrary mapping of any of the uid/gid that the
> user owns to be used within the fsuid/fsgid map. One potential use
> for this would be to create any number of user namespaces, each with
> their own mapping for uid 0 while still having all VFS access be
> mapped to the user that spawned them (say uid=1000, gid=1000).
> 
> 
> Note that in our case, the intended use for this is from a privileged
> runtime where our images would be unshifted as would be the container
> storage and any shared storage for containers. The security model
> effectively relying on properly configured filesystem permissions and
> mount namespaces such that the content of those paths can never be
> seen by anyone but root outside of those containers (and therefore
> avoids all the issues around setuid/setgid/fscaps).

Yes, I understand ... all orchestration systems are currently hugely
privileged.  However, there is interest in getting them down to only
"slightly privileged".

James


> We will then be able to allocate distinct, random, ranges of 65536
> uids/gids (or more) for each container without ever having to do any
> uid/gid shifting at the filesystem layer or run into issues when
> having to setup shared storage between containers or attaching
> external storage volumes to those containers.
Christian Brauner Feb. 17, 2020, 11:05 p.m. UTC | #9
On Mon, Feb 17, 2020 at 02:35:38PM -0800, James Bottomley wrote:
> On Mon, 2020-02-17 at 22:20 +0100, Christian Brauner wrote:
> > On Mon, Feb 17, 2020 at 01:06:08PM -0800, James Bottomley wrote:
> > > On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote:
> > > [...]
> > > > People not as familiar with user namespaces might not be aware
> > > > that fsid mappings already exist. Right now, fsid mappings are
> > > > always identical to id mappings. Specifically, the kernel will
> > > > lookup fsuids in the uid mappings and fsgids in the gid mappings
> > > > of the relevant user namespace.
> > > 
> > > This isn't actually entirely true: today we have the superblock
> > > user namespace, which can be used for fsid remapping on filesystems
> > > that support it (currently f2fs and fuse).  Since this is a single
> > > shift,
> > 
> > Note that this states "the relevant" user namespace not the caller's
> > user namespace. And the point is true even for such filesystems. fuse
> > does call make_kuid(fc->user_ns, attr->uid) and hence looks up the
> > mapping in the id mappings.. This would be replaced by make_kfsuid().
> > 
> > > how is it going to play with s_user_ns?  Do you have to understand
> > > the superblock mapping to use this shift, or are we simply using
> > > this to replace s_user_ns?
> > 
> > I'm not sure what you mean by understand the superblock mapping. The
> > case is not different from the devpts patch in this series.
> 
> So since devpts wasn't originally a s_user_ns consumer, I assume you're
> thinking that this patch series just replaces the whole of s_user_ns
> for fuse and f2fs and we can remove it?

No, as I said it's just about replacing make_kuid() with make_kfsuid().
This doesn't change anything for all cases where id mappings equal fsid
mappings and if there are separate id mappings it will look at the fsid
mappings for the user namespace in struct fuse_conn.

> 
> > Fuse needs to be changed to call make_kfsuid() since it is mountable
> > inside user namespaces at which point everthing just works.
> 
> The fuse case is slightly more complicated because there are sound
> reasons to run the daemon in a separate user namespace regardless of
> where the end fuse mount is.

I'm curious how you're doing that today as it's usually
tricky to mount across mount namespaces? In any case, this patchset
doesn't change any of that fuse logic, so thing will keep working as
they do today.
Stéphane Graber Feb. 17, 2020, 11:11 p.m. UTC | #10
On Mon, Feb 17, 2020 at 6:03 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> On Mon, 2020-02-17 at 16:57 -0500, Stéphane Graber wrote:
> > On Mon, Feb 17, 2020 at 4:12 PM James Bottomley <
> > James.Bottomley@hansenpartnership.com> wrote:
> >
> > > On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote:
> > > [...]
> > > > With this patch series we simply introduce the ability to create
> > > > fsid mappings that are different from the id mappings of a user
> > > > namespace. The whole feature set is placed under a config option
> > > > that defaults to false.
> > > >
> > > > In the usual case of running an unprivileged container we will
> > > > have setup an id mapping, e.g. 0 100000 100000. The on-disk
> > > > mapping will correspond to this id mapping, i.e. all files which
> > > > we want to appear as 0:0 inside the user namespace will be
> > > > chowned to 100000:100000 on the host. This works, because
> > > > whenever the kernel needs to do a filesystem access it will
> > > > lookup the corresponding uid and gid in the idmapping tables of
> > > > the container.
> > > > Now think about the case where we want to have an id mapping of 0
> > > > 100000 100000 but an on-disk mapping of 0 300000 100000 which is
> > > > needed to e.g. share a single on-disk mapping with multiple
> > > > containers that all have different id mappings.
> > > > This will be problematic. Whenever a filesystem access is
> > > > requested, the kernel will now try to lookup a mapping for 300000
> > > > in the id mapping tables of the user namespace but since there is
> > > > none the files will appear to be owned by the overflow id, i.e.
> > > > usually 65534:65534 or nobody:nogroup.
> > > >
> > > > With fsid mappings we can solve this by writing an id mapping of
> > > > 0 100000 100000 and an fsid mapping of 0 300000 100000. On
> > > > filesystem access the kernel will now lookup the mapping for
> > > > 300000 in the fsid mapping tables of the user namespace. And
> > > > since such a mapping exists, the corresponding files will have
> > > > correct ownership.
> > >
> > > How do we parametrise this new fsid shift for the unprivileged use
> > > case?  For newuidmap/newgidmap, it's easy because each user gets a
> > > dedicated range and everything "just works (tm)".  However, for the
> > > fsid mapping, assuming some newfsuid/newfsgid tool to help, that
> > > tool has to know not only your allocated uid/gid chunk, but also
> > > the offset map of the image.  The former is easy, but the latter is
> > > going to vary by the actual image ... well unless we standardise
> > > some accepted shift for images and it simply becomes a known static
> > > offset.
> > >
> >
> > For unprivileged runtimes, I would expect images to be unshifted and
> > be unpacked from within a userns.
>
> For images whose resting format is an archive like tar, I concur.
>
> >  So your unprivileged user would be allowed a uid/gid range through
> > /etc/subuid and /etc/subgid and allowed to use them through
> > newuidmap/newgidmap.In that namespace, you can then pull
> > and unpack any images/layers you may want and the resulting fs tree
> > will look correct from within that namespace.
> >
> > All that is possible today and is how for example unprivileged LXC
> > works right now.
>
> I do have a counter example, but it might be more esoteric: I do use
> unprivileged architecture emulation containers to maintain actual
> physical system boot environments.  These are stored as mountable disk
> images, not as archives, so I do need a simple remapping ... however, I
> think this use case is simple: it's a back shift along my owned uid/gid
> range, so tools for allowing unprivileged use can easily cope with this
> use case, so the use is either fsid identity or fsid back along
> existing user_ns mapping.
>
> > What this patchset then allows is for containers to have differing
> > uid/gid maps while still being based off the same image or layers.
> > In this scenario, you would carve a subset of your main uid/gid map
> > for each container you run and run them in a child user namespace
> > while setting up a fsuid/fsgid map such that their filesystem access
> > do not follow their uid/gid map. This then results in proper
> > isolation for processes, networks, ... as everything runs as
> > different kuid/kgid but the VFS view will be the same in all
> > containers.
>
> Who owns the shifted range of the image ... all tenants or none?

I would expect the most common case being none of them.
So you'd have a uid/gid range carved out of your own allocation which is
used to unpack images, let's call that the image map.

Your containers would then use a uid/gid map which is distinct from that map
and distinct from each other but all using the image map as their
fsuid/fsgid map.

This will make the VFS behave in a normal way and would also allow for
shared paths between those containers by using a shared directory
through bind-mount which is also owned by a uid/gid in that image range.

> > Shared storage between those otherwise isolated containers would also
> > work just fine by simply bind-mounting the same path into two or more
> > containers.
> >
> >
> > Now one additional thing that would be safe for a setuid wrapper to
> > allow would be for arbitrary mapping of any of the uid/gid that the
> > user owns to be used within the fsuid/fsgid map. One potential use
> > for this would be to create any number of user namespaces, each with
> > their own mapping for uid 0 while still having all VFS access be
> > mapped to the user that spawned them (say uid=1000, gid=1000).
> >
> >
> > Note that in our case, the intended use for this is from a privileged
> > runtime where our images would be unshifted as would be the container
> > storage and any shared storage for containers. The security model
> > effectively relying on properly configured filesystem permissions and
> > mount namespaces such that the content of those paths can never be
> > seen by anyone but root outside of those containers (and therefore
> > avoids all the issues around setuid/setgid/fscaps).
>
> Yes, I understand ... all orchestration systems are currently hugely
> privileged.  However, there is interest in getting them down to only
> "slightly privileged".
>
> James
>
>
> > We will then be able to allocate distinct, random, ranges of 65536
> > uids/gids (or more) for each container without ever having to do any
> > uid/gid shifting at the filesystem layer or run into issues when
> > having to setup shared storage between containers or attaching
> > external storage volumes to those containers.