Message ID | 20201029003252.2128653-1-christian.brauner@ubuntu.com (mailing list archive) |
---|---|
Headers | show |
Series | fs: idmapped mounts | expand |
On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote: > Hey everyone, > > I vanished for a little while to focus on this work here so sorry for > not being available by mail for a while. > > Since quite a long time we have issues with sharing mounts between > multiple unprivileged containers with different id mappings, sharing a > rootfs between multiple containers with different id mappings, and also > sharing regular directories and filesystems between users with different > uids and gids. The latter use-cases have become even more important with > the availability and adoption of systemd-homed (cf. [1]) to implement > portable home directories. > > The solutions we have tried and proposed so far include the introduction > of fsid mappings, a tiny overlay based filesystem, and an approach to > call override creds in the vfs. None of these solutions have covered all > of the above use-cases. > > The solution proposed here has it's origins in multiple discussions > during Linux Plumbers 2017 during and after the end of the containers > microconference. > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David, > James, and myself. A variant of the solution proposed here has also been > discussed, again to the best of my knowledge, after a Linux conference > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017 > after Linux Plumbers. > I've taken the time to finally implement a working version of this > solution over the last weeks to the best of my abilities. Tycho has > signed up for this sligthly crazy endeavour as well and he has helped > with the conversion of the xattr codepaths. > > The core idea is to make idmappings a property of struct vfsmount > instead of tying it to a process being inside of a user namespace which > has been the case for all other proposed approaches. > It means that idmappings become a property of bind-mounts, i.e. each > bind-mount can have a separate idmapping. This has the obvious advantage > that idmapped mounts can be created inside of the initial user > namespace, i.e. on the host itself instead of requiring the caller to be > located inside of a user namespace. This enables such use-cases as e.g. > making a usb stick available in multiple locations with different > idmappings (see the vfat port that is part of this patch series). > > The vfsmount struct gains a new struct user_namespace member. The > idmapping of the user namespace becomes the idmapping of the mount. A > caller that is either privileged with respect to the user namespace of > the superblock of the underlying filesystem or a caller that is > privileged with respect to the user namespace a mount has been idmapped > with can create a new bind-mount and mark it with a user namespace. The > user namespace the mount will be marked with can be specified by passing > a file descriptor refering to the user namespace as an argument to the > new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. > By default vfsmounts are marked with the initial user namespace and no > behavioral or performance changes should be observed. All mapping > operations are nops for the initial user namespace. > > When a file/inode is accessed through an idmapped mount the i_uid and > i_gid of the inode will be remapped according to the user namespace the > mount has been marked with. When a new object is created based on the > fsuid and fsgid of the caller they will similarly be remapped according > to the user namespace of the mount they care created from. > > This means the user namespace of the mount needs to be passed down into > a few relevant inode_operations. This mostly includes inode operations > that create filesystem objects or change file attributes. That's really quite ... messy. Maybe I'm missing something, but if you have the user_ns to be used for the VFS operation we are about to execute then why can't we use the same model as current_fsuid/current_fsgid() for passing the filesystem credentials down to the filesystem operations? i.e. attach it to the current->cred->fs_userns, and then the filesystem code that actually needs to know the current userns can call current_fs_user_ns() instead of current_user_ns(). i.e. #define current_fs_user_ns() \ (current->cred->fs_userns ? current->cred->fs_userns \ : current->cred->userns) At this point, the filesystem will now always have the correct userns it is supposed to use for mapping the uid/gid, right? Also, if we are passing work off to worker threads, duplicating the current creds will capture this information and won't leave random landmines where stuff doesn't work as it should because the worker thread is unaware of the userns that it is supposed to be doing filesytsem operations under... Cheers, Dave.
On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote: > Hey everyone, > > I vanished for a little while to focus on this work here so sorry for > not being available by mail for a while. > > Since quite a long time we have issues with sharing mounts between > multiple unprivileged containers with different id mappings, sharing a > rootfs between multiple containers with different id mappings, and also > sharing regular directories and filesystems between users with different > uids and gids. The latter use-cases have become even more important with > the availability and adoption of systemd-homed (cf. [1]) to implement > portable home directories. > > The solutions we have tried and proposed so far include the introduction > of fsid mappings, a tiny overlay based filesystem, and an approach to > call override creds in the vfs. None of these solutions have covered all > of the above use-cases. > > The solution proposed here has it's origins in multiple discussions > during Linux Plumbers 2017 during and after the end of the containers > microconference. > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David, > James, and myself. A variant of the solution proposed here has also been > discussed, again to the best of my knowledge, after a Linux conference > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017 > after Linux Plumbers. > I've taken the time to finally implement a working version of this > solution over the last weeks to the best of my abilities. Tycho has > signed up for this sligthly crazy endeavour as well and he has helped > with the conversion of the xattr codepaths. > > The core idea is to make idmappings a property of struct vfsmount > instead of tying it to a process being inside of a user namespace which > has been the case for all other proposed approaches. > It means that idmappings become a property of bind-mounts, i.e. each > bind-mount can have a separate idmapping. This has the obvious advantage > that idmapped mounts can be created inside of the initial user > namespace, i.e. on the host itself instead of requiring the caller to be > located inside of a user namespace. This enables such use-cases as e.g. > making a usb stick available in multiple locations with different > idmappings (see the vfat port that is part of this patch series). > > The vfsmount struct gains a new struct user_namespace member. The > idmapping of the user namespace becomes the idmapping of the mount. A > caller that is either privileged with respect to the user namespace of > the superblock of the underlying filesystem or a caller that is > privileged with respect to the user namespace a mount has been idmapped > with can create a new bind-mount and mark it with a user namespace. The > user namespace the mount will be marked with can be specified by passing > a file descriptor refering to the user namespace as an argument to the > new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. > By default vfsmounts are marked with the initial user namespace and no > behavioral or performance changes should be observed. All mapping > operations are nops for the initial user namespace. > > When a file/inode is accessed through an idmapped mount the i_uid and > i_gid of the inode will be remapped according to the user namespace the > mount has been marked with. When a new object is created based on the > fsuid and fsgid of the caller they will similarly be remapped according > to the user namespace of the mount they care created from. > > This means the user namespace of the mount needs to be passed down into > a few relevant inode_operations. This mostly includes inode operations > that create filesystem objects or change file attributes. Some of them > such as ->getattr() don't even need to change since they pass down a > struct path and thus the struct vfsmount is already available. Other > inode operations need to be adapted to pass down the user namespace the > vfsmount has been marked with. Al was nice enough to point out that he > will not tolerate struct vfsmount being passed to filesystems and that I > should pass down the user namespace directly; which is what I did. > The inode struct itself is never altered whenever the i_uid and i_gid > need to be mapped, i.e. i_uid and i_gid are only remapped at the time of > the check. An inode once initialized (during lookup or object creation) > is never altered when accessed through an idmapped mount. > > To limit the amount of noise in this first iteration we have not changed > the existing inode operations but rather introduced a few new struct > inode operation methods such as ->mkdir_mapped which pass down the user > namespace of the mount they have been called from. Should this solution > be worth pursuing we have no problem adapting the existing inode > operations instead. > > In order to support idmapped mounts, filesystems need to be changed and > mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. In this first > iteration I tried to illustrate this by changing three different > filesystem with different levels of complexity. Of course with some bias > towards urgent use-cases and filesystems I was at least a little more > familiar with. However, Tycho and I (and others) have no problem > converting each filesystem one-by-one. This first iteration includes fat > (msdos and vfat), ext4, and overlayfs (both with idmapped lower and > upper directories and idmapped merged directories). I'm sure I haven't > gotten everything right for all three of them in the first version of > this patch. > Thanks for this patchset. It's been a long-time coming. I'm curious as to for the most cases, how much the new fs mount APIs help, and if focusing on those could solve the problem for everything other than bind mounts? Specifically, the idea of doing fsopen (creation of fs_context) under the user namespace of question, and relying on a user with CAP_SYS_ADMIN to call fsmount[1]. I think this is actually especially valuable for places like overlayfs that use the entire cred object, as opposed to just the uid / gid. I imagine that soon, most filesystems will support the new mount APIs, and not set the global flag if they don't need to. How popular is the "vfsmount (bind mounts) needs different uid mappings" use case? The other thing I worry about is the "What UID are you really?" game that's been a thing recently. For example, you can have a different user namespace UID mapping for your network namespace that netfilter checks[2], and a different one for your mount namespace, and a different one that the process is actually in. This proliferation of different mappings makes auditing, and doing things like writing perf toolings more difficult (since I think bpf_get_current_uid_gid use the initial user namespace still [3]). [1]: https://lore.kernel.org/linux-nfs/20201016123745.9510-4-sargun@sargun.me/T/#u [2]: https://elixir.bootlin.com/linux/v5.9.1/source/net/netfilter/xt_owner.c#L37 [3]: https://elixir.bootlin.com/linux/v5.9.1/source/kernel/bpf/helpers.c#L196
Christian Brauner <christian.brauner@ubuntu.com> writes: > Hey everyone, > > I vanished for a little while to focus on this work here so sorry for > not being available by mail for a while. > > Since quite a long time we have issues with sharing mounts between > multiple unprivileged containers with different id mappings, sharing a > rootfs between multiple containers with different id mappings, and also > sharing regular directories and filesystems between users with different > uids and gids. The latter use-cases have become even more important with > the availability and adoption of systemd-homed (cf. [1]) to implement > portable home directories. Can you walk us through the motivating use case? As of this year's LPC I had the distinct impression that the primary use case for such a feature was due to the RLIMIT_NPROC problem where two containers with the same users still wanted different uid mappings to the disk because the users were conflicting with each other because of the per user rlimits. Fixing rlimits is straight forward to implement, and easier to manage for implementations and administrators. Reading up on systemd-homed it appears to be a way to have encrypted home directories. Those home directories can either be encrypted at the fs or at the block level. Those home directories appear to have the goal of being luggable between systems. If the systems in question don't have common administration of uids and gids after lugging your encrypted home directory to another system chowning the files is required. Is that the use case you are looking at removing the need for systemd-homed to avoid chowning after lugging encrypted home directories from one system to another? Why would it be desirable to avoid the chown? If the goal is to solve fragmented administration of uid assignment I suggest that it might be better to solve the administration problem so that all of the uids of interest get assigned the same way on all of the systems of interest. To the extent it is possible to solve the uid assignment problem that would seem to have fewer long term administrative problems. Eric
On 2020-10-29, Eric W. Biederman <ebiederm@xmission.com> wrote: > Christian Brauner <christian.brauner@ubuntu.com> writes: > > > Hey everyone, > > > > I vanished for a little while to focus on this work here so sorry for > > not being available by mail for a while. > > > > Since quite a long time we have issues with sharing mounts between > > multiple unprivileged containers with different id mappings, sharing a > > rootfs between multiple containers with different id mappings, and also > > sharing regular directories and filesystems between users with different > > uids and gids. The latter use-cases have become even more important with > > the availability and adoption of systemd-homed (cf. [1]) to implement > > portable home directories. > > Can you walk us through the motivating use case? > > As of this year's LPC I had the distinct impression that the primary use > case for such a feature was due to the RLIMIT_NPROC problem where two > containers with the same users still wanted different uid mappings to > the disk because the users were conflicting with each other because of > the per user rlimits. > > Fixing rlimits is straight forward to implement, and easier to manage > for implementations and administrators. This is separate to the question of "isolated user namespaces" and managing different mappings between containers. This patchset is solving the same problem that shiftfs solved -- sharing a single directory tree between containers that have different ID mappings. rlimits (nor any of the other proposals we discussed at LPC) will help with this problem.
On Do, 29.10.20 10:47, Eric W. Biederman (ebiederm@xmission.com) wrote: > Is that the use case you are looking at removing the need for > systemd-homed to avoid chowning after lugging encrypted home directories > from one system to another? Why would it be desirable to avoid the > chown? Yes, I am very interested in seeing Christian's work succeed, for the usecase in systemd-homed. In systemd-homed each user gets their own private file system, and these fs shall be owned by the user's local UID, regardless in which system it is used. The UID should be an artifact of the local, individual system in this model, and thus the UID on of the same user/home on system A might be picked as 1010 and on another as 1543, and on a third as 1323, and it shouldn't matter. This way, home directories become migratable without having to universially sync UID assignments: it doesn't matter anymore what the local UID is. Right now we do a recursive chown() at login time to ensure the home dir is properly owned. This has two disadvantages: 1. It's slow. In particular on large home dirs, it takes a while to go through the whole user's homedir tree and chown/adjust ACLs for everything. 2. Because it is so slow we take a shortcut right now: if the top-level home dir inode itself is owned by the correct user, we skip the recursive chowning. This means in the typical case where a user uses the same system most of the time, and thus the UID is stable we can avoid the slowness. But this comes at a drawback: if the user for some reason ends up with files in their homedir owned by an unrelated user, then we'll never notice or readjust. > If the goal is to solve fragmented administration of uid assignment I > suggest that it might be better to solve the administration problem so > that all of the uids of interest get assigned the same way on all of the > systems of interest. Well, the goal is to make things simple and be able to use the home dir everywhere without any prior preparation, without central UID assignment authority. The goal is to have a scheme that requires no administration, by making the UID management problem go away. Hence, if you suggest solving this by having a central administrative authority: this is exactly what the model wants to get away from. Or to say this differently: just because I personally use three different computers, I certainly don't want to set up LDAP or sync UIDs manually. Lennart -- Lennart Poettering, Berlin
On Thu, Oct 29, 2020 at 01:27:33PM +1100, Dave Chinner wrote: > On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote: > > Hey everyone, > > > > I vanished for a little while to focus on this work here so sorry for > > not being available by mail for a while. > > > > Since quite a long time we have issues with sharing mounts between > > multiple unprivileged containers with different id mappings, sharing a > > rootfs between multiple containers with different id mappings, and also > > sharing regular directories and filesystems between users with different > > uids and gids. The latter use-cases have become even more important with > > the availability and adoption of systemd-homed (cf. [1]) to implement > > portable home directories. > > > > The solutions we have tried and proposed so far include the introduction > > of fsid mappings, a tiny overlay based filesystem, and an approach to > > call override creds in the vfs. None of these solutions have covered all > > of the above use-cases. > > > > The solution proposed here has it's origins in multiple discussions > > during Linux Plumbers 2017 during and after the end of the containers > > microconference. > > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David, > > James, and myself. A variant of the solution proposed here has also been > > discussed, again to the best of my knowledge, after a Linux conference > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017 > > after Linux Plumbers. > > I've taken the time to finally implement a working version of this > > solution over the last weeks to the best of my abilities. Tycho has > > signed up for this sligthly crazy endeavour as well and he has helped > > with the conversion of the xattr codepaths. > > > > The core idea is to make idmappings a property of struct vfsmount > > instead of tying it to a process being inside of a user namespace which > > has been the case for all other proposed approaches. > > It means that idmappings become a property of bind-mounts, i.e. each > > bind-mount can have a separate idmapping. This has the obvious advantage > > that idmapped mounts can be created inside of the initial user > > namespace, i.e. on the host itself instead of requiring the caller to be > > located inside of a user namespace. This enables such use-cases as e.g. > > making a usb stick available in multiple locations with different > > idmappings (see the vfat port that is part of this patch series). > > > > The vfsmount struct gains a new struct user_namespace member. The > > idmapping of the user namespace becomes the idmapping of the mount. A > > caller that is either privileged with respect to the user namespace of > > the superblock of the underlying filesystem or a caller that is > > privileged with respect to the user namespace a mount has been idmapped > > with can create a new bind-mount and mark it with a user namespace. The > > user namespace the mount will be marked with can be specified by passing > > a file descriptor refering to the user namespace as an argument to the > > new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. > > By default vfsmounts are marked with the initial user namespace and no > > behavioral or performance changes should be observed. All mapping > > operations are nops for the initial user namespace. > > > > When a file/inode is accessed through an idmapped mount the i_uid and > > i_gid of the inode will be remapped according to the user namespace the > > mount has been marked with. When a new object is created based on the > > fsuid and fsgid of the caller they will similarly be remapped according > > to the user namespace of the mount they care created from. > > > > This means the user namespace of the mount needs to be passed down into > > a few relevant inode_operations. This mostly includes inode operations > > that create filesystem objects or change file attributes. > > That's really quite ... messy. I don't agree. It's changes all across the vfs but it's not hacky in any way since it cleanly passes down an additional argument (I'm biased of course.). > > Maybe I'm missing something, but if you have the user_ns to be used > for the VFS operation we are about to execute then why can't we use > the same model as current_fsuid/current_fsgid() for passing the > filesystem credentials down to the filesystem operations? i.e. > attach it to the current->cred->fs_userns, and then the filesystem > code that actually needs to know the current userns can call > current_fs_user_ns() instead of current_user_ns(). i.e. > > #define current_fs_user_ns() \ > (current->cred->fs_userns ? current->cred->fs_userns \ > : current->cred->userns) > > At this point, the filesystem will now always have the correct > userns it is supposed to use for mapping the uid/gid, right? Thanks for this interesting idea! I have some troubles with it though. This approach (always) seemed conceptually wrong to me. Like Tycho said somewhere else this basically would act like a global variable which isn't great. There's also a substantial difference between in that the current fsuid and fsgid are an actual property of the callers creds so to have them in there makes perfect sense. But the user namespace of the vfsmount is a property of the mount and as such glueing it to the callers creds when calling into the vfs is just weird and I would very much like to avoid this. If inode's wouldn't have an i_sb member we wouldn't suddenly start to pass down the s_user_ns via the callers creds to the filesystems. I'm also not fond of having to call prepare_creds() and override_creds() all across the vfs. It's messy and prepare_creds() is especially problematic during RCU pathwalk where we can't call it. We could in path_init() at the start of every every lookup operation call prepare_creds() and then override them when we need to switch the fs_userns global variable and then put_creds() at the end of every path walk in terminate_walk(). But this means penalizing every lookup operations with an additional prepare_creds() which needs to be called at least once, I think. Then during lookup we would need to override/change this new global fs_userns variable potentially at each mountpoint crossing to switch back to the correct fs_userns for idmapped and non-idmapped mounts. We'd also need to rearrange a bunch of terminate_walk() calls and we would like end up with a comparable amount of changes only that they would indeed be more messy since we're strapping fs_userns to the caller's creds. I an alternative might be to have a combined approach where you pass the user namespace around in the vfs but when calling into the filesystem use the fs_userns global variable approach but I would very much prefer to avoid this and instead cleanly pass down the user namespace correctly. That's more work, it'll take longer but I and others are around to see these changes through. > > Also, if we are passing work off to worker threads, duplicating > the current creds will capture this information and won't leave > random landmines where stuff doesn't work as it should because the > worker thread is unaware of the userns that it is supposed to be > doing filesytsem operations under... That seems like a problem that can be handled by simply keeping the userns around similar to how we need to already keep creds around. Christian
On Thu, Oct 29, 2020 at 10:12:31AM -0600, Tycho Andersen wrote: > Hi Eric, > > On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote: > > Christian Brauner <christian.brauner@ubuntu.com> writes: > > > > > Hey everyone, > > > > > > I vanished for a little while to focus on this work here so sorry for > > > not being available by mail for a while. > > > > > > Since quite a long time we have issues with sharing mounts between > > > multiple unprivileged containers with different id mappings, sharing a > > > rootfs between multiple containers with different id mappings, and also > > > sharing regular directories and filesystems between users with different > > > uids and gids. The latter use-cases have become even more important with > > > the availability and adoption of systemd-homed (cf. [1]) to implement > > > portable home directories. > > > > Can you walk us through the motivating use case? > > > > As of this year's LPC I had the distinct impression that the primary use > > case for such a feature was due to the RLIMIT_NPROC problem where two > > containers with the same users still wanted different uid mappings to > > the disk because the users were conflicting with each other because of > > the per user rlimits. > > > > Fixing rlimits is straight forward to implement, and easier to manage > > for implementations and administrators. > > Our use case is to have the same directory exposed to several > different containers which each have disjoint ID mappings. > > > Reading up on systemd-homed it appears to be a way to have encrypted > > home directories. Those home directories can either be encrypted at the > > fs or at the block level. Those home directories appear to have the > > goal of being luggable between systems. If the systems in question > > don't have common administration of uids and gids after lugging your > > encrypted home directory to another system chowning the files is > > required. > > > > Is that the use case you are looking at removing the need for > > systemd-homed to avoid chowning after lugging encrypted home directories > > from one system to another? Why would it be desirable to avoid the > > chown? > > Not just systemd-homed, but LXD has to do this, as does our > application at Cisco, and presumably others. > > Several reasons: > > * the chown is slow > * the chown requires somewhere to write the delta in metadata (e.g. an > overlay workdir, or an LV or something), and there are N copies of > this delta, one for each container. > * it means we need to have a +w filesystem at some point during > execution. > * it's ugly :). Conceptually, the kernel solves the uid shifting > problem for us for most other kernel subsystems (including in a > limited way fscaps) by configuring a user namespace. It feels like > we should be able to do the same with the VFS. And chown prevents the same inode from being shared by different containers through different id mappings. You can overlay, but then they can't actually share updates. -serge
On Thu, Oct 29, 2020 at 05:05:02PM +0100, Lennart Poettering wrote: > On Do, 29.10.20 10:47, Eric W. Biederman (ebiederm@xmission.com) wrote: > > > Is that the use case you are looking at removing the need for > > systemd-homed to avoid chowning after lugging encrypted home directories > > from one system to another? Why would it be desirable to avoid the > > chown? > > Yes, I am very interested in seeing Christian's work succeed, for the > usecase in systemd-homed. In systemd-homed each user gets their own > private file system, and these fs shall be owned by the user's local > UID, regardless in which system it is used. The UID should be an > artifact of the local, individual system in this model, and thus > the UID on of the same user/home on system A might be picked as 1010 > and on another as 1543, and on a third as 1323, and it shouldn't > matter. This way, home directories become migratable without having to > universially sync UID assignments: it doesn't matter anymore what the > local UID is. > > Right now we do a recursive chown() at login time to ensure the home > dir is properly owned. This has two disadvantages: > > 1. It's slow. In particular on large home dirs, it takes a while to go > through the whole user's homedir tree and chown/adjust ACLs for > everything. > > 2. Because it is so slow we take a shortcut right now: if the > top-level home dir inode itself is owned by the correct user, we > skip the recursive chowning. This means in the typical case where a > user uses the same system most of the time, and thus the UID is > stable we can avoid the slowness. But this comes at a drawback: if > the user for some reason ends up with files in their homedir owned > by an unrelated user, then we'll never notice or readjust. > > > If the goal is to solve fragmented administration of uid assignment I > > suggest that it might be better to solve the administration problem so > > that all of the uids of interest get assigned the same way on all of the > > systems of interest. > > Well, the goal is to make things simple and be able to use the home > dir everywhere without any prior preparation, without central UID > assignment authority. > > The goal is to have a scheme that requires no administration, by > making the UID management problem go away. Hence, if you suggest > solving this by having a central administrative authority: this is > exactly what the model wants to get away from. > > Or to say this differently: just because I personally use three > different computers, I certainly don't want to set up LDAP or sync > UIDs manually. > > Lennart > > -- > Lennart Poettering, Berlin Can you help me understand systemd-homed a little bit? In the man page it says: systemd-homed is a system service that may be used to create, remove, change or inspect home areas (directories and network mounts and real or loopback block devices with a filesystem, optionally encrypted). It seems that the "underlay?" (If you'll call it that, maybe there is a better term) can either be a standalone block device (this sounds close to systemd machined?), a btrfs subvolume (which receives its own superblock (IIRC?, I might be wrong. It's been a while since I've used btrfs), or just be a directory that's mapped? What decides whether it's just a directory and bind-mounted (or a similar vfsmount), or an actual superblock? How is the mapping of "real UIDs" to "namespace UIDs" works when it's just a bind mount? From the perspective of multiple user namespaces, are all "underlying" UIDs mapped through, or if I try to look at another user's home directory will they not show up? Is there a reason you can't / don't / wont use overlayfs instead of bind mounts?
Aleksa Sarai <cyphar@cyphar.com> writes: > On 2020-10-29, Eric W. Biederman <ebiederm@xmission.com> wrote: >> Christian Brauner <christian.brauner@ubuntu.com> writes: >> >> > Hey everyone, >> > >> > I vanished for a little while to focus on this work here so sorry for >> > not being available by mail for a while. >> > >> > Since quite a long time we have issues with sharing mounts between >> > multiple unprivileged containers with different id mappings, sharing a >> > rootfs between multiple containers with different id mappings, and also >> > sharing regular directories and filesystems between users with different >> > uids and gids. The latter use-cases have become even more important with >> > the availability and adoption of systemd-homed (cf. [1]) to implement >> > portable home directories. >> >> Can you walk us through the motivating use case? >> >> As of this year's LPC I had the distinct impression that the primary use >> case for such a feature was due to the RLIMIT_NPROC problem where two >> containers with the same users still wanted different uid mappings to >> the disk because the users were conflicting with each other because of >> the per user rlimits. >> >> Fixing rlimits is straight forward to implement, and easier to manage >> for implementations and administrators. > > This is separate to the question of "isolated user namespaces" and > managing different mappings between containers. This patchset is solving > the same problem that shiftfs solved -- sharing a single directory tree > between containers that have different ID mappings. rlimits (nor any of > the other proposals we discussed at LPC) will help with this problem. First and foremost: A uid shift on write to a filesystem is a security bug waiting to happen. This is especially in the context of facilities like iouring, that play very agressive games with how process context makes it to system calls. The only reason containers were not immediately exploitable when iouring was introduced is because the mechanisms are built so that even if something escapes containment the security properties still apply. Changes to the uid when writing to the filesystem does not have that property. The tiniest slip in containment will be a security issue. This is not even the least bit theoretical. I have seem reports of how shitfs+overlayfs created a situation where anyone could read /etc/shadow. If you are going to write using the same uid to disk from different containers the question becomes why can't those containers configure those users to use the same kuid? What fixing rlimits does is it fixes one of the reasons that different containers could not share the same kuid for users that want to write to disk with the same uid. I humbly suggest that it will be more secure, and easier to maintain for both developers and users if we fix the reasons people want different containers to have the same user running with different kuids. If not what are the reasons we fundamentally need the same on-disk user using multiple kuids in the kernel? Eric
Tycho Andersen <tycho@tycho.pizza> writes: > Hi Eric, > > On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote: >> Christian Brauner <christian.brauner@ubuntu.com> writes: >> >> > Hey everyone, >> > >> > I vanished for a little while to focus on this work here so sorry for >> > not being available by mail for a while. >> > >> > Since quite a long time we have issues with sharing mounts between >> > multiple unprivileged containers with different id mappings, sharing a >> > rootfs between multiple containers with different id mappings, and also >> > sharing regular directories and filesystems between users with different >> > uids and gids. The latter use-cases have become even more important with >> > the availability and adoption of systemd-homed (cf. [1]) to implement >> > portable home directories. >> >> Can you walk us through the motivating use case? >> >> As of this year's LPC I had the distinct impression that the primary use >> case for such a feature was due to the RLIMIT_NPROC problem where two >> containers with the same users still wanted different uid mappings to >> the disk because the users were conflicting with each other because of >> the per user rlimits. >> >> Fixing rlimits is straight forward to implement, and easier to manage >> for implementations and administrators. > > Our use case is to have the same directory exposed to several > different containers which each have disjoint ID mappings. Why do the you have disjoint ID mappings for the users that are writing to disk with the same ID? >> Reading up on systemd-homed it appears to be a way to have encrypted >> home directories. Those home directories can either be encrypted at the >> fs or at the block level. Those home directories appear to have the >> goal of being luggable between systems. If the systems in question >> don't have common administration of uids and gids after lugging your >> encrypted home directory to another system chowning the files is >> required. >> >> Is that the use case you are looking at removing the need for >> systemd-homed to avoid chowning after lugging encrypted home directories >> from one system to another? Why would it be desirable to avoid the >> chown? > > Not just systemd-homed, but LXD has to do this, I asked why the same disk users are assigned different kuids and the only reason I have heard that LXD does this is the RLIMIT_NPROC problem. Perhaps there is another reason. In part this is why I am eager to hear peoples use case, and why I was trying very hard to make certain we get the requirements. I want the real requirements though and some thought, not just we did this and it hurts. Changning the uids on write is a very hard problem, and not just in implementating it but also in maintaining and understanding what is going on. Eric
Lennart Poettering <lennart@poettering.net> writes: > On Do, 29.10.20 10:47, Eric W. Biederman (ebiederm@xmission.com) wrote: > >> Is that the use case you are looking at removing the need for >> systemd-homed to avoid chowning after lugging encrypted home directories >> from one system to another? Why would it be desirable to avoid the >> chown? > > Yes, I am very interested in seeing Christian's work succeed, for the > usecase in systemd-homed. In systemd-homed each user gets their own > private file system, and these fs shall be owned by the user's local > UID, regardless in which system it is used. The UID should be an > artifact of the local, individual system in this model, and thus > the UID on of the same user/home on system A might be picked as 1010 > and on another as 1543, and on a third as 1323, and it shouldn't > matter. This way, home directories become migratable without having to > universially sync UID assignments: it doesn't matter anymore what the > local UID is. > > Right now we do a recursive chown() at login time to ensure the home > dir is properly owned. This has two disadvantages: > > 1. It's slow. In particular on large home dirs, it takes a while to go > through the whole user's homedir tree and chown/adjust ACLs for > everything. > > 2. Because it is so slow we take a shortcut right now: if the > top-level home dir inode itself is owned by the correct user, we > skip the recursive chowning. This means in the typical case where a > user uses the same system most of the time, and thus the UID is > stable we can avoid the slowness. But this comes at a drawback: if > the user for some reason ends up with files in their homedir owned > by an unrelated user, then we'll never notice or readjust. The classic solution to this problem for removable media are uid=XXX and gid=XXX mount options. I suspect a similar solution can apply here. I don't think you need a solution that requires different kuids to be able to write to the same filesystem uid. >> If the goal is to solve fragmented administration of uid assignment I >> suggest that it might be better to solve the administration problem so >> that all of the uids of interest get assigned the same way on all of the >> systems of interest. > > Well, the goal is to make things simple and be able to use the home > dir everywhere without any prior preparation, without central UID > assignment authority. > > The goal is to have a scheme that requires no administration, by > making the UID management problem go away. Hence, if you suggest > solving this by having a central administrative authority: this is > exactly what the model wants to get away from. For a files that can be accessed by more than a single user this is fundamentally necessary. Otherwise group permissions and acls can not work. They wind up as meaningless garbage, because without some kind of synchronization those other users and groups simply can not be represented. > Or to say this differently: just because I personally use three > different computers, I certainly don't want to set up LDAP or sync > UIDs manually. If they are single users systems why should you need to? But if permissions on files are going to be at all meaningful it is a fundamentally a requirement that there be no confusion about which party the other parties are talking about. To the best of my knowledge syncing uids/usernames between machines is as simple as it can get. Eric
On Thu, Oct 29, 2020 at 12:45 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Tycho Andersen <tycho@tycho.pizza> writes: > > > Hi Eric, > > > > On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote: > >> Christian Brauner <christian.brauner@ubuntu.com> writes: > >> > >> > Hey everyone, > >> > > >> > I vanished for a little while to focus on this work here so sorry for > >> > not being available by mail for a while. > >> > > >> > Since quite a long time we have issues with sharing mounts between > >> > multiple unprivileged containers with different id mappings, sharing a > >> > rootfs between multiple containers with different id mappings, and also > >> > sharing regular directories and filesystems between users with different > >> > uids and gids. The latter use-cases have become even more important with > >> > the availability and adoption of systemd-homed (cf. [1]) to implement > >> > portable home directories. > >> > >> Can you walk us through the motivating use case? > >> > >> As of this year's LPC I had the distinct impression that the primary use > >> case for such a feature was due to the RLIMIT_NPROC problem where two > >> containers with the same users still wanted different uid mappings to > >> the disk because the users were conflicting with each other because of > >> the per user rlimits. > >> > >> Fixing rlimits is straight forward to implement, and easier to manage > >> for implementations and administrators. > > > > Our use case is to have the same directory exposed to several > > different containers which each have disjoint ID mappings. > > Why do the you have disjoint ID mappings for the users that are writing > to disk with the same ID? > > >> Reading up on systemd-homed it appears to be a way to have encrypted > >> home directories. Those home directories can either be encrypted at the > >> fs or at the block level. Those home directories appear to have the > >> goal of being luggable between systems. If the systems in question > >> don't have common administration of uids and gids after lugging your > >> encrypted home directory to another system chowning the files is > >> required. > >> > >> Is that the use case you are looking at removing the need for > >> systemd-homed to avoid chowning after lugging encrypted home directories > >> from one system to another? Why would it be desirable to avoid the > >> chown? > > > > Not just systemd-homed, but LXD has to do this, > > I asked why the same disk users are assigned different kuids and the > only reason I have heard that LXD does this is the RLIMIT_NPROC problem. > > Perhaps there is another reason. > > In part this is why I am eager to hear peoples use case, and why I was > trying very hard to make certain we get the requirements. > > I want the real requirements though and some thought, not just we did > this and it hurts. Changning the uids on write is a very hard problem, > and not just in implementating it but also in maintaining and > understanding what is going on. The most common cases where shiftfs is used or where folks would like to use it today are (by importance): - Fast container creation (by not having to uid/gid shift all files in the downloaded image) - Sharing data between the host system and a container (some paths under /home being the most common) - Sharing data between unprivileged containers with a disjointed map - Sharing data between multiple containers, some privileged, some unprivileged Fixing the ulimit issue only takes care of one of those (3rd item), it does not solve any of the other cases. The first item on there alone can be quite significant. Creation and startup of a regular Debian container on my system takes around 500ms when shiftfs is used (btrfs/lvm/zfs copy-on-write clone of the image, setup shiftfs, start container) compared to 2-3s when running without it (same clone, followed by rewrite of all uid/gid present on the fs, including acls and capabilities, then start container). And that's on a fast system with an NVME SSD and a small rootfs. We have had reports of a few users running on slow spinning rust with large containers where shifting can take several minutes. The second item can technically be worked around without shifted bind-mounts by doing userns map hole punching, mapping the user's uid/gid from the host straight into the container. The downside to this is that another shifting pass becomes needed for any file outside of the bind-mounted path (or it would become owned by -1/-1) and it's very much not dynamic, requiring the container be stopped, config updated by the user, /etc/subuid and subgid maps being updated and container started back up. If you need another user/group be exposed, start all over again... This is far more complex, slow and disruptive than the shifted approach where we just need to do: lxc config device add MY-CONTAINER home disk source=/home path=/home shift=true To inject a new mount of /home from the host into the container with a shifting layer in place, no need to reconfig subuid/subgid, no need to re-create the userns to update the mapping and no need to go through the container's rootfs for any file which may now need remapping because of the map change. Stéphane > Eric > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers
> On Oct 28, 2020, at 5:35 PM, Christian Brauner <christian.brauner@ubuntu.com> wrote: > > Hey everyone, > > I vanished for a little while to focus on this work here so sorry for > not being available by mail for a while. > > Since quite a long time we have issues with sharing mounts between > multiple unprivileged containers with different id mappings, sharing a > rootfs between multiple containers with different id mappings, and also > sharing regular directories and filesystems between users with different > uids and gids. The latter use-cases have become even more important with > the availability and adoption of systemd-homed (cf. [1]) to implement > portable home directories. > > The solutions we have tried and proposed so far include the introduction > of fsid mappings, a tiny overlay based filesystem, and an approach to > call override creds in the vfs. None of these solutions have covered all > of the above use-cases. > > The solution proposed here has it's origins in multiple discussions > during Linux Plumbers 2017 during and after the end of the containers > microconference. > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David, > James, and myself. A variant of the solution proposed here has also been > discussed, again to the best of my knowledge, after a Linux conference > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017 > after Linux Plumbers. > I've taken the time to finally implement a working version of this > solution over the last weeks to the best of my abilities. Tycho has > signed up for this sligthly crazy endeavour as well and he has helped > with the conversion of the xattr codepaths. > > The core idea is to make idmappings a property of struct vfsmount > instead of tying it to a process being inside of a user namespace which > has been the case for all other proposed approaches. > It means that idmappings become a property of bind-mounts, i.e. each > bind-mount can have a separate idmapping. This has the obvious advantage > that idmapped mounts can be created inside of the initial user > namespace, i.e. on the host itself instead of requiring the caller to be > located inside of a user namespace. This enables such use-cases as e.g. > making a usb stick available in multiple locations with different > idmappings (see the vfat port that is part of this patch series). > > The vfsmount struct gains a new struct user_namespace member. The > idmapping of the user namespace becomes the idmapping of the mount. A > caller that is either privileged with respect to the user namespace of > the superblock of the underlying filesystem or a caller that is > privileged with respect to the user namespace a mount has been idmapped > with can create a new bind-mount and mark it with a user namespace. So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside. For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege. Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this. Inside the namespace, the user creates a suid-root file. Now, outside the namespace, the user has privilege over the namespace. (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over. So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace. We already do the latter for the vfsmnt’s mntns’s userns. Hmm. What happens if we require that an idmap userns equal the vfsmnt’s mntns’s userns? Is that too limiting? I hope that whatever solution gets used is straightforward enough to wrap one’s head around. > When a file/inode is accessed through an idmapped mount the i_uid and > i_gid of the inode will be remapped according to the user namespace the > mount has been marked with. When a new object is created based on the > fsuid and fsgid of the caller they will similarly be remapped according > to the user namespace of the mount they care created from. By “mapped according to”, I presume you mean that the on-disk uid/gid is the gid as seen in the user namespace in question.
On Thu, Oct 29, 2020 at 11:37:23AM -0500, Eric W. Biederman wrote: > Aleksa Sarai <cyphar@cyphar.com> writes: > > > On 2020-10-29, Eric W. Biederman <ebiederm@xmission.com> wrote: > >> Christian Brauner <christian.brauner@ubuntu.com> writes: > >> > >> > Hey everyone, > >> > > >> > I vanished for a little while to focus on this work here so sorry for > >> > not being available by mail for a while. > >> > > >> > Since quite a long time we have issues with sharing mounts between > >> > multiple unprivileged containers with different id mappings, sharing a > >> > rootfs between multiple containers with different id mappings, and also > >> > sharing regular directories and filesystems between users with different > >> > uids and gids. The latter use-cases have become even more important with > >> > the availability and adoption of systemd-homed (cf. [1]) to implement > >> > portable home directories. > >> > >> Can you walk us through the motivating use case? > >> > >> As of this year's LPC I had the distinct impression that the primary use > >> case for such a feature was due to the RLIMIT_NPROC problem where two > >> containers with the same users still wanted different uid mappings to > >> the disk because the users were conflicting with each other because of > >> the per user rlimits. > >> > >> Fixing rlimits is straight forward to implement, and easier to manage > >> for implementations and administrators. > > > > This is separate to the question of "isolated user namespaces" and > > managing different mappings between containers. This patchset is solving > > the same problem that shiftfs solved -- sharing a single directory tree > > between containers that have different ID mappings. rlimits (nor any of > > the other proposals we discussed at LPC) will help with this problem. > > First and foremost: A uid shift on write to a filesystem is a security > bug waiting to happen. This is especially in the context of facilities > like iouring, that play very agressive games with how process context > makes it to system calls. > > The only reason containers were not immediately exploitable when iouring > was introduced is because the mechanisms are built so that even if > something escapes containment the security properties still apply. > Changes to the uid when writing to the filesystem does not have that > property. The tiniest slip in containment will be a security issue. > > This is not even the least bit theoretical. I have seem reports of how > shitfs+overlayfs created a situation where anyone could read > /etc/shadow. > > If you are going to write using the same uid to disk from different > containers the question becomes why can't those containers configure > those users to use the same kuid? Because if user 'myapp' in two otherwise isolated containers both have the same kuid, so that they can write to a shared directory, then root in container 1 has privilege over all files owned by 'myapp' in container 2. Whereas if they can each have distinct kuids, but when writing to the shared fs have a shared uid not otherwise belonging to either container, their rootfs's can remain completely off limits to each other. > What fixing rlimits does is it fixes one of the reasons that different > containers could not share the same kuid for users that want to write to > disk with the same uid. > > > I humbly suggest that it will be more secure, and easier to maintain for > both developers and users if we fix the reasons people want different > containers to have the same user running with different kuids. > > If not what are the reasons we fundamentally need the same on-disk user > using multiple kuids in the kernel? > > Eric
On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote: > > > > On Oct 28, 2020, at 5:35 PM, Christian Brauner <christian.brauner@ubuntu.com> wrote: > > > > Hey everyone, > > > > I vanished for a little while to focus on this work here so sorry for > > not being available by mail for a while. > > > > Since quite a long time we have issues with sharing mounts between > > multiple unprivileged containers with different id mappings, sharing a > > rootfs between multiple containers with different id mappings, and also > > sharing regular directories and filesystems between users with different > > uids and gids. The latter use-cases have become even more important with > > the availability and adoption of systemd-homed (cf. [1]) to implement > > portable home directories. > > > > The solutions we have tried and proposed so far include the introduction > > of fsid mappings, a tiny overlay based filesystem, and an approach to > > call override creds in the vfs. None of these solutions have covered all > > of the above use-cases. > > > > The solution proposed here has it's origins in multiple discussions > > during Linux Plumbers 2017 during and after the end of the containers > > microconference. > > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David, > > James, and myself. A variant of the solution proposed here has also been > > discussed, again to the best of my knowledge, after a Linux conference > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017 > > after Linux Plumbers. > > I've taken the time to finally implement a working version of this > > solution over the last weeks to the best of my abilities. Tycho has > > signed up for this sligthly crazy endeavour as well and he has helped > > with the conversion of the xattr codepaths. > > > > The core idea is to make idmappings a property of struct vfsmount > > instead of tying it to a process being inside of a user namespace which > > has been the case for all other proposed approaches. > > It means that idmappings become a property of bind-mounts, i.e. each > > bind-mount can have a separate idmapping. This has the obvious advantage > > that idmapped mounts can be created inside of the initial user > > namespace, i.e. on the host itself instead of requiring the caller to be > > located inside of a user namespace. This enables such use-cases as e.g. > > making a usb stick available in multiple locations with different > > idmappings (see the vfat port that is part of this patch series). > > > > The vfsmount struct gains a new struct user_namespace member. The > > idmapping of the user namespace becomes the idmapping of the mount. A > > caller that is either privileged with respect to the user namespace of > > the superblock of the underlying filesystem or a caller that is > > privileged with respect to the user namespace a mount has been idmapped > > with can create a new bind-mount and mark it with a user namespace. > > So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside. > > For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege. > > Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this. Inside the namespace, the user creates a suid-root file. > > Now, outside the namespace, the user has privilege over the namespace. (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over. > > So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace. We already do the latter for the vfsmnt’s mntns’s userns. With this series, in order to create an idmapped mount the user must either be cap_sys_admin in the superblock of the underlying filesystem or if the mount is already idmapped and they want to create another idmapped mount from it they must have cap_sys_admin in the userns that the mount is currrently marked with. It is also not possible to change an idmapped mount once it has been idmapped, i.e. the user must create a new detached bind-mount first. > > Hmm. What happens if we require that an idmap userns equal the vfsmnt’s mntns’s userns? Is that too limiting? > > I hope that whatever solution gets used is straightforward enough to wrap one’s head around. > > > When a file/inode is accessed through an idmapped mount the i_uid and > > i_gid of the inode will be remapped according to the user namespace the > > mount has been marked with. When a new object is created based on the > > fsuid and fsgid of the caller they will similarly be remapped according > > to the user namespace of the mount they care created from. > > By “mapped according to”, I presume you mean that the on-disk uid/gid is the gid as seen in the user namespace in question. If I understand you correctly, then yes.
On Thu, Oct 29, 2020 at 11:37:23AM -0500, Eric W. Biederman wrote: > First and foremost: A uid shift on write to a filesystem is a security > bug waiting to happen. This is especially in the context of facilities > like iouring, that play very agressive games with how process context > makes it to system calls. > > The only reason containers were not immediately exploitable when iouring > was introduced is because the mechanisms are built so that even if > something escapes containment the security properties still apply. > Changes to the uid when writing to the filesystem does not have that > property. The tiniest slip in containment will be a security issue. > > This is not even the least bit theoretical. I have seem reports of how > shitfs+overlayfs created a situation where anyone could read > /etc/shadow. This bug was the result of a complex interaction with several contributing factors. It's fair to say that one component was overlayfs writing through an id-shifted mount, but the primary cause was related to how copy-up was done coupled with allowing unprivileged overlayfs mounts in a user ns. Checks that the mounter had access to the lower fs file were not done before copying data up, and so the file was copied up temporarily to the id shifted upperdir. Even though it was immediately removed, other factors made it possible for the user to get the file contents from the upperdir. Regardless, I do think you raise a good point. We need to be wary of any place the kernel could open files through a shifted mount, especially when the open could be influenced by userspace. Perhaps kernel file opens through shifted mounts should to be opt-in. I.e. unless a flag is passed, or a different open interface used, the open will fail if the dentry being opened is subject to id shifting. This way any kernel writes which would be subject to id shifting will only happen through code which as been written to take it into account. Seth
On Fri, Oct 30, 2020 at 10:07:48AM -0500, Seth Forshee wrote: > On Thu, Oct 29, 2020 at 11:37:23AM -0500, Eric W. Biederman wrote: > > First and foremost: A uid shift on write to a filesystem is a security > > bug waiting to happen. This is especially in the context of facilities > > like iouring, that play very agressive games with how process context > > makes it to system calls. > > > > The only reason containers were not immediately exploitable when iouring > > was introduced is because the mechanisms are built so that even if > > something escapes containment the security properties still apply. > > Changes to the uid when writing to the filesystem does not have that > > property. The tiniest slip in containment will be a security issue. > > > > This is not even the least bit theoretical. I have seem reports of how > > shitfs+overlayfs created a situation where anyone could read > > /etc/shadow. > > This bug was the result of a complex interaction with several > contributing factors. It's fair to say that one component was overlayfs > writing through an id-shifted mount, but the primary cause was related > to how copy-up was done coupled with allowing unprivileged overlayfs > mounts in a user ns. Checks that the mounter had access to the lower fs > file were not done before copying data up, and so the file was copied up > temporarily to the id shifted upperdir. Even though it was immediately > removed, other factors made it possible for the user to get the file > contents from the upperdir. > > Regardless, I do think you raise a good point. We need to be wary of any > place the kernel could open files through a shifted mount, especially > when the open could be influenced by userspace. > > Perhaps kernel file opens through shifted mounts should to be opt-in. > I.e. unless a flag is passed, or a different open interface used, the > open will fail if the dentry being opened is subject to id shifting. > This way any kernel writes which would be subject to id shifting will > only happen through code which as been written to take it into account. For my use cases, it would be fine to require opt-in at original fs mount time by init_user_ns admin. I.e. mount -o allow_idmap /dev/mapper/whoozit /whatzit I'm quite certain I would always be sharing a separate LV or loopback or tmpfs. -serge
On Fri, Oct 30, 2020 at 01:01:57PM +0100, Christian Brauner wrote: > On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote: > > > > > > > On Oct 28, 2020, at 5:35 PM, Christian Brauner <christian.brauner@ubuntu.com> wrote: > > > > > > Hey everyone, > > > > > > I vanished for a little while to focus on this work here so sorry for > > > not being available by mail for a while. > > > > > > Since quite a long time we have issues with sharing mounts between > > > multiple unprivileged containers with different id mappings, sharing a > > > rootfs between multiple containers with different id mappings, and also > > > sharing regular directories and filesystems between users with different > > > uids and gids. The latter use-cases have become even more important with > > > the availability and adoption of systemd-homed (cf. [1]) to implement > > > portable home directories. > > > > > > The solutions we have tried and proposed so far include the introduction > > > of fsid mappings, a tiny overlay based filesystem, and an approach to > > > call override creds in the vfs. None of these solutions have covered all > > > of the above use-cases. > > > > > > The solution proposed here has it's origins in multiple discussions > > > during Linux Plumbers 2017 during and after the end of the containers > > > microconference. > > > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David, > > > James, and myself. A variant of the solution proposed here has also been > > > discussed, again to the best of my knowledge, after a Linux conference > > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017 > > > after Linux Plumbers. > > > I've taken the time to finally implement a working version of this > > > solution over the last weeks to the best of my abilities. Tycho has > > > signed up for this sligthly crazy endeavour as well and he has helped > > > with the conversion of the xattr codepaths. > > > > > > The core idea is to make idmappings a property of struct vfsmount > > > instead of tying it to a process being inside of a user namespace which > > > has been the case for all other proposed approaches. > > > It means that idmappings become a property of bind-mounts, i.e. each > > > bind-mount can have a separate idmapping. This has the obvious advantage > > > that idmapped mounts can be created inside of the initial user > > > namespace, i.e. on the host itself instead of requiring the caller to be > > > located inside of a user namespace. This enables such use-cases as e.g. > > > making a usb stick available in multiple locations with different > > > idmappings (see the vfat port that is part of this patch series). > > > > > > The vfsmount struct gains a new struct user_namespace member. The > > > idmapping of the user namespace becomes the idmapping of the mount. A > > > caller that is either privileged with respect to the user namespace of > > > the superblock of the underlying filesystem or a caller that is > > > privileged with respect to the user namespace a mount has been idmapped > > > with can create a new bind-mount and mark it with a user namespace. > > > > So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside. > > > > For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege. > > > > Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this. Inside the namespace, the user creates a suid-root file. > > > > Now, outside the namespace, the user has privilege over the namespace. (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over. > > > > So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace. We already do the latter for the vfsmnt’s mntns’s userns. > > With this series, in order to create an idmapped mount the user must > either be cap_sys_admin in the superblock of the underlying filesystem > or if the mount is already idmapped and they want to create another > idmapped mount from it they must have cap_sys_admin in the userns that > the mount is currrently marked with. It is also not possible to change > an idmapped mount once it has been idmapped, i.e. the user must create a > new detached bind-mount first. Yeah I spent quite some time last night trying to figure out the scenario you were presenting, but I failed. Andy, could you either rephrase or give a more concrete end to end attack scenario?
On Fri, Oct 30, 2020 at 5:02 AM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote: > > > > > > > On Oct 28, 2020, at 5:35 PM, Christian Brauner <christian.brauner@ubuntu.com> wrote: > > > > > > Hey everyone, > > > > > > I vanished for a little while to focus on this work here so sorry for > > > not being available by mail for a while. > > > > > > Since quite a long time we have issues with sharing mounts between > > > multiple unprivileged containers with different id mappings, sharing a > > > rootfs between multiple containers with different id mappings, and also > > > sharing regular directories and filesystems between users with different > > > uids and gids. The latter use-cases have become even more important with > > > the availability and adoption of systemd-homed (cf. [1]) to implement > > > portable home directories. > > > > > > The solutions we have tried and proposed so far include the introduction > > > of fsid mappings, a tiny overlay based filesystem, and an approach to > > > call override creds in the vfs. None of these solutions have covered all > > > of the above use-cases. > > > > > > The solution proposed here has it's origins in multiple discussions > > > during Linux Plumbers 2017 during and after the end of the containers > > > microconference. > > > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David, > > > James, and myself. A variant of the solution proposed here has also been > > > discussed, again to the best of my knowledge, after a Linux conference > > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017 > > > after Linux Plumbers. > > > I've taken the time to finally implement a working version of this > > > solution over the last weeks to the best of my abilities. Tycho has > > > signed up for this sligthly crazy endeavour as well and he has helped > > > with the conversion of the xattr codepaths. > > > > > > The core idea is to make idmappings a property of struct vfsmount > > > instead of tying it to a process being inside of a user namespace which > > > has been the case for all other proposed approaches. > > > It means that idmappings become a property of bind-mounts, i.e. each > > > bind-mount can have a separate idmapping. This has the obvious advantage > > > that idmapped mounts can be created inside of the initial user > > > namespace, i.e. on the host itself instead of requiring the caller to be > > > located inside of a user namespace. This enables such use-cases as e.g. > > > making a usb stick available in multiple locations with different > > > idmappings (see the vfat port that is part of this patch series). > > > > > > The vfsmount struct gains a new struct user_namespace member. The > > > idmapping of the user namespace becomes the idmapping of the mount. A > > > caller that is either privileged with respect to the user namespace of > > > the superblock of the underlying filesystem or a caller that is > > > privileged with respect to the user namespace a mount has been idmapped > > > with can create a new bind-mount and mark it with a user namespace. > > > > So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside. > > > > For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege. > > > > Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this. Inside the namespace, the user creates a suid-root file. > > > > Now, outside the namespace, the user has privilege over the namespace. (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over. > > > > So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace. We already do the latter for the vfsmnt’s mntns’s userns. > > With this series, in order to create an idmapped mount the user must > either be cap_sys_admin in the superblock of the underlying filesystem > or if the mount is already idmapped and they want to create another > idmapped mount from it they must have cap_sys_admin in the userns that > the mount is currrently marked with. It is also not possible to change > an idmapped mount once it has been idmapped, i.e. the user must create a > new detached bind-mount first. I think my attack might not work, but I also think I didn't explain it very well. Let me try again. I'll also try to lay out what I understand the rules of idmaps to be so that you can correct me when I'm inevitable wrong :) First, background: there are a bunch of user namespaces around. Every superblock has one, every idmapped mount has one, and every vfsmnt also (indirectly) has one: mnt->mnt_ns->user_ns. So, if you're looking at a given vfsmnt, you have three user namespaces that are relevant, in addition to whatever namespaces are active for the task (or kernel thread) accessing that mount. I'm wondering whether mnt_user_ns() should possibly have a name that makes it clear that it refers to the idmap namespace and not mnt->mnt_ns->user_ns. So here's the attack. An attacker with uid=1000 creates a userns N (so the attacker owns the ns and 1000 outside maps to 0 inside). N is a child of init_user_ns. Now the attacker creates a mount namespace M inside the userns and, potentially with the help of a container management tool, creates an idmapped filesystem mount F inside M. So, playing fast and loose with my ampersands: F->mnt_ns == M F->mnt_ns->user_ns == N mnt_user_ns(F) == N I expect that this wouldn't be a particularly uncommon setup. Now the user has the ability to create files with inode->uid == 0 and the SUID bit set on their filesystem. This isn't terribly different from FUSE, except that the mount won't have nosuid set, whereas at least many uses of unprivileged FUSE would have nosuid set. So the thing that makes me a little bit nervous. But it actually seems likely that I was wrong and this is okay. Specifically, to exploit this using kernel mechanisms, one would need to pass a mnt_may_suid() check, which means that one would need to acquire a mount of F in one's current mount namespace, and one would need one's current user namespace to be init_ns (or something else sensitive). But you already need to own the namespace to create mounts, unless you have a way to confuse some existing user tooling. You would also need to be in F's superblock's user_ns (second line of mnt_may_suid()), which totally kills this type of attack if F's superblock is in the container's user_ns, but I wouldn't count on that. So maybe this is all fine. I'll continue to try to poke holes in it, but perhaps there aren't any holes to poke. I'll also continue to try to see if I can state the security properties of idmap in a way that is clear and obviously has nice properties. Why are you allowing the creation of a new idmapped mount if you have cap_sys_admin over an existing idmap userns but not over the superblock's userns? I assume this is for a nested container use case, but can you spell out a specific example usage? --Andy
On Thu, Oct 29, 2020 at 5:37 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Aleksa Sarai <cyphar@cyphar.com> writes: > > > On 2020-10-29, Eric W. Biederman <ebiederm@xmission.com> wrote: > >> Christian Brauner <christian.brauner@ubuntu.com> writes: > >> > >> > Hey everyone, > >> > > >> > I vanished for a little while to focus on this work here so sorry for > >> > not being available by mail for a while. > >> > > >> > Since quite a long time we have issues with sharing mounts between > >> > multiple unprivileged containers with different id mappings, sharing a > >> > rootfs between multiple containers with different id mappings, and also > >> > sharing regular directories and filesystems between users with different > >> > uids and gids. The latter use-cases have become even more important with > >> > the availability and adoption of systemd-homed (cf. [1]) to implement > >> > portable home directories. > >> > >> Can you walk us through the motivating use case? > >> > >> As of this year's LPC I had the distinct impression that the primary use > >> case for such a feature was due to the RLIMIT_NPROC problem where two > >> containers with the same users still wanted different uid mappings to > >> the disk because the users were conflicting with each other because of > >> the per user rlimits. > >> > >> Fixing rlimits is straight forward to implement, and easier to manage > >> for implementations and administrators. > > > > This is separate to the question of "isolated user namespaces" and > > managing different mappings between containers. This patchset is solving > > the same problem that shiftfs solved -- sharing a single directory tree > > between containers that have different ID mappings. rlimits (nor any of > > the other proposals we discussed at LPC) will help with this problem. > > First and foremost: A uid shift on write to a filesystem is a security > bug waiting to happen. This is especially in the context of facilities > like iouring, that play very agressive games with how process context > makes it to system calls. > > The only reason containers were not immediately exploitable when iouring > was introduced is because the mechanisms are built so that even if > something escapes containment the security properties still apply. > Changes to the uid when writing to the filesystem does not have that > property. The tiniest slip in containment will be a security issue. > > This is not even the least bit theoretical. I have seem reports of how > shitfs+overlayfs created a situation where anyone could read > /etc/shadow. > > If you are going to write using the same uid to disk from different > containers the question becomes why can't those containers configure > those users to use the same kuid? > > What fixing rlimits does is it fixes one of the reasons that different > containers could not share the same kuid for users that want to write to > disk with the same uid. > > > I humbly suggest that it will be more secure, and easier to maintain for > both developers and users if we fix the reasons people want different > containers to have the same user running with different kuids. > > If not what are the reasons we fundamentally need the same on-disk user > using multiple kuids in the kernel? I would like to use this patch set in the context of Kubernetes. I described my two possible setups in https://www.spinics.net/lists/linux-containers/msg36537.html: 1. Each Kubernetes pod has its own userns but with the same user id mapping 2. Each Kubernetes pod has its own userns with non-overlapping user id mapping (providing additional isolation between pods) But even in the setup where all pods run with the same id mappings, this patch set is still useful to me for 2 reasons: 1. To avoid the expensive recursive chown of the rootfs. We cannot necessarily extract the tarball directly with the right uids because we might use the same container image for privileged containers (with the host userns) and unprivileged containers (with a new userns), so we have at least 2 “mappings” (taking more time and resulting in more storage space). Although the “metacopy” mount option in overlayfs helps to make the recursive chown faster, it can still take time with large container images with lots of files. I’d like to use this patch set to set up the root fs in constant time. 2. To manage large external volumes (NFS or other filesystems). Even if admins can decide to use the same kuid on all the nodes of the Kubernetes cluster, this is impractical for migration. People can have existing Kubernetes clusters (currently without using user namespaces) and large NFS volumes. If they want to switch to a new version of Kubernetes with the user namespace feature enabled, they would need to recursively chown all the files on the NFS shares. This could take time on large filesystems and realistically, we want to support rolling updates where some nodes use the previous version without user namespaces and new nodes are progressively migrated to the new userns with the new id mapping. If both sets of nodes use the same NFS share, that can’t work. Alban