diff mbox

[RFC,v2,0/8] VFS:userns: support portable root filesystems

Message ID 1462923416.14896.10.camel@HansenPartnership.com (mailing list archive)
State New, archived
Headers show

Commit Message

James Bottomley May 10, 2016, 11:36 p.m. UTC
On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
> On Thu, 2016-05-05 at 22:49 +0100, Djalal Harouni wrote:
> > On Thu, May 05, 2016 at 07:56:28AM -0400, James Bottomley wrote:
> > > On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> > > > On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley
> > > > wrote:
[...]
> > > > > So this option was discussed at the recent LSF/MM summit. 
> > > > >  The 
> > > > > most supported suggestion was that you'd use a new internal
> > > > > fs 
> > > > > type that had a struct mount with a new superblock and would 
> > > > > copy the underlying inodes but substitute it's own with 
> > > > >  modified  ->getatrr/->setattr calls that did the uid shift. 
> > > > >  In many ways it would be a remapping bind which would look 
> > > > > similar to overlayfs but be a lot simpler.
> > > > 
> > > > Hmm, it's not only about ->getattr and ->setattr, you have all 
> > > > the other file system operations that need access too...
> > > 
> > > Why?  Or perhaps we should more cogently define the actual 
> > > problem.   My problem is simply mounting image volumes that were 
> > > created with real uids at user namespace shifted uids because I'm
> > >  downshifting the privileged ids in the container.  I actually 
> > > *only* need the uid/gids on the attributes shifted because that's
> > > what I need to manipulate the
> > >   
> > We need them obviously for read/write/creation... ?!
> 
> OK, so the way attributes are populated on an inode is via getattr. 
>  You intercept that, you change the inode owner and group that are
> installed on the inode.  That means that when you list the directory,
> you see the shift and the shifted uid/gid are used to check 
> permissions for vfs_open().

Just to illustrate how this could be done, here's a functional proof of
concept for a uid/gid shifting bind mount equivalent.  It's not
actually a proper bind mount because it has to manufacture its own
inodes.  As you can see, it can only be used by root, it will shift all
the uid/gid bits as well as the permission comparisons.  It operates on
subtrees, so it can shift the uids/gids on any filesystem or part of
one and because the shifts are per superblock, it could actually shift
the same subtree for multiple users on different shifts.  Best of all,
it requires no vfs changes at all, being entirely implemented inside
its own filesystem type.

You use it just like bind mount:

mount -t shiftfs <source> <target>

except that it takes uidshift=x:y:z and gidshift=x:y:z multiple times
as options.  It's currently not recursive and it definitely needs
polishing to show things like mount options and be properly Kconfig
using.

There's a bit of an open question of whether it should have vfs
changes: the way the struct file f_inode and f_ops are hijacked is a
bit nasty and perhaps d_select_inode() could be made a bit cleverer to
help us here instead.

James

---

 fs/Makefile                |   1 +
 fs/shiftfs.c               | 790 +++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/magic.h |   2 +
 3 files changed, 793 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Al Viro May 11, 2016, 12:38 a.m. UTC | #1
On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:

> mount -t shiftfs <source> <target>

Note to self: do not eat while reading l-k...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro May 11, 2016, 12:53 a.m. UTC | #2
On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> +static int shiftfs_rename2(struct inode *olddir, struct dentry *old,
> +			   struct inode *newdir, struct dentry *new,
> +			   unsigned int flags)
> +{
> +	struct dentry *rodd = olddir->i_private, *rndd = newdir->i_private,
> +		*realold = old->d_inode->i_private,
> +		*realnew = new->d_inode->i_private;
> +	struct inode *realolddir = rodd->d_inode, *realnewdir = rndd->d_inode;
> +	const struct inode_operations *iop = realolddir->i_op;
> +	int err;
> +	const struct cred *oldcred, *newcred;
> +
> +	oldcred = shiftfs_new_creds(&newcred, old->d_sb);
> +	err = iop->rename2(realolddir, realold, realnewdir, realnew, flags);
> +	shiftfs_old_creds(oldcred, &newcred);

... and you've just violated all locking rules for ->rename2().

> +static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
> +				     unsigned int flags)
> +{
> +	struct dentry *real = dir->i_private, *new;
> +	struct inode *reali = real->d_inode, *newi;
> +	const struct cred *oldcred, *newcred;
> +
> +	/* note: violation of usual fs rules here: dentries are never
> +	 * added with d_add.  This is because we want no dentry cache
> +	 * for shiftfs.  All lookups proceed through the dentry cache
> +	 * of the underlying filesystem, meaning we always see any
> +	 * changes in the underlying */

Bloody wonderful.  So
	* we lose caching the negative lookups
	* we've got buggered hardlinks (different inodes for those)
	* it has never, ever been tried on -next (would do rather nasty
things on that d_instantiate())

> +
> +	kfree(sfc);
> +
> +	return err;
> +}

> +	file->f_op = &sfc->fop;

Lovely - now try that with underlying fs something built modular.

Or try to use it on top of something with non-trivial dentry_operations
(hell, on top of itself, for starters).
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley May 11, 2016, 3:47 a.m. UTC | #3
On Wed, 2016-05-11 at 01:53 +0100, Al Viro wrote:
> On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> > +static int shiftfs_rename2(struct inode *olddir, struct dentry
> > *old,
> > +			   struct inode *newdir, struct dentry
> > *new,
> > +			   unsigned int flags)
> > +{
> > +	struct dentry *rodd = olddir->i_private, *rndd = newdir
> > ->i_private,
> > +		*realold = old->d_inode->i_private,
> > +		*realnew = new->d_inode->i_private;
> > +	struct inode *realolddir = rodd->d_inode, *realnewdir =
> > rndd->d_inode;
> > +	const struct inode_operations *iop = realolddir->i_op;
> > +	int err;
> > +	const struct cred *oldcred, *newcred;
> > +
> > +	oldcred = shiftfs_new_creds(&newcred, old->d_sb);
> > +	err = iop->rename2(realolddir, realold, realnewdir,
> > realnew, flags);
> > +	shiftfs_old_creds(oldcred, &newcred);
> 
> ... and you've just violated all locking rules for ->rename2().

Yes, sorry, somehow I missed that when I converted everything else to
the vfs_ functions.

> > +static struct dentry *shiftfs_lookup(struct inode *dir, struct
> > dentry *dentry,
> > +				     unsigned int flags)
> > +{
> > +	struct dentry *real = dir->i_private, *new;
> > +	struct inode *reali = real->d_inode, *newi;
> > +	const struct cred *oldcred, *newcred;
> > +
> > +	/* note: violation of usual fs rules here: dentries are
> > never
> > +	 * added with d_add.  This is because we want no dentry
> > cache
> > +	 * for shiftfs.  All lookups proceed through the dentry
> > cache
> > +	 * of the underlying filesystem, meaning we always see any
> > +	 * changes in the underlying */
> 
> Bloody wonderful.  So
> 	* we lose caching the negative lookups

We do?  They should be cached in the underlying layer's dcache. If
that's not enough, I can hash them, but I was trying to avoid doubling
the dcache size.

> 	* we've got buggered hardlinks (different inodes for those)

Yes, had a note to do the lookup, but forgot.

> 	* it has never, ever been tried on -next (would do rather nasty
> things on that d_instantiate())

So this is just a proof of concept; I figured it was best to do it
against current rather than have people who wanted to try it pull in
your tree.  I can respin it after the merge window closes.

> 
> > +
> > +	kfree(sfc);
> > +
> > +	return err;
> > +}
> 
> > +	file->f_op = &sfc->fop;
> 
> Lovely - now try that with underlying fs something built modular.
> 
> Or try to use it on top of something with non-trivial
> dentry_operations
> (hell, on top of itself, for starters).

So if I add the missing fops_get/put, you're happy with the way this
hijacks f_op and f_inode?

James


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Djalal Harouni May 11, 2016, 4:42 p.m. UTC | #4
On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
[...]
> > 
> > OK, so the way attributes are populated on an inode is via getattr. 
> >  You intercept that, you change the inode owner and group that are
> > installed on the inode.  That means that when you list the directory,
> > you see the shift and the shifted uid/gid are used to check 
> > permissions for vfs_open().
> 
> Just to illustrate how this could be done, here's a functional proof of
> concept for a uid/gid shifting bind mount equivalent.  It's not
> actually a proper bind mount because it has to manufacture its own
> inodes.  As you can see, it can only be used by root, it will shift all
> the uid/gid bits as well as the permission comparisons.  It operates on
> subtrees, so it can shift the uids/gids on any filesystem or part of
> one and because the shifts are per superblock, it could actually shift
> the same subtree for multiple users on different shifts.  Best of all,
> it requires no vfs changes at all, being entirely implemented inside
> its own filesystem type.

First, I guess this should be in a separate thread.. this way this RFC
was just hijacked!

Obviously as you say later in your response it may require a VFS
change... 

You have just consumed all inodes... what about containers or small apps
that are spawned quickly... it can even used maybe as a DoS...  maybe you
endup reporting different inode numbers... ?


> You use it just like bind mount:
> 
> mount -t shiftfs <source> <target>
> 
> except that it takes uidshift=x:y:z and gidshift=x:y:z multiple times
> as options.  It's currently not recursive and it definitely needs
> polishing to show things like mount options and be properly Kconfig
> using.

why it's not recursive ? and what if you have circular bind mounts ? 

Hmm anyway you are mounting this on behalf of filesystems, so if you add
the recursive thing, you will just probably make everything worse, by
making any /proc, /sys dentry that's under that path shiftable, and
unprivileged users can just create user namespaces and read /proc/*
and all the other stuff that doesn't have capable() related to the
init_user_ns host...

  what if you have paths like /filesystem0/uidshiftedY/dir,
/filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
where some of them are also bind mounts that point to same dentry ?


Also, you create a totally new user namespace interface here! by making
your own new interface we just lose the notion of init_user_ns and its
children and mapping ?

I'm not sure of the implication of all this... your user namespace
mapping is not related at all to init_user_ns! it seems that it has
its own init_user_ns ?   does a capable() check now on a shifted
filesystem relates to that and hence to your mapping or to the real
init_user_ns ?


> There's a bit of an open question of whether it should have vfs
> changes: the way the struct file f_inode and f_ops are hijacked is a
> bit nasty and perhaps d_select_inode() could be made a bit cleverer to
> help us here instead.

I'm not sure if this PoC works... but you sure you didn't introduce
a serious vulnerability here ? you use a new mapping and you update
current_fsuid() creds up, which is global on any fs operation, so may
be: lets operate on any inode, update our current_fsuid()... and
access the rest of *unshifted filesystems*... !?

The worst thing is that current_fsuid() does not follow now the
/proc/self/uid_map interface! this is a serious vulnerability and a mix
of the current semantics... it's updated but using other rules...?

For overlayfs I did write an expriment but for me it's not an overlayfs
or another new filesystem problem... we are manipulating UID/GID
identities...

It would have been better if you did send this as a separate thread.
It was a vfs:userns RFC fix which if we continue we turn it into a
complicated thing! implement another new light filesystem with
userns... (overlayfs...)

Will follow up if the appropriate thread is created, not here, I guess
it's ok ?

> James
> 

Thank you for your feedback!
James Bottomley May 11, 2016, 6:33 p.m. UTC | #5
On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> > On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
> [...]
> > > 
> > > OK, so the way attributes are populated on an inode is via 
> > > getattr.  You intercept that, you change the inode owner and 
> > > group that are installed on the inode.  That means that when you 
> > > list the directory, you see the shift and the shifted uid/gid are 
> > > used to check permissions for vfs_open().
> > 
> > Just to illustrate how this could be done, here's a functional 
> > proof of concept for a uid/gid shifting bind mount equivalent. 
> >  It's not actually a proper bind mount because it has to 
> > manufacture its own inodes.  As you can see, it can only be used by 
> > root, it will shift all the uid/gid bits as well as the permission 
> > comparisons.  It operates on subtrees, so it can shift the 
> > uids/gids on any filesystem or part of one and because the shifts 
> > are per superblock, it could actually shift the same subtree for 
> > multiple users on different shifts.  Best of all, it requires no 
> > vfs changes at all, being entirely implemented inside its own
> > filesystem type.
> 
> First, I guess this should be in a separate thread.. this way this 
> RFC was just hijacked!
> 
> Obviously as you say later in your response it may require a VFS
> change... 

I thought it may but viro didn't rip my head off for shifting the file
operations and inode, so perhaps it's OK as is.

> You have just consumed all inodes... what about containers or small 
> apps that are spawned quickly... it can even used maybe as a DoS... 
>  maybe you endup reporting different inode numbers... ?

Please explain?  Shiftfs deliberately doesn't populate its dentry
cache, so it basically has the same number inodes and dentries in use
as the lower filesystem would ordinarily have.

> 
> > You use it just like bind mount:
> > 
> > mount -t shiftfs <source> <target>
> > 
> > except that it takes uidshift=x:y:z and gidshift=x:y:z multiple
> > times
> > as options.  It's currently not recursive and it definitely needs
> > polishing to show things like mount options and be properly Kconfig
> > using.
> 
> why it's not recursive ? and what if you have circular bind mounts ? 

Because, as I said, it's a proof of concept.  It can easily have MS_REC
semantics added.

> Hmm anyway you are mounting this on behalf of filesystems, so if you 
> add the recursive thing, you will just probably make everything 
> worse, by making any /proc, /sys dentry that's under that path 
> shiftable, and unprivileged users can just create user namespaces and 
> read /proc/* and all the other stuff that doesn't have capable() 
> related to the init_user_ns host...

That's up to the admin who does the shifting.  Recursive would be an
option if added.

>   what if you have paths like /filesystem0/uidshiftedY/dir,
> /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> where some of them are also bind mounts that point to same dentry ?

Without recursive semantics, you see the underlying inode.  With them,
you see the upper vfsmnts.  Shiftfs isn't idempotent, so you would need
to be careful about nesting.  However, that's an admin problem.

> Also, you create a totally new user namespace interface here! by 
> making your own new interface we just lose the notion of init_user_ns 
> and its children and mapping ?

I don't quite understand this; the only use of the init_user_ns is the
capable(CAP_SYS_ADMIN) in fill_super which is how only the real admin
can mount at a shifted uid/gid.  Otherwise, there's no need to see into
the userns because filesystems see the kuid_t/kgid_t which is what I'm
shifting.

> I'm not sure of the implication of all this... your user namespace
> mapping is not related at all to init_user_ns! it seems that it has
> its own init_user_ns ?   does a capable() check now on a shifted
> filesystem relates to that and hence to your mapping or to the real
> init_user_ns ?

capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN)

Or is there a misunderstanding here about how user namespaces work
inside the kernel?  The design is that the ID shift is done as you
cross the kernel boundary, so a filesystem, being usually all in-kernel
operating via the VFS interfaces, ideally never needs to make any
from_kuid/make_kuid calls.  However, there are ways filesystems can
send data across the kernel/user bounary outside of the usual vfs
interfaces (ioctls being the most usual one) so in that specific code,
they have to do the kuid_t to uid_t changes themselves.  Shiftfs never
sends data to the user outside of the VFS so it never needs to do this
and can operate entirely on kuid_ts.

> > There's a bit of an open question of whether it should have vfs
> > changes: the way the struct file f_inode and f_ops are hijacked is 
> > a bit nasty and perhaps d_select_inode() could be made a bit 
> > cleverer to help us here instead.
> 
> I'm not sure if this PoC works... but you sure you didn't introduce
> a serious vulnerability here ? you use a new mapping and you update
> current_fsuid() creds up, which is global on any fs operation, so may
> be: lets operate on any inode, update our current_fsuid()... and
> access the rest of *unshifted filesystems*... !?

The credentials are per thread, so it's a standard way of doing
credential shifting and no other threads of execution in the same task
get access. As long as you bound the override_creds()/revert_creds()
pairs within the kernel, you're safe.

> The worst thing is that current_fsuid() does not follow now the
> /proc/self/uid_map interface! this is a serious vulnerability and a 
> mix of the current semantics... it's updated but using other
> rules...?

current_fsuid() is aready mapped via the userns; it's already a kuid_t
at its final value.  Shifting that is what you want to remap underlying
volume uid/gid's.  The uidmap/gidmap inputs to this are shifts on the
final underlying uid/gids.

So, if I've got a uid_map in a userns of 0:100000:1000 which remaps all
the privileged ids down to 100000, but I have a volume which still has
realids, I can mount that volume using shiftfs with
uidmap=0:100000:1000 and it will allow this userns to read and write
the volume through its remapped ids.

> For overlayfs I did write an expriment but for me it's not an 
> overlayfs or another new filesystem problem... we are manipulating 
> UID/GID identities...
> 
> It would have been better if you did send this as a separate thread.
> It was a vfs:userns RFC fix which if we continue we turn it into a
> complicated thing! implement another new light filesystem with
> userns... (overlayfs...)
> 
> Will follow up if the appropriate thread is created, not here, I 
> guess it's ok ?

Well, I can resend the patch as a separate thread when I've fixed some
of the problems viro pointed out.

James

> > James
> > 
> 
> Thank you for your feedback!
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Djalal Harouni May 12, 2016, 7:55 p.m. UTC | #6
On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
[...]
> > Hmm anyway you are mounting this on behalf of filesystems, so if you 
> > add the recursive thing, you will just probably make everything 
> > worse, by making any /proc, /sys dentry that's under that path 
> > shiftable, and unprivileged users can just create user namespaces and 
> > read /proc/* and all the other stuff that doesn't have capable() 
> > related to the init_user_ns host...
> 
> That's up to the admin who does the shifting.  Recursive would be an
> option if added.

Hmm, not sure if you get my point... you just made it an admin problem
where admins want to mount an image downloaded verify it and use it for
their container with /proc...! that's another problem!


> >   what if you have paths like /filesystem0/uidshiftedY/dir,
> > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> > where some of them are also bind mounts that point to same dentry ?
> 
> Without recursive semantics, you see the underlying inode.  With them,
> you see the upper vfsmnts.  Shiftfs isn't idempotent, so you would need
> to be careful about nesting.  However, that's an admin problem.
> 
> > Also, you create a totally new user namespace interface here! by 
> > making your own new interface we just lose the notion of init_user_ns 
> > and its children and mapping ?
> 
> I don't quite understand this; the only use of the init_user_ns is the
> capable(CAP_SYS_ADMIN) in fill_super which is how only the real admin
> can mount at a shifted uid/gid.  Otherwise, there's no need to see into
> the userns because filesystems see the kuid_t/kgid_t which is what I'm
> shifting.
> 
> > I'm not sure of the implication of all this... your user namespace
> > mapping is not related at all to init_user_ns! it seems that it has
> > its own init_user_ns ?   does a capable() check now on a shifted
> > filesystem relates to that and hence to your mapping or to the real
> > init_user_ns ?
> 
> capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN)
> 
> Or is there a misunderstanding here about how user namespaces work
> inside the kernel?  The design is that the ID shift is done as you
> cross the kernel boundary, so a filesystem, being usually all in-kernel
> operating via the VFS interfaces, ideally never needs to make any
> from_kuid/make_kuid calls.  However, there are ways filesystems can
> send data across the kernel/user bounary outside of the usual vfs
> interfaces (ioctls being the most usual one) so in that specific code,
> they have to do the kuid_t to uid_t changes themselves.  Shiftfs never
> sends data to the user outside of the VFS so it never needs to do this
> and can operate entirely on kuid_ts.
> 
> > > There's a bit of an open question of whether it should have vfs
> > > changes: the way the struct file f_inode and f_ops are hijacked is 
> > > a bit nasty and perhaps d_select_inode() could be made a bit 
> > > cleverer to help us here instead.
> > 
> > I'm not sure if this PoC works... but you sure you didn't introduce
> > a serious vulnerability here ? you use a new mapping and you update
> > current_fsuid() creds up, which is global on any fs operation, so may
> > be: lets operate on any inode, update our current_fsuid()... and
> > access the rest of *unshifted filesystems*... !?
> 
> The credentials are per thread, so it's a standard way of doing
> credential shifting and no other threads of execution in the same task
> get access. As long as you bound the override_creds()/revert_creds()
> pairs within the kernel, you're safe.

No, and here sorry I mean shifted.

current_fsuid() is global through all fs operations which means it
crosses user namespaces... it was safe the days of only init_user_ns,
not anymore... You give a mapping inside containers to fsuid where they
don't want to have it... this allows to operate on inodes inside other
containers... update current_fsuid() even if we want that user to be
nobody inside the container... and later it can access the inodes of
the shifted fs... and by same current of course...



> > The worst thing is that current_fsuid() does not follow now the
> > /proc/self/uid_map interface! this is a serious vulnerability and a 
> > mix of the current semantics... it's updated but using other
> > rules...?
> 
> current_fsuid() is aready mapped via the userns; it's already a kuid_t
> at its final value.  Shifting that is what you want to remap underlying
> volume uid/gid's.  The uidmap/gidmap inputs to this are shifts on the
> final underlying uid/gids.

=> some points:
Changing setfsuid() its interfaces and rules... or an indrect way to
break another syscall...

The userns used for *mapping* is totatly different and not standard...
losing "init_user_ns and its decendents userns *semantics*...", a
yet a totatly unlinked mapping...


Breaking current_uid(),current_euid(),current_fsuid() which are mapped
but in *different* user namespaces... hence different values inside
namespaces... you can change your userns mapping but that current_fsuid
specific one will always be remapped to some other value inside
even if you don't want it...
It crosses user namespaces...  uid and euid are remapped according to
/proc/self/uid_map, fsuid is remapped according to this new interface...

Hard coding the mapping, nested containers/apps may *share* fsuid and
can't get rid of it even if they change the inside userns mapping to
disable, split, reduce mapped users or offer better isolation they
can't... no way to make private inodes inside containers if they share
the final fsuid, inside container mapping is ignored...

...

> the privileged ids down to 100000, but I have a volume which still has
> realids, I can mount that volume using shiftfs with
> uidmap=0:100000:1000 and it will allow this userns to read and write
> the volume through its remapped ids.
> 
> > For overlayfs I did write an expriment but for me it's not an 
> > overlayfs or another new filesystem problem... we are manipulating 
> > UID/GID identities...
> > 
> > It would have been better if you did send this as a separate thread.
> > It was a vfs:userns RFC fix which if we continue we turn it into a
> > complicated thing! implement another new light filesystem with
> > userns... (overlayfs...)
> > 
> > Will follow up if the appropriate thread is created, not here, I 
> > guess it's ok ?
> 
> Well, I can resend the patch as a separate thread when I've fixed some
> of the problems viro pointed out.
> 
> James
> 
> > > James
> > > 
> > 
> > Thank you for your feedback!
> > 
> > 
> 

Thanks!
James Bottomley May 12, 2016, 10:24 p.m. UTC | #7
On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> [...]
> > > Hmm anyway you are mounting this on behalf of filesystems, so if
> > > you 
> > > add the recursive thing, you will just probably make everything 
> > > worse, by making any /proc, /sys dentry that's under that path 
> > > shiftable, and unprivileged users can just create user namespaces
> > > and 
> > > read /proc/* and all the other stuff that doesn't have capable() 
> > > related to the init_user_ns host...
> > 
> > That's up to the admin who does the shifting.  Recursive would be
> > an
> > option if added.
> 
> Hmm, not sure if you get my point... you just made it an admin 
> problem where admins want to mount an image downloaded verify it and 
> use it for their container with /proc...! that's another problem!

You can't allow unprivileged containers to shift uids on arbitrary
filesystems, so the admin always has to do something for the initial
setup.

> > >   what if you have paths like /filesystem0/uidshiftedY/dir,
> > > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> > > where some of them are also bind mounts that point to same dentry
> > > ?
> > 
> > Without recursive semantics, you see the underlying inode.  With 
> > them, you see the upper vfsmnts.  Shiftfs isn't idempotent, so you 
> > would need to be careful about nesting.  However, that's an admin
> > problem.
> > 
> > > Also, you create a totally new user namespace interface here! by 
> > > making your own new interface we just lose the notion of 
> > > init_user_ns and its children and mapping ?
> > 
> > I don't quite understand this; the only use of the init_user_ns is 
> > the capable(CAP_SYS_ADMIN) in fill_super which is how only the real
> > admin can mount at a shifted uid/gid.  Otherwise, there's no need 
> > to see into the userns because filesystems see the kuid_t/kgid_t 
> > which is what I'm shifting.
> > 
> > > I'm not sure of the implication of all this... your user 
> > > namespace mapping is not related at all to init_user_ns! it seems 
> > > that it has its own init_user_ns ?   does a capable() check now 
> > > on a shifted filesystem relates to that and hence to your mapping 
> > > or to the real init_user_ns ?
> > 
> > capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN)
> > 
> > Or is there a misunderstanding here about how user namespaces work
> > inside the kernel?  The design is that the ID shift is done as you
> > cross the kernel boundary, so a filesystem, being usually all in
> > -kernel operating via the VFS interfaces, ideally never needs to 
> > make any from_kuid/make_kuid calls.  However, there are ways 
> > filesystems can send data across the kernel/user bounary outside of 
> > the usual vfs interfaces (ioctls being the most usual one) so in 
> > that specific code, they have to do the kuid_t to uid_t changes 
> > themselves.  Shiftfs never sends data to the user outside of the 
> > VFS so it never needs to do this and can operate entirely on
> > kuid_ts.
> > 
> > > > There's a bit of an open question of whether it should have vfs
> > > > changes: the way the struct file f_inode and f_ops are hijacked 
> > > > is a bit nasty and perhaps d_select_inode() could be made a bit
> > > > cleverer to help us here instead.
> > > 
> > > I'm not sure if this PoC works... but you sure you didn't 
> > > introduce a serious vulnerability here ? you use a new mapping 
> > > and you update current_fsuid() creds up, which is global on any 
> > > fs operation, so may be: lets operate on any inode, update our 
> > > current_fsuid()... and access the rest of *unshifted filesystems*
> > > ... !?
> > 
> > The credentials are per thread, so it's a standard way of doing
> > credential shifting and no other threads of execution in the same 
> > task get access. As long as you bound the override_creds()/revert_c
> > reds() pairs within the kernel, you're safe.
> 
> No, and here sorry I mean shifted.
> 
> current_fsuid() is global through all fs operations which means it
> crosses user namespaces... it was safe the days of only init_user_ns,
> not anymore... You give a mapping inside containers to fsuid where 
> they don't want to have it... this allows to operate on inodes inside
> other containers... update current_fsuid() even if we want that user 
> to be nobody inside the container... and later it can access the 
> inodes of the shifted fs... and by same current of course...

OK, I still don't understand what you're getting at.  There are three
per-thread uids: uid, euid and fsuid (real, effective and filesystem). 
 They're all either settable via syscall or inherited on fork.  They're
all kernel side, meaning they're kuid_t.  Their values stay invariant
as you move through namespaces.  They change (and get mapped according
to the current user namespace setting) when you call set[fe]uid() So
when I enter a user namespace with mapping

0 100000 1000

and call setuid(0) (which sets all three). they all pick up the kuid_t
of 100000.  This means that writing a file inside the user namespace
after calling setuid(0) appears as real uid 100000 on the medium even
though if I call getuid() from the namespace, I get back 0.  What
shiftfs does is hijack temporarily the kernel fsuid/fsgid for
permission checks, so you can remap to any old uid on the medium
(although usually you'd pass in uidmap=0:100000:1000") it maps back
from kuid_t 100000 to kuid_t 0, which is why the container can now read
and write the underlying medium at on-media id 0 even through root
inside the container has kuid_t 100000.  There's no permanent change of
fsuid and it stays at its invariant value for the thread except as a
temporary measure to do the permission checks on the underlying of the
shifted filesystem.

> > > The worst thing is that current_fsuid() does not follow now the
> > > /proc/self/uid_map interface! this is a serious vulnerability and 
> > > a mix of the current semantics... it's updated but using other
> > > rules...?
> > 
> > current_fsuid() is aready mapped via the userns; it's already a 
> > kuid_t at its final value.  Shifting that is what you want to remap
> > underlying volume uid/gid's.  The uidmap/gidmap inputs to this are 
> > shifts on the final underlying uid/gids.
> 
> => some points:
> Changing setfsuid() its interfaces and rules... or an indrect way to
> break another syscall...

There is no change to setfsuid().

> The userns used for *mapping* is totatly different and not standard..
> . losing "init_user_ns and its decendents userns *semantics*...", a
> yet a totatly unlinked mapping...

There is no user namespace mapping at all.  This is a simple shift,
kernel side, of uids and gids at their kuid_t values.

> Breaking current_uid(),current_euid(),current_fsuid() which are
> mapped but in *different* user namespaces... hence different values
> inside namespaces... you can change your userns mapping but that
> current_fsuid specific one will always be remapped to some other 
> value inside even if you don't want it... It crosses user 
> namespaces...  uid and euid are remapped according to /proc/self/uid_
> map, fsuid is remapped according to this new interface...
> 
> Hard coding the mapping, nested containers/apps may *share* fsuid and
> can't get rid of it even if they change the inside userns mapping to
> disable, split, reduce mapped users or offer better isolation they
> can't... no way to make private inodes inside containers if they 
> share the final fsuid, inside container mapping is ignored...
> ...

OK, I think there's a misunderstanding about how credential overrides
work.  They're not permanent changes to the credentials, they're
temporary ones to get stuff done within the kernel at a temporary
privilege.  You can make credentials permanent if you go through
prepare_creds()/commit_creds(), but for making them temporary you do
prepare_creds()/override_creds() and then revert_creds() once you're
done using them.

If you want to see a current use of this, try fs/open.c:faccessat. 
 What it's doing is temporarily overriding fsuid with the real uid to
check the permissions before reverting the credentials and returning to
the user.

James

> > the privileged ids down to 100000, but I have a volume which still 
> > has realids, I can mount that volume using shiftfs with
> > uidmap=0:100000:1000 and it will allow this userns to read and 
> > write the volume through its remapped ids.
> > 
> > > For overlayfs I did write an expriment but for me it's not an 
> > > overlayfs or another new filesystem problem... we are 
> > > manipulating UID/GID identities...
> > > 
> > > It would have been better if you did send this as a separate 
> > > thread. It was a vfs:userns RFC fix which if we continue we turn 
> > > it into a complicated thing! implement another new light 
> > > filesystem with userns... (overlayfs...)
> > > 
> > > Will follow up if the appropriate thread is created, not here, I 
> > > guess it's ok ?
> > 
> > Well, I can resend the patch as a separate thread when I've fixed 
> > some of the problems viro pointed out.
> > 
> > James
> > 
> > > > James
> > > > 
> > > 
> > > Thank you for your feedback!
> > > 
> > > 
> > 
> 
> Thanks!
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Djalal Harouni May 14, 2016, 9:53 a.m. UTC | #8
On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
[...]

> > > 
> > > The credentials are per thread, so it's a standard way of doing
> > > credential shifting and no other threads of execution in the same 
> > > task get access. As long as you bound the override_creds()/revert_c
> > > reds() pairs within the kernel, you're safe.
> > 
> > No, and here sorry I mean shifted.
> > 
> > current_fsuid() is global through all fs operations which means it
> > crosses user namespaces... it was safe the days of only init_user_ns,
> > not anymore... You give a mapping inside containers to fsuid where 
> > they don't want to have it... this allows to operate on inodes inside
> > other containers... update current_fsuid() even if we want that user 
> > to be nobody inside the container... and later it can access the 
> > inodes of the shifted fs... and by same current of course...
> 
> OK, I still don't understand what you're getting at.  There are three
> per-thread uids: uid, euid and fsuid (real, effective and filesystem). 
>  They're all either settable via syscall or inherited on fork.  They're
> all kernel side, meaning they're kuid_t.  Their values stay invariant
> as you move through namespaces.  They change (and get mapped according
> to the current user namespace setting) when you call set[fe]uid() So
> when I enter a user namespace with mapping
> 
> 0 100000 1000
> 
> and call setuid(0) (which sets all three). they all pick up the kuid_t
> of 100000.  This means that writing a file inside the user namespace
> after calling setuid(0) appears as real uid 100000 on the medium even
> though if I call getuid() from the namespace, I get back 0.  What
> shiftfs does is hijack temporarily the kernel fsuid/fsgid for
> permission checks, so you can remap to any old uid on the medium
> (although usually you'd pass in uidmap=0:100000:1000") it maps back
> from kuid_t 100000 to kuid_t 0, which is why the container can now read
> and write the underlying medium at on-media id 0 even through root
> inside the container has kuid_t 100000.  There's no permanent change of
> fsuid and it stays at its invariant value for the thread except as a
> temporary measure to do the permission checks on the underlying of the
> shifted filesystem.
> 
> > > > The worst thing is that current_fsuid() does not follow now the
> > > > /proc/self/uid_map interface! this is a serious vulnerability and 
> > > > a mix of the current semantics... it's updated but using other
> > > > rules...?
> > > 
> > > current_fsuid() is aready mapped via the userns; it's already a 
> > > kuid_t at its final value.  Shifting that is what you want to remap
> > > underlying volume uid/gid's.  The uidmap/gidmap inputs to this are 
> > > shifts on the final underlying uid/gids.
> > 
> > => some points:
> > Changing setfsuid() its interfaces and rules... or an indrect way to
> > break another syscall...
> 
> There is no change to setfsuid().
> 
> > The userns used for *mapping* is totatly different and not standard..
> > . losing "init_user_ns and its decendents userns *semantics*...", a
> > yet a totatly unlinked mapping...
> 
> There is no user namespace mapping at all.  This is a simple shift,
> kernel side, of uids and gids at their kuid_t values.
> 
> > Breaking current_uid(),current_euid(),current_fsuid() which are
> > mapped but in *different* user namespaces... hence different values
> > inside namespaces... you can change your userns mapping but that
> > current_fsuid specific one will always be remapped to some other 
> > value inside even if you don't want it... It crosses user 
> > namespaces...  uid and euid are remapped according to /proc/self/uid_
> > map, fsuid is remapped according to this new interface...
> > 
> > Hard coding the mapping, nested containers/apps may *share* fsuid and
> > can't get rid of it even if they change the inside userns mapping to
> > disable, split, reduce mapped users or offer better isolation they
> > can't... no way to make private inodes inside containers if they 
> > share the final fsuid, inside container mapping is ignored...
> > ...
> 
> OK, I think there's a misunderstanding about how credential overrides
> work.  They're not permanent changes to the credentials, they're
> temporary ones to get stuff done within the kernel at a temporary
> privilege.  You can make credentials permanent if you go through
> prepare_creds()/commit_creds(), but for making them temporary you do
> prepare_creds()/override_creds() and then revert_creds() once you're
> done using them.
> 
> If you want to see a current use of this, try fs/open.c:faccessat. 
>  What it's doing is temporarily overriding fsuid with the real uid to
> check the permissions before reverting the credentials and returning to
> the user.

Thank you for explaining things, but I think you should take the time to
read this RFC and understand some problems. This is a quick dump of some
problems that it avoids...:

In this series we don't hijack setfsuid() in an indirect way, setfsuid
maps UIDs into current userns according to rules set by parent.
Changing current_fsuid() to some other mapping is a way to allow
processes to bypass that and use it to access other inodes...
This should not change and fsuid should continue to follow these
rules...

A cred->fsuid solution is safe or used to be safe only inside
init_user_ns where there is always a mapping or in context of current
user namespace. In an other user namespace with 0:1000:1 mapping,  you
can't set it to arbitrary mapping like 0:4000:1... It will give confined
processes access to inodes that satisfy the kuid_t 4000 mapping and
which the app/container wants to deny, they only want 0:1000:1. ..

We don't cross user namespaces, we don't use different mappings for
cred->uid, cred->fsuid...  A clean solution is to shift inodes UID/GID
and not change fsuid to cross namespaces. Not to mention how it may
interact with capabilities...

We follow user namespace rules and we keep "the parent defines a range
that the children can't escape" semantics.  There is a clear relation
between user namespaces that should not be broken.

We explicitly don't define a new user namespace mapping nor add a new
interface for the simple reason it's: *too complicated*. We can do that,
but no thanks! May be in future if there is a real need or things are
clear...
The current user namespace interface is getting standard and stable, so
we just keep it that way and make it consistant inside VFS.

We give VFS control of that, and we make mount namespaces the central
part of this whole logic.

We make admins life easier where they can pull container images, root
filesystems from containers/apps hubs... verify the signature and start
them with different mappings according to host resources... We don't
want them to do anything.
The design was planned to make it easier for users, it should work out
of the box, and it can be used to handle complex stuff too, since it's
flexible.

Able to support most filesystems including on-disk filesystems natively.

Able to support disk quota according to the shifted UID/GID on-disk
values. Especially during inode creation...

Able to support ACL if requested.

The user namespace mapping is kept a runtime configure option, we don't
pin a special mapping at any time, and of course parent creator of user
namespace is the one that can manipulate it, at the same time the
mapping is restricted according to grandpa rules and so on...

It allows unprivileged to use the VFS UID/GID shift without the
intervention of a privileged process each time.
The real privileged process sets the filesystem and the mount namespace
the first time, then it should work for all nested namespaces and
containers. It does not need the intervation of  init_user_ns root to
set the mapping and make it work, you don't have to go in and go out to
setup the thing, etc.

We don't do this on behalf of filesystems, they should explicitly
support it. procfs and other host resource virtual filesystems are safe
and currently they don't need shifting.

We try to fix the problem where it should be fixed, and not hide it...
James Bottomley May 14, 2016, 1:46 p.m. UTC | #9
On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> > On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > > > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> [...]
> 
> > > > 
> > > > The credentials are per thread, so it's a standard way of doing
> > > > credential shifting and no other threads of execution in the
> > > > same 
> > > > task get access. As long as you bound the
> > > > override_creds()/revert_c
> > > > reds() pairs within the kernel, you're safe.
> > > 
> > > No, and here sorry I mean shifted.
> > > 
> > > current_fsuid() is global through all fs operations which means
> > > it
> > > crosses user namespaces... it was safe the days of only
> > > init_user_ns,
> > > not anymore... You give a mapping inside containers to fsuid
> > > where 
> > > they don't want to have it... this allows to operate on inodes
> > > inside
> > > other containers... update current_fsuid() even if we want that
> > > user 
> > > to be nobody inside the container... and later it can access the 
> > > inodes of the shifted fs... and by same current of course...
> > 
> > OK, I still don't understand what you're getting at.  There are
> > three
> > per-thread uids: uid, euid and fsuid (real, effective and
> > filesystem). 
> >  They're all either settable via syscall or inherited on fork. 
> >  They're
> > all kernel side, meaning they're kuid_t.  Their values stay
> > invariant
> > as you move through namespaces.  They change (and get mapped
> > according
> > to the current user namespace setting) when you call set[fe]uid()
> > So
> > when I enter a user namespace with mapping
> > 
> > 0 100000 1000
> > 
> > and call setuid(0) (which sets all three). they all pick up the
> > kuid_t
> > of 100000.  This means that writing a file inside the user
> > namespace
> > after calling setuid(0) appears as real uid 100000 on the medium
> > even
> > though if I call getuid() from the namespace, I get back 0.  What
> > shiftfs does is hijack temporarily the kernel fsuid/fsgid for
> > permission checks, so you can remap to any old uid on the medium
> > (although usually you'd pass in uidmap=0:100000:1000") it maps back
> > from kuid_t 100000 to kuid_t 0, which is why the container can now
> > read
> > and write the underlying medium at on-media id 0 even through root
> > inside the container has kuid_t 100000.  There's no permanent
> > change of
> > fsuid and it stays at its invariant value for the thread except as
> > a
> > temporary measure to do the permission checks on the underlying of
> > the
> > shifted filesystem.
> > 
> > > > > The worst thing is that current_fsuid() does not follow now
> > > > > the
> > > > > /proc/self/uid_map interface! this is a serious vulnerability
> > > > > and 
> > > > > a mix of the current semantics... it's updated but using
> > > > > other
> > > > > rules...?
> > > > 
> > > > current_fsuid() is aready mapped via the userns; it's already a
> > > > kuid_t at its final value.  Shifting that is what you want to
> > > > remap
> > > > underlying volume uid/gid's.  The uidmap/gidmap inputs to this
> > > > are 
> > > > shifts on the final underlying uid/gids.
> > > 
> > > => some points:
> > > Changing setfsuid() its interfaces and rules... or an indrect way
> > > to
> > > break another syscall...
> > 
> > There is no change to setfsuid().
> > 
> > > The userns used for *mapping* is totatly different and not
> > > standard..
> > > . losing "init_user_ns and its decendents userns *semantics*...",
> > > a
> > > yet a totatly unlinked mapping...
> > 
> > There is no user namespace mapping at all.  This is a simple shift,
> > kernel side, of uids and gids at their kuid_t values.
> > 
> > > Breaking current_uid(),current_euid(),current_fsuid() which are
> > > mapped but in *different* user namespaces... hence different
> > > values
> > > inside namespaces... you can change your userns mapping but that
> > > current_fsuid specific one will always be remapped to some other 
> > > value inside even if you don't want it... It crosses user 
> > > namespaces...  uid and euid are remapped according to
> > > /proc/self/uid_
> > > map, fsuid is remapped according to this new interface...
> > > 
> > > Hard coding the mapping, nested containers/apps may *share* fsuid
> > > and
> > > can't get rid of it even if they change the inside userns mapping
> > > to
> > > disable, split, reduce mapped users or offer better isolation
> > > they
> > > can't... no way to make private inodes inside containers if they 
> > > share the final fsuid, inside container mapping is ignored...
> > > ...
> > 
> > OK, I think there's a misunderstanding about how credential
> > overrides
> > work.  They're not permanent changes to the credentials, they're
> > temporary ones to get stuff done within the kernel at a temporary
> > privilege.  You can make credentials permanent if you go through
> > prepare_creds()/commit_creds(), but for making them temporary you
> > do
> > prepare_creds()/override_creds() and then revert_creds() once
> > you're
> > done using them.
> > 
> > If you want to see a current use of this, try fs/open.c:faccessat. 
> >  What it's doing is temporarily overriding fsuid with the real uid
> > to
> > check the permissions before reverting the credentials and
> > returning to
> > the user.
> 
> Thank you for explaining things, but I think you should take the time 
> to read this RFC and understand some problems. This is a quick dump 
> of some problems that it avoids...:

I did.  The problem is how to get the userns to read and write files at
the interior not the exterior id.  Your solution is to thread the
mapping through the VFS and even on to the filesystems themselves to
get the mount option.  I already commented that this is a bit ugly and
couldn't it be encapsulated in a filesystem.  The way I approached the
problem is from the base that I do have build container roots with
shifted uid/gids because I installed them that way.  So, if it already
works, one possible solution is to have a filesystem which does the
shift and mounts the shifted root somewhere in the mount tree for the
namespace to access.  The point about doing it this way is that the
filesystem that does it needs no user namespace knowledge.  All it does
is remap from one on disk id to another using a map function.  How it
gets the map was left up to the admin in the implementation.

> In this series we don't hijack setfsuid() in an indirect way, 
> setfsuid maps UIDs into current userns according to rules set by 
> parent. Changing current_fsuid() to some other mapping is a way to 
> allow processes to bypass that and use it to access other inodes...
> This should not change and fsuid should continue to follow these
> rules...

Both solutions do this

> A cred->fsuid solution is safe or used to be safe only inside
> init_user_ns where there is always a mapping or in context of current
> user namespace. In an other user namespace with 0:1000:1 mapping, 
>  you can't set it to arbitrary mapping like 0:4000:1... It will give
> confined processes access to inodes that satisfy the kuid_t 4000 
> mapping and which the app/container wants to deny, they only want
> 0:1000:1. ..

OK, so both solutions are safe here too.  Your safety comes from only
remapping in the userns; mine comes from the normal filesystem acl
rules: either the userns for different users all have disjoint ids
regulated by /etc/subuidmap or they're all using the same one (like
docker 1.10) in either case, you could regulate by having the mount
under a directory which is accessible only to the userns owner.

> We don't cross user namespaces, we don't use different mappings for
> cred->uid, cred->fsuid...  A clean solution is to shift inodes 
> UID/GID and not change fsuid to cross namespaces. Not to mention how 
> it may interact with capabilities...

This is a subjective question on what constitutes "clean".  I think we
both think the other solution isn't clean, so that's for others to
adjudicate.

> We follow user namespace rules and we keep "the parent defines a 
> range that the children can't escape" semantics.  There is a clear 
> relation between user namespaces that should not be broken.

OK, so I separated the problem into a userns one, which remaps for the
processes in user space, and a vfs one which remaps the on-disk id. 
 However, they could be combined by allowing the userns to mount
shiftfs but only on designated filesystems and setting the uidmappings
to the same ones as the userns.

> We explicitly don't define a new user namespace mapping nor add a new
> interface for the simple reason it's: *too complicated*. We can do 
> that, but no thanks! May be in future if there is a real need or 
> things are clear... The current user namespace interface is getting 
> standard and stable, so we just keep it that way and make it
> consistant inside VFS.

I don't accept the too complicated point.  For fully unprivileged
containers, the host admin already has to set up the subuid/subgid map
files which is most of the complexity.  Once that's done, the same maps
can be used to shift mount.  Once it's all set up, no further
intervention is required.

> We give VFS control of that, and we make mount namespaces the central
> part of this whole logic.

Right, that's what causes the logic to thread throughout the entire vfs
and into the fs layer.  The fundamental point of difference is that I'd
like a solution which encapsulates the problem rather than exposing it
to the vfs.

> We make admins life easier where they can pull container images, root
> filesystems from containers/apps hubs... verify the signature and 
> start them with different mappings according to host resources... We 
> don't want them to do anything. The design was planned to make it 
> easier for users, it should work out of the box, and it can be used 
> to handle complex stuff too, since it's flexible.

Either works easily for users.  Setting stuff up is always the job of
the admin in both solutions.

> Able to support most filesystems including on-disk filesystems
> natively.

Shiftfs does this.  More importantly it supports subtrees, so I can
unpack an image root on to an existing filesystem and remap it into a
container.

> Able to support disk quota according to the shifted UID/GID on-disk
> values. Especially during inode creation...

Quota can be shifted, I just wasn't sure it was necessary.  If the
usual use case is for unpacked roots, chances are you want the
remapping to use the group quota of the userns owner, which they'd get
naturally so, while it's possible to remap projid, I didn't think it
needed to be done.

> Able to support ACL if requested.

Both do this.

> The user namespace mapping is kept a runtime configure option, we 
> don't pin a special mapping at any time, and of course parent creator 
> of user namespace is the one that can manipulate it, at the same time 
> the mapping is restricted according to grandpa rules and so on...
> 
> It allows unprivileged to use the VFS UID/GID shift without the
> intervention of a privileged process each time. The real privileged 
> process sets the filesystem and the mount namespace the first time, 
> then it should work for all nested namespaces and containers. It does 
> not need the intervation of  init_user_ns root to set the mapping and 
> make it work, you don't have to go in and go out to setup the thing,
> etc.

Both solutions work like this.  When I use this for shifted roots of
emulation containers, it's set up once at start of day.  I then build
the containers unprivileged using newsubuid/newsubgid as I'm using
them.  Once the shifts are done at start of day, no other admin support
is required.

James


> We don't do this on behalf of filesystems, they should explicitly
> support it. procfs and other host resource virtual filesystems are 
> safe and currently they don't need shifting.
> 
> We try to fix the problem where it should be fixed, and not hide 
> it...


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman May 15, 2016, 2:21 a.m. UTC | #10
James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:

Just a couple of quick comments from a very high level design point.

- I think a shiftfs is valuable in the same way that overlayfs is
  valuable.

  Esepcially in the Docker case where a lot of containers want a shared
  base image (for efficiency), but it is desirable to run those
  containers in different user namespaces for safety.

- It is also the plan to make it possible to mount a filesystem where
  the uids and gids of that filesystem on disk do not have a one to one
  mapping to kernel uids and gids.  99% of the work has already be done,
  for all filesystem except XFS.

  That said there are some significant issues to work through, before
  something like that can be enabled.

  * Handling of uids/gids on disk that don't map into a kuid/kgid.
  * Safety from poisoned filesystem images.

  I have slowly been working with Seth Forshee on these issues as
  the last thing I want is to introduce more security bugs right now.
  Seth being a braver man than I am has already merged his changes into
  the Ubuntu kernel.

  Right now we are targeting fuse, because fuse is already designed to
  handle poisoned filesystem images.  So to safely enable this kind of
  mapping for fuse is not a giant step.

  The big thing from my point of view is to get the VFS interfaces
  correct so that the VFS handles all of the weird cases that come up
  with uids and gids that don't map, and any other weird cases.  Keeping
  the weird bits out of the filesystems.

James, Djalal  I regert I have not been able to read through either of
your patches cloesely yet.  From a high level view I believe there are
use cases for both approaches, and the use cases do not necessarily
overlap.

Djalal I think you are seeing the upsides and not the practical dangers
of poisoned filesystem images.

James I think you are missing the fact that all filesystems already have
the make_kuid and make_kgid calls right where the data comes off disk,
and the from_kuid and from_kgid calls right where the on-disk data is
being created just before it goes on disk.  Which means that the actual
impact on filesystems of the translation is trivial.

Where the actual impact of filesystems is much higher is the
infrastructure needed to ensure poisoned filesystem images do not cause
a kernel compromise.  That extends to the filesystem testing and code
review process beyond and is more than just a kernel problem.  Hardening
that attack surface of the disk side of filesystems is difficult
especially when not impacting filesystem performance.


So I don't think it makes sense to frame this as an either/or situation.
I think there is a need for both solutions.

Djalal if you could work with Seth I think that would be very useful.  I
know I am dragging my heels there but I really hope I can dig in and get
everything reviewed and merged soonish.

James if you could see shiftfs with a different set of merits than what
to Djalal is doing I think that would be useful.  As it would allow
everyone to concentrate on getting the bugs out of their solutions.



That said I am not certain shiftfs makes sense without Seth's patches to
handle the weird cases at the VFS level.    What do you do with uids and
gids that don't map?  You can reinvent how to handle the strange cases
in shfitfs or we can work on solving this problem at the VFS level so
people don't have to go through the error prone work of reinventing
solutions.


The big ugly nasty in all of this is that we are fundamentally dealing
with uids and gids which are security identifiers.  Practically any bug
is exploitable and CVE worthy.  So it make sense to tread very
carefully.  Even with care it can takes months if not years to get
the number of bugs down to a level where you are not the favorite target
of people looking for exploitable kernel bugs.
 
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley May 15, 2016, 3:04 p.m. UTC | #11
On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
> James if you could see shiftfs with a different set of merits than 
> what to Djalal is doing I think that would be useful.  As it would 
> allow everyone to concentrate on getting the bugs out of their
> solutions.

Just to reply to this specific point.  Djalal's patches can't actually
work for me because I use subtree based roots rather than whole fs
roots ... it's mostly because I work with image directories, not the
full mounted images themselves.  For stuff I unpack into /home, I could
see having /home on a separate directory and adding the vfs_shift_
flags.  however, I'm not doing (and it would be really unsafe to do)
that for / to get my images that unpack in /var/tmp (like the obs build
roots).

However, half the ugliness of the patch set is that it needs lower
layer FS support because vfs_shift_ are mount flags in the superblock. 
 If they were made subtree flags instead (so MNT_ flags), I think you
could eliminate the need to modify any underlying filesystems and they
would allow us to mark subtrees for shifting.  the mount command would
need modifying to add them (like it was for --shared and --private) so
we'd need an additional --vfs-shift --ufs-shift to mark the subtree but
then the series would work for bind mounting subtrees, which is what I
need.  And they would work for *any* filesystem without modification.

This would probably be the better of both worlds because it will work
for the docker case as well.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Seth Forshee May 16, 2016, 2:12 p.m. UTC | #12
On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.

Maybe not quite so brave as you think. I also threw on a patch to
disable the feature unless explicitly enabled by a sys admin.

> James I think you are missing the fact that all filesystems already have
> the make_kuid and make_kgid calls right where the data comes off disk,
> and the from_kuid and from_kgid calls right where the on-disk data is
> being created just before it goes on disk.  Which means that the actual
> impact on filesystems of the translation is trivial.

It is fairly simple but a there's bit more that just id conversions to
change. With ext4 I found that there were mount options which needed to
be restricted, some capability checks to update, and access to external
journal devices must be checked. In all it wasn't a whole lot of changes
to the filesystem though. Fuse was a bit more involved, but the
complexities there won't apply to other filesystems.

> Djalal if you could work with Seth I think that would be very useful.  I
> know I am dragging my heels there but I really hope I can dig in and get
> everything reviewed and merged soonish.

That would make me very happy :-)

I'm happy to look with Djalal for commonalities. I did skim his patches
before, and based on that all I really expect to find are things related
to permission checks when ids don't map. The rest seems fundamentally
different.

Seth
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman May 16, 2016, 4:42 p.m. UTC | #13
Seth Forshee <seth.forshee@canonical.com> writes:

> On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
>>   I have slowly been working with Seth Forshee on these issues as
>>   the last thing I want is to introduce more security bugs right now.
>>   Seth being a braver man than I am has already merged his changes into
>>   the Ubuntu kernel.
>
> Maybe not quite so brave as you think. I also threw on a patch to
> disable the feature unless explicitly enabled by a sys admin.
>
>> James I think you are missing the fact that all filesystems already have
>> the make_kuid and make_kgid calls right where the data comes off disk,
>> and the from_kuid and from_kgid calls right where the on-disk data is
>> being created just before it goes on disk.  Which means that the actual
>> impact on filesystems of the translation is trivial.
>
> It is fairly simple but a there's bit more that just id conversions to
> change. With ext4 I found that there were mount options which needed to
> be restricted, some capability checks to update, and access to external
> journal devices must be checked. In all it wasn't a whole lot of changes
> to the filesystem though. Fuse was a bit more involved, but the
> complexities there won't apply to other filesystems.
>
>> Djalal if you could work with Seth I think that would be very useful.  I
>> know I am dragging my heels there but I really hope I can dig in and get
>> everything reviewed and merged soonish.
>
> That would make me very happy :-)

It has missed this merge window :( But I am hoping with am aiming to
review them and get your patches (or modified versions of your patches)
into my tree as soon after rc1 as humanly possible.

Part of that will have to be the fix for mqueuefs, that Docker just hit.

> I'm happy to look with Djalal for commonalities. I did skim his patches
> before, and based on that all I really expect to find are things related
> to permission checks when ids don't map. The rest seems fundamentally
> different.

Hmm.  Then I may have to look closer at what Djalal is doing then.  It
sounded like what you were doing and if not, I will scratch my head.

That said yes.  The biggy is getting the VFS changes to handle all of
the weird translation corner cases etc (that are part of your patches).

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Seth Forshee May 16, 2016, 6:25 p.m. UTC | #14
On Mon, May 16, 2016 at 11:42:46AM -0500, Eric W. Biederman wrote:
> Seth Forshee <seth.forshee@canonical.com> writes:
> 
> > On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
> >>   I have slowly been working with Seth Forshee on these issues as
> >>   the last thing I want is to introduce more security bugs right now.
> >>   Seth being a braver man than I am has already merged his changes into
> >>   the Ubuntu kernel.
> >
> > Maybe not quite so brave as you think. I also threw on a patch to
> > disable the feature unless explicitly enabled by a sys admin.
> >
> >> James I think you are missing the fact that all filesystems already have
> >> the make_kuid and make_kgid calls right where the data comes off disk,
> >> and the from_kuid and from_kgid calls right where the on-disk data is
> >> being created just before it goes on disk.  Which means that the actual
> >> impact on filesystems of the translation is trivial.
> >
> > It is fairly simple but a there's bit more that just id conversions to
> > change. With ext4 I found that there were mount options which needed to
> > be restricted, some capability checks to update, and access to external
> > journal devices must be checked. In all it wasn't a whole lot of changes
> > to the filesystem though. Fuse was a bit more involved, but the
> > complexities there won't apply to other filesystems.
> >
> >> Djalal if you could work with Seth I think that would be very useful.  I
> >> know I am dragging my heels there but I really hope I can dig in and get
> >> everything reviewed and merged soonish.
> >
> > That would make me very happy :-)
> 
> It has missed this merge window :( But I am hoping with am aiming to
> review them and get your patches (or modified versions of your patches)
> into my tree as soon after rc1 as humanly possible.
> 
> Part of that will have to be the fix for mqueuefs, that Docker just hit.

Yeah, I've got a patch that's been tested to fix the bug, so I'll send
new patches which include that before long.

Seth

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley May 16, 2016, 7:13 p.m. UTC | #15
On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> 
> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> 
> Just a couple of quick comments from a very high level design point.
> 
> - I think a shiftfs is valuable in the same way that overlayfs is
>   valuable.
> 
>   Esepcially in the Docker case where a lot of containers want a shared
>   base image (for efficiency), but it is desirable to run those
>   containers in different user namespaces for safety.
> 
> - It is also the plan to make it possible to mount a filesystem where
>   the uids and gids of that filesystem on disk do not have a one to one
>   mapping to kernel uids and gids.  99% of the work has already be done,
>   for all filesystem except XFS.

Can you elaborate a bit more on why we want to do this?  I think only
having a single shift of uid_t to kuid_t across the kernel to user
boundary is a nice feature of user namespaces.  Architecturally, it's
not such a big thing to do it as the data goes on to the disk as well,
but what's the use case for it?

>   That said there are some significant issues to work through, before
>   something like that can be enabled.
> 
>   * Handling of uids/gids on disk that don't map into a kuid/kgid.

So I think this is nicely handled in the capability checks in
generic_permission() (capable_wrt_inode_uidgid()) is there a need to
make it more complex (and thus more error prone)?

>   * Safety from poisoned filesystem images.

By poisoned FS image, you mean an image over whose internal data the
user has control?  The basic problem of how do we give users write
access to data devices they can then cause to be mounted as
filesystems?

>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.
> 
>   Right now we are targeting fuse, because fuse is already designed to
>   handle poisoned filesystem images.  So to safely enable this kind of
>   mapping for fuse is not a giant step.
> 
>   The big thing from my point of view is to get the VFS interfaces
>   correct so that the VFS handles all of the weird cases that come up
>   with uids and gids that don't map, and any other weird cases.  Keeping
>   the weird bits out of the filesystems.

If by VFS interfaces, you mean where we've already got the mapping 
confined, absolutely.

> James I think you are missing the fact that all filesystems already 
> have the make_kuid and make_kgid calls right where the data comes off
> disk,

I beg to differ: they certainly don't.  The underlying filesystem
populates the inode in ->lookup with the data off the disk which goes
into the inode as a kuid_t/kgid_t  It remains forever in the inode as
that.  We convert it as it goes out of the kernel in the stat calls
(actually stat.c:cp_old/new_stat())

>  and the from_kuid and from_kgid calls right where the on-disk data
> is being created just before it goes on disk.  Which means that the
> actual impact on filesystems of the translation is trivial.

Are you looking at a different tree from me?  I'm actually just looking
at Linus git head.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Djalal Harouni May 17, 2016, 11:42 a.m. UTC | #16
Hi Eric,

On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> 
> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> 
> Just a couple of quick comments from a very high level design point.
> 
> - I think a shiftfs is valuable in the same way that overlayfs is
>   valuable.
> 
>   Esepcially in the Docker case where a lot of containers want a shared
>   base image (for efficiency), but it is desirable to run those
>   containers in different user namespaces for safety.
> 
> - It is also the plan to make it possible to mount a filesystem where
>   the uids and gids of that filesystem on disk do not have a one to one
>   mapping to kernel uids and gids.  99% of the work has already be done,
>   for all filesystem except XFS.
> 
>   That said there are some significant issues to work through, before
>   something like that can be enabled.
> 
>   * Handling of uids/gids on disk that don't map into a kuid/kgid.
>   * Safety from poisoned filesystem images.
> 
>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.
> 
>   Right now we are targeting fuse, because fuse is already designed to
>   handle poisoned filesystem images.  So to safely enable this kind of
>   mapping for fuse is not a giant step.

Alright!

>   The big thing from my point of view is to get the VFS interfaces
>   correct so that the VFS handles all of the weird cases that come up
>   with uids and gids that don't map, and any other weird cases.  Keeping
>   the weird bits out of the filesystems.

Indeed, I totally agree here.


> James, Djalal  I regert I have not been able to read through either of
> your patches cloesely yet.  From a high level view I believe there are
> use cases for both approaches, and the use cases do not necessarily
> overlap.
> 
> Djalal I think you are seeing the upsides and not the practical dangers
> of poisoned filesystem images.

Thanks for your reply Eric, I will let you sleep on the approach. Yes
it's totatly different thing, I think you can consider it as a first
step to use filesystems inside user namespace safely. Real root is still
the only one who mounts and sets the mount namespace shift flag that can
be inherited by unprivlieged userns users.. So real root is *still* in
control of things. The solution is flexible. At the same time you have
the fuse patches for ones that want to use it for unprivileged mounts, and
later and it depends on the future and the state of art or how things
are and improve...

The real problem seems poisoned filesystem images, ok I agree. However
this series considers at the moment only real root is the one who has to
mount filesystems that will be used for user namespaces.

So nothing real changes, just consider it like this:
1) root of init_user_ns mounts filesystems with mount shift flags and
create shift mount namespace.
2) then give access for inodes that have inode->{uid/gid} that match
the inside mapping of the calling process. This is like real root doing
recursive chown of files to give rwx permission but without hitting the
real disk. Every thing is virtual.

So nothing really changes for poisoned filesystems since unprivileged
users can't mount them, only real is able to do so, and he can verify
the image before doing so...

Now, the problem that I can see is if there is some special inodes
related to these filesystems and host resources that are marked 0400
only for real root, in this case we have to add the needed capability
check, capable in init_user_ns. For ioctl I guess they are already safe
since they should have the appropriate capable check, but I will check
it of course.

Now, as Seth has been working with fuse mounts, and I guess they will be
merged, I will of course check with him so everything is synced and that
this approach will continue to work after his patches are merged.


> James I think you are missing the fact that all filesystems already have
> the make_kuid and make_kgid calls right where the data comes off disk,
> and the from_kuid and from_kgid calls right where the on-disk data is
> being created just before it goes on disk.  Which means that the actual
> impact on filesystems of the translation is trivial.
> 
> Where the actual impact of filesystems is much higher is the
> infrastructure needed to ensure poisoned filesystem images do not cause
> a kernel compromise.  That extends to the filesystem testing and code
> review process beyond and is more than just a kernel problem.  Hardening
> that attack surface of the disk side of filesystems is difficult
> especially when not impacting filesystem performance.
> 
> 
> So I don't think it makes sense to frame this as an either/or situation.
> I think there is a need for both solutions.
> 
> Djalal if you could work with Seth I think that would be very useful.  I
> know I am dragging my heels there but I really hope I can dig in and get
> everything reviewed and merged soonish.

Alright!


> James if you could see shiftfs with a different set of merits than what
> to Djalal is doing I think that would be useful.  As it would allow
> everyone to concentrate on getting the bugs out of their solutions.
> 
> 
> 
> That said I am not certain shiftfs makes sense without Seth's patches to
> handle the weird cases at the VFS level.    What do you do with uids and
> gids that don't map?  You can reinvent how to handle the strange cases
> in shfitfs or we can work on solving this problem at the VFS level so
> people don't have to go through the error prone work of reinventing
> solutions.
> 
> 
> The big ugly nasty in all of this is that we are fundamentally dealing
> with uids and gids which are security identifiers.  Practically any bug
> is exploitable and CVE worthy.  So it make sense to tread very
> carefully.  Even with care it can takes months if not years to get
> the number of bugs down to a level where you are not the favorite target
> of people looking for exploitable kernel bugs.

I totally share this concern, that's why this RFC was designed like this,
when you have time please check it, thanks!

Here just for the record, I had a series that works with overlayfs that
updated current_fsuid() to match inodes to give access, and later drop
it for another better solution, but in the end I'm pretty sure that
this should be handled inside VFS, and do not mess with creds or
current_fsuid since they are global values, they cross user namespaces.

> Eric
Djalal Harouni May 17, 2016, 3:42 p.m. UTC | #17
On Sat, May 14, 2016 at 06:46:54AM -0700, James Bottomley wrote:
> On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> > On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> > > On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > > > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:

[...]
> > In this series we don't hijack setfsuid() in an indirect way, 
> > setfsuid maps UIDs into current userns according to rules set by 
> > parent. Changing current_fsuid() to some other mapping is a way to 
> > allow processes to bypass that and use it to access other inodes...
> > This should not change and fsuid should continue to follow these
> > rules...
> 
> Both solutions do this

James, I don't update current_fsuid() nor any other creds field in this
RFC. For the reason that if I've a pinned mapping of 0:100000:65536 that
containers or apps want to use for their own purpose, an app X started
by privileged process and sets global uid to 100000 and its current user
namespace 0:100000:65536, and that app X forks another app Y with global
uid 100000 sandbox it, hide other processes, sets its user namespace
mapping to 1000:100000:1 for app Y, same thing for app Z  2000:100000:1
restrict the set of syscalls for both Y and Z... even with all this they
will be able keep their access to inode->i_uid == 0 where we don't want
that since we don't give a mapping to 0... we just want them to access
inode->i_uid == 1000 for app Y and 2000 for app Z... they cross user
namespaces... they use another mapping... and if Y forks to another app
and even if it sets a new userns mapping with a new restricted range, it
will continue to use the old range 65536 and inodes will show up with
real uids instead of nobody..


> > A cred->fsuid solution is safe or used to be safe only inside
> > init_user_ns where there is always a mapping or in context of current
> > user namespace. In an other user namespace with 0:1000:1 mapping, 
> >  you can't set it to arbitrary mapping like 0:4000:1... It will give
> > confined processes access to inodes that satisfy the kuid_t 4000 
> > mapping and which the app/container wants to deny, they only want
> > 0:1000:1. ..
> 
> OK, so both solutions are safe here too.  Your safety comes from only
> remapping in the userns; mine comes from the normal filesystem acl
> rules: either the userns for different users all have disjoint ids
> regulated by /etc/subuidmap or they're all using the same one (like
> docker 1.10) in either case, you could regulate by having the mount
> under a directory which is accessible only to the userns owner.

Please see above comment. Nested unprivileged apps may want to restrict
syscall operations and access to inodes, maybe we don't want the forked
sandboxed app to have access to inodes, and it will be hard if not
impossible if you update global creds each time...


> > We don't cross user namespaces, we don't use different mappings for
> > cred->uid, cred->fsuid...  A clean solution is to shift inodes 
> > UID/GID and not change fsuid to cross namespaces. Not to mention how 
> > it may interact with capabilities...
> 
> This is a subjective question on what constitutes "clean".  I think we
> both think the other solution isn't clean, so that's for others to
> adjudicate.

If you see it that way :-) , I just want to access from user namespace
in the safest way as possible, if there is a better solution or if my
patches are buggy, I'll drop them... no problem!


> > We follow user namespace rules and we keep "the parent defines a 
> > range that the children can't escape" semantics.  There is a clear 
> > relation between user namespaces that should not be broken.
> 
> OK, so I separated the problem into a userns one, which remaps for the
> processes in user space, and a vfs one which remaps the on-disk id. 
>  However, they could be combined by allowing the userns to mount
> shiftfs but only on designated filesystems and setting the uidmappings
> to the same ones as the userns.
> 
> > We explicitly don't define a new user namespace mapping nor add a new
> > interface for the simple reason it's: *too complicated*. We can do 
> > that, but no thanks! May be in future if there is a real need or 
> > things are clear... The current user namespace interface is getting 
> > standard and stable, so we just keep it that way and make it
> > consistant inside VFS.
> 
> I don't accept the too complicated point.  For fully unprivileged
> containers, the host admin already has to set up the subuid/subgid map
> files which is most of the complexity.  Once that's done, the same maps
> can be used to shift mount.  Once it's all set up, no further
> intervention is required.

Well, please check my first comment. In this RFC you don't have to be
always the real root or a privileged parent to do so... it allows nesting
since it seems that the maintainers want nesting support.


> > We give VFS control of that, and we make mount namespaces the central
> > part of this whole logic.
> 
> Right, that's what causes the logic to thread throughout the entire vfs
> and into the fs layer.  The fundamental point of difference is that I'd
> like a solution which encapsulates the problem rather than exposing it
> to the vfs.
> 
> > We make admins life easier where they can pull container images, root
> > filesystems from containers/apps hubs... verify the signature and 
> > start them with different mappings according to host resources... We 
> > don't want them to do anything. The design was planned to make it 
> > easier for users, it should work out of the box, and it can be used 
> > to handle complex stuff too, since it's flexible.
> 
> Either works easily for users.  Setting stuff up is always the job of
> the admin in both solutions.


Hmm, I don't agree here, things should be safe by default and work out
of the box without the intervention of the admin.


> > Able to support most filesystems including on-disk filesystems
> > natively.
> 
> Shiftfs does this.  More importantly it supports subtrees, so I can
> unpack an image root on to an existing filesystem and remap it into a
> container.

This RFC should support subtrees of course, the mapping is done in the
context of the mount namespace of the caller.


> > Able to support disk quota according to the shifted UID/GID on-disk
> > values. Especially during inode creation...
> 
> Quota can be shifted, I just wasn't sure it was necessary.  If the
> usual use case is for unpacked roots, chances are you want the
> remapping to use the group quota of the userns owner, which they'd get
> naturally so, while it's possible to remap projid, I didn't think it
> needed to be done.
> 
> > Able to support ACL if requested.
> 
> Both do this.
> 
> > The user namespace mapping is kept a runtime configure option, we 
> > don't pin a special mapping at any time, and of course parent creator 
> > of user namespace is the one that can manipulate it, at the same time 
> > the mapping is restricted according to grandpa rules and so on...
> > 
> > It allows unprivileged to use the VFS UID/GID shift without the
> > intervention of a privileged process each time. The real privileged 
> > process sets the filesystem and the mount namespace the first time, 
> > then it should work for all nested namespaces and containers. It does 
> > not need the intervation of  init_user_ns root to set the mapping and 
> > make it work, you don't have to go in and go out to setup the thing,
> > etc.
> 
> Both solutions work like this.  When I use this for shifted roots of
> emulation containers, it's set up once at start of day.  I then build
> the containers unprivileged using newsubuid/newsubgid as I'm using
> them.  Once the shifts are done at start of day, no other admin support
> is required.

This RFC does not require the intervention of the admin or real root
process to adapt the mapping, when the filesystem is mounted with
shifted options and the mount namespace is created, everything is
inherited and you can use real separation for new nested containers/apps
if you want... you don't need the intervention of a privileged entity to
adapt the mapping at the start or after...

Just take a stock rootfs or an image and use it.

Please note that the approach this RFC takes was never discussed... I'll
let everyone sleep on it and see later after the merge window. Thanks!


> James
> 
> 
> > We don't do this on behalf of filesystems, they should explicitly
> > support it. procfs and other host resource virtual filesystems are 
> > safe and currently they don't need shifting.
> > 
> > We try to fix the problem where it should be fixed, and not hide 
> > it...
> 
>
Eric W. Biederman May 17, 2016, 10:40 p.m. UTC | #18
James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
>> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
>> 
>> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
>> 
>> Just a couple of quick comments from a very high level design point.
>> 
>> - I think a shiftfs is valuable in the same way that overlayfs is
>>   valuable.
>> 
>>   Esepcially in the Docker case where a lot of containers want a shared
>>   base image (for efficiency), but it is desirable to run those
>>   containers in different user namespaces for safety.
>> 
>> - It is also the plan to make it possible to mount a filesystem where
>>   the uids and gids of that filesystem on disk do not have a one to one
>>   mapping to kernel uids and gids.  99% of the work has already be done,
>>   for all filesystem except XFS.
>
> Can you elaborate a bit more on why we want to do this?  I think only
> having a single shift of uid_t to kuid_t across the kernel to user
> boundary is a nice feature of user namespaces.  Architecturally, it's
> not such a big thing to do it as the data goes on to the disk as well,
> but what's the use case for it?

fuse/nfs or just plain sanity.  As the data comes off disk we convert it
into the kernel internal form kuid_t and kgid_t.   For shiftfs this
would be converting the uids when they come from your underlying
filesystem to the upper level vfs abstractions.

Converting to the kernel form for a filesystem such as fuse that is does
all that is necessary to keep evil users from breaking the kernel means
that we call allow users in a user namespace to mount fuse themselves.
Supply whatever uids and gids they want in the fuse messages.  If the
uids/gids don't map from the mounting users user namespace into the
kernel then we set inode->i_uid to INVALID_UID.

That is all we ask of a filesystem, and we are sorting out the rest in
the VFS as nothing sets INVALID_UID in inode->i_uid today.


>>   That said there are some significant issues to work through, before
>>   something like that can be enabled.
>> 
>>   * Handling of uids/gids on disk that don't map into a kuid/kgid.
>
> So I think this is nicely handled in the capability checks in
> generic_permission() (capable_wrt_inode_uidgid()) is there a need to
> make it more complex (and thus more error prone)?

No just a need to handle INVALID_UID, and INVALID_GID which we don't
handle today.

>>   * Safety from poisoned filesystem images.
>
> By poisoned FS image, you mean an image over whose internal data the
> user has control?  The basic problem of how do we give users write
> access to data devices they can then cause to be mounted as
> filesystems?

Yes.  For fuse except for uids and gids this is already solved for most
other filesystems it is a whole new world of horror.

The general case of evil usb devices (think android) that look like
block devices but can return whatever they want already exists in the
wild.

>>   I have slowly been working with Seth Forshee on these issues as
>>   the last thing I want is to introduce more security bugs right now.
>>   Seth being a braver man than I am has already merged his changes into
>>   the Ubuntu kernel.
>> 
>>   Right now we are targeting fuse, because fuse is already designed to
>>   handle poisoned filesystem images.  So to safely enable this kind of
>>   mapping for fuse is not a giant step.
>> 
>>   The big thing from my point of view is to get the VFS interfaces
>>   correct so that the VFS handles all of the weird cases that come up
>>   with uids and gids that don't map, and any other weird cases.  Keeping
>>   the weird bits out of the filesystems.
>
> If by VFS interfaces, you mean where we've already got the mapping 
> confined, absolutely.

Yes.  It is just making certain we handle INVALID_UID and INVALID_GID
that results from a mapping failure.  As we don't handle that in 4.6.0.

>> James I think you are missing the fact that all filesystems already 
>> have the make_kuid and make_kgid calls right where the data comes off
>> disk,
>
> I beg to differ: they certainly don't.  The underlying filesystem
> populates the inode in ->lookup with the data off the disk which goes
> into the inode as a kuid_t/kgid_t  It remains forever in the inode as
> that.  We convert it as it goes out of the kernel in the stat calls
> (actually stat.c:cp_old/new_stat())

They do.  i_uid_write calls make_kuid to map the in comming uid from
disk into a kuid_t.  That is all I was referring to.

The only thing I am looking at infrastructure wise it to make it so that
we cleanly handle when the first parameter to make_kuid is not
&init_user_ns.  That is the core point of Seths work.

>>  and the from_kuid and from_kgid calls right where the on-disk data
>> is being created just before it goes on disk.  Which means that the
>> actual impact on filesystems of the translation is trivial.
>
> Are you looking at a different tree from me?  I'm actually just looking
> at Linus git head.

Take a look at i_uid_read and i_gid_read.  They are inline functions in
fs.h

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/Makefile b/fs/Makefile
index 85b6e13..bad03b2 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -128,3 +128,4 @@  obj-y				+= exofs/ # Multiple modules
 obj-$(CONFIG_CEPH_FS)		+= ceph/
 obj-$(CONFIG_PSTORE)		+= pstore/
 obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
+obj-m				+= shiftfs.o
diff --git a/fs/shiftfs.c b/fs/shiftfs.c
new file mode 100644
index 0000000..b40cdfe
--- /dev/null
+++ b/fs/shiftfs.c
@@ -0,0 +1,790 @@ 
+#include <linux/cred.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/magic.h>
+#include <linux/parser.h>
+#include <linux/statfs.h>
+#include <linux/slab.h>
+#include <linux/user_namespace.h>
+#include <linux/uidgid.h>
+
+struct shiftfs_super_info {
+	struct vfsmount *mnt;
+	struct uid_gid_map uid_map, gid_map;
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+				       struct dentry *dentry);
+
+enum {
+	OPT_UIDMAP,
+	OPT_GIDMAP,
+	OPT_LAST,
+};
+
+/* global filesystem options */
+static const match_table_t tokens = {
+	{ OPT_UIDMAP, "uidmap=%u:%u:%u" },
+	{ OPT_GIDMAP, "gidmap=%u:%u:%u" },
+	{ OPT_LAST, NULL }
+};
+
+/*
+ * code stolen from user_namespace.c ... except that these functions
+ * return the same id back if unmapped ... should probably have a
+ * library?
+ */
+static u32 map_id_down(struct uid_gid_map *map, u32 id)
+{
+	unsigned idx, extents;
+	u32 first, last;
+
+	/* Find the matching extent */
+	extents = map->nr_extents;
+	smp_rmb();
+	for (idx = 0; idx < extents; idx++) {
+		first = map->extent[idx].first;
+		last = first + map->extent[idx].count - 1;
+		if (id >= first && id <= last)
+			break;
+	}
+	/* Map the id or note failure */
+	if (idx < extents)
+		id = (id - first) + map->extent[idx].lower_first;
+
+	return id;
+}
+
+static u32 map_id_up(struct uid_gid_map *map, u32 id)
+{
+	unsigned idx, extents;
+	u32 first, last;
+
+	/* Find the matching extent */
+	extents = map->nr_extents;
+	smp_rmb();
+	for (idx = 0; idx < extents; idx++) {
+		first = map->extent[idx].lower_first;
+		last = first + map->extent[idx].count - 1;
+		if (id >= first && id <= last)
+			break;
+	}
+	/* Map the id or note failure */
+	if (idx < extents)
+		id = (id - first) + map->extent[idx].first;
+
+	return id;
+}
+
+static bool mappings_overlap(struct uid_gid_map *new_map,
+			     struct uid_gid_extent *extent)
+{
+	u32 upper_first, lower_first, upper_last, lower_last;
+	unsigned idx;
+
+	upper_first = extent->first;
+	lower_first = extent->lower_first;
+	upper_last = upper_first + extent->count - 1;
+	lower_last = lower_first + extent->count - 1;
+
+	for (idx = 0; idx < new_map->nr_extents; idx++) {
+		u32 prev_upper_first, prev_lower_first;
+		u32 prev_upper_last, prev_lower_last;
+		struct uid_gid_extent *prev;
+
+		prev = &new_map->extent[idx];
+
+		prev_upper_first = prev->first;
+		prev_lower_first = prev->lower_first;
+		prev_upper_last = prev_upper_first + prev->count - 1;
+		prev_lower_last = prev_lower_first + prev->count - 1;
+
+		/* Does the upper range intersect a previous extent? */
+		if ((prev_upper_first <= upper_last) &&
+		    (prev_upper_last >= upper_first))
+			return true;
+
+		/* Does the lower range intersect a previous extent? */
+		if ((prev_lower_first <= lower_last) &&
+		    (prev_lower_last >= lower_first))
+			return true;
+	}
+	return false;
+}
+/* end code stolen from user_namespace.c */
+
+static const struct cred *shiftfs_get_up_creds(struct super_block *sb)
+{
+	struct cred *cred = prepare_creds();
+	struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+	if (!cred)
+		return NULL;
+
+	cred->fsuid = KUIDT_INIT(map_id_up(&ssi->uid_map, __kuid_val(cred->fsuid)));
+	cred->fsgid = KGIDT_INIT(map_id_up(&ssi->gid_map, __kgid_val(cred->fsgid)));
+
+	return cred;
+}
+
+static const struct cred *shiftfs_new_creds(const struct cred **newcred,
+					    struct super_block *sb)
+{
+	const struct cred *cred = shiftfs_get_up_creds(sb);
+
+	*newcred = cred;
+
+	if (cred)
+		cred = override_creds(cred);
+	else
+		printk(KERN_ERR "Credential override failed: no memory\n");
+
+	return cred;
+}
+
+static void shiftfs_old_creds(const struct cred *oldcred,
+			      const struct cred **newcred)
+{
+	if (!*newcred)
+		return;
+
+	revert_creds(oldcred);
+	put_cred(*newcred);
+}
+
+static int shiftfs_parse_options(struct shiftfs_super_info *ssi, char *options)
+{
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int from, to, count;
+	struct uid_gid_map *map, *maps[2] = {
+		[OPT_UIDMAP] = &ssi->uid_map,
+		[OPT_GIDMAP] = &ssi->gid_map,
+	};
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+		struct uid_gid_extent ext;
+
+		if (!*p)
+			continue;
+
+		token = match_token(p, tokens, args);
+		if (token != OPT_UIDMAP && token != OPT_GIDMAP)
+			return -EINVAL;
+		if (match_int(&args[0], &from) ||
+		    match_int(&args[1], &to) ||
+		    match_int(&args[2], &count))
+			return -EINVAL;
+		map = maps[token];
+		if (map->nr_extents >= UID_GID_MAP_MAX_EXTENTS)
+			return -EINVAL;
+		ext.first = from;
+		ext.lower_first = to;
+		ext.count = count;
+		if (mappings_overlap(map, &ext))
+			return -EINVAL;
+		map->extent[map->nr_extents++] = ext;
+	}
+	return 0;
+}
+
+static void shiftfs_d_iput(struct dentry *dentry, struct inode *inode)
+{
+	struct dentry *real = inode->i_private;
+
+	dput(real);
+	iput(inode);
+}
+
+static const struct dentry_operations shiftfs_dentry_ops = {
+	.d_iput		= shiftfs_d_iput,
+};
+
+static int shiftfs_readlink(struct dentry *dentry, char __user *data,
+			    int flags)
+{
+	struct dentry *real = dentry->d_inode->i_private;
+	const struct inode_operations *iop = real->d_inode->i_op;
+
+	if (iop->readlink)
+		return iop->readlink(real, data, flags);
+
+	return -EINVAL;
+}
+
+static const char *shiftfs_get_link(struct dentry *dentry, struct inode *inode,
+				    struct delayed_call *done)
+{
+	if (dentry) {
+		struct dentry *real = dentry->d_inode->i_private;
+		struct inode *reali = real->d_inode;
+		const struct inode_operations *iop = reali->i_op;
+		const char *res = ERR_PTR(-EPERM);
+
+		if (iop->get_link)
+			res = iop->get_link(real, reali, done);
+
+		return res;
+	} else {
+		/* RCU lookup not supported */
+		return ERR_PTR(-ECHILD);
+	}
+}
+
+static int shiftfs_setxattr(struct dentry *dentry, const char *name,
+			    const void *value, size_t size, int flags)
+{
+	struct dentry *real = dentry->d_inode->i_private;
+	const struct inode_operations *iop = real->d_inode->i_op;
+	int err = -EOPNOTSUPP;
+
+	if (iop->setxattr) {
+		const struct cred *oldcred, *newcred;
+
+		oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+		err = iop->setxattr(real, name, value, size, flags);
+		shiftfs_old_creds(oldcred, &newcred);
+	}
+
+	return err;
+}
+
+static ssize_t shiftfs_getxattr(struct dentry *dentry, const char *name,
+				void *value, size_t size)
+{
+	struct dentry *real = dentry->d_inode->i_private;
+	const struct inode_operations *iop = real->d_inode->i_op;
+	int err = -EOPNOTSUPP;
+
+	if (iop->getxattr) {
+		const struct cred *oldcred, *newcred;
+
+		oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+		err = iop->getxattr(real, name, value, size);
+		shiftfs_old_creds(oldcred, &newcred);
+	}
+
+	return err;
+}
+
+static ssize_t shiftfs_listxattr(struct dentry *dentry, char *list,
+				 size_t size)
+{
+	struct dentry *real = dentry->d_inode->i_private;
+	const struct inode_operations *iop = real->d_inode->i_op;
+
+	if (iop->listxattr)
+		return iop->listxattr(real, list, size);
+
+	return -EINVAL;
+}
+
+static int shiftfs_removexattr(struct dentry *dentry, const char *name)
+{
+	struct dentry *real = dentry->d_inode->i_private;
+	const struct inode_operations *iop = real->d_inode->i_op;
+
+	if (iop->removexattr)
+		return iop->removexattr(real, name);
+
+	return -EINVAL;
+}
+
+static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry)
+{
+	struct inode *reali;
+	struct shiftfs_super_info *ssi = inode->i_sb->s_fs_info;
+
+	if (!dentry)
+		return;
+
+	reali = dentry->d_inode;
+
+	if (!reali->i_op->get_link)
+		inode->i_opflags |= IOP_NOFOLLOW;
+
+	inode->i_mapping = reali->i_mapping;
+	inode->i_private = dentry;
+
+	inode->i_uid = KUIDT_INIT(map_id_down(&ssi->uid_map, __kuid_val(reali->i_uid)));
+	inode->i_gid = KGIDT_INIT(map_id_down(&ssi->gid_map, __kgid_val(reali->i_gid)));
+}
+
+static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
+			       umode_t mode, const char *symlink,
+			       struct dentry *hardlink, bool excl)
+{
+	struct dentry *real = dir->i_private, *new;
+	struct inode *reali = real->d_inode, *newi;
+	const struct inode_operations *iop = reali->i_op;
+	int err;
+	const struct cred *oldcred, *newcred;
+	bool op_ok = false;
+
+	if (hardlink) {
+		op_ok = iop->link;
+	} else {
+		switch (mode & S_IFMT) {
+		case S_IFDIR:
+			op_ok = iop->mkdir;
+			break;
+		case S_IFREG:
+			op_ok = iop->create;
+			break;
+		case S_IFLNK:
+			op_ok = iop->symlink;
+		}
+	}
+	if (!op_ok)
+		return -EINVAL;
+
+
+	newi = shiftfs_new_inode(dentry->d_sb, mode, NULL);
+	if (!newi)
+		return -ENOMEM;
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+
+	inode_lock_nested(reali, I_MUTEX_PARENT);
+	new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
+	err = PTR_ERR(new);
+	if (IS_ERR(new))
+		goto out_unlock;
+
+	if (hardlink) {
+		struct dentry *realhardlink = hardlink->d_inode->i_private;
+
+		err = vfs_link(new, reali, realhardlink, NULL);
+	} else {
+		switch (mode & S_IFMT) {
+		case S_IFDIR:
+			err = vfs_mkdir(reali, new, mode);
+			break;
+		case S_IFREG:
+			err = vfs_create(reali, new, mode, excl);
+			break;
+		case S_IFLNK:
+			err = vfs_symlink(reali, new, symlink);
+		}
+	}
+
+	shiftfs_old_creds(oldcred, &newcred);
+
+	if (err)
+		goto out_dput;
+
+	shiftfs_fill_inode(newi, new);
+
+	d_instantiate(dentry, newi);
+
+	new = NULL;
+	newi = NULL;
+
+ out_dput:
+	dput(new);
+ out_unlock:
+	iput(newi);
+	inode_unlock(reali);
+
+	return err;
+}
+
+static int shiftfs_create(struct inode *dir, struct dentry *dentry,
+			  umode_t mode,  bool excl)
+{
+	mode |= S_IFREG;
+
+	return shiftfs_make_object(dir, dentry, mode, NULL, NULL, excl);
+}
+
+static int shiftfs_mkdir(struct inode *dir, struct dentry *dentry,
+			 umode_t mode)
+{
+	mode |= S_IFDIR;
+
+	return shiftfs_make_object(dir, dentry, mode, NULL, NULL, false);
+}
+
+static int shiftfs_link(struct dentry *dentry, struct inode *dir,
+			struct dentry *hardlink)
+{
+	return shiftfs_make_object(dir, dentry, 0, NULL, hardlink, false);
+}
+
+static int shiftfs_symlink(struct inode *dir, struct dentry *dentry,
+			   const char *symlink)
+{
+	return shiftfs_make_object(dir, dentry, S_IFLNK, symlink, NULL, false);
+}
+
+static int shiftfs_rm(struct inode *dir, struct dentry *dentry, bool rmdir)
+{
+	struct dentry *real = dir->i_private, *new;
+	struct inode *reali = real->d_inode;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	inode_lock_nested(reali, I_MUTEX_PARENT);
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+
+	new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
+	err = PTR_ERR(new);
+	if (IS_ERR(new))
+		goto out_unlock;
+
+	if (rmdir)
+		err = vfs_rmdir(reali, new);
+	else
+		err = vfs_unlink(reali, new, NULL);
+
+	dput(new);
+
+ out_unlock:
+	shiftfs_old_creds(oldcred, &newcred);
+	inode_unlock(reali);
+
+	return err;
+}
+
+static int shiftfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	return shiftfs_rm(dir, dentry, false);
+}
+
+static int shiftfs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	return shiftfs_rm(dir, dentry, true);
+}
+
+static int shiftfs_rename2(struct inode *olddir, struct dentry *old,
+			   struct inode *newdir, struct dentry *new,
+			   unsigned int flags)
+{
+	struct dentry *rodd = olddir->i_private, *rndd = newdir->i_private,
+		*realold = old->d_inode->i_private,
+		*realnew = new->d_inode->i_private;
+	struct inode *realolddir = rodd->d_inode, *realnewdir = rndd->d_inode;
+	const struct inode_operations *iop = realolddir->i_op;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	oldcred = shiftfs_new_creds(&newcred, old->d_sb);
+	err = iop->rename2(realolddir, realold, realnewdir, realnew, flags);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
+				     unsigned int flags)
+{
+	struct dentry *real = dir->i_private, *new;
+	struct inode *reali = real->d_inode, *newi;
+	const struct cred *oldcred, *newcred;
+
+	/* note: violation of usual fs rules here: dentries are never
+	 * added with d_add.  This is because we want no dentry cache
+	 * for shiftfs.  All lookups proceed through the dentry cache
+	 * of the underlying filesystem, meaning we always see any
+	 * changes in the underlying */
+
+	inode_lock(reali);
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+	new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
+	shiftfs_old_creds(oldcred, &newcred);
+	inode_unlock(reali);
+
+	if (IS_ERR(new) || !new)
+		return new;
+
+	if (!new->d_inode) {
+		dput(new);
+		return NULL;
+	}
+
+	newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
+	if (!newi) {
+		dput(new);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	d_instantiate(dentry, newi);
+
+	return NULL;
+}
+
+static int shiftfs_permission(struct inode *inode, int mask)
+{
+	struct dentry *real = inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
+	if (iop->permission)
+		err = iop->permission(reali, mask);
+	else
+		err = generic_permission(reali, mask);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	struct dentry *real = dentry->d_inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	struct iattr newattr = *attr;
+	const struct cred *oldcred, *newcred;
+	struct shiftfs_super_info *ssi = dentry->d_sb->s_fs_info;
+	int err;
+
+	newattr.ia_uid = KUIDT_INIT(map_id_up(&ssi->uid_map, __kuid_val(attr->ia_uid)));
+	newattr.ia_gid = KGIDT_INIT(map_id_up(&ssi->gid_map, __kgid_val(attr->ia_gid)));
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+	if (iop->setattr)
+		err = iop->setattr(real, &newattr);
+	else
+		err = simple_setattr(real, &newattr);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static int shiftfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
+			   struct kstat *stat)
+{
+	struct inode *inode = dentry->d_inode;
+	struct dentry *real = inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	int err = 0;
+
+	mnt = dentry->d_sb->s_fs_info;
+
+	if (iop->getattr)
+		err = iop->getattr(mnt, real, stat);
+	else
+		generic_fillattr(reali, stat);
+
+	if (err)
+		return err;
+
+	stat->uid = inode->i_uid;
+	stat->gid = inode->i_gid;
+	return 0;
+}
+
+struct shiftfs_fop_carrier {
+	struct inode *inode;
+	int (*release)(struct inode *, struct file *);
+	struct file_operations fop;
+};
+
+static int shiftfs_release(struct inode *inode, struct file *file)
+{
+	struct shiftfs_fop_carrier *sfc;
+	int err = 0;
+
+	sfc = container_of(file->f_op, struct shiftfs_fop_carrier, fop);
+
+	if (sfc->release)
+		err = sfc->release(inode, file);
+
+	file->f_inode = sfc->inode;
+	file->f_op = sfc->inode->i_fop;
+
+	kfree(sfc);
+
+	return err;
+}
+
+static int shiftfs_open(struct inode *inode, struct file *file)
+{
+	struct dentry *real = inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct file_operations *fop;
+	struct shiftfs_fop_carrier *sfc;
+	int err = 0;
+
+	sfc = kmalloc(sizeof(*sfc), GFP_KERNEL);
+	if (!sfc)
+		return -ENOMEM;
+
+	if (real->d_flags & DCACHE_OP_SELECT_INODE)
+		reali = real->d_op->d_select_inode(real, file->f_flags);
+
+	fop = reali->i_fop;
+	sfc->inode = inode;
+	memcpy(&sfc->fop, fop, sizeof(*fop));
+	sfc->release = sfc->fop.release;
+	sfc->fop.release = shiftfs_release;
+
+	file->f_op = &sfc->fop;
+	file->f_inode = reali;
+
+	if (fop->open)
+		err = fop->open(reali, file);
+
+	return err;
+}
+
+static const struct inode_operations shiftfs_inode_ops = {
+	/* intercepted */
+	.lookup		= shiftfs_lookup,
+	.getattr	= shiftfs_getattr,
+	.setattr	= shiftfs_setattr,
+	.permission	= shiftfs_permission,
+
+	/*pass though */
+	.mkdir		= shiftfs_mkdir,
+	.symlink	= shiftfs_symlink,
+	.get_link	= shiftfs_get_link,
+	.readlink	= shiftfs_readlink,
+	.unlink		= shiftfs_unlink,
+	.rmdir		= shiftfs_rmdir,
+	.rename2	= shiftfs_rename2,
+	.link		= shiftfs_link,
+	.create		= shiftfs_create,
+	.mknod		= NULL,	/* no special files currently */
+	.setxattr	= shiftfs_setxattr,
+	.getxattr	= shiftfs_getxattr,
+	.listxattr	= shiftfs_listxattr,
+	.removexattr	= shiftfs_removexattr,
+};
+
+static const struct file_operations shiftfs_file_ops = {
+	.open		= shiftfs_open,
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+				       struct dentry *dentry)
+{
+	struct inode *inode;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return NULL;
+
+	mode &= S_IFMT;
+
+	inode->i_ino = get_next_ino();
+	inode->i_mode = mode;
+	inode->i_flags |= S_NOATIME | S_NOCMTIME;
+
+	inode->i_op = &shiftfs_inode_ops;
+	inode->i_fop = &shiftfs_file_ops;
+
+	shiftfs_fill_inode(inode, dentry);
+
+	return inode;
+}
+
+static void shiftfs_put_super(struct super_block *sb)
+{
+	struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+	mntput(ssi->mnt);
+	kfree(ssi);
+}
+
+static const struct super_operations shiftfs_super_ops = {
+	.put_super	= shiftfs_put_super,
+};
+
+struct shiftfs_data {
+	void *data;
+	const char *path;
+};
+
+static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
+			      int silent)
+{
+	struct shiftfs_data *data = raw_data;
+	char *name = kstrdup(data->path, GFP_KERNEL);
+	int err = -ENOMEM;
+	struct shiftfs_super_info *ssi = NULL;
+	struct path path;
+
+	if (!name)
+		goto out;
+
+	ssi = kzalloc(sizeof(*ssi), GFP_KERNEL);
+	if (!ssi)
+		goto out;
+
+	err = -EPERM;
+	if (!capable(CAP_SYS_ADMIN))
+		goto out;
+
+	err = shiftfs_parse_options(ssi, data->data);
+	if (err)
+		goto out;
+
+	err = kern_path(name, LOOKUP_FOLLOW, &path);
+	if (err)
+		goto out;
+
+	if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
+		err = -ENOTDIR;
+		goto out_put;
+	}
+	ssi->mnt = path.mnt;
+
+	sb->s_fs_info = ssi;
+	sb->s_magic = SHIFTFS_MAGIC;
+	sb->s_op = &shiftfs_super_ops;
+	sb->s_d_op = &shiftfs_dentry_ops;
+	sb->s_root = d_make_root(shiftfs_new_inode(sb, S_IFDIR, path.dentry));
+
+	return 0;
+
+ out_put:
+	path_put(&path);
+ out:
+	kfree(name);
+	if (err)
+		kfree(ssi);
+	return err;
+}
+
+static struct dentry *shiftfs_mount(struct file_system_type *fs_type,
+				    int flags, const char *dev_name, void *data)
+{
+	struct shiftfs_data d = { data, dev_name };
+
+	return mount_nodev(fs_type, flags, &d, shiftfs_fill_super);
+}
+
+static struct file_system_type shiftfs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "shiftfs",
+	.mount		= shiftfs_mount,
+	.kill_sb	= kill_anon_super,
+};
+
+static int __init shiftfs_init(void)
+{
+	return register_filesystem(&shiftfs_type);
+}
+
+static void __exit shiftfs_exit(void)
+{
+	unregister_filesystem(&shiftfs_type);
+}
+
+MODULE_ALIAS_FS("shiftfs");
+MODULE_AUTHOR("James Bottomley");
+MODULE_DESCRIPTION("uid/gid shifting bind filesystem");
+MODULE_LICENSE("GPL v2");
+module_init(shiftfs_init)
+module_exit(shiftfs_exit)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 0de181a..d7992f5 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -79,4 +79,6 @@ 
 #define NSFS_MAGIC		0x6e736673
 #define BPF_FS_MAGIC		0xcafe4a11
 
+#define SHIFTFS_MAGIC		0x6a656a62
+
 #endif /* __LINUX_MAGIC_H__ */