[v9,05/10] namei: O_BENEATH-style path resolution flags
diff mbox series

Message ID 20190706145737.5299-6-cyphar@cyphar.com
State New
Headers show
Series
  • namei: openat2(2) path resolution restrictions
Related show

Commit Message

Aleksa Sarai July 6, 2019, 2:57 p.m. UTC
Add the following flags to allow various restrictions on path
resolution (these affect the *entire* resolution, rather than just the
final path component -- as is the case with most other AT_* flags).

The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).

This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web
servers, archive extraction tools, network file servers, and so on.

These flags are exposed to userspace in a later patchset.

* LOOKUP_XDEV: Disallow mount-point crossing (both *down* into one, or
  *up* from one). The primary "scoping" use is to blocking resolution
  that crosses a bind-mount, which has a similar property to a symlink
  (in the way that it allows for escape from the starting-point). Since
  it is not possible to differentiate bind-mounts However since
  bind-mounting requires privileges (in ways symlinks don't) this has
  been split from LOOKUP_BENEATH. The naming is based on "find -xdev" as
  well as -EXDEV (though find(1) doesn't walk upwards, the semantics
  seem obvious).

* LOOKUP_NO_MAGICLINKS: Disallows ->get_link "symlink" jumping. This is
  a very specific restriction, and it exists because /proc/$pid/fd/...
  "symlinks" allow for access outside nd->root and pose risk to
  container runtimes that don't want to be tricked into accessing a host
  path (but do want to allow no-funny-business symlink resolution).

* LOOKUP_NO_SYMLINKS: Disallows symlink jumping *of any kind*. Implies
  LOOKUP_NO_MAGICLINKS (obviously).

* LOOKUP_BENEATH: Disallow "escapes" from the starting point of the
  filesystem tree during resolution (you must stay "beneath" the
  starting point at all times). Currently this is done by disallowing
  ".." and absolute paths (either in the given path or found during
  symlink resolution) entirely, as well as all magic-link jumping.

  The wholesale banning of ".." is because it is currently not safe to
  allow ".." resolution (races can cause the path to be moved outside of
  the root -- this is conceptually similar to historical chroot(2)
  escape attacks). Future patches in this series will address this, and
  will re-enable ".." resolution once it is safe. With those patches,
  ".." resolution will only be allowed if it remains in the root
  throughout resolution (such as "a/../b" not "a/../../outside/b").

  The banning of magic-link jumping is done because it is not clear
  whether semantically they should be allowed -- while some magic-links
  are safe there are many that can cause escapes (and once a
  resolution is outside of the root, O_BENEATH will no longer detect
  it). Future patches may re-enable magic-link jumping when such jumps
  would remain inside the root.

The LOOKUP_NO_*LINK flags return -ELOOP if path resolution would
violates their requirement, while the others all return -EXDEV.

This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.

[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/

Cc: Christian Brauner <christian@brauner.io>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c            | 74 ++++++++++++++++++++++++++++++++++++-------
 include/linux/namei.h |  7 ++++
 2 files changed, 70 insertions(+), 11 deletions(-)

Comments

Al Viro July 12, 2019, 4:33 a.m. UTC | #1
On Sun, Jul 07, 2019 at 12:57:32AM +1000, Aleksa Sarai wrote:

> @@ -1442,8 +1464,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>  	struct inode *inode = nd->inode;
>  
>  	while (1) {
> -		if (path_equal(&nd->path, &nd->root))
> +		if (path_equal(&nd->path, &nd->root)) {
> +			if (unlikely(nd->flags & LOOKUP_BENEATH))
> +				return -EXDEV;

> @@ -1468,6 +1493,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>  				return -ECHILD;
>  			if (&mparent->mnt == nd->path.mnt)
>  				break;
> +			if (unlikely(nd->flags & LOOKUP_XDEV))
> +				return -EXDEV;
>  			/* we know that mountpoint was pinned */
>  			nd->path.dentry = mountpoint;
>  			nd->path.mnt = &mparent->mnt;
> @@ -1482,6 +1509,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>  			return -ECHILD;
>  		if (!mounted)
>  			break;
> +		if (unlikely(nd->flags & LOOKUP_XDEV))
> +			return -EXDEV;

Are you sure these failure exits in follow_dotdot_rcu() won't give
suprious hard errors?

> +	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
> +		error = dirfd_path_init(nd);
> +		if (unlikely(error))
> +			return ERR_PTR(error);
> +		nd->root = nd->path;
> +		if (!(nd->flags & LOOKUP_RCU))
> +			path_get(&nd->root);
> +	}
>  	if (*s == '/') {
>  		if (likely(!nd->root.mnt))
>  			set_root(nd);
> @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>  			s = ERR_PTR(error);
>  		return s;
>  	}
> -	error = dirfd_path_init(nd);
> -	if (unlikely(error))
> -		return ERR_PTR(error);
> +	if (likely(!nd->path.mnt)) {

Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
Aleksa Sarai July 12, 2019, 10:57 a.m. UTC | #2
On 2019-07-12, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Jul 07, 2019 at 12:57:32AM +1000, Aleksa Sarai wrote:
> > @@ -1442,8 +1464,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> >  	struct inode *inode = nd->inode;
> >  
> >  	while (1) {
> > -		if (path_equal(&nd->path, &nd->root))
> > +		if (path_equal(&nd->path, &nd->root)) {
> > +			if (unlikely(nd->flags & LOOKUP_BENEATH))
> > +				return -EXDEV;
> 
> > @@ -1468,6 +1493,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> >  				return -ECHILD;
> >  			if (&mparent->mnt == nd->path.mnt)
> >  				break;
> > +			if (unlikely(nd->flags & LOOKUP_XDEV))
> > +				return -EXDEV;
> >  			/* we know that mountpoint was pinned */
> >  			nd->path.dentry = mountpoint;
> >  			nd->path.mnt = &mparent->mnt;
> > @@ -1482,6 +1509,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> >  			return -ECHILD;
> >  		if (!mounted)
> >  			break;
> > +		if (unlikely(nd->flags & LOOKUP_XDEV))
> > +			return -EXDEV;
> 
> Are you sure these failure exits in follow_dotdot_rcu() won't give
> suprious hard errors?

I could switch to -ECHILD for the *_rcu() checks if you'd prefer that.
Though, I'd have (probably naively) thought that you'd have already
gotten -ECHILD from the seqlock checks if there was a race during ".."
handling.

> > +	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
> > +		error = dirfd_path_init(nd);
> > +		if (unlikely(error))
> > +			return ERR_PTR(error);
> > +		nd->root = nd->path;
> > +		if (!(nd->flags & LOOKUP_RCU))
> > +			path_get(&nd->root);
> > +	}
> >  	if (*s == '/') {
> >  		if (likely(!nd->root.mnt))
> >  			set_root(nd);
> > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> >  			s = ERR_PTR(error);
> >  		return s;
> >  	}
> > -	error = dirfd_path_init(nd);
> > -	if (unlikely(error))
> > -		return ERR_PTR(error);
> > +	if (likely(!nd->path.mnt)) {
> 
> Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?

Yes. I did it to be more consistent with the other "have we got the
root" checks elsewhere. Is there another way you'd prefer I do it?
Al Viro July 12, 2019, 12:39 p.m. UTC | #3
On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote:

> > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > >  			s = ERR_PTR(error);
> > >  		return s;
> > >  	}
> > > -	error = dirfd_path_init(nd);
> > > -	if (unlikely(error))
> > > -		return ERR_PTR(error);
> > > +	if (likely(!nd->path.mnt)) {
> > 
> > Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
> 
> Yes. I did it to be more consistent with the other "have we got the
> root" checks elsewhere. Is there another way you'd prefer I do it?

"Have we got the root" checks are inevitable evil; here you are making the
control flow in a single function hard to follow.

I *think* what you are doing is
	absolute pathname, no LOOKUP_BENEATH:
		set_root
		error = nd_jump_root(nd)
	else
		error = dirfd_path_init(nd)
	return unlikely(error) ? ERR_PTR(error) : s;
which should be a lot easier to follow (not to mention shorter), but I might
be missing something in all of that.
Al Viro July 12, 2019, 12:55 p.m. UTC | #4
On Fri, Jul 12, 2019 at 01:39:24PM +0100, Al Viro wrote:
> On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote:
> 
> > > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > > >  			s = ERR_PTR(error);
> > > >  		return s;
> > > >  	}
> > > > -	error = dirfd_path_init(nd);
> > > > -	if (unlikely(error))
> > > > -		return ERR_PTR(error);
> > > > +	if (likely(!nd->path.mnt)) {
> > > 
> > > Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
> > 
> > Yes. I did it to be more consistent with the other "have we got the
> > root" checks elsewhere. Is there another way you'd prefer I do it?
> 
> "Have we got the root" checks are inevitable evil; here you are making the
> control flow in a single function hard to follow.
> 
> I *think* what you are doing is
> 	absolute pathname, no LOOKUP_BENEATH:
> 		set_root
> 		error = nd_jump_root(nd)
> 	else
> 		error = dirfd_path_init(nd)
> 	return unlikely(error) ? ERR_PTR(error) : s;
> which should be a lot easier to follow (not to mention shorter), but I might
> be missing something in all of that.

PS: if that's what's going on, I would be tempted to turn the entire
path_init() part into this:
	if (flags & LOOKUP_BENEATH)
		while (*s == '/')
			s++;
in the very beginning (plus the handling of nd_jump_root() prototype
change, but that belongs with nd_jump_root() change itself, obviously).
Again, I might be missing something here...
Al Viro July 12, 2019, 1:25 p.m. UTC | #5
On Fri, Jul 12, 2019 at 01:55:52PM +0100, Al Viro wrote:
> On Fri, Jul 12, 2019 at 01:39:24PM +0100, Al Viro wrote:
> > On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote:
> > 
> > > > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > > > >  			s = ERR_PTR(error);
> > > > >  		return s;
> > > > >  	}
> > > > > -	error = dirfd_path_init(nd);
> > > > > -	if (unlikely(error))
> > > > > -		return ERR_PTR(error);
> > > > > +	if (likely(!nd->path.mnt)) {
> > > > 
> > > > Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
> > > 
> > > Yes. I did it to be more consistent with the other "have we got the
> > > root" checks elsewhere. Is there another way you'd prefer I do it?
> > 
> > "Have we got the root" checks are inevitable evil; here you are making the
> > control flow in a single function hard to follow.
> > 
> > I *think* what you are doing is
> > 	absolute pathname, no LOOKUP_BENEATH:
> > 		set_root
> > 		error = nd_jump_root(nd)
> > 	else
> > 		error = dirfd_path_init(nd)
> > 	return unlikely(error) ? ERR_PTR(error) : s;
> > which should be a lot easier to follow (not to mention shorter), but I might
> > be missing something in all of that.
> 
> PS: if that's what's going on, I would be tempted to turn the entire
> path_init() part into this:
> 	if (flags & LOOKUP_BENEATH)
> 		while (*s == '/')
> 			s++;
> in the very beginning (plus the handling of nd_jump_root() prototype
> change, but that belongs with nd_jump_root() change itself, obviously).
> Again, I might be missing something here...

Argh... I am, at that - you have setting path->root (and grabbing it)
in LOOKUP_BENEATH cases and you do it after dirfd_path_init().  So
how about
	if (flags & LOOKUP_BENEATH)
		while (*s == '/')
			s++;
before the whole thing and
        if (*s == '/') { /* can happen only without LOOKUP_BENEATH */
                set_root(nd);
		error = nd_jump_root(nd);
		if (unlikely(error))
			return ERR_PTR(error);
        } else if (nd->dfd == AT_FDCWD) {
                if (flags & LOOKUP_RCU) {
                        struct fs_struct *fs = current->fs;
                        unsigned seq;

                        do {
                                seq = read_seqcount_begin(&fs->seq);
                                nd->path = fs->pwd;
                                nd->inode = nd->path.dentry->d_inode;
                                nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
                        } while (read_seqcount_retry(&fs->seq, seq));
                } else {
                        get_fs_pwd(current->fs, &nd->path);
                        nd->inode = nd->path.dentry->d_inode;
                }  
        } else {
                /* Caller must check execute permissions on the starting path component */
                struct fd f = fdget_raw(nd->dfd);
                struct dentry *dentry;

                if (!f.file)
                        return ERR_PTR(-EBADF);

                dentry = f.file->f_path.dentry;

                if (*s && unlikely(!d_can_lookup(dentry))) {
                        fdput(f);
                        return ERR_PTR(-ENOTDIR);
                }

                nd->path = f.file->f_path;
                if (flags & LOOKUP_RCU) {
                        nd->inode = nd->path.dentry->d_inode;
                        nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
                } else {
                        path_get(&nd->path);
                        nd->inode = nd->path.dentry->d_inode;
                }
                fdput(f);
        }
	if (flags & LOOKUP_BENEATH) {
		nd->root = nd->path;
		if (!(flags & LOOKUP_RCU))
			path_get(&nd->root);
		else
			nd->root_seq = nd->seq;
	}
	return s;
replacing the part in the end?  Makes for much smaller change; it might
very well still make sense to add dirfd_path_init() as a separate
cleanup (perhaps with the *s == '/' case included), though.
Al Viro July 12, 2019, 3 p.m. UTC | #6
On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:

> 	if (flags & LOOKUP_BENEATH) {
> 		nd->root = nd->path;
> 		if (!(flags & LOOKUP_RCU))
> 			path_get(&nd->root);
> 		else
> 			nd->root_seq = nd->seq;

BTW, this assignment is needed for LOOKUP_RCU case.  Without it
you are pretty much guaranteed that lazy pathwalk will fail,
when it comes to complete_walk().

Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH
combination would someday get passed?
Al Viro July 13, 2019, 2:41 a.m. UTC | #7
On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote:
> On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:
> 
> > 	if (flags & LOOKUP_BENEATH) {
> > 		nd->root = nd->path;
> > 		if (!(flags & LOOKUP_RCU))
> > 			path_get(&nd->root);
> > 		else
> > 			nd->root_seq = nd->seq;
> 
> BTW, this assignment is needed for LOOKUP_RCU case.  Without it
> you are pretty much guaranteed that lazy pathwalk will fail,
> when it comes to complete_walk().
> 
> Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH
> combination would someday get passed?

I don't understand what's going on with ->r_seq in there - your
call of path_is_under() is after having (re-)sampled rename_lock,
but if that was the only .. in there, who's going to recheck
the value?  For that matter, what's to guarantee that the thing
won't get moved just as you are returning from handle_dots()?

IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())?
Al Viro July 14, 2019, 3:58 a.m. UTC | #8
On Sat, Jul 13, 2019 at 03:41:53AM +0100, Al Viro wrote:
> On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote:
> > On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:
> > 
> > > 	if (flags & LOOKUP_BENEATH) {
> > > 		nd->root = nd->path;
> > > 		if (!(flags & LOOKUP_RCU))
> > > 			path_get(&nd->root);
> > > 		else
> > > 			nd->root_seq = nd->seq;
> > 
> > BTW, this assignment is needed for LOOKUP_RCU case.  Without it
> > you are pretty much guaranteed that lazy pathwalk will fail,
> > when it comes to complete_walk().
> > 
> > Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH
> > combination would someday get passed?
> 
> I don't understand what's going on with ->r_seq in there - your
> call of path_is_under() is after having (re-)sampled rename_lock,
> but if that was the only .. in there, who's going to recheck
> the value?  For that matter, what's to guarantee that the thing
> won't get moved just as you are returning from handle_dots()?
> 
> IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())?

Sigh...  Usual effects of trying to document things:

1) LOOKUP_NO_EVAL looks bogus.  It had been introduced by commit 57d4657716ac
(audit: ignore fcaps on umount) and AFAICS it's crap.  It is set in
ksys_umount() and nowhere else.  It's ignored by everything except
filename_mountpoint().  The thing is, call graph for filename_mountpoint()
is
	filename_mountpoint()
		<- user_path_mountpoint_at()
			<- ksys_umount()
		<- kern_path_mountpoint()
			<- autofs_dev_ioctl_ismountpoint()
			<- find_autofs_mount()
				<- autofs_dev_ioctl_open_mountpoint()
				<- autofs_dev_ioctl_requester()
				<- autofs_dev_ioctl_ismountpoint()
In other words, that flag is basically "was filename_mountpoint()
been called by umount(2) or has it come from an autofs ioctl?".
And looking at the rationale in that commit, autofs ioctls need
it just as much as umount(2) does.  Why is it not set for those
as well?  And why is it conditional at all?

1b) ... because audit_inode() wants LOOKUP_... as the last argument,
only to remap it into AUDIT_..., that's why.  So audit needs something
guaranteed not to conflict with LOOKUP_PARENT (another flag getting
remapped).  So why do we bother with remapping those, anyway?  Let's look
at the callers:

fs/namei.c:933: audit_inode(nd->name, nd->stack[0].link.dentry, 0);
fs/namei.c:2353:                audit_inode(name, path->dentry, flags & LOOKUP_PARENT);
fs/namei.c:2394:                audit_inode(name, parent->dentry, LOOKUP_PARENT);
fs/namei.c:2721:                audit_inode(name, path->dentry, flags & LOOKUP_NO_EVAL);
fs/namei.c:3302:                audit_inode(nd->name, dir, LOOKUP_PARENT);
fs/namei.c:3336:                audit_inode(nd->name, file->f_path.dentry, 0);
fs/namei.c:3371:        audit_inode(nd->name, path.dentry, 0);
fs/namei.c:3389:        audit_inode(nd->name, nd->path.dentry, 0);
fs/namei.c:3490:        audit_inode(nd->name, child, 0);
fs/namei.c:3509:                audit_inode(nd->name, path.dentry, 0);
ipc/mqueue.c:788:       audit_inode(name, dentry, 0);

In all but two of those we have a nice constant value - 0 or AUDIT_INODE_PARENT.
One of two exceptions is in filename_mountpoint(), and there we want
unconditional AUDIT_INODE_NOEVAL (see above).  What of the other?  It's
        if (likely(!retval))
                audit_inode(name, path->dentry, flags & LOOKUP_PARENT);
in filename_lookup().  And that is bogus as well.  filename_lookupat() would
better *NOT* get LOOKUP_PARENT in flags.  And it doesn't - not since
commit 8bcb77fabd7c (namei: split off filename_lookupat() with LOOKUP_PARENT)
back in 2015.  In filename_parentat() introduced there we have
                audit_inode(name, parent->dentry, LOOKUP_PARENT);
and at the same point the call in filename_lookupat() should've become
                audit_inode(name, path->dentry, 0);
It hadn't; my fault.  And after fixing that everything becomes nice and
unconditional - the last argument of audit_inode() is always an AUDIT_...
constant or zero.  Moving AUDIT_... definitions outside of ifdef on
CONFIG_AUDITSYSCALL, getting rid of remapping in audit_inode() and
passing the right values in 3 callers that don't pass 0 and LOOKUP_NO_EVAL
can go to hell.

Any objections from audit folks?

2) comment in namei.h is seriously out of sync with reality.  To quote:
 *  - follow links at the end
OK, that's LOOKUP_FOLLOW (1)
 *  - require a directory
... and LOOKUP_DIRECTORY (2)
 *  - ending slashes ok even for nonexistent files
... used to be about LOOKUP_CONTINUE (eight years dead now)
 *  - internal "there are more path components" flag
... LOOKUP_PARENT (16)
 *  - dentry cache is untrusted; force a real lookup
... LOOKUP_REVAL (32)
 *  - suppress terminal automount
... used to be LOOKUP_NO_AUTOMOUNT (128), except that it's been
replaced with LOOKUP_AUTOMOUNT (at 4) almost eight years ago.  And
the meaning of LOOKUP_AUTOMOUNT is opposite to the comment,
of course.
 *  - skip revalidation
... LOOKUP_NO_REVAL (128)
 *  - don't fetch xattrs on audit_inode
... and that's about soon-to-be dead LOOKUP_NO_EVAL (256)

Note that LOOKUP_RCU (at 64) is quietly skipped and so's the tail
of the list.  If not for "suppress terminal automount" bit, I wouldn't
really care, but that one makes for a really nasty trap for readers.
I'm going to convert that to (accurate) comments next to actual defines...

3) while looking through LOOKUP_AUTOMOUNT users,
in aa_bind_mount() we have
        error = kern_path(dev_name, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT, &old_path);
matching do_loopback(), while tomoyo_mount_acl() has
                if (!dev_name || kern_path(dev_name, LOOKUP_FOLLOW, &path)) {
And yes, that *is* hit on mount --bind.  As well as on new mounts, where
apparmor (and bdev_lookup()) has plain LOOKUP_FOLLOW.

->sb_mount() is garbage by design (not the least because of the need to
have pathname lookups in the first place, as well as having to duplicate the
demultiplexing parts of do_mount() without fucking it up)...
Aleksa Sarai July 14, 2019, 7 a.m. UTC | #9
On 2019-07-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote:
> > On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:
> > 
> > > 	if (flags & LOOKUP_BENEATH) {
> > > 		nd->root = nd->path;
> > > 		if (!(flags & LOOKUP_RCU))
> > > 			path_get(&nd->root);
> > > 		else
> > > 			nd->root_seq = nd->seq;
> > 
> > BTW, this assignment is needed for LOOKUP_RCU case.  Without it
> > you are pretty much guaranteed that lazy pathwalk will fail,
> > when it comes to complete_walk().
> > 
> > Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH
> > combination would someday get passed?
> 
> I don't understand what's going on with ->r_seq in there - your
> call of path_is_under() is after having (re-)sampled rename_lock,
> but if that was the only .. in there, who's going to recheck
> the value?  For that matter, what's to guarantee that the thing
> won't get moved just as you are returning from handle_dots()?
> 
> IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())?

I tried to explain this in the commit message for "namei: aggressively
check for nd->root escape on ".." resolution", but I probably could've
explained it better.

The basic property being guaranteed by LOOKUP_IN_ROOT is that it will
not result in resolution of a path component which was not inside the
root of the dirfd tree at some point during resolution (and that all
absolute symlink and ".." resolution will be done relative to the
dirfd). This may smell slightly of chroot(2), because unfortunately it
is a similar concept -- the reason for this is to allow for a more
efficient way to safely resolve paths inside a rootfs than spawning a
separate process to then pass back the fd to the caller.

We don't want to do a path_is_under() check for every ".." (otherwise
lookups have a quadratic slowdown when doing many ".."s), so I instead
only do a check after a rename or a mount (which are the only operations
which could change what ".." points to). And since we do the
path_is_under() check if m_seq or r_seq need a retry, we can re-take
them[+].

The main reason why I don't re-check path_is_under() after handle_dots()
is that there is no way to be sure that a racing rename didn't happen
after your last path_is_under() check. The rename could happen after the
syscall returns, after all.

So, the main purpose of the check is to make sure that a ".."s after a
rename doesn't result in an escape. If the rename happens after we've
traversed through a ".." that means that the ".." was inside the root in
the first place (a root ".." is handled by follow_dotdot). If the rename
happens after we've gone through handle_dots() and there is no
subsequent ".." then to userspace it looks identical to the rename
occurring after the syscall has returned. If there is a subsequent ".."
after a racing rename then we may have moved into a path that wasn't
path_is_under() and so we have to check it.

The only way I could see you could solve the race completely is if you
had a way for userspace to lock things from being able to be renamed (or
MS_MOVE'd). And that feels like a really bad idea to me.

[+]: You asked why don't I re-take m_seq. The reason is that I don't
	 fully understand all the other m_seq checks being done during
	 resolution, and we aren't definitely doing them all in
	 handle_dots(). So I assumed re-taking it could result in me
	 breaking RCU-walk which obviously would be bad. Since I am the only
	 thing using nd->r_seq, I can re-take it without issue.
Aleksa Sarai July 14, 2019, 10:31 a.m. UTC | #10
On 2019-07-12, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jul 12, 2019 at 01:55:52PM +0100, Al Viro wrote:
> > On Fri, Jul 12, 2019 at 01:39:24PM +0100, Al Viro wrote:
> > > On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote:
> > > 
> > > > > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > > > > >  			s = ERR_PTR(error);
> > > > > >  		return s;
> > > > > >  	}
> > > > > > -	error = dirfd_path_init(nd);
> > > > > > -	if (unlikely(error))
> > > > > > -		return ERR_PTR(error);
> > > > > > +	if (likely(!nd->path.mnt)) {
> > > > > 
> > > > > Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
> > > > 
> > > > Yes. I did it to be more consistent with the other "have we got the
> > > > root" checks elsewhere. Is there another way you'd prefer I do it?
> > > 
> > > "Have we got the root" checks are inevitable evil; here you are making the
> > > control flow in a single function hard to follow.
> > > 
> > > I *think* what you are doing is
> > > 	absolute pathname, no LOOKUP_BENEATH:
> > > 		set_root
> > > 		error = nd_jump_root(nd)
> > > 	else
> > > 		error = dirfd_path_init(nd)
> > > 	return unlikely(error) ? ERR_PTR(error) : s;
> > > which should be a lot easier to follow (not to mention shorter), but I might
> > > be missing something in all of that.
> > 
> > PS: if that's what's going on, I would be tempted to turn the entire
> > path_init() part into this:
> > 	if (flags & LOOKUP_BENEATH)
> > 		while (*s == '/')
> > 			s++;
> > in the very beginning (plus the handling of nd_jump_root() prototype
> > change, but that belongs with nd_jump_root() change itself, obviously).
> > Again, I might be missing something here...
> 
> Argh... I am, at that - you have setting path->root (and grabbing it)
> in LOOKUP_BENEATH cases and you do it after dirfd_path_init().  So
> how about
> 	if (flags & LOOKUP_BENEATH)
> 		while (*s == '/')
> 			s++;

I can do this for LOOKUP_IN_ROOT, but currently the semantics for
LOOKUP_BENEATH is that absolute paths will return -EXDEV
indiscriminately (nd_jump_root() errors out with LOOKUP_BENEATH). To be
honest, the check could actually just be:

  if (flags & LOOKUP_BENEATH)
    if (*s == '/')
	  return ERR_PTR(-EXDEV);

(Though we'd still need -EXDEV in nd_jump_root() for obvious reasons.)

The logic being that an absolute path means that the resolution starts
out without being "beneath" the starting point -- thus violating the
contract of LOOKUP_BENEATH. And since the "handle absolute paths like
they're scoped to the root" is only implemented for LOOKUP_IN_ROOT, I'd
think it's a bit odd to have LOOKUP_BENEATH do it too for absolute
paths.

I'll be honest, this patchset is more confusing to both of us because of
LOOKUP_BENEATH -- I've only kept it since it was part of the original
patchset (O_BENEATH). Personally I think more people will be far more
interested in LOOKUP_IN_ROOT. Does anyone mind if I drop the
LOOKUP_BENEATH parts of this series, and only keep LOOKUP_NO_* and
LOOKUP_IN_ROOT?

I make a change as you outlined for LOOKUP_IN_ROOT, though.
Al Viro July 14, 2019, 2:36 p.m. UTC | #11
On Sun, Jul 14, 2019 at 05:00:29PM +1000, Aleksa Sarai wrote:

> The basic property being guaranteed by LOOKUP_IN_ROOT is that it will
> not result in resolution of a path component which was not inside the
> root of the dirfd tree at some point during resolution (and that all
> absolute symlink and ".." resolution will be done relative to the
> dirfd). This may smell slightly of chroot(2), because unfortunately it
> is a similar concept -- the reason for this is to allow for a more
> efficient way to safely resolve paths inside a rootfs than spawning a
> separate process to then pass back the fd to the caller.

IDGI...  If attacker can modify your subtree, you have already lost -
after all, they can make anything appear inside that tree just before
your syscall is made and bring it back out immediately afterwards.
And if they can't, what is the race you are trying to protect against?
Confused...
Aleksa Sarai July 16, 2019, 8:03 a.m. UTC | #12
On 2019-07-14, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Jul 13, 2019 at 03:41:53AM +0100, Al Viro wrote:
> > On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote:
> > > On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:
> > > 
> > > > 	if (flags & LOOKUP_BENEATH) {
> > > > 		nd->root = nd->path;
> > > > 		if (!(flags & LOOKUP_RCU))
> > > > 			path_get(&nd->root);
> > > > 		else
> > > > 			nd->root_seq = nd->seq;
> > > 
> > > BTW, this assignment is needed for LOOKUP_RCU case.  Without it
> > > you are pretty much guaranteed that lazy pathwalk will fail,
> > > when it comes to complete_walk().
> > > 
> > > Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH
> > > combination would someday get passed?
> > 
> > I don't understand what's going on with ->r_seq in there - your
> > call of path_is_under() is after having (re-)sampled rename_lock,
> > but if that was the only .. in there, who's going to recheck
> > the value?  For that matter, what's to guarantee that the thing
> > won't get moved just as you are returning from handle_dots()?
> > 
> > IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())?
> 
> Sigh...  Usual effects of trying to document things:
> 
> 1) LOOKUP_NO_EVAL looks bogus.  It had been introduced by commit 57d4657716ac
> (audit: ignore fcaps on umount) and AFAICS it's crap.  It is set in
> ksys_umount() and nowhere else.  It's ignored by everything except
> filename_mountpoint().  The thing is, call graph for filename_mountpoint()
> is
> 	filename_mountpoint()
> 		<- user_path_mountpoint_at()
> 			<- ksys_umount()
> 		<- kern_path_mountpoint()
> 			<- autofs_dev_ioctl_ismountpoint()
> 			<- find_autofs_mount()
> 				<- autofs_dev_ioctl_open_mountpoint()
> 				<- autofs_dev_ioctl_requester()
> 				<- autofs_dev_ioctl_ismountpoint()
> In other words, that flag is basically "was filename_mountpoint()
> been called by umount(2) or has it come from an autofs ioctl?".
> And looking at the rationale in that commit, autofs ioctls need
> it just as much as umount(2) does.  Why is it not set for those
> as well?  And why is it conditional at all?

In addition, LOOKUP_NO_EVAL == LOOKUP_OPEN (0x100). Is that meant to be
the case? Also I just saw you have a patch in work.namei that fixes this
up -- do you want me to rebase on top of that?
Aleksa Sarai July 18, 2019, 3:17 a.m. UTC | #13
On 2019-07-14, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Jul 14, 2019 at 05:00:29PM +1000, Aleksa Sarai wrote:
> > The basic property being guaranteed by LOOKUP_IN_ROOT is that it will
> > not result in resolution of a path component which was not inside the
> > root of the dirfd tree at some point during resolution (and that all
> > absolute symlink and ".." resolution will be done relative to the
> > dirfd). This may smell slightly of chroot(2), because unfortunately it
> > is a similar concept -- the reason for this is to allow for a more
> > efficient way to safely resolve paths inside a rootfs than spawning a
> > separate process to then pass back the fd to the caller.
> 
> IDGI...  If attacker can modify your subtree, you have already lost -
> after all, they can make anything appear inside that tree just before
> your syscall is made and bring it back out immediately afterwards.
> And if they can't, what is the race you are trying to protect against?
> Confused...

I'll be honest, this code mostly exists because Jann Horn said that it
was necessary in order for this interface to be safe against those kinds
of attacks. Though, it's also entirely possible I just am
mis-remembering the attack scenario he described when I posted v1 of
this series last year.

The use-case I need this functionality for (as do other container
runtimes) is one where you are trying to safely interact with a
directory tree that is a (malicious) container's root filesystem -- so
the container won't be able to move the directory tree root, nor can
they move things outside the rootfs into it (or the reverse). Users
dealing with FTP, web, or file servers probably have similar
requirements.

There is an obvious race condition if you allow the attacker to move the
root (I give an example and test-case of it in the last patch in the
series), and given that it is fairly trivial to defend against I don't
see the downside in including it? But it's obviously your call -- and
maybe Jann Horn can explain the reasoning behind this much better than I
can.

Patch
diff mbox series

diff --git a/fs/namei.c b/fs/namei.c
index b490bcf855f8..9c3ed597466b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -852,6 +852,13 @@  static inline void path_to_nameidata(const struct path *path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+	if (unlikely(nd->flags & LOOKUP_BENEATH))
+		return -EXDEV;
+	if (unlikely(nd->flags & LOOKUP_XDEV)) {
+		/* Absolute path arguments to path_init() are allowed. */
+		if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
+			return -EXDEV;
+	}
 	if (nd->flags & LOOKUP_RCU) {
 		struct dentry *d;
 		nd->path = nd->root;
@@ -1104,6 +1111,9 @@  const char *get_link(struct nameidata *nd, bool trailing)
 	int error;
 	const char *res;
 
+	if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+		return ERR_PTR(-ELOOP);
+
 	if (!(nd->flags & LOOKUP_RCU)) {
 		touch_atime(&last->link);
 		cond_resched();
@@ -1136,6 +1146,11 @@  const char *get_link(struct nameidata *nd, bool trailing)
 		}
 		/* If we just jumped it was because of a magic-link. */
 		if (unlikely(nd->flags & LOOKUP_JUMPED)) {
+			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
+				return ERR_PTR(-ELOOP);
+			/* Not currently safe. */
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return ERR_PTR(-EXDEV);
 			/*
 			 * For trailing_symlink we check whether the symlink's
 			 * mode allows us to do what we want through acc_mode.
@@ -1178,8 +1193,9 @@  const char *get_link(struct nameidata *nd, bool trailing)
 	if (*res == '/') {
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (unlikely(nd_jump_root(nd)))
-			return ERR_PTR(-ECHILD);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
 		while (unlikely(*++res == '/'))
 			;
 	}
@@ -1360,12 +1376,16 @@  static int follow_managed(struct path *path, struct nameidata *nd)
 		break;
 	}
 
-	if (need_mntput && path->mnt == mnt)
-		mntput(path->mnt);
+	if (need_mntput) {
+		if (path->mnt == mnt)
+			mntput(path->mnt);
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			ret = -EXDEV;
+		else
+			nd->flags |= LOOKUP_JUMPED;
+	}
 	if (ret == -EISDIR || !ret)
 		ret = 1;
-	if (need_mntput)
-		nd->flags |= LOOKUP_JUMPED;
 	if (unlikely(ret < 0))
 		path_put_conditional(path, nd);
 	return ret;
@@ -1422,6 +1442,8 @@  static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		mounted = __lookup_mnt(path->mnt, path->dentry);
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return false;
 		path->mnt = &mounted->mnt;
 		path->dentry = mounted->mnt.mnt_root;
 		nd->flags |= LOOKUP_JUMPED;
@@ -1442,8 +1464,11 @@  static int follow_dotdot_rcu(struct nameidata *nd)
 	struct inode *inode = nd->inode;
 
 	while (1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			struct dentry *old = nd->path.dentry;
 			struct dentry *parent = old->d_parent;
@@ -1468,6 +1493,8 @@  static int follow_dotdot_rcu(struct nameidata *nd)
 				return -ECHILD;
 			if (&mparent->mnt == nd->path.mnt)
 				break;
+			if (unlikely(nd->flags & LOOKUP_XDEV))
+				return -EXDEV;
 			/* we know that mountpoint was pinned */
 			nd->path.dentry = mountpoint;
 			nd->path.mnt = &mparent->mnt;
@@ -1482,6 +1509,8 @@  static int follow_dotdot_rcu(struct nameidata *nd)
 			return -ECHILD;
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 		nd->path.mnt = &mounted->mnt;
 		nd->path.dentry = mounted->mnt.mnt_root;
 		inode = nd->path.dentry->d_inode;
@@ -1570,8 +1599,11 @@  static int path_parent_directory(struct path *path)
 static int follow_dotdot(struct nameidata *nd)
 {
 	while(1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			int ret = path_parent_directory(&nd->path);
 			if (ret)
@@ -1580,6 +1612,8 @@  static int follow_dotdot(struct nameidata *nd)
 		}
 		if (!follow_up(&nd->path))
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
@@ -1794,6 +1828,13 @@  static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
+		/*
+		 * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
+		 * cause our parent to have moved outside of the root and us to skip
+		 * over it.
+		 */
+		if (unlikely(nd->flags & LOOKUP_BENEATH))
+			return -EXDEV;
 		if (!nd->root.mnt)
 			set_root(nd);
 		if (nd->flags & LOOKUP_RCU) {
@@ -2342,6 +2383,15 @@  static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.dentry = NULL;
 
 	nd->m_seq = read_seqbegin(&mount_lock);
+
+	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+		nd->root = nd->path;
+		if (!(nd->flags & LOOKUP_RCU))
+			path_get(&nd->root);
+	}
 	if (*s == '/') {
 		if (likely(!nd->root.mnt))
 			set_root(nd);
@@ -2350,9 +2400,11 @@  static const char *path_init(struct nameidata *nd, unsigned flags)
 			s = ERR_PTR(error);
 		return s;
 	}
-	error = dirfd_path_init(nd);
-	if (unlikely(error))
-		return ERR_PTR(error);
+	if (likely(!nd->path.mnt)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+	}
 	return s;
 }
 
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 9138b4471dbf..7bc819ad0cd3 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -50,6 +50,13 @@  enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_EMPTY		0x4000
 #define LOOKUP_DOWN		0x8000
 
+/* Scoping flags for lookup. */
+#define LOOKUP_BENEATH		0x010000 /* No escaping from starting point. */
+#define LOOKUP_XDEV		0x020000 /* No mountpoint crossing. */
+#define LOOKUP_NO_MAGICLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_SYMLINKS	0x080000 /* No symlink crossing *at all*.
+					    Implies LOOKUP_NO_MAGICLINKS. */
+
 extern int path_pts(struct path *path);
 
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);