[git,pull] vfs fixes

Message ID	20170403022059.GM29622@ZenIV.linux.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Mon, 3 Apr 2017 03:21:00 +0100 From: Al Viro <viro@ZenIV.linux.org.uk> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Eric Biederman <ebiederm@xmission.com> Subject: Re: [git pull] vfs fixes Message-ID: <20170403022059.GM29622@ZenIV.linux.org.uk> References: <20170402170148.GI29622@ZenIV.linux.org.uk> <CA+55aFymH_GwFfh=PhHOmiT5qzr6dwdt3GfmUFJrccLgdhPjTg@mail.gmail.com> <CA+55aFwxjEjV++j-WRXomZrjBR6Dhx9nM-SOwM7pD=zCnMspKQ@mail.gmail.com> <20170403003045.GK29622@ZenIV.linux.org.uk> <20170403004314.GL29622@ZenIV.linux.org.uk> <CA+55aFxevVjhjnxMoLY9OtpGc81ZNUYV4NcFN0VPX7pQY1pxXQ@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <CA+55aFxevVjhjnxMoLY9OtpGc81ZNUYV4NcFN0VPX7pQY1pxXQ@mail.gmail.com> User-Agent: Mutt/1.7.1 (2016-10-04) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Message ID

20170403022059.GM29622@ZenIV.linux.org.uk (mailing list archive)

State

New, archived

Headers

Date: Mon, 3 Apr 2017 03:21:00 +0100
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Eric Biederman <ebiederm@xmission.com>
Subject: Re: [git pull] vfs fixes
Message-ID: <20170403022059.GM29622@ZenIV.linux.org.uk>
References: <20170402170148.GI29622@ZenIV.linux.org.uk>
	<CA+55aFymH_GwFfh=PhHOmiT5qzr6dwdt3GfmUFJrccLgdhPjTg@mail.gmail.com>
	<CA+55aFwxjEjV++j-WRXomZrjBR6Dhx9nM-SOwM7pD=zCnMspKQ@mail.gmail.com>
	<20170403003045.GK29622@ZenIV.linux.org.uk>
	<20170403004314.GL29622@ZenIV.linux.org.uk>
	<CA+55aFxevVjhjnxMoLY9OtpGc81ZNUYV4NcFN0VPX7pQY1pxXQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CA+55aFxevVjhjnxMoLY9OtpGc81ZNUYV4NcFN0VPX7pQY1pxXQ@mail.gmail.com>
User-Agent: Mutt/1.7.1 (2016-10-04)
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Commit Message

Al Viro April 3, 2017, 2:21 a.m. UTC

On Sun, Apr 02, 2017 at 05:58:41PM -0700, Linus Torvalds wrote:

> I had to go and double-check that "DCACHE_DIRECTORY_TYPE" is what
> d_can_lookup() actually checks, so _that_ part is perhaps a bit
> subtle, and might be worth noting in that comment that you edited.
> 
> So the real "rule" ends up being that we only ever look up things from
> dentries of type DCACHE_DIRECTORY_TYPE set, and those had better have
> DCACHE_RCUACCESS bit set.
> 
> And the only reason path_init() only checks it for that case is that
> nd->root and nd->pwd both have to be of type d_can_lookup().
> 
> Do we check that when we set it? I hope/assume we do.

For chdir()/chroot()/pivot_root() it's done by LOOKUP_DIRECTORY in lookup
flags; fchdir() is slightly different - there we check S_ISDIR of inode
of opened file.  Which is almost the same thing, except for
kinda-sorta directories that have no ->lookup() - we have them for
NFS referral points.  It should be impossible to end up with
one of those opened - not even with O_PATH; follow_managed() will be called
and we'll either fail or cross into whatever ends up overmounting them.
Still, it might be cleaner to turn that check into
	d_can_lookup(f.file->f_path.dentry)
simply for consistency sake.

The thing I really don't like is mntns_install().  With sufficiently
nasty nfsroot setup it might be possible to end up with referral point
as one's root/pwd; getting out of such state would be interesting...
Smells like that place should be a solitary follow_down(), not a loop
of follow_down_one(), but I want Eric's opinion on that one; userns stuff
is weird.

Comments

Eric W. Biederman April 3, 2017, 6 a.m. UTC | #1

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Sun, Apr 02, 2017 at 05:58:41PM -0700, Linus Torvalds wrote:
>
>> I had to go and double-check that "DCACHE_DIRECTORY_TYPE" is what
>> d_can_lookup() actually checks, so _that_ part is perhaps a bit
>> subtle, and might be worth noting in that comment that you edited.
>> 
>> So the real "rule" ends up being that we only ever look up things from
>> dentries of type DCACHE_DIRECTORY_TYPE set, and those had better have
>> DCACHE_RCUACCESS bit set.
>> 
>> And the only reason path_init() only checks it for that case is that
>> nd->root and nd->pwd both have to be of type d_can_lookup().
>> 
>> Do we check that when we set it? I hope/assume we do.
>
> For chdir()/chroot()/pivot_root() it's done by LOOKUP_DIRECTORY in lookup
> flags; fchdir() is slightly different - there we check S_ISDIR of inode
> of opened file.  Which is almost the same thing, except for
> kinda-sorta directories that have no ->lookup() - we have them for
> NFS referral points.  It should be impossible to end up with
> one of those opened - not even with O_PATH; follow_managed() will be called
> and we'll either fail or cross into whatever ends up overmounting them.
> Still, it might be cleaner to turn that check into
> 	d_can_lookup(f.file->f_path.dentry)
> simply for consistency sake.
>
> The thing I really don't like is mntns_install().  With sufficiently
> nasty nfsroot setup it might be possible to end up with referral point
> as one's root/pwd; getting out of such state would be interesting...
> Smells like that place should be a solitary follow_down(), not a loop
> of follow_down_one(), but I want Eric's opinion on that one; userns stuff
> is weird.

If I read the conversation correctly the concern is that we might
initialize a pwd or root with something that is almost but not quite a
directory in mntns_install.

Refereshing my memory.  d_automount mounts things and is what
is used for nfs referrals.  d_manage blocks waiting for
an automounts to complete or expire.  follow_down just calls d_manage,
follow_manage calls both d_manage and d_automount as appropriate.

If the concern is nfs referral points calling follow_down is wrong and
what is wanted is follow_managed.

The only thing that follow_down prevents is changing onto directories
that are only half mounted, and not really directories yet.  Which
is certainly part of the invarient we want to preserve.

The intent of the logic in mntns_install is to just pick a reasonable
looking place somewhere in that mount namespace to use as a root
directory.  I arbitrarily picked the top of the mount stack on "/".  Which
is typically used as the root directory.  If people really care where
their root is they save a directory file descriptor off somewhere and
call chroot.  So there is a little wiggle room exactly what the code
does.

There is a secondary use of mntns_install which is to give you a way to
access what is under "/" if you are so foolish as to umount "/".  I keep
thinking setns to your own mount namespace would be a handy way to get
back to the rootfs and to use it for something during system shutdown.
I don't know if anyone has actually used setns to your own mount
namespace for that.

The worst case I can see from the proposed change is we would
not be able to umount all of the way down to rootfs.  That
would be a self inflicted wound so I don't care.

I can't imagine anyone mounting an automount point deliberately on /
except as way to confuse the vfs.  Though I can almost imagine getting
there by accident if an automount expires.

So yes please let's change the follow_down_one loop to follow_managed
to preserve the invariant that we always have a directory that
supports d_can_lookup to pass to set_fs_pwd and set_fs_root.

Eric

> diff --git a/fs/dcache.c b/fs/dcache.c
> index 95d71eda8142..05550139a8a6 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1757,7 +1757,13 @@ static unsigned d_flags_for_inode(struct inode *inode)
>  		return DCACHE_MISS_TYPE;
>  
>  	if (S_ISDIR(inode->i_mode)) {
> -		add_flags = DCACHE_DIRECTORY_TYPE;
> +		/*
> +		 * Any potential starting point of lookup should have
> +		 * DCACHE_RCUACCESS; currently directory dentries
> +		 * come from d_alloc() anyway, but it costs us nothing
> +		 * to enforce it here.
> +		 */
> +		add_flags = DCACHE_DIRECTORY_TYPE | DCACHE_RCUACCESS;
>  		if (unlikely(!(inode->i_opflags & IOP_LOOKUP))) {
>  			if (unlikely(!inode->i_op->lookup))
>  				add_flags = DCACHE_AUTODIR_TYPE;
> diff --git a/fs/namei.c b/fs/namei.c
> index d41fab78798b..19dcf62133cc 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2145,6 +2145,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>  	int retval = 0;
>  	const char *s = nd->name->name;
>  
> +	if (!*s)
> +		flags &= ~LOOKUP_RCU;
> +
>  	nd->last_type = LAST_ROOT; /* if there are only slashes... */
>  	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
>  	nd->depth = 0;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index cc1375eff88c..31ec9a79d2d4 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3467,6 +3467,7 @@ static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns)
>  	struct fs_struct *fs = current->fs;
>  	struct mnt_namespace *mnt_ns = to_mnt_ns(ns);
>  	struct path root;
> +	int err;
>  
>  	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
>  	    !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
> @@ -3484,15 +3485,14 @@ static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns)
>  	root.mnt    = &mnt_ns->root->mnt;
>  	root.dentry = mnt_ns->root->mnt.mnt_root;
>  	path_get(&root);
> -	while(d_mountpoint(root.dentry) && follow_down_one(&root))
> -		;
> -
> -	/* Update the pwd and root */
> -	set_fs_pwd(fs, &root);
> -	set_fs_root(fs, &root);
> -
> +	err = follow_down(&root);
> +	if (likely(!err)) {
> +		/* Update the pwd and root */
> +		set_fs_pwd(fs, &root);
> +		set_fs_root(fs, &root);
> +	}
>  	path_put(&root);
> -	return 0;
> +	return err;
>  }
>  
>  static struct user_namespace *mntns_owner(struct ns_common *ns)
> diff --git a/fs/open.c b/fs/open.c
> index 949cef29c3bb..217b5db588c8 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -459,20 +459,17 @@ SYSCALL_DEFINE1(chdir, const char __user *, filename)
>  SYSCALL_DEFINE1(fchdir, unsigned int, fd)
>  {
>  	struct fd f = fdget_raw(fd);
> -	struct inode *inode;
>  	int error = -EBADF;
>  
>  	error = -EBADF;
>  	if (!f.file)
>  		goto out;
>  
> -	inode = file_inode(f.file);
> -
>  	error = -ENOTDIR;
> -	if (!S_ISDIR(inode->i_mode))
> +	if (!d_can_lookup(f.file->f_path.dentry))
>  		goto out_putf;
>  
> -	error = inode_permission(inode, MAY_EXEC | MAY_CHDIR);
> +	error = inode_permission(file_inode(f.file), MAY_EXEC | MAY_CHDIR);
>  	if (!error)
>  		set_fs_pwd(current->fs, &f.file->f_path);
>  out_putf:

Al Viro April 3, 2017, 7:46 a.m. UTC | #2

On Mon, Apr 03, 2017 at 01:00:56AM -0500, Eric W. Biederman wrote:

> Refereshing my memory.  d_automount mounts things and is what
> is used for nfs referrals.  d_manage blocks waiting for
> an automounts to complete or expire.  follow_down just calls d_manage,
> follow_manage calls both d_manage and d_automount as appropriate.

D'oh...  Right.  What's more, by that point getting back to original
state on error is needed.

> If the concern is nfs referral points calling follow_down is wrong and
> what is wanted is follow_managed.

... except that follow_managed() takes nameidata and there is no way in
hell we are letting that animal out of fs/namei.c ever again.  Too low-level.

> The intent of the logic in mntns_install is to just pick a reasonable
> looking place somewhere in that mount namespace to use as a root
> directory.  I arbitrarily picked the top of the mount stack on "/".  Which
> is typically used as the root directory.  If people really care where
> their root is they save a directory file descriptor off somewhere and
> call chroot.  So there is a little wiggle room exactly what the code
> does.

Hmm...  If anything, I'm tempted to add LOOKUP_DOWN that would have
path_lookupat() do
	if (unlikely(flags & LOOKUP_DOWN)) {
		struct path path = nd->path;
		dget(nd->path.dentry);
		err = follow_managed(&path, nd);
		if (unlikely(err < 0))
			terminate_walk(nd);
			return err;
		}
		path_to_nameidate(&path, nd);
	}
right after path_init().  Then your stuff would've turned into

        get_mnt_ns(mnt_ns);
        old_mnt_ns = nsproxy->mnt_ns;
        nsproxy->mnt_ns = mnt_ns;

	/* Find the root */
	err = vfs_path_lookup(mnt_ns->root->mnt.mnt_root, &mnt_ns->root->mnt,
				"/", LOOKUP_DOWN, &root);
	if (err) {
		/* revert to old namespace */
		nsproxy->mnt_ns = old_mnt_ns;
		put_mnt_ns(mnt_ns);
		return err;
	}

        /* Update the pwd and root */
        set_fs_pwd(fs, &root);
        set_fs_root(fs, &root);

        path_put(&root);
	put_mnt_ns(old_mnt_ns);
	return 0;

This is absolutely untested, and I won't get around to testing it until
tomorrow, but something along those lines would IMO be better than
exposing a trimmed-down follow_managed(), not to mention struct nameidata
itself...

Ian Kent April 4, 2017, 12:22 a.m. UTC | #3

On Mon, 2017-04-03 at 01:00 -0500, Eric W. Biederman wrote:
> Al Viro <viro@ZenIV.linux.org.uk> writes:
> 
> > On Sun, Apr 02, 2017 at 05:58:41PM -0700, Linus Torvalds wrote:
> > 
> > > I had to go and double-check that "DCACHE_DIRECTORY_TYPE" is what
> > > d_can_lookup() actually checks, so _that_ part is perhaps a bit
> > > subtle, and might be worth noting in that comment that you edited.
> > > 
> > > So the real "rule" ends up being that we only ever look up things from
> > > dentries of type DCACHE_DIRECTORY_TYPE set, and those had better have
> > > DCACHE_RCUACCESS bit set.
> > > 
> > > And the only reason path_init() only checks it for that case is that
> > > nd->root and nd->pwd both have to be of type d_can_lookup().
> > > 
> > > Do we check that when we set it? I hope/assume we do.
> > 
> > For chdir()/chroot()/pivot_root() it's done by LOOKUP_DIRECTORY in lookup
> > flags; fchdir() is slightly different - there we check S_ISDIR of inode
> > of opened file.  Which is almost the same thing, except for
> > kinda-sorta directories that have no ->lookup() - we have them for
> > NFS referral points.  It should be impossible to end up with
> > one of those opened - not even with O_PATH; follow_managed() will be called
> > and we'll either fail or cross into whatever ends up overmounting them.
> > Still, it might be cleaner to turn that check into
> > 	d_can_lookup(f.file->f_path.dentry)
> > simply for consistency sake.
> > 
> > The thing I really don't like is mntns_install().  With sufficiently
> > nasty nfsroot setup it might be possible to end up with referral point
> > as one's root/pwd; getting out of such state would be interesting...
> > Smells like that place should be a solitary follow_down(), not a loop
> > of follow_down_one(), but I want Eric's opinion on that one; userns stuff
> > is weird.
> 
> If I read the conversation correctly the concern is that we might
> initialize a pwd or root with something that is almost but not quite a
> directory in mntns_install.
> 
> Refereshing my memory.  d_automount mounts things and is what
> is used for nfs referrals.  d_manage blocks waiting for
> an automounts to complete or expire.  follow_down just calls d_manage,
> follow_manage calls both d_manage and d_automount as appropriate.

AFAIK d_manage() is only defined by autofs.

It was needed by autofs because the the mount creation and addition is done by
another (user space) thread whereas "normal" file systems like NFS do all the
work in-kernel.

> 
> If the concern is nfs referral points calling follow_down is wrong and
> what is wanted is follow_managed.
> 
> The only thing that follow_down prevents is changing onto directories
> that are only half mounted, and not really directories yet.  Which
> is certainly part of the invarient we want to preserve.
> 
> 
> 
> The intent of the logic in mntns_install is to just pick a reasonable
> looking place somewhere in that mount namespace to use as a root
> directory.  I arbitrarily picked the top of the mount stack on "/".  Which
> is typically used as the root directory.  If people really care where
> their root is they save a directory file descriptor off somewhere and
> call chroot.  So there is a little wiggle room exactly what the code
> does.
> 
> There is a secondary use of mntns_install which is to give you a way to
> access what is under "/" if you are so foolish as to umount "/".  I keep
> thinking setns to your own mount namespace would be a handy way to get
> back to the rootfs and to use it for something during system shutdown.
> I don't know if anyone has actually used setns to your own mount
> namespace for that.
> 
> The worst case I can see from the proposed change is we would
> not be able to umount all of the way down to rootfs.  That
> would be a self inflicted wound so I don't care.
> 
> I can't imagine anyone mounting an automount point deliberately on /
> except as way to confuse the vfs.  Though I can almost imagine getting
> there by accident if an automount expires.
> 
> So yes please let's change the follow_down_one loop to follow_managed
> to preserve the invariant that we always have a directory that
> supports d_can_lookup to pass to set_fs_pwd and set_fs_root.
> 
> Eric
> 
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 95d71eda8142..05550139a8a6 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -1757,7 +1757,13 @@ static unsigned d_flags_for_inode(struct inode
> > *inode)
> >  		return DCACHE_MISS_TYPE;
> >  
> >  	if (S_ISDIR(inode->i_mode)) {
> > -		add_flags = DCACHE_DIRECTORY_TYPE;
> > +		/*
> > +		 * Any potential starting point of lookup should have
> > +		 * DCACHE_RCUACCESS; currently directory dentries
> > +		 * come from d_alloc() anyway, but it costs us nothing
> > +		 * to enforce it here.
> > +		 */
> > +		add_flags = DCACHE_DIRECTORY_TYPE | DCACHE_RCUACCESS;
> >  		if (unlikely(!(inode->i_opflags & IOP_LOOKUP))) {
> >  			if (unlikely(!inode->i_op->lookup))
> >  				add_flags = DCACHE_AUTODIR_TYPE;
> > diff --git a/fs/namei.c b/fs/namei.c
> > index d41fab78798b..19dcf62133cc 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2145,6 +2145,9 @@ static const char *path_init(struct nameidata *nd,
> > unsigned flags)
> >  	int retval = 0;
> >  	const char *s = nd->name->name;
> >  
> > +	if (!*s)
> > +		flags &= ~LOOKUP_RCU;
> > +
> >  	nd->last_type = LAST_ROOT; /* if there are only slashes... */
> >  	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
> >  	nd->depth = 0;
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index cc1375eff88c..31ec9a79d2d4 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -3467,6 +3467,7 @@ static int mntns_install(struct nsproxy *nsproxy,
> > struct ns_common *ns)
> >  	struct fs_struct *fs = current->fs;
> >  	struct mnt_namespace *mnt_ns = to_mnt_ns(ns);
> >  	struct path root;
> > +	int err;
> >  
> >  	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
> >  	    !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
> > @@ -3484,15 +3485,14 @@ static int mntns_install(struct nsproxy *nsproxy,
> > struct ns_common *ns)
> >  	root.mnt    = &mnt_ns->root->mnt;
> >  	root.dentry = mnt_ns->root->mnt.mnt_root;
> >  	path_get(&root);
> > -	while(d_mountpoint(root.dentry) && follow_down_one(&root))
> > -		;
> > -
> > -	/* Update the pwd and root */
> > -	set_fs_pwd(fs, &root);
> > -	set_fs_root(fs, &root);
> > -
> > +	err = follow_down(&root);
> > +	if (likely(!err)) {
> > +		/* Update the pwd and root */
> > +		set_fs_pwd(fs, &root);
> > +		set_fs_root(fs, &root);
> > +	}
> >  	path_put(&root);
> > -	return 0;
> > +	return err;
> >  }
> >  
> >  static struct user_namespace *mntns_owner(struct ns_common *ns)
> > diff --git a/fs/open.c b/fs/open.c
> > index 949cef29c3bb..217b5db588c8 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -459,20 +459,17 @@ SYSCALL_DEFINE1(chdir, const char __user *, filename)
> >  SYSCALL_DEFINE1(fchdir, unsigned int, fd)
> >  {
> >  	struct fd f = fdget_raw(fd);
> > -	struct inode *inode;
> >  	int error = -EBADF;
> >  
> >  	error = -EBADF;
> >  	if (!f.file)
> >  		goto out;
> >  
> > -	inode = file_inode(f.file);
> > -
> >  	error = -ENOTDIR;
> > -	if (!S_ISDIR(inode->i_mode))
> > +	if (!d_can_lookup(f.file->f_path.dentry))
> >  		goto out_putf;
> >  
> > -	error = inode_permission(inode, MAY_EXEC | MAY_CHDIR);
> > +	error = inode_permission(file_inode(f.file), MAY_EXEC | MAY_CHDIR);
> >  	if (!error)
> >  		set_fs_pwd(current->fs, &f.file->f_path);
> >  out_putf:

Ian Kent April 4, 2017, 12:47 a.m. UTC | #4

On Mon, 2017-04-03 at 01:00 -0500, Eric W. Biederman wrote:
> Al Viro <viro@ZenIV.linux.org.uk> writes:
> 
> > On Sun, Apr 02, 2017 at 05:58:41PM -0700, Linus Torvalds wrote:
> > 
> > > I had to go and double-check that "DCACHE_DIRECTORY_TYPE" is what
> > > d_can_lookup() actually checks, so _that_ part is perhaps a bit
> > > subtle, and might be worth noting in that comment that you edited.
> > > 
> > > So the real "rule" ends up being that we only ever look up things from
> > > dentries of type DCACHE_DIRECTORY_TYPE set, and those had better have
> > > DCACHE_RCUACCESS bit set.
> > > 
> > > And the only reason path_init() only checks it for that case is that
> > > nd->root and nd->pwd both have to be of type d_can_lookup().
> > > 
> > > Do we check that when we set it? I hope/assume we do.
> > 
> > For chdir()/chroot()/pivot_root() it's done by LOOKUP_DIRECTORY in lookup
> > flags; fchdir() is slightly different - there we check S_ISDIR of inode
> > of opened file.  Which is almost the same thing, except for
> > kinda-sorta directories that have no ->lookup() - we have them for
> > NFS referral points.  It should be impossible to end up with
> > one of those opened - not even with O_PATH; follow_managed() will be called
> > and we'll either fail or cross into whatever ends up overmounting them.
> > Still, it might be cleaner to turn that check into
> > 	d_can_lookup(f.file->f_path.dentry)
> > simply for consistency sake.
> > 
> > The thing I really don't like is mntns_install().  With sufficiently
> > nasty nfsroot setup it might be possible to end up with referral point
> > as one's root/pwd; getting out of such state would be interesting...
> > Smells like that place should be a solitary follow_down(), not a loop
> > of follow_down_one(), but I want Eric's opinion on that one; userns stuff
> > is weird.
> 
> If I read the conversation correctly the concern is that we might
> initialize a pwd or root with something that is almost but not quite a
> directory in mntns_install.
> 
> Refereshing my memory.  d_automount mounts things and is what
> is used for nfs referrals.  d_manage blocks waiting for
> an automounts to complete or expire.  follow_down just calls d_manage,
> follow_manage calls both d_manage and d_automount as appropriate.
> 
> If the concern is nfs referral points calling follow_down is wrong and
> what is wanted is follow_managed.

The case Al was concerned about (sounds like) where the root (or pwd) being
followed is an NFS referral (a similar case could be NFS file system migration
if (when?) being used, and that's probably more likely to be triggered from a
file system root than a referral).

I can't see how that could happen for a referral, but if it did the follow would
need to call d_automount(). It's unlikely ->d_manage() would factor into it but
it is available for use so should be part of it. 

So follow_down() rather than follow_down_one() sounds like the right thing to
do.

> 
> The only thing that follow_down prevents is changing onto directories
> that are only half mounted, and not really directories yet.  Which
> is certainly part of the invarient we want to preserve.
> 
> 
> 
> The intent of the logic in mntns_install is to just pick a reasonable
> looking place somewhere in that mount namespace to use as a root
> directory.  I arbitrarily picked the top of the mount stack on "/".  Which
> is typically used as the root directory.  If people really care where
> their root is they save a directory file descriptor off somewhere and
> call chroot.  So there is a little wiggle room exactly what the code
> does.
> 
> There is a secondary use of mntns_install which is to give you a way to
> access what is under "/" if you are so foolish as to umount "/".  I keep
> thinking setns to your own mount namespace would be a handy way to get
> back to the rootfs and to use it for something during system shutdown.
> I don't know if anyone has actually used setns to your own mount
> namespace for that.
> 
> The worst case I can see from the proposed change is we would
> not be able to umount all of the way down to rootfs.  That
> would be a self inflicted wound so I don't care.
> 
> I can't imagine anyone mounting an automount point deliberately on /
> except as way to confuse the vfs.  Though I can almost imagine getting
> there by accident if an automount expires.
> 
> So yes please let's change the follow_down_one loop to follow_managed
> to preserve the invariant that we always have a directory that
> supports d_can_lookup to pass to set_fs_pwd and set_fs_root.
> 
> Eric
> 
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 95d71eda8142..05550139a8a6 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -1757,7 +1757,13 @@ static unsigned d_flags_for_inode(struct inode
> > *inode)
> >  		return DCACHE_MISS_TYPE;
> >  
> >  	if (S_ISDIR(inode->i_mode)) {
> > -		add_flags = DCACHE_DIRECTORY_TYPE;
> > +		/*
> > +		 * Any potential starting point of lookup should have
> > +		 * DCACHE_RCUACCESS; currently directory dentries
> > +		 * come from d_alloc() anyway, but it costs us nothing
> > +		 * to enforce it here.
> > +		 */
> > +		add_flags = DCACHE_DIRECTORY_TYPE | DCACHE_RCUACCESS;
> >  		if (unlikely(!(inode->i_opflags & IOP_LOOKUP))) {
> >  			if (unlikely(!inode->i_op->lookup))
> >  				add_flags = DCACHE_AUTODIR_TYPE;
> > diff --git a/fs/namei.c b/fs/namei.c
> > index d41fab78798b..19dcf62133cc 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2145,6 +2145,9 @@ static const char *path_init(struct nameidata *nd,
> > unsigned flags)
> >  	int retval = 0;
> >  	const char *s = nd->name->name;
> >  
> > +	if (!*s)
> > +		flags &= ~LOOKUP_RCU;
> > +
> >  	nd->last_type = LAST_ROOT; /* if there are only slashes... */
> >  	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
> >  	nd->depth = 0;
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index cc1375eff88c..31ec9a79d2d4 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -3467,6 +3467,7 @@ static int mntns_install(struct nsproxy *nsproxy,
> > struct ns_common *ns)
> >  	struct fs_struct *fs = current->fs;
> >  	struct mnt_namespace *mnt_ns = to_mnt_ns(ns);
> >  	struct path root;
> > +	int err;
> >  
> >  	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
> >  	    !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
> > @@ -3484,15 +3485,14 @@ static int mntns_install(struct nsproxy *nsproxy,
> > struct ns_common *ns)
> >  	root.mnt    = &mnt_ns->root->mnt;
> >  	root.dentry = mnt_ns->root->mnt.mnt_root;
> >  	path_get(&root);
> > -	while(d_mountpoint(root.dentry) && follow_down_one(&root))
> > -		;
> > -
> > -	/* Update the pwd and root */
> > -	set_fs_pwd(fs, &root);
> > -	set_fs_root(fs, &root);
> > -
> > +	err = follow_down(&root);
> > +	if (likely(!err)) {
> > +		/* Update the pwd and root */
> > +		set_fs_pwd(fs, &root);
> > +		set_fs_root(fs, &root);
> > +	}
> >  	path_put(&root);
> > -	return 0;
> > +	return err;
> >  }
> >  
> >  static struct user_namespace *mntns_owner(struct ns_common *ns)
> > diff --git a/fs/open.c b/fs/open.c
> > index 949cef29c3bb..217b5db588c8 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -459,20 +459,17 @@ SYSCALL_DEFINE1(chdir, const char __user *, filename)
> >  SYSCALL_DEFINE1(fchdir, unsigned int, fd)
> >  {
> >  	struct fd f = fdget_raw(fd);
> > -	struct inode *inode;
> >  	int error = -EBADF;
> >  
> >  	error = -EBADF;
> >  	if (!f.file)
> >  		goto out;
> >  
> > -	inode = file_inode(f.file);
> > -
> >  	error = -ENOTDIR;
> > -	if (!S_ISDIR(inode->i_mode))
> > +	if (!d_can_lookup(f.file->f_path.dentry))
> >  		goto out_putf;
> >  
> > -	error = inode_permission(inode, MAY_EXEC | MAY_CHDIR);
> > +	error = inode_permission(file_inode(f.file), MAY_EXEC | MAY_CHDIR);
> >  	if (!error)
> >  		set_fs_pwd(current->fs, &f.file->f_path);
> >  out_putf:

diff --git a/fs/dcache.c b/fs/dcache.c
index 95d71eda8142..05550139a8a6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1757,7 +1757,13 @@  static unsigned d_flags_for_inode(struct inode *inode)
 		return DCACHE_MISS_TYPE;
 
 	if (S_ISDIR(inode->i_mode)) {
-		add_flags = DCACHE_DIRECTORY_TYPE;
+		/*
+		 * Any potential starting point of lookup should have
+		 * DCACHE_RCUACCESS; currently directory dentries
+		 * come from d_alloc() anyway, but it costs us nothing
+		 * to enforce it here.
+		 */
+		add_flags = DCACHE_DIRECTORY_TYPE | DCACHE_RCUACCESS;
 		if (unlikely(!(inode->i_opflags & IOP_LOOKUP))) {
 			if (unlikely(!inode->i_op->lookup))
 				add_flags = DCACHE_AUTODIR_TYPE;
diff --git a/fs/namei.c b/fs/namei.c
index d41fab78798b..19dcf62133cc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2145,6 +2145,9 @@  static const char *path_init(struct nameidata *nd, unsigned flags)
 	int retval = 0;
 	const char *s = nd->name->name;
 
+	if (!*s)
+		flags &= ~LOOKUP_RCU;
+
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
 	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
 	nd->depth = 0;
diff --git a/fs/namespace.c b/fs/namespace.c
index cc1375eff88c..31ec9a79d2d4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3467,6 +3467,7 @@  static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns)
 	struct fs_struct *fs = current->fs;
 	struct mnt_namespace *mnt_ns = to_mnt_ns(ns);
 	struct path root;
+	int err;
 
 	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
 	    !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
@@ -3484,15 +3485,14 @@  static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns)
 	root.mnt    = &mnt_ns->root->mnt;
 	root.dentry = mnt_ns->root->mnt.mnt_root;
 	path_get(&root);
-	while(d_mountpoint(root.dentry) && follow_down_one(&root))
-		;
-
-	/* Update the pwd and root */
-	set_fs_pwd(fs, &root);
-	set_fs_root(fs, &root);
-
+	err = follow_down(&root);
+	if (likely(!err)) {
+		/* Update the pwd and root */
+		set_fs_pwd(fs, &root);
+		set_fs_root(fs, &root);
+	}
 	path_put(&root);
-	return 0;
+	return err;
 }
 
 static struct user_namespace *mntns_owner(struct ns_common *ns)
diff --git a/fs/open.c b/fs/open.c
index 949cef29c3bb..217b5db588c8 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -459,20 +459,17 @@  SYSCALL_DEFINE1(chdir, const char __user *, filename)
 SYSCALL_DEFINE1(fchdir, unsigned int, fd)
 {
 	struct fd f = fdget_raw(fd);
-	struct inode *inode;
 	int error = -EBADF;
 
 	error = -EBADF;
 	if (!f.file)
 		goto out;
 
-	inode = file_inode(f.file);
-
 	error = -ENOTDIR;
-	if (!S_ISDIR(inode->i_mode))
+	if (!d_can_lookup(f.file->f_path.dentry))
 		goto out_putf;
 
-	error = inode_permission(inode, MAY_EXEC | MAY_CHDIR);
+	error = inode_permission(file_inode(f.file), MAY_EXEC | MAY_CHDIR);
 	if (!error)
 		set_fs_pwd(current->fs, &f.file->f_path);
 out_putf:

[git,pull] vfs fixes

Commit Message

Comments

Patch