[RFC] nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed
diff mbox

Message ID 20180511211339.GG30522@ZenIV.linux.org.uk
State New
Headers show

Commit Message

Al Viro May 11, 2018, 9:13 p.m. UTC
[-stable fodder; as it is, one can e.g. add
/mnt/cgroup     localhost(rw,no_root_squash,fsid=4242)
to /etc/exports,
mount -t cgroup none /mnt/cgroup
mkdir /tmp/a
mount -t nfs localhost://mnt/cgroup /tmp/a
mkdir /tmp/a/foo

and have knfsd oops; the patch below deals with that.

Questions:
	1) is fh_update() use below legitimate, or should we
do fh_put/fh_compose instead?
	2) is nfserr_serverfail valid for situation when
directory created by such vfs_mkdir() manages to disappear
under us immediately afterwards?  Or should we return nfserr_stale
instead?

It's in vfs.git#for-linus, if you prefer to use git for review...
]

That can (and does, on some filesystems) happen - ->mkdir() (and thus
vfs_mkdir()) can legitimately leave its argument negative and just
unhash it, counting upon the lookup to pick the object we'd created
next time we try to look at that name.

Some vfs_mkdir() callers forget about that possibility...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bruce Fields May 14, 2018, 3:32 p.m. UTC | #1
On Fri, May 11, 2018 at 10:13:39PM +0100, Al Viro wrote:
> [-stable fodder; as it is, one can e.g. add
> /mnt/cgroup     localhost(rw,no_root_squash,fsid=4242)
> to /etc/exports,
> mount -t cgroup none /mnt/cgroup
> mkdir /tmp/a
> mount -t nfs localhost://mnt/cgroup /tmp/a
> mkdir /tmp/a/foo

How is the cgroup filesystem exportable?  That sounds like a bug in
itself.  We don't want people using NFS as some kind of weird remote
configuration protocol.

> and have knfsd oops; the patch below deals with that.
> 
> Questions:
> 	1) is fh_update() use below legitimate, or should we
> do fh_put/fh_compose instead?

fh_update looks OK to me, but do we need it here?  There's already a

	if (!err)
		err = fh_update(reshp);

at the end of nfsd_create_locked.

> 	2) is nfserr_serverfail valid for situation when
> directory created by such vfs_mkdir() manages to disappear
> under us immediately afterwards?  Or should we return nfserr_stale
> instead?

We just got a successful result on the create and the parent's still
locked, so if somebody hits this I think we want them reporting a bug,
not wasting time trying to find a mistake in their own logic.

> It's in vfs.git#for-linus, if you prefer to use git for review...
> ]
> 
> That can (and does, on some filesystems) happen - ->mkdir() (and thus
> vfs_mkdir()) can legitimately leave its argument negative and just
> unhash it, counting upon the lookup to pick the object we'd created
> next time we try to look at that name.
> 
> Some vfs_mkdir() callers forget about that possibility...

I'd rather have this in a helper function with a comment or two, but I
can do that as a followup patch.

--b.

> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 2410b093a2e6..b0555d7d8200 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1201,6 +1201,28 @@ nfsd_create_locked(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		break;
>  	case S_IFDIR:
>  		host_err = vfs_mkdir(dirp, dchild, iap->ia_mode);
> +		if (!host_err && unlikely(d_unhashed(dchild))) {
> +			struct dentry *d;
> +			d = lookup_one_len(dchild->d_name.name,
> +					   dchild->d_parent,
> +					   dchild->d_name.len);
> +			if (IS_ERR(d)) {
> +				host_err = PTR_ERR(d);
> +				break;
> +			}
> +			if (unlikely(d_is_negative(d))) {
> +				dput(d);
> +				err = nfserr_serverfault;
> +				goto out;
> +			}
> +			dput(resfhp->fh_dentry);
> +			resfhp->fh_dentry = dget(d);
> +			err = fh_update(resfhp);
> +			dput(dchild);
> +			dchild = d;
> +			if (err)
> +				goto out;
> +		}
>  		break;
>  	case S_IFCHR:
>  	case S_IFBLK:
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro May 14, 2018, 3:45 p.m. UTC | #2
On Mon, May 14, 2018 at 11:32:16AM -0400, J. Bruce Fields wrote:
> On Fri, May 11, 2018 at 10:13:39PM +0100, Al Viro wrote:
> > [-stable fodder; as it is, one can e.g. add
> > /mnt/cgroup     localhost(rw,no_root_squash,fsid=4242)
> > to /etc/exports,
> > mount -t cgroup none /mnt/cgroup
> > mkdir /tmp/a
> > mount -t nfs localhost://mnt/cgroup /tmp/a
> > mkdir /tmp/a/foo
> 
> How is the cgroup filesystem exportable?  That sounds like a bug in
> itself.  We don't want people using NFS as some kind of weird remote
> configuration protocol.

You can't have open-by-fhandle without exportability.  And it's not
the only fs like that.

> > and have knfsd oops; the patch below deals with that.
> > 
> > Questions:
> > 	1) is fh_update() use below legitimate, or should we
> > do fh_put/fh_compose instead?
> 
> fh_update looks OK to me, but do we need it here?  There's already a
> 
> 	if (!err)
> 		err = fh_update(reshp);
> 
> at the end of nfsd_create_locked.

Might be too late for that, though - the trouble hits when we hit
nfsd_create_setattr().

> > 	2) is nfserr_serverfail valid for situation when
> > directory created by such vfs_mkdir() manages to disappear
> > under us immediately afterwards?  Or should we return nfserr_stale
> > instead?
> 
> We just got a successful result on the create and the parent's still
> locked, so if somebody hits this I think we want them reporting a bug,
> not wasting time trying to find a mistake in their own logic.

No.  Suppose it's NFS-over-NFS (and yes, I agree that it's a bad idea;
somebody *has* done that).  Lookup after successful mkdir can legitimately
fail if it's been removed server-side.

And we *will* need to allow nfs_mkdir() to return that way in some cases
(== success with dentry passed to it left unhashed negative), I'm afraid ;-/
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bruce Fields May 14, 2018, 4:27 p.m. UTC | #3
On Mon, May 14, 2018 at 04:45:51PM +0100, Al Viro wrote:
> On Mon, May 14, 2018 at 11:32:16AM -0400, J. Bruce Fields wrote:
> > On Fri, May 11, 2018 at 10:13:39PM +0100, Al Viro wrote:
> > > [-stable fodder; as it is, one can e.g. add
> > > /mnt/cgroup     localhost(rw,no_root_squash,fsid=4242)
> > > to /etc/exports,
> > > mount -t cgroup none /mnt/cgroup
> > > mkdir /tmp/a
> > > mount -t nfs localhost://mnt/cgroup /tmp/a
> > > mkdir /tmp/a/foo
> > 
> > How is the cgroup filesystem exportable?  That sounds like a bug in
> > itself.  We don't want people using NFS as some kind of weird remote
> > configuration protocol.
> 
> You can't have open-by-fhandle without exportability.  And it's not
> the only fs like that.

We could separate the two--add a flag to export_operations, if
necessary.  I haven't formulated a strong argument, but exporting those
filesystems makes me *really* uncomfortable.

Poking around....  Looks like this was added by aa8188253474 "kernfs:
add exportfs operations", and they really do claim a use case for lookup
by filehandle.

> > > and have knfsd oops; the patch below deals with that.
> > > 
> > > Questions:
> > > 	1) is fh_update() use below legitimate, or should we
> > > do fh_put/fh_compose instead?
> > 
> > fh_update looks OK to me, but do we need it here?  There's already a
> > 
> > 	if (!err)
> > 		err = fh_update(reshp);
> > 
> > at the end of nfsd_create_locked.
> 
> Might be too late for that, though - the trouble hits when we hit
> nfsd_create_setattr().

Oh, got it.  Could move the bottom fh_update to just above the
nfsd_create_setattr(), though?

> > > 	2) is nfserr_serverfail valid for situation when
> > > directory created by such vfs_mkdir() manages to disappear
> > > under us immediately afterwards?  Or should we return nfserr_stale
> > > instead?
> > 
> > We just got a successful result on the create and the parent's still
> > locked, so if somebody hits this I think we want them reporting a bug,
> > not wasting time trying to find a mistake in their own logic.
> 
> No.  Suppose it's NFS-over-NFS (and yes, I agree that it's a bad idea;
> somebody *has* done that).  Lookup after successful mkdir can legitimately
> fail if it's been removed server-side.
> 
> And we *will* need to allow nfs_mkdir() to return that way in some cases
> (== success with dentry passed to it left unhashed negative), I'm afraid ;-/

Thanks, makes sense.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro May 14, 2018, 4:47 p.m. UTC | #4
On Mon, May 14, 2018 at 04:45:51PM +0100, Al Viro wrote:

> > > 	2) is nfserr_serverfail valid for situation when
> > > directory created by such vfs_mkdir() manages to disappear
> > > under us immediately afterwards?  Or should we return nfserr_stale
> > > instead?
> > 
> > We just got a successful result on the create and the parent's still
> > locked, so if somebody hits this I think we want them reporting a bug,
> > not wasting time trying to find a mistake in their own logic.
> 
> No.  Suppose it's NFS-over-NFS (and yes, I agree that it's a bad idea;
> somebody *has* done that).  Lookup after successful mkdir can legitimately
> fail if it's been removed server-side.
> 
> And we *will* need to allow nfs_mkdir() to return that way in some cases
> (== success with dentry passed to it left unhashed negative), I'm afraid ;-/

Consider the situation when you have NFS reexported - there's server S1,
exporting a filesystem to S2, which reexports it for client C.

On S2, process A does stat foo, gets ENOENT and proceeds to mkdir foo.  
Dentry of foo is passed to nfs_mkdir(), which calls e.g. nfs3_proc_mkdir(),
which calls nfs3_do_create(), which requests S1 to create the damn thing.
Then, after that succeeds, it calls nfs_instantiate().  There we would 
proceed to get the inode and call d_add(dentry, inode).

In the meanwhile, C, having figured out the fhandle S2 would assign to
foo (e.g. having spied upon the traffic, etc.) sends that fhandle to
S2.  nfs_fh_to_dentry() is called by process B on S2 (either knfsd, or,
in case if C==S2 and attacker has done open-by-fhandle - the caller
of open_by_handle(2)).  It gets the inode - the damn thing has been
created on S1 already.  That inode still has no dentries attached to
it (B has just set it up), so d_obtain_alias() creates one and attaches
to it.

Now A gets around to nfs_fhget() and finds the same inode.  Voila -
d_add(dentry, inode) and we are fucked; two dentries over the same
directory inode.  Fun starts very shortly when fs/exportfs/* code
is used by B to reconnect its dentry - the things really get bogus
once that constraint is violated.

The obvious solution would be to replace that d_add() with
	res = d_splice_alias(inode, dentry);
	if (IS_ERR(res)) {
		error = PTR_ERR(res);
		goto out_error;
	}
	dput(res);
	
leaving the dentry passed to nfs_mkdir() unhashed and negative if
we hit that race.  That's fine - the next lookup (e.g. the one
done by exportfs to reconnect the sucker) will find the right
dentry in dcache; it's just that it won't the one passed to
vfs_mkdir().

That's different from the local case - there mkdir gets the inumber,
inserts locked in-core inode with that inumber into icache and
only then proceeds to set on-disk data up.  Anyone guessing the
inumber (and thus the fhandle) will either get -ESTALE (if they
come first, as in this scenario) or find the in-core inode mkdir
is setting up and wait for it to be unlocked, at which point
d_obtain_alias() will already find it attached to dentry passed
to mkdir.

But that critically relies upon the icache search key being known
to mkdir *BEFORE* the on-disk metadata starts looking acceptable.
For NFS we really can't do that - there the key is S1's fhandle
and we don't get that until S1 has created the damn thing.

I don't see any variation of the trick used by local filesystems
that would not have this race; we really can end up with B getting
there first and creating directory inode with dentry attached to
it before A gets through.  Which, AFAICS, leaves only one solution -
let A put the dentry created by B in place of what had been passed
to A (and leave its argument unhashed).  Which is trivial to
implement (see above); however, it means that NFS itself is in
the same class as cgroup - its ->mkdir() may, in some cases,
end up succeeding and leaving its argument unhashed negative.
Note that dcache is perfectly fine after that - we have hashed
positive dentry with the right name and right parent, over the
inode for directory we'd just created; everything's fine, except
that it's not the struct dentry originally passed to vfs_mkdir().
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro May 14, 2018, 5 p.m. UTC | #5
On Mon, May 14, 2018 at 12:27:04PM -0400, J. Bruce Fields wrote:
> > > 	if (!err)
> > > 		err = fh_update(reshp);
> > > 
> > > at the end of nfsd_create_locked.
> > 
> > Might be too late for that, though - the trouble hits when we hit
> > nfsd_create_setattr().
> 
> Oh, got it.  Could move the bottom fh_update to just above the
> nfsd_create_setattr(), though?

Probably, but I'm not comfortable enough with the nfsd guts wrt in-core
fhandle and assumptions made by users of such, so I'd rather bounce that
question to nfsd maintainers ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bruce Fields May 14, 2018, 5:56 p.m. UTC | #6
On Mon, May 14, 2018 at 06:00:41PM +0100, Al Viro wrote:
> On Mon, May 14, 2018 at 12:27:04PM -0400, J. Bruce Fields wrote:
> > > > 	if (!err)
> > > > 		err = fh_update(reshp);
> > > > 
> > > > at the end of nfsd_create_locked.
> > > 
> > > Might be too late for that, though - the trouble hits when we hit
> > > nfsd_create_setattr().
> > 
> > Oh, got it.  Could move the bottom fh_update to just above the
> > nfsd_create_setattr(), though?
> 
> Probably, but I'm not comfortable enough with the nfsd guts wrt in-core
> fhandle and assumptions made by users of such, so I'd rather bounce that
> question to nfsd maintainers ;-)

I'm fine with doing it as in your patch and then queuing up cleanup
later.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bruce Fields May 16, 2018, 2:09 p.m. UTC | #7
On Mon, May 14, 2018 at 05:47:43PM +0100, Al Viro wrote:
> On Mon, May 14, 2018 at 04:45:51PM +0100, Al Viro wrote:
> 
> > > > 	2) is nfserr_serverfail valid for situation when
> > > > directory created by such vfs_mkdir() manages to disappear
> > > > under us immediately afterwards?  Or should we return nfserr_stale
> > > > instead?
> > > 
> > > We just got a successful result on the create and the parent's still
> > > locked, so if somebody hits this I think we want them reporting a bug,
> > > not wasting time trying to find a mistake in their own logic.
> > 
> > No.  Suppose it's NFS-over-NFS (and yes, I agree that it's a bad idea;
> > somebody *has* done that).  Lookup after successful mkdir can legitimately
> > fail if it's been removed server-side.
> > 
> > And we *will* need to allow nfs_mkdir() to return that way in some cases
> > (== success with dentry passed to it left unhashed negative), I'm afraid ;-/
> 
> Consider the situation when you have NFS reexported - there's server S1,
> exporting a filesystem to S2, which reexports it for client C.

Thanks for the explanation!  I'd missed the connection between this and
the mkdir/filehandle-lookup races.

	Acked-by: J. Bruce Fields <bfields@redhat.com>

for the patch.

--b.

> 
> On S2, process A does stat foo, gets ENOENT and proceeds to mkdir foo.  
> Dentry of foo is passed to nfs_mkdir(), which calls e.g. nfs3_proc_mkdir(),
> which calls nfs3_do_create(), which requests S1 to create the damn thing.
> Then, after that succeeds, it calls nfs_instantiate().  There we would 
> proceed to get the inode and call d_add(dentry, inode).
> 
> In the meanwhile, C, having figured out the fhandle S2 would assign to
> foo (e.g. having spied upon the traffic, etc.) sends that fhandle to
> S2.  nfs_fh_to_dentry() is called by process B on S2 (either knfsd, or,
> in case if C==S2 and attacker has done open-by-fhandle - the caller
> of open_by_handle(2)).  It gets the inode - the damn thing has been
> created on S1 already.  That inode still has no dentries attached to
> it (B has just set it up), so d_obtain_alias() creates one and attaches
> to it.
> 
> Now A gets around to nfs_fhget() and finds the same inode.  Voila -
> d_add(dentry, inode) and we are fucked; two dentries over the same
> directory inode.  Fun starts very shortly when fs/exportfs/* code
> is used by B to reconnect its dentry - the things really get bogus
> once that constraint is violated.
> 
> The obvious solution would be to replace that d_add() with
> 	res = d_splice_alias(inode, dentry);
> 	if (IS_ERR(res)) {
> 		error = PTR_ERR(res);
> 		goto out_error;
> 	}
> 	dput(res);
> 	
> leaving the dentry passed to nfs_mkdir() unhashed and negative if
> we hit that race.  That's fine - the next lookup (e.g. the one
> done by exportfs to reconnect the sucker) will find the right
> dentry in dcache; it's just that it won't the one passed to
> vfs_mkdir().
> 
> That's different from the local case - there mkdir gets the inumber,
> inserts locked in-core inode with that inumber into icache and
> only then proceeds to set on-disk data up.  Anyone guessing the
> inumber (and thus the fhandle) will either get -ESTALE (if they
> come first, as in this scenario) or find the in-core inode mkdir
> is setting up and wait for it to be unlocked, at which point
> d_obtain_alias() will already find it attached to dentry passed
> to mkdir.
> 
> But that critically relies upon the icache search key being known
> to mkdir *BEFORE* the on-disk metadata starts looking acceptable.
> For NFS we really can't do that - there the key is S1's fhandle
> and we don't get that until S1 has created the damn thing.
> 
> I don't see any variation of the trick used by local filesystems
> that would not have this race; we really can end up with B getting
> there first and creating directory inode with dentry attached to
> it before A gets through.  Which, AFAICS, leaves only one solution -
> let A put the dentry created by B in place of what had been passed
> to A (and leave its argument unhashed).  Which is trivial to
> implement (see above); however, it means that NFS itself is in
> the same class as cgroup - its ->mkdir() may, in some cases,
> end up succeeding and leaving its argument unhashed negative.
> Note that dcache is perfectly fine after that - we have hashed
> positive dentry with the right name and right parent, over the
> inode for directory we'd just created; everything's fine, except
> that it's not the struct dentry originally passed to vfs_mkdir().
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 2410b093a2e6..b0555d7d8200 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1201,6 +1201,28 @@  nfsd_create_locked(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		break;
 	case S_IFDIR:
 		host_err = vfs_mkdir(dirp, dchild, iap->ia_mode);
+		if (!host_err && unlikely(d_unhashed(dchild))) {
+			struct dentry *d;
+			d = lookup_one_len(dchild->d_name.name,
+					   dchild->d_parent,
+					   dchild->d_name.len);
+			if (IS_ERR(d)) {
+				host_err = PTR_ERR(d);
+				break;
+			}
+			if (unlikely(d_is_negative(d))) {
+				dput(d);
+				err = nfserr_serverfault;
+				goto out;
+			}
+			dput(resfhp->fh_dentry);
+			resfhp->fh_dentry = dget(d);
+			err = fh_update(resfhp);
+			dput(dchild);
+			dchild = d;
+			if (err)
+				goto out;
+		}
 		break;
 	case S_IFCHR:
 	case S_IFBLK: