nfsd: Fix error return values for nfsd4_clone_file_range()

Message ID	20190121205838.18680-1-trond.myklebust@hammerspace.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> From: Trond Myklebust <trondmy@gmail.com> To: "J. Bruce Fields" <bfields@redhat.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>, linux-nfs@vger.kernel.org Subject: [PATCH] nfsd: Fix error return values for nfsd4_clone_file_range() Date: Mon, 21 Jan 2019 15:58:38 -0500 Message-Id: <20190121205838.18680-1-trond.myklebust@hammerspace.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk
Series	nfsd: Fix error return values for nfsd4_clone_file_range() \| expand nfsd: Fix error return values for nfsd4_clone_file_range()

Trond Myklebust Jan. 21, 2019, 8:58 p.m. UTC

If the parameter 'count' is non-zero, nfsd4_clone_file_range() will
currently clobber all errors returned by vfs_clone_file_range() and
replace them with EINVAL.

Fixes: 42ec3d4c0218 ("vfs: make remap_file_range functions take and...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.20+
---
 fs/nfsd/vfs.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

J. Bruce Fields Jan. 25, 2019, 12:46 a.m. UTC | #1

On Mon, Jan 21, 2019 at 03:58:38PM -0500, Trond Myklebust wrote:
> If the parameter 'count' is non-zero, nfsd4_clone_file_range() will
> currently clobber all errors returned by vfs_clone_file_range() and
> replace them with EINVAL.

Oops, thanks for the fix.  I'm still a little confused, though:

> 
> Fixes: 42ec3d4c0218 ("vfs: make remap_file_range functions take and...")
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> Cc: stable@vger.kernel.org # v4.20+
> ---
>  fs/nfsd/vfs.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 9824e32b2f23..7dc98e14655d 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -557,9 +557,11 @@ __be32 nfsd4_clone_file_range(struct file *src, u64 src_pos, struct file *dst,
>  	loff_t cloned;
>  
>  	cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos, count, 0);
> +	if (cloned < 0)
> +		return nfserrno(cloned);
>  	if (count && cloned != count)
> -		cloned = -EINVAL;
> -	return nfserrno(cloned < 0 ? cloned : 0);
> +		return nfserrno(-EINVAL);
> +	return 0;

I still don't understand the cloned != count case.  I thought clone was
supposed to be all-or-nothing and atomic, can it really return a short
copy?  And how is that inval, shouldn't that be serverfault?

--b.

>  }
>  
>  ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
> -- 
> 2.20.1

Trond Myklebust Jan. 25, 2019, 5:50 a.m. UTC | #2

On Thu, 2019-01-24 at 19:46 -0500, J. Bruce Fields wrote:
> On Mon, Jan 21, 2019 at 03:58:38PM -0500, Trond Myklebust wrote:
> > If the parameter 'count' is non-zero, nfsd4_clone_file_range() will
> > currently clobber all errors returned by vfs_clone_file_range() and
> > replace them with EINVAL.
> 
> Oops, thanks for the fix.  I'm still a little confused, though:
> 
> > Fixes: 42ec3d4c0218 ("vfs: make remap_file_range functions take
> > and...")
> > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> > Cc: stable@vger.kernel.org # v4.20+
> > ---
> >  fs/nfsd/vfs.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 9824e32b2f23..7dc98e14655d 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -557,9 +557,11 @@ __be32 nfsd4_clone_file_range(struct file
> > *src, u64 src_pos, struct file *dst,
> >  	loff_t cloned;
> >  
> >  	cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos,
> > count, 0);
> > +	if (cloned < 0)
> > +		return nfserrno(cloned);
> >  	if (count && cloned != count)
> > -		cloned = -EINVAL;
> > -	return nfserrno(cloned < 0 ? cloned : 0);
> > +		return nfserrno(-EINVAL);
> > +	return 0;
> 
> I still don't understand the cloned != count case.  I thought clone
> was
> supposed to be all-or-nothing and atomic, can it really return a
> short
> copy?  And how is that inval, shouldn't that be serverfault?

That, quite frankly, seems like more of a question for Darrick, not me.
I haven't changed that part of the code.

The main thing I care about is being able to correctly report
EOPNOTSUPP errors for the vast majority of filesystems that don't
support clone() or dedup().

Cheers
  Trond

J. Bruce Fields Jan. 25, 2019, 4:32 p.m. UTC | #3

On Fri, Jan 25, 2019 at 12:50:09AM -0500, Trond Myklebust wrote:
> On Thu, 2019-01-24 at 19:46 -0500, J. Bruce Fields wrote:
> > On Mon, Jan 21, 2019 at 03:58:38PM -0500, Trond Myklebust wrote:
> > > If the parameter 'count' is non-zero, nfsd4_clone_file_range() will
> > > currently clobber all errors returned by vfs_clone_file_range() and
> > > replace them with EINVAL.
> > 
> > Oops, thanks for the fix.  I'm still a little confused, though:
...
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index 9824e32b2f23..7dc98e14655d 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -557,9 +557,11 @@ __be32 nfsd4_clone_file_range(struct file
> > > *src, u64 src_pos, struct file *dst,
> > >  	loff_t cloned;
> > >  
> > >  	cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos,
> > > count, 0);
> > > +	if (cloned < 0)
> > > +		return nfserrno(cloned);
> > >  	if (count && cloned != count)
> > > -		cloned = -EINVAL;
> > > -	return nfserrno(cloned < 0 ? cloned : 0);
> > > +		return nfserrno(-EINVAL);
> > > +	return 0;
> > 
> > I still don't understand the cloned != count case.  I thought clone
> > was
> > supposed to be all-or-nothing and atomic, can it really return a
> > short
> > copy?  And how is that inval, shouldn't that be serverfault?
> 
> That, quite frankly, seems like more of a question for Darrick, not me.
> I haven't changed that part of the code.
> 
> The main thing I care about is being able to correctly report
> EOPNOTSUPP errors for the vast majority of filesystems that don't
> support clone() or dedup().

Makes sense, and I'm happy just to apply this and then sort out the rest in a
subsequent patch, but I'd really like to understand; Darrick?:

ioctl_file_clone also converts short copies to EINVAL:

	if (cloned < 0)
                ret = cloned;
        else if (olen && cloned != olen)
                ret = -EINVAL;
        else
                ret = 0;

Maybe that happens iff we hit EOF in the short file?

Does that mean we can successfully copy up to EOF and then return -EINVAL?
That sounds wrong.

There's a man page (IOCTL-FICLONERANGE(2)) but it doesn't cover this case.

--b.

Olga Kornievskaia Jan. 25, 2019, 4:42 p.m. UTC | #4

On Fri, Jan 25, 2019 at 11:32 AM J. Bruce Fields <bfields@fieldses.org> wrote:
>
> On Fri, Jan 25, 2019 at 12:50:09AM -0500, Trond Myklebust wrote:
> > On Thu, 2019-01-24 at 19:46 -0500, J. Bruce Fields wrote:
> > > On Mon, Jan 21, 2019 at 03:58:38PM -0500, Trond Myklebust wrote:
> > > > If the parameter 'count' is non-zero, nfsd4_clone_file_range() will
> > > > currently clobber all errors returned by vfs_clone_file_range() and
> > > > replace them with EINVAL.
> > >
> > > Oops, thanks for the fix.  I'm still a little confused, though:
> ...
> > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > > index 9824e32b2f23..7dc98e14655d 100644
> > > > --- a/fs/nfsd/vfs.c
> > > > +++ b/fs/nfsd/vfs.c
> > > > @@ -557,9 +557,11 @@ __be32 nfsd4_clone_file_range(struct file
> > > > *src, u64 src_pos, struct file *dst,
> > > >   loff_t cloned;
> > > >
> > > >   cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos,
> > > > count, 0);
> > > > + if (cloned < 0)
> > > > +         return nfserrno(cloned);
> > > >   if (count && cloned != count)
> > > > -         cloned = -EINVAL;
> > > > - return nfserrno(cloned < 0 ? cloned : 0);
> > > > +         return nfserrno(-EINVAL);
> > > > + return 0;
> > >
> > > I still don't understand the cloned != count case.  I thought clone
> > > was
> > > supposed to be all-or-nothing and atomic, can it really return a
> > > short
> > > copy?  And how is that inval, shouldn't that be serverfault?
> >
> > That, quite frankly, seems like more of a question for Darrick, not me.
> > I haven't changed that part of the code.
> >
> > The main thing I care about is being able to correctly report
> > EOPNOTSUPP errors for the vast majority of filesystems that don't
> > support clone() or dedup().
>
> Makes sense, and I'm happy just to apply this and then sort out the rest in a
> subsequent patch, but I'd really like to understand; Darrick?:
>
> ioctl_file_clone also converts short copies to EINVAL:
>
>         if (cloned < 0)
>                 ret = cloned;
>         else if (olen && cloned != olen)
>                 ret = -EINVAL;
>         else
>                 ret = 0;
>
> Maybe that happens iff we hit EOF in the short file?
>
> Does that mean we can successfully copy up to EOF and then return -EINVAL?
> That sounds wrong.
>
> There's a man page (IOCTL-FICLONERANGE(2)) but it doesn't cover this case.

I thought cloned by definition was all or nothing meaning there can't
be a "short" clone. If you allow for less then asked bytes to be
returned, then your next offsets might not be block aligned.

J. Bruce Fields Jan. 25, 2019, 8:10 p.m. UTC | #5

On Fri, Jan 25, 2019 at 11:42:17AM -0500, Olga Kornievskaia wrote:
> On Fri, Jan 25, 2019 at 11:32 AM J. Bruce Fields <bfields@fieldses.org> wrote:
> >
> > On Fri, Jan 25, 2019 at 12:50:09AM -0500, Trond Myklebust wrote:
> > > On Thu, 2019-01-24 at 19:46 -0500, J. Bruce Fields wrote:
> > > > On Mon, Jan 21, 2019 at 03:58:38PM -0500, Trond Myklebust wrote:
> > > > > If the parameter 'count' is non-zero, nfsd4_clone_file_range() will
> > > > > currently clobber all errors returned by vfs_clone_file_range() and
> > > > > replace them with EINVAL.
> > > >
> > > > Oops, thanks for the fix.  I'm still a little confused, though:
> > ...
> > > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > > > index 9824e32b2f23..7dc98e14655d 100644
> > > > > --- a/fs/nfsd/vfs.c
> > > > > +++ b/fs/nfsd/vfs.c
> > > > > @@ -557,9 +557,11 @@ __be32 nfsd4_clone_file_range(struct file
> > > > > *src, u64 src_pos, struct file *dst,
> > > > >   loff_t cloned;
> > > > >
> > > > >   cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos,
> > > > > count, 0);
> > > > > + if (cloned < 0)
> > > > > +         return nfserrno(cloned);
> > > > >   if (count && cloned != count)
> > > > > -         cloned = -EINVAL;
> > > > > - return nfserrno(cloned < 0 ? cloned : 0);
> > > > > +         return nfserrno(-EINVAL);
> > > > > + return 0;
> > > >
> > > > I still don't understand the cloned != count case.  I thought clone
> > > > was
> > > > supposed to be all-or-nothing and atomic, can it really return a
> > > > short
> > > > copy?  And how is that inval, shouldn't that be serverfault?
> > >
> > > That, quite frankly, seems like more of a question for Darrick, not me.
> > > I haven't changed that part of the code.
> > >
> > > The main thing I care about is being able to correctly report
> > > EOPNOTSUPP errors for the vast majority of filesystems that don't
> > > support clone() or dedup().
> >
> > Makes sense, and I'm happy just to apply this and then sort out the rest in a
> > subsequent patch, but I'd really like to understand; Darrick?:
> >
> > ioctl_file_clone also converts short copies to EINVAL:
> >
> >         if (cloned < 0)
> >                 ret = cloned;
> >         else if (olen && cloned != olen)
> >                 ret = -EINVAL;
> >         else
> >                 ret = 0;
> >
> > Maybe that happens iff we hit EOF in the short file?
> >
> > Does that mean we can successfully copy up to EOF and then return -EINVAL?
> > That sounds wrong.
> >
> > There's a man page (IOCTL-FICLONERANGE(2)) but it doesn't cover this case.
> 
> I thought cloned by definition was all or nothing meaning there can't
> be a "short" clone. If you allow for less then asked bytes to be
> returned, then your next offsets might not be block aligned.

Yeah.  I was assuming it could happen in the case you ask to clone
beyond the end of the source file.  But looking at the code, there's a
check for that case in generic_remap_checks() before doing the clone,
and while holding a write lock on i_rwsem (I assume that's enough to
hold the file size constant).  At least that's true in the cases (btrfs
& xfs) that I checked.

So, I don't know, maybe that check is just dead code.

--b.

J. Bruce Fields Jan. 25, 2019, 8:15 p.m. UTC | #6

On Fri, Jan 25, 2019 at 03:10:37PM -0500, J. Bruce Fields wrote:
> Yeah.  I was assuming it could happen in the case you ask to clone
> beyond the end of the source file.  But looking at the code, there's a
> check for that case in generic_remap_checks() before doing the clone,
> and while holding a write lock on i_rwsem (I assume that's enough to
> hold the file size constant).  At least that's true in the cases (btrfs
> & xfs) that I checked.
> 
> So, I don't know, maybe that check is just dead code.

In the xfs case it looks like the main work of the clone is done in
xfs_reflink_remap_blocks(), where there's a loop like:

	while (len) {
		... mysterious code that clones range_len worth of
		    extents?
		if (fatal_signal_pending(current)) {
			error =-EINTR;
			break;
		}
		...
		len -= range_len;
		remapped_len += range_len;
	}

And then it ends up returning remapped_len if it's positive.

So it looks to me like if you do a big clone on xfs and kill the
process, it can clone part of the range, return the amount cloned, and
then the ioctl code will throw away that amount and just return EINVAL,
with the result that the application thinks the operation failed
completely actually it cloned a bunch of data.

--b.

Darrick J. Wong Jan. 25, 2019, 8:57 p.m. UTC | #7

On Fri, Jan 25, 2019 at 03:15:51PM -0500, J. Bruce Fields wrote:
> On Fri, Jan 25, 2019 at 03:10:37PM -0500, J. Bruce Fields wrote:
> > Yeah.  I was assuming it could happen in the case you ask to clone
> > beyond the end of the source file.  But looking at the code, there's a
> > check for that case in generic_remap_checks() before doing the clone,
> > and while holding a write lock on i_rwsem (I assume that's enough to
> > hold the file size constant).  At least that's true in the cases (btrfs
> > & xfs) that I checked.
> > 
> > So, I don't know, maybe that check is just dead code.
> 
> In the xfs case it looks like the main work of the clone is done in
> xfs_reflink_remap_blocks(), where there's a loop like:
> 
> 	while (len) {
> 		... mysterious code that clones range_len worth of
> 		    extents?
> 		if (fatal_signal_pending(current)) {
> 			error =-EINTR;
> 			break;
> 		}
> 		...
> 		len -= range_len;
> 		remapped_len += range_len;
> 	}
> 
> And then it ends up returning remapped_len if it's positive.

Hmm?  In xfs_reflink_remap_blocks, *remapped (an out parameter) is set
to remapped_len just prior to returning whatever error is.  The caller
(xfs_file_remap_range) can see how many bytes were remapped, as well as
any error that might have cut short the remapping process...

> So it looks to me like if you do a big clone on xfs and kill the
> process, it can clone part of the range, return the amount cloned, and
> then the ioctl code will throw away that amount and just return EINVAL,

...and so xfs_file_remap_range will return the number of bytes remapped
before it returns any error codes.  This was done so that
copy_file_range can call remap_file_range and report a short but
otherwise successful copy.

Yes, it's sort of dumb that we pass the "bytes remapped" information all
the way up the call stack only to have the clonerange ioctl spit back
EINVAL on a short remap, but we're stuck with that (poorly thought out)
artifact of btrfs.

Note that XFS can return cloned < count if (for example) it runs out of
space trying to expand the extent map btree to add more extents.

> with the result that the application thinks the operation failed
> completely actually it cloned a bunch of data.

(Yes, the ioctl is dumb; I would say that programs should use
copy_file_range instead as a less bad interface, but the splice copy
fallback is ...  yuck.)

--D

> --b.

J. Bruce Fields Jan. 26, 2019, 10:36 p.m. UTC | #8

On Fri, Jan 25, 2019 at 12:57:26PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 25, 2019 at 03:15:51PM -0500, J. Bruce Fields wrote:
> > On Fri, Jan 25, 2019 at 03:10:37PM -0500, J. Bruce Fields wrote:
> > > Yeah.  I was assuming it could happen in the case you ask to clone
> > > beyond the end of the source file.  But looking at the code, there's a
> > > check for that case in generic_remap_checks() before doing the clone,
> > > and while holding a write lock on i_rwsem (I assume that's enough to
> > > hold the file size constant).  At least that's true in the cases (btrfs
> > > & xfs) that I checked.
> > > 
> > > So, I don't know, maybe that check is just dead code.
> > 
> > In the xfs case it looks like the main work of the clone is done in
> > xfs_reflink_remap_blocks(), where there's a loop like:
> > 
> > 	while (len) {
> > 		... mysterious code that clones range_len worth of
> > 		    extents?
> > 		if (fatal_signal_pending(current)) {
> > 			error =-EINTR;
> > 			break;
> > 		}
> > 		...
> > 		len -= range_len;
> > 		remapped_len += range_len;
> > 	}
> > 
> > And then it ends up returning remapped_len if it's positive.
> 
> Hmm?  In xfs_reflink_remap_blocks, *remapped (an out parameter) is set
> to remapped_len just prior to returning whatever error is.  The caller
> (xfs_file_remap_range) can see how many bytes were remapped, as well as
> any error that might have cut short the remapping process...
> 
> > So it looks to me like if you do a big clone on xfs and kill the
> > process, it can clone part of the range, return the amount cloned, and
> > then the ioctl code will throw away that amount and just return EINVAL,
> 
> ...and so xfs_file_remap_range will return the number of bytes remapped
> before it returns any error codes.  This was done so that
> copy_file_range can call remap_file_range and report a short but
> otherwise successful copy.
> 
> Yes, it's sort of dumb that we pass the "bytes remapped" information all
> the way up the call stack only to have the clonerange ioctl spit back
> EINVAL on a short remap, but we're stuck with that (poorly thought out)
> artifact of btrfs.

Is there any real reason not to just remove it?  It looks to me like a
no-op in the btrfs case.

> Note that XFS can return cloned < count if (for example) it runs out of
> space trying to expand the extent map btree to add more extents.

Makes sense.  Uh, my fatal signal case was dumb, wasn't it, nobody's
going to see the return value from the system call then anyway.

Still, returning -EINVAL after some data was actually copied seems like
the kind of thing that could cause real bugs.

> > with the result that the application thinks the operation failed
> > completely actually it cloned a bunch of data.
> 
> (Yes, the ioctl is dumb; I would say that programs should use
> copy_file_range instead as a less bad interface, but the splice copy
> fallback is ...  yuck.)

I agree.

--b.

nfsd: Fix error return values for nfsd4_clone_file_range()

Commit Message

Comments

Patch