diff mbox series

[01/11] vfs: copy_file_range source range over EOF should fail

Message ID 20181203083416.28978-2-david@fromorbit.com (mailing list archive)
State New, archived
Headers show
Series fs: fixes for major copy_file_range() issues | expand

Commit Message

Dave Chinner Dec. 3, 2018, 8:34 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

The man page says:

EINVAL Requested range extends beyond the end of the source file

But the current behaviour is that copy_file_range does a short
copy up to the source file EOF. Fix the kernel behaviour to match
the behaviour described in the man page.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/read_write.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Amir Goldstein Dec. 3, 2018, 12:46 p.m. UTC | #1
On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> The man page says:
>
> EINVAL Requested range extends beyond the end of the source file
>
> But the current behaviour is that copy_file_range does a short
> copy up to the source file EOF. Fix the kernel behaviour to match
> the behaviour described in the man page.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/read_write.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 4dae0399c75a..09d1816cf3cf 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1581,6 +1581,10 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>         if (len == 0)
>                 return 0;
>
> +       /* If the source range crosses EOF, fail the copy */
> +       if (pos_in >= i_size(inode_in) || pos_in + len > i_size(inode_in))
> +               return -EINVAL;
> +

i_size_read()...

Otherwise
Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.


> --
> 2.19.1
>
Christoph Hellwig Dec. 4, 2018, 3:13 p.m. UTC | #2
On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > The man page says:
> >
> > EINVAL Requested range extends beyond the end of the source file
> >
> > But the current behaviour is that copy_file_range does a short
> > copy up to the source file EOF. Fix the kernel behaviour to match
> > the behaviour described in the man page.

I think the behavior implemented is a lot more useful than the one
documented..

> > +       /* If the source range crosses EOF, fail the copy */
> > +       if (pos_in >= i_size(inode_in) || pos_in + len > i_size(inode_in))
> > +               return -EINVAL;
> > +
> 
> i_size_read()...
> 
> Otherwise
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Looks like this doesn't even compile?
Dave Chinner Dec. 4, 2018, 9:29 p.m. UTC | #3
On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > The man page says:
> > >
> > > EINVAL Requested range extends beyond the end of the source file
> > >
> > > But the current behaviour is that copy_file_range does a short
> > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > the behaviour described in the man page.
> 
> I think the behavior implemented is a lot more useful than the one
> documented..

The current behaviour is really nasty. Because copy_file_range() can
return short copies, the caller has to implement a loop to ensure
the range hey want get copied.  When the source range you are
trying to copy overlaps source EOF, this loop:

	while (len > 0) {
		ret = copy_file_range(... len ...)
		...
		off_in += ret;
		off_out += ret;
		len -= ret;
	}

Currently the fallback code copies up to the end of the source file
on the first copy and then fails the second copy with EINVAL because
the source range is now completely beyond EOF.

So, from an application perspective, did the copy succeed or did it
fail?

Existing tools that exercise copy_file_range (like xfs_io) consider
this a failure, because the second copy_file_range() call returns
EINVAL and not some "there is no more to copy" marker like read()
returning 0 bytes when attempting to read beyond EOF.

IOWs, we cannot tell the difference between a real error and a short
copy because the input range spans EOF and it was silently
shortened. That's the API problem we need to fix here - the existing
behaviour is really crappy for applications. Erroring out
immmediately is one solution, and it's what the man page says should
happen so that is what I implemented.

Realistically, though, I think an attempt to read beyond EOF for the
copy should result in behaviour like read() (i.e. return 0 bytes),
not EINVAL. The existing behaviour needs to change, though.

> > i_size_read()...
> > 
> > Otherwise
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> 
> Looks like this doesn't even compile?

It's fixed in a later patch that consolidates the checks into a
generic check function, but I'm not sure why my "compile every
patch" script didn't catch this.

Cheers,

-Dave.
Olga Kornievskaia Dec. 4, 2018, 9:47 p.m. UTC | #4
On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > The man page says:
> > > >
> > > > EINVAL Requested range extends beyond the end of the source file
> > > >
> > > > But the current behaviour is that copy_file_range does a short
> > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > the behaviour described in the man page.
> >
> > I think the behavior implemented is a lot more useful than the one
> > documented..
>
> The current behaviour is really nasty. Because copy_file_range() can
> return short copies, the caller has to implement a loop to ensure
> the range hey want get copied.  When the source range you are
> trying to copy overlaps source EOF, this loop:
>
>         while (len > 0) {
>                 ret = copy_file_range(... len ...)
>                 ...
>                 off_in += ret;
>                 off_out += ret;
>                 len -= ret;
>         }
>
> Currently the fallback code copies up to the end of the source file
> on the first copy and then fails the second copy with EINVAL because
> the source range is now completely beyond EOF.
>
> So, from an application perspective, did the copy succeed or did it
> fail?
>
> Existing tools that exercise copy_file_range (like xfs_io) consider
> this a failure, because the second copy_file_range() call returns
> EINVAL and not some "there is no more to copy" marker like read()
> returning 0 bytes when attempting to read beyond EOF.
>
> IOWs, we cannot tell the difference between a real error and a short
> copy because the input range spans EOF and it was silently
> shortened. That's the API problem we need to fix here - the existing
> behaviour is really crappy for applications. Erroring out
> immmediately is one solution, and it's what the man page says should
> happen so that is what I implemented.
>
> Realistically, though, I think an attempt to read beyond EOF for the
> copy should result in behaviour like read() (i.e. return 0 bytes),
> not EINVAL. The existing behaviour needs to change, though.

There are two checks to consider
1. pos_in >= EOF should return EINVAL
2. however what's perhaps should be relaxed is pos_in+len >= EOF
should return a short copy.

Having check#1 enforced allows to us to differentiate between a real
error and a short copy.

>
> > > i_size_read()...
> > >
> > > Otherwise
> > > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> >
> > Looks like this doesn't even compile?
>
> It's fixed in a later patch that consolidates the checks into a
> generic check function, but I'm not sure why my "compile every
> patch" script didn't catch this.
>
> Cheers,
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
Dave Chinner Dec. 4, 2018, 10:31 p.m. UTC | #5
On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > >
> > > > > The man page says:
> > > > >
> > > > > EINVAL Requested range extends beyond the end of the source file
> > > > >
> > > > > But the current behaviour is that copy_file_range does a short
> > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > the behaviour described in the man page.
> > >
> > > I think the behavior implemented is a lot more useful than the one
> > > documented..
> >
> > The current behaviour is really nasty. Because copy_file_range() can
> > return short copies, the caller has to implement a loop to ensure
> > the range hey want get copied.  When the source range you are
> > trying to copy overlaps source EOF, this loop:
> >
> >         while (len > 0) {
> >                 ret = copy_file_range(... len ...)
> >                 ...
> >                 off_in += ret;
> >                 off_out += ret;
> >                 len -= ret;
> >         }
> >
> > Currently the fallback code copies up to the end of the source file
> > on the first copy and then fails the second copy with EINVAL because
> > the source range is now completely beyond EOF.
> >
> > So, from an application perspective, did the copy succeed or did it
> > fail?
> >
> > Existing tools that exercise copy_file_range (like xfs_io) consider
> > this a failure, because the second copy_file_range() call returns
> > EINVAL and not some "there is no more to copy" marker like read()
> > returning 0 bytes when attempting to read beyond EOF.
> >
> > IOWs, we cannot tell the difference between a real error and a short
> > copy because the input range spans EOF and it was silently
> > shortened. That's the API problem we need to fix here - the existing
> > behaviour is really crappy for applications. Erroring out
> > immmediately is one solution, and it's what the man page says should
> > happen so that is what I implemented.
> >
> > Realistically, though, I think an attempt to read beyond EOF for the
> > copy should result in behaviour like read() (i.e. return 0 bytes),
> > not EINVAL. The existing behaviour needs to change, though.
> 
> There are two checks to consider
> 1. pos_in >= EOF should return EINVAL
> 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> should return a short copy.
> 
> Having check#1 enforced allows to us to differentiate between a real
> error and a short copy.

That's what the code does right now and *exactly what I'm trying to
fix* because it EINVAL is ambiguous and not an indicator that we've
reached the end of the source file. EINVAL can indicate several
different errors, so it really has to be treated as a "copy failed"
error by applications.

Have a look at read/pread() - they return 0 in this case to indicate
a short read, and the value of zero is explicitly defined as meaning
"read position is beyond EOF".  Applications know straight away that
there is no more data to be read and there was no error, so can
terminate on a successful short read.

We need to allow applications to terminate copy loops on a
successful short copy. IOWs, applications need to either:

	- get an immediate error saying the range is invalid rather
	  than doing a short copy (as per the man page); or
	- have an explicit marker to say "no more data to be copied"

Applications need the "no more data to copy" case to be explicit and
unambiguous so they can make sane decisions about whether a short
copy was successful because the file was shorter than expected or
whether a short copy was a result of a real error being encountered.
The current behaviour is largely unusable for applications because
they have to guess at the reason for EINVAL part way through a
copy....

Cheers,

Dave.
Christoph Hellwig Dec. 5, 2018, 2:12 p.m. UTC | #6
> Realistically, though, I think an attempt to read beyond EOF for the
> copy should result in behaviour like read() (i.e. return 0 bytes),
> not EINVAL. The existing behaviour needs to change, though.

I agree with this statement.  So we don't we implement these semantics?
J. Bruce Fields Dec. 5, 2018, 4:51 p.m. UTC | #7
On Wed, Dec 05, 2018 at 09:31:02AM +1100, Dave Chinner wrote:
> That's what the code does right now and *exactly what I'm trying to
> fix* because it EINVAL is ambiguous and not an indicator that we've
> reached the end of the source file. EINVAL can indicate several
> different errors, so it really has to be treated as a "copy failed"
> error by applications.
> 
> Have a look at read/pread() - they return 0 in this case to indicate
> a short read, and the value of zero is explicitly defined as meaning
> "read position is beyond EOF".  Applications know straight away that
> there is no more data to be read and there was no error, so can
> terminate on a successful short read.
> 
> We need to allow applications to terminate copy loops on a
> successful short copy.

I'm a little confused by your definition of "short copy" and "short
read".  Are you using that to mean a copy/read that returns zero?  I
usually see it used to mean any successful call that returned less than
the requested amount.  I'd expect a zero return to terminate a copy
loop, but not any positive return.

--b.


> IOWs, applications need to either:
> 
> 	- get an immediate error saying the range is invalid rather
> 	  than doing a short copy (as per the man page); or
> 	- have an explicit marker to say "no more data to be copied"
> 
> Applications need the "no more data to copy" case to be explicit and
> unambiguous so they can make sane decisions about whether a short
> copy was successful because the file was shorter than expected or
> whether a short copy was a result of a real error being encountered.
> The current behaviour is largely unusable for applications because
> they have to guess at the reason for EINVAL part way through a
> copy....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Dec. 5, 2018, 9:08 p.m. UTC | #8
On Wed, Dec 05, 2018 at 06:12:52AM -0800, Christoph Hellwig wrote:
> > Realistically, though, I think an attempt to read beyond EOF for the
> > copy should result in behaviour like read() (i.e. return 0 bytes),
> > not EINVAL. The existing behaviour needs to change, though.
> 
> I agree with this statement.  So we don't we implement these semantics?

No, we don't.

I will rework the patch series to make attempts to copy beyond the
end of the source file return 0 to indicate that there is no more
data to copy rather than return an error.

Cheers,

Dave.
Christoph Hellwig Dec. 5, 2018, 9:30 p.m. UTC | #9
On Thu, Dec 06, 2018 at 08:08:24AM +1100, Dave Chinner wrote:
> On Wed, Dec 05, 2018 at 06:12:52AM -0800, Christoph Hellwig wrote:
> > > Realistically, though, I think an attempt to read beyond EOF for the
> > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > not EINVAL. The existing behaviour needs to change, though.
> > 
> > I agree with this statement.  So we don't we implement these semantics?
> 
> No, we don't.

Sorry - I was rushing that sentence out.  It should have been:

So why don't we implement these semantics?

> 
> I will rework the patch series to make attempts to copy beyond the
> end of the source file return 0 to indicate that there is no more
> data to copy rather than return an error.

Great, thanks!
Amir Goldstein May 20, 2019, 9:10 a.m. UTC | #10
On Wed, Dec 5, 2018 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> > On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > >
> > > > > > The man page says:
> > > > > >
> > > > > > EINVAL Requested range extends beyond the end of the source file
> > > > > >
> > > > > > But the current behaviour is that copy_file_range does a short
> > > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > > the behaviour described in the man page.
> > > >
> > > > I think the behavior implemented is a lot more useful than the one
> > > > documented..
> > >
> > > The current behaviour is really nasty. Because copy_file_range() can
> > > return short copies, the caller has to implement a loop to ensure
> > > the range hey want get copied.  When the source range you are
> > > trying to copy overlaps source EOF, this loop:
> > >
> > >         while (len > 0) {
> > >                 ret = copy_file_range(... len ...)
> > >                 ...
> > >                 off_in += ret;
> > >                 off_out += ret;
> > >                 len -= ret;
> > >         }
> > >
> > > Currently the fallback code copies up to the end of the source file
> > > on the first copy and then fails the second copy with EINVAL because
> > > the source range is now completely beyond EOF.
> > >
> > > So, from an application perspective, did the copy succeed or did it
> > > fail?
> > >
> > > Existing tools that exercise copy_file_range (like xfs_io) consider
> > > this a failure, because the second copy_file_range() call returns
> > > EINVAL and not some "there is no more to copy" marker like read()
> > > returning 0 bytes when attempting to read beyond EOF.
> > >
> > > IOWs, we cannot tell the difference between a real error and a short
> > > copy because the input range spans EOF and it was silently
> > > shortened. That's the API problem we need to fix here - the existing
> > > behaviour is really crappy for applications. Erroring out
> > > immmediately is one solution, and it's what the man page says should
> > > happen so that is what I implemented.
> > >
> > > Realistically, though, I think an attempt to read beyond EOF for the
> > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > not EINVAL. The existing behaviour needs to change, though.
> >
> > There are two checks to consider
> > 1. pos_in >= EOF should return EINVAL
> > 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> > should return a short copy.
> >
> > Having check#1 enforced allows to us to differentiate between a real
> > error and a short copy.
>
> That's what the code does right now and *exactly what I'm trying to
> fix* because it EINVAL is ambiguous and not an indicator that we've
> reached the end of the source file. EINVAL can indicate several
> different errors, so it really has to be treated as a "copy failed"
> error by applications.
>
> Have a look at read/pread() - they return 0 in this case to indicate
> a short read, and the value of zero is explicitly defined as meaning
> "read position is beyond EOF".  Applications know straight away that
> there is no more data to be read and there was no error, so can
> terminate on a successful short read.
>
> We need to allow applications to terminate copy loops on a
> successful short copy. IOWs, applications need to either:
>
>         - get an immediate error saying the range is invalid rather
>           than doing a short copy (as per the man page); or
>         - have an explicit marker to say "no more data to be copied"
>
> Applications need the "no more data to copy" case to be explicit and
> unambiguous so they can make sane decisions about whether a short
> copy was successful because the file was shorter than expected or
> whether a short copy was a result of a real error being encountered.
> The current behaviour is largely unusable for applications because
> they have to guess at the reason for EINVAL part way through a
> copy....
>

Dave,

I went a head and implemented the desired behavior.
However, while testing I observed that the desired behavior is already
the existing behavior. For example, trying to copy 10 bytes from a 2 bytes file,
xfs_io copy loop ends as expected:
copy_file_range(4, [0], 3, [0], 10, 0)  = 2
copy_file_range(4, [2], 3, [2], 8, 0)   = 0

This was tested on ext4 and xfs with reflink on recent kernel as well as on
v4.20-rc1 (era of original patch set).

Where and how did you observe the EINVAL behavior described above?
(besides man page that is). There are even xfstests (which you modified)
that verify the return 0 for past EOF behavior.

For now, I am just dropping this patch from the patch series.
Let me know if I am missing something.

Thanks,
Amir.
Olga Kornievskaia May 20, 2019, 1:12 p.m. UTC | #11
On Mon, May 20, 2019 at 5:10 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Dec 5, 2018 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> > > On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > >
> > > > > > > The man page says:
> > > > > > >
> > > > > > > EINVAL Requested range extends beyond the end of the source file
> > > > > > >
> > > > > > > But the current behaviour is that copy_file_range does a short
> > > > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > > > the behaviour described in the man page.
> > > > >
> > > > > I think the behavior implemented is a lot more useful than the one
> > > > > documented..
> > > >
> > > > The current behaviour is really nasty. Because copy_file_range() can
> > > > return short copies, the caller has to implement a loop to ensure
> > > > the range hey want get copied.  When the source range you are
> > > > trying to copy overlaps source EOF, this loop:
> > > >
> > > >         while (len > 0) {
> > > >                 ret = copy_file_range(... len ...)
> > > >                 ...
> > > >                 off_in += ret;
> > > >                 off_out += ret;
> > > >                 len -= ret;
> > > >         }
> > > >
> > > > Currently the fallback code copies up to the end of the source file
> > > > on the first copy and then fails the second copy with EINVAL because
> > > > the source range is now completely beyond EOF.
> > > >
> > > > So, from an application perspective, did the copy succeed or did it
> > > > fail?
> > > >
> > > > Existing tools that exercise copy_file_range (like xfs_io) consider
> > > > this a failure, because the second copy_file_range() call returns
> > > > EINVAL and not some "there is no more to copy" marker like read()
> > > > returning 0 bytes when attempting to read beyond EOF.
> > > >
> > > > IOWs, we cannot tell the difference between a real error and a short
> > > > copy because the input range spans EOF and it was silently
> > > > shortened. That's the API problem we need to fix here - the existing
> > > > behaviour is really crappy for applications. Erroring out
> > > > immmediately is one solution, and it's what the man page says should
> > > > happen so that is what I implemented.
> > > >
> > > > Realistically, though, I think an attempt to read beyond EOF for the
> > > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > > not EINVAL. The existing behaviour needs to change, though.
> > >
> > > There are two checks to consider
> > > 1. pos_in >= EOF should return EINVAL
> > > 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> > > should return a short copy.
> > >
> > > Having check#1 enforced allows to us to differentiate between a real
> > > error and a short copy.
> >
> > That's what the code does right now and *exactly what I'm trying to
> > fix* because it EINVAL is ambiguous and not an indicator that we've
> > reached the end of the source file. EINVAL can indicate several
> > different errors, so it really has to be treated as a "copy failed"
> > error by applications.
> >
> > Have a look at read/pread() - they return 0 in this case to indicate
> > a short read, and the value of zero is explicitly defined as meaning
> > "read position is beyond EOF".  Applications know straight away that
> > there is no more data to be read and there was no error, so can
> > terminate on a successful short read.
> >
> > We need to allow applications to terminate copy loops on a
> > successful short copy. IOWs, applications need to either:
> >
> >         - get an immediate error saying the range is invalid rather
> >           than doing a short copy (as per the man page); or
> >         - have an explicit marker to say "no more data to be copied"
> >
> > Applications need the "no more data to copy" case to be explicit and
> > unambiguous so they can make sane decisions about whether a short
> > copy was successful because the file was shorter than expected or
> > whether a short copy was a result of a real error being encountered.
> > The current behaviour is largely unusable for applications because
> > they have to guess at the reason for EINVAL part way through a
> > copy....
> >
>
> Dave,
>
> I went a head and implemented the desired behavior.
> However, while testing I observed that the desired behavior is already
> the existing behavior. For example, trying to copy 10 bytes from a 2 bytes file,
> xfs_io copy loop ends as expected:
> copy_file_range(4, [0], 3, [0], 10, 0)  = 2
> copy_file_range(4, [2], 3, [2], 8, 0)   = 0
>
> This was tested on ext4 and xfs with reflink on recent kernel as well as on
> v4.20-rc1 (era of original patch set).
>
> Where and how did you observe the EINVAL behavior described above?
> (besides man page that is). There are even xfstests (which you modified)
> that verify the return 0 for past EOF behavior.
>
> For now, I am just dropping this patch from the patch series.
> Let me know if I am missing something.

The was fixing inconsistency in what the man page specified (ie., it
must fail with EINVAL if offsets are out of range) which was never
enforced by the code. The patch then could be to fix the existing
semantics (man page) of the system call.

Copy file range range is not only read and write but rather
lseek+read+write and if somebody specifies an incorrect offset to the
lseek the system call should fail. Thus I still think that copy file
range should enforce that specifying a source offset beyond the end of
the file should fail with EINVAL.

If the copy file range returned 0 bytes does it mean it's a stopping
condition, not according to the current semantics.
Amir Goldstein May 20, 2019, 1:36 p.m. UTC | #12
On Mon, May 20, 2019 at 4:12 PM Olga Kornievskaia
<olga.kornievskaia@gmail.com> wrote:
>
> On Mon, May 20, 2019 at 5:10 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Wed, Dec 5, 2018 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> > > > On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > > > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > > >
> > > > > > > > The man page says:
> > > > > > > >
> > > > > > > > EINVAL Requested range extends beyond the end of the source file
> > > > > > > >
> > > > > > > > But the current behaviour is that copy_file_range does a short
> > > > > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > > > > the behaviour described in the man page.
> > > > > >
> > > > > > I think the behavior implemented is a lot more useful than the one
> > > > > > documented..
> > > > >
> > > > > The current behaviour is really nasty. Because copy_file_range() can
> > > > > return short copies, the caller has to implement a loop to ensure
> > > > > the range hey want get copied.  When the source range you are
> > > > > trying to copy overlaps source EOF, this loop:
> > > > >
> > > > >         while (len > 0) {
> > > > >                 ret = copy_file_range(... len ...)
> > > > >                 ...
> > > > >                 off_in += ret;
> > > > >                 off_out += ret;
> > > > >                 len -= ret;
> > > > >         }
> > > > >
> > > > > Currently the fallback code copies up to the end of the source file
> > > > > on the first copy and then fails the second copy with EINVAL because
> > > > > the source range is now completely beyond EOF.
> > > > >
> > > > > So, from an application perspective, did the copy succeed or did it
> > > > > fail?
> > > > >
> > > > > Existing tools that exercise copy_file_range (like xfs_io) consider
> > > > > this a failure, because the second copy_file_range() call returns
> > > > > EINVAL and not some "there is no more to copy" marker like read()
> > > > > returning 0 bytes when attempting to read beyond EOF.
> > > > >
> > > > > IOWs, we cannot tell the difference between a real error and a short
> > > > > copy because the input range spans EOF and it was silently
> > > > > shortened. That's the API problem we need to fix here - the existing
> > > > > behaviour is really crappy for applications. Erroring out
> > > > > immmediately is one solution, and it's what the man page says should
> > > > > happen so that is what I implemented.
> > > > >
> > > > > Realistically, though, I think an attempt to read beyond EOF for the
> > > > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > > > not EINVAL. The existing behaviour needs to change, though.
> > > >
> > > > There are two checks to consider
> > > > 1. pos_in >= EOF should return EINVAL
> > > > 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> > > > should return a short copy.
> > > >
> > > > Having check#1 enforced allows to us to differentiate between a real
> > > > error and a short copy.
> > >
> > > That's what the code does right now and *exactly what I'm trying to
> > > fix* because it EINVAL is ambiguous and not an indicator that we've
> > > reached the end of the source file. EINVAL can indicate several
> > > different errors, so it really has to be treated as a "copy failed"
> > > error by applications.
> > >
> > > Have a look at read/pread() - they return 0 in this case to indicate
> > > a short read, and the value of zero is explicitly defined as meaning
> > > "read position is beyond EOF".  Applications know straight away that
> > > there is no more data to be read and there was no error, so can
> > > terminate on a successful short read.
> > >
> > > We need to allow applications to terminate copy loops on a
> > > successful short copy. IOWs, applications need to either:
> > >
> > >         - get an immediate error saying the range is invalid rather
> > >           than doing a short copy (as per the man page); or
> > >         - have an explicit marker to say "no more data to be copied"
> > >
> > > Applications need the "no more data to copy" case to be explicit and
> > > unambiguous so they can make sane decisions about whether a short
> > > copy was successful because the file was shorter than expected or
> > > whether a short copy was a result of a real error being encountered.
> > > The current behaviour is largely unusable for applications because
> > > they have to guess at the reason for EINVAL part way through a
> > > copy....
> > >
> >
> > Dave,
> >
> > I went a head and implemented the desired behavior.
> > However, while testing I observed that the desired behavior is already
> > the existing behavior. For example, trying to copy 10 bytes from a 2 bytes file,
> > xfs_io copy loop ends as expected:
> > copy_file_range(4, [0], 3, [0], 10, 0)  = 2
> > copy_file_range(4, [2], 3, [2], 8, 0)   = 0
> >
> > This was tested on ext4 and xfs with reflink on recent kernel as well as on
> > v4.20-rc1 (era of original patch set).
> >
> > Where and how did you observe the EINVAL behavior described above?
> > (besides man page that is). There are even xfstests (which you modified)
> > that verify the return 0 for past EOF behavior.
> >
> > For now, I am just dropping this patch from the patch series.
> > Let me know if I am missing something.
>
> The was fixing inconsistency in what the man page specified (ie., it
> must fail with EINVAL if offsets are out of range) which was never
> enforced by the code. The patch then could be to fix the existing
> semantics (man page) of the system call.
>
> Copy file range range is not only read and write but rather
> lseek+read+write and if somebody specifies an incorrect offset to the

Nope. it is like either read+write or pread+pwrite.

> lseek the system call should fail. Thus I still think that copy file
> range should enforce that specifying a source offset beyond the end of
> the file should fail with EINVAL.

You appear to be out numbered by reviewers that think copy_file_range(2)
should behave like pread(2) and return 0 when offf_in >= size_in.

>
> If the copy file range returned 0 bytes does it mean it's a stopping
> condition, not according to the current semantics.

Yes. Same as read(2)/pread(2).

Thanks,
Amir.
Olga Kornievskaia May 20, 2019, 1:58 p.m. UTC | #13
On Mon, May 20, 2019 at 9:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, May 20, 2019 at 4:12 PM Olga Kornievskaia
> <olga.kornievskaia@gmail.com> wrote:
> >
> > On Mon, May 20, 2019 at 5:10 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Wed, Dec 5, 2018 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> > > > > On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > >
> > > > > > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > > > > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > > > >
> > > > > > > > > The man page says:
> > > > > > > > >
> > > > > > > > > EINVAL Requested range extends beyond the end of the source file
> > > > > > > > >
> > > > > > > > > But the current behaviour is that copy_file_range does a short
> > > > > > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > > > > > the behaviour described in the man page.
> > > > > > >
> > > > > > > I think the behavior implemented is a lot more useful than the one
> > > > > > > documented..
> > > > > >
> > > > > > The current behaviour is really nasty. Because copy_file_range() can
> > > > > > return short copies, the caller has to implement a loop to ensure
> > > > > > the range hey want get copied.  When the source range you are
> > > > > > trying to copy overlaps source EOF, this loop:
> > > > > >
> > > > > >         while (len > 0) {
> > > > > >                 ret = copy_file_range(... len ...)
> > > > > >                 ...
> > > > > >                 off_in += ret;
> > > > > >                 off_out += ret;
> > > > > >                 len -= ret;
> > > > > >         }
> > > > > >
> > > > > > Currently the fallback code copies up to the end of the source file
> > > > > > on the first copy and then fails the second copy with EINVAL because
> > > > > > the source range is now completely beyond EOF.
> > > > > >
> > > > > > So, from an application perspective, did the copy succeed or did it
> > > > > > fail?
> > > > > >
> > > > > > Existing tools that exercise copy_file_range (like xfs_io) consider
> > > > > > this a failure, because the second copy_file_range() call returns
> > > > > > EINVAL and not some "there is no more to copy" marker like read()
> > > > > > returning 0 bytes when attempting to read beyond EOF.
> > > > > >
> > > > > > IOWs, we cannot tell the difference between a real error and a short
> > > > > > copy because the input range spans EOF and it was silently
> > > > > > shortened. That's the API problem we need to fix here - the existing
> > > > > > behaviour is really crappy for applications. Erroring out
> > > > > > immmediately is one solution, and it's what the man page says should
> > > > > > happen so that is what I implemented.
> > > > > >
> > > > > > Realistically, though, I think an attempt to read beyond EOF for the
> > > > > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > > > > not EINVAL. The existing behaviour needs to change, though.
> > > > >
> > > > > There are two checks to consider
> > > > > 1. pos_in >= EOF should return EINVAL
> > > > > 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> > > > > should return a short copy.
> > > > >
> > > > > Having check#1 enforced allows to us to differentiate between a real
> > > > > error and a short copy.
> > > >
> > > > That's what the code does right now and *exactly what I'm trying to
> > > > fix* because it EINVAL is ambiguous and not an indicator that we've
> > > > reached the end of the source file. EINVAL can indicate several
> > > > different errors, so it really has to be treated as a "copy failed"
> > > > error by applications.
> > > >
> > > > Have a look at read/pread() - they return 0 in this case to indicate
> > > > a short read, and the value of zero is explicitly defined as meaning
> > > > "read position is beyond EOF".  Applications know straight away that
> > > > there is no more data to be read and there was no error, so can
> > > > terminate on a successful short read.
> > > >
> > > > We need to allow applications to terminate copy loops on a
> > > > successful short copy. IOWs, applications need to either:
> > > >
> > > >         - get an immediate error saying the range is invalid rather
> > > >           than doing a short copy (as per the man page); or
> > > >         - have an explicit marker to say "no more data to be copied"
> > > >
> > > > Applications need the "no more data to copy" case to be explicit and
> > > > unambiguous so they can make sane decisions about whether a short
> > > > copy was successful because the file was shorter than expected or
> > > > whether a short copy was a result of a real error being encountered.
> > > > The current behaviour is largely unusable for applications because
> > > > they have to guess at the reason for EINVAL part way through a
> > > > copy....
> > > >
> > >
> > > Dave,
> > >
> > > I went a head and implemented the desired behavior.
> > > However, while testing I observed that the desired behavior is already
> > > the existing behavior. For example, trying to copy 10 bytes from a 2 bytes file,
> > > xfs_io copy loop ends as expected:
> > > copy_file_range(4, [0], 3, [0], 10, 0)  = 2
> > > copy_file_range(4, [2], 3, [2], 8, 0)   = 0
> > >
> > > This was tested on ext4 and xfs with reflink on recent kernel as well as on
> > > v4.20-rc1 (era of original patch set).
> > >
> > > Where and how did you observe the EINVAL behavior described above?
> > > (besides man page that is). There are even xfstests (which you modified)
> > > that verify the return 0 for past EOF behavior.
> > >
> > > For now, I am just dropping this patch from the patch series.
> > > Let me know if I am missing something.
> >
> > The was fixing inconsistency in what the man page specified (ie., it
> > must fail with EINVAL if offsets are out of range) which was never
> > enforced by the code. The patch then could be to fix the existing
> > semantics (man page) of the system call.
> >
> > Copy file range range is not only read and write but rather
> > lseek+read+write and if somebody specifies an incorrect offset to the
>
> Nope. it is like either read+write or pread+pwrite.
>
> > lseek the system call should fail. Thus I still think that copy file
> > range should enforce that specifying a source offset beyond the end of
> > the file should fail with EINVAL.
>
> You appear to be out numbered by reviewers that think copy_file_range(2)
> should behave like pread(2) and return 0 when offf_in >= size_in.
>
> >
> > If the copy file range returned 0 bytes does it mean it's a stopping
> > condition, not according to the current semantics.
>
> Yes. Same as read(2)/pread(2).

If that's the case, then it's great. Perhaps it's the fact that the
copy_file_range man page doesn't talk about it that makes it
confusing.

>
> Thanks,
> Amir.
Amir Goldstein May 20, 2019, 2:02 p.m. UTC | #14
On Mon, May 20, 2019 at 4:58 PM Olga Kornievskaia
<olga.kornievskaia@gmail.com> wrote:
>
> On Mon, May 20, 2019 at 9:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, May 20, 2019 at 4:12 PM Olga Kornievskaia
> > <olga.kornievskaia@gmail.com> wrote:
> > >
> > > On Mon, May 20, 2019 at 5:10 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > On Wed, Dec 5, 2018 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> > > > > > On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > > >
> > > > > > > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > > > > > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > > > > >
> > > > > > > > > > The man page says:
> > > > > > > > > >
> > > > > > > > > > EINVAL Requested range extends beyond the end of the source file
> > > > > > > > > >
> > > > > > > > > > But the current behaviour is that copy_file_range does a short
> > > > > > > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > > > > > > the behaviour described in the man page.
> > > > > > > >
> > > > > > > > I think the behavior implemented is a lot more useful than the one
> > > > > > > > documented..
> > > > > > >
> > > > > > > The current behaviour is really nasty. Because copy_file_range() can
> > > > > > > return short copies, the caller has to implement a loop to ensure
> > > > > > > the range hey want get copied.  When the source range you are
> > > > > > > trying to copy overlaps source EOF, this loop:
> > > > > > >
> > > > > > >         while (len > 0) {
> > > > > > >                 ret = copy_file_range(... len ...)
> > > > > > >                 ...
> > > > > > >                 off_in += ret;
> > > > > > >                 off_out += ret;
> > > > > > >                 len -= ret;
> > > > > > >         }
> > > > > > >
> > > > > > > Currently the fallback code copies up to the end of the source file
> > > > > > > on the first copy and then fails the second copy with EINVAL because
> > > > > > > the source range is now completely beyond EOF.
> > > > > > >
> > > > > > > So, from an application perspective, did the copy succeed or did it
> > > > > > > fail?
> > > > > > >
> > > > > > > Existing tools that exercise copy_file_range (like xfs_io) consider
> > > > > > > this a failure, because the second copy_file_range() call returns
> > > > > > > EINVAL and not some "there is no more to copy" marker like read()
> > > > > > > returning 0 bytes when attempting to read beyond EOF.
> > > > > > >
> > > > > > > IOWs, we cannot tell the difference between a real error and a short
> > > > > > > copy because the input range spans EOF and it was silently
> > > > > > > shortened. That's the API problem we need to fix here - the existing
> > > > > > > behaviour is really crappy for applications. Erroring out
> > > > > > > immmediately is one solution, and it's what the man page says should
> > > > > > > happen so that is what I implemented.
> > > > > > >
> > > > > > > Realistically, though, I think an attempt to read beyond EOF for the
> > > > > > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > > > > > not EINVAL. The existing behaviour needs to change, though.
> > > > > >
> > > > > > There are two checks to consider
> > > > > > 1. pos_in >= EOF should return EINVAL
> > > > > > 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> > > > > > should return a short copy.
> > > > > >
> > > > > > Having check#1 enforced allows to us to differentiate between a real
> > > > > > error and a short copy.
> > > > >
> > > > > That's what the code does right now and *exactly what I'm trying to
> > > > > fix* because it EINVAL is ambiguous and not an indicator that we've
> > > > > reached the end of the source file. EINVAL can indicate several
> > > > > different errors, so it really has to be treated as a "copy failed"
> > > > > error by applications.
> > > > >
> > > > > Have a look at read/pread() - they return 0 in this case to indicate
> > > > > a short read, and the value of zero is explicitly defined as meaning
> > > > > "read position is beyond EOF".  Applications know straight away that
> > > > > there is no more data to be read and there was no error, so can
> > > > > terminate on a successful short read.
> > > > >
> > > > > We need to allow applications to terminate copy loops on a
> > > > > successful short copy. IOWs, applications need to either:
> > > > >
> > > > >         - get an immediate error saying the range is invalid rather
> > > > >           than doing a short copy (as per the man page); or
> > > > >         - have an explicit marker to say "no more data to be copied"
> > > > >
> > > > > Applications need the "no more data to copy" case to be explicit and
> > > > > unambiguous so they can make sane decisions about whether a short
> > > > > copy was successful because the file was shorter than expected or
> > > > > whether a short copy was a result of a real error being encountered.
> > > > > The current behaviour is largely unusable for applications because
> > > > > they have to guess at the reason for EINVAL part way through a
> > > > > copy....
> > > > >
> > > >
> > > > Dave,
> > > >
> > > > I went a head and implemented the desired behavior.
> > > > However, while testing I observed that the desired behavior is already
> > > > the existing behavior. For example, trying to copy 10 bytes from a 2 bytes file,
> > > > xfs_io copy loop ends as expected:
> > > > copy_file_range(4, [0], 3, [0], 10, 0)  = 2
> > > > copy_file_range(4, [2], 3, [2], 8, 0)   = 0
> > > >
> > > > This was tested on ext4 and xfs with reflink on recent kernel as well as on
> > > > v4.20-rc1 (era of original patch set).
> > > >
> > > > Where and how did you observe the EINVAL behavior described above?
> > > > (besides man page that is). There are even xfstests (which you modified)
> > > > that verify the return 0 for past EOF behavior.
> > > >
> > > > For now, I am just dropping this patch from the patch series.
> > > > Let me know if I am missing something.
> > >
> > > The was fixing inconsistency in what the man page specified (ie., it
> > > must fail with EINVAL if offsets are out of range) which was never
> > > enforced by the code. The patch then could be to fix the existing
> > > semantics (man page) of the system call.
> > >
> > > Copy file range range is not only read and write but rather
> > > lseek+read+write and if somebody specifies an incorrect offset to the
> >
> > Nope. it is like either read+write or pread+pwrite.
> >
> > > lseek the system call should fail. Thus I still think that copy file
> > > range should enforce that specifying a source offset beyond the end of
> > > the file should fail with EINVAL.
> >
> > You appear to be out numbered by reviewers that think copy_file_range(2)
> > should behave like pread(2) and return 0 when offf_in >= size_in.
> >
> > >
> > > If the copy file range returned 0 bytes does it mean it's a stopping
> > > condition, not according to the current semantics.
> >
> > Yes. Same as read(2)/pread(2).
>
> If that's the case, then it's great. Perhaps it's the fact that the
> copy_file_range man page doesn't talk about it that makes it
> confusing.
>

We agreed that updating the man page is better, see:
https://github.com/amir73il/man-pages/commits/copy_file_range-v2

I'm currently testing reworked patches.
Will post them once they pass the tests.

Thanks,
Amir.
diff mbox series

Patch

diff --git a/fs/read_write.c b/fs/read_write.c
index 4dae0399c75a..09d1816cf3cf 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1581,6 +1581,10 @@  ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (len == 0)
 		return 0;
 
+	/* If the source range crosses EOF, fail the copy */
+	if (pos_in >= i_size(inode_in) || pos_in + len > i_size(inode_in))
+		return -EINVAL;
+
 	file_start_write(file_out);
 
 	/*