diff mbox

[RFC,v1,01/19] fs: Don't copy beyond the end of the file

Message ID 20170302160123.30375-2-kolga@netapp.com (mailing list archive)
State New, archived
Headers show

Commit Message

Olga Kornievskaia March 2, 2017, 4:01 p.m. UTC
From: Anna Schumaker <Anna.Schumaker@Netapp.com>

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
---
 fs/read_write.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Christoph Hellwig March 2, 2017, 4:22 p.m. UTC | #1
On Thu, Mar 02, 2017 at 11:01:05AM -0500, Olga Kornievskaia wrote:
> +	if (pos_in >= i_size_read(inode_in))
> +		return -EINVAL;

That's not how the syscall is supposed to work, we'd rather do a
short read^^^^^copy.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia March 2, 2017, 4:34 p.m. UTC | #2
> On Mar 2, 2017, at 11:22 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Thu, Mar 02, 2017 at 11:01:05AM -0500, Olga Kornievskaia wrote:
>> +	if (pos_in >= i_size_read(inode_in))
>> +		return -EINVAL;
> 
> That's not how the syscall is supposed to work, we'd rather do a
> short read^^^^^copy.

This is when the offset from which to read is beyond the end of the file. I’m not sure what kind of short read are you expecting here.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields March 3, 2017, 8:47 p.m. UTC | #3
On Thu, Mar 02, 2017 at 08:22:21AM -0800, Christoph Hellwig wrote:
> On Thu, Mar 02, 2017 at 11:01:05AM -0500, Olga Kornievskaia wrote:
> > +	if (pos_in >= i_size_read(inode_in))
> > +		return -EINVAL;
> 
> That's not how the syscall is supposed to work, we'd rather do a
> short read^^^^^copy.

That's what I think too, but then is COPY(2) wrong?:

	EINVAL Requested  range  extends beyond the end of the source
	  file; or the flags argument is not 0.

Also, copy_file_range can be implemented by ->clone_file_range, where
these kinds of checks make more sense, I think; e.g. from btrfs:

	ret = -EINVAL;
	if (off + len > src->i_size || off + len < off)
		goto out_unlock;

Well, so the caller just has to be prepared for either behavior, I
guess, but that may make it more complicated to use.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia March 3, 2017, 9:08 p.m. UTC | #4
> On Mar 3, 2017, at 3:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Thu, Mar 02, 2017 at 08:22:21AM -0800, Christoph Hellwig wrote:
>> On Thu, Mar 02, 2017 at 11:01:05AM -0500, Olga Kornievskaia wrote:
>>> +	if (pos_in >= i_size_read(inode_in))
>>> +		return -EINVAL;
>> 
>> That's not how the syscall is supposed to work, we'd rather do a
>> short read^^^^^copy.
> 
> That's what I think too, but then is COPY(2) wrong?:
> 
> 	EINVAL Requested  range  extends beyond the end of the source
> 	  file; or the flags argument is not 0.
> 
> Also, copy_file_range can be implemented by ->clone_file_range, where
> these kinds of checks make more sense, I think; e.g. from btrfs:
> 
> 	ret = -EINVAL;
> 	if (off + len > src->i_size || off + len < off)
> 		goto out_unlock;
> 
> Well, so the caller just has to be prepared for either behavior, I
> guess, but that may make it more complicated to use.
> 

I’m still rather very confused again by the comment and what it is proposing.

There are two checks to consider for the validity of the arguments

1. If the offset of the source file is beyond the end of the source file.
2. If the offset + len is beyond the end of the file.

I read that the man page is talking about #2.  This is actually what the NFSv4.2 spec required for the COPY too but we’ve been discussing that it should be a short read instead.

This patch address is to address case #1. As far as I can tell it applies to all file systems.

Are you suggesting that the checks for the validity of the arguments do not belong in VFS but instead should be in each of the underlying file systems?

Not all vfs_copy_file_range() are implemented via clone_file_range(). At least I hope that “inter” NFSv4.2 COPY will also use vfs_file_copy_range() and it would not be calling clone().


> --b.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields March 3, 2017, 9:32 p.m. UTC | #5
On Fri, Mar 03, 2017 at 04:08:19PM -0500, Olga Kornievskaia wrote:
> 
> > On Mar 3, 2017, at 3:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > On Thu, Mar 02, 2017 at 08:22:21AM -0800, Christoph Hellwig wrote:
> >> On Thu, Mar 02, 2017 at 11:01:05AM -0500, Olga Kornievskaia wrote:
> >>> +	if (pos_in >= i_size_read(inode_in))
> >>> +		return -EINVAL;
> >> 
> >> That's not how the syscall is supposed to work, we'd rather do a
> >> short read^^^^^copy.
> > 
> > That's what I think too, but then is COPY(2) wrong?:
> > 
> > 	EINVAL Requested  range  extends beyond the end of the source
> > 	  file; or the flags argument is not 0.
> > 
> > Also, copy_file_range can be implemented by ->clone_file_range, where
> > these kinds of checks make more sense, I think; e.g. from btrfs:
> > 
> > 	ret = -EINVAL;
> > 	if (off + len > src->i_size || off + len < off)
> > 		goto out_unlock;
> > 
> > Well, so the caller just has to be prepared for either behavior, I
> > guess, but that may make it more complicated to use.
> > 
> 
> I’m still rather very confused again by the comment and what it is proposing.
> 
> There are two checks to consider for the validity of the arguments
> 
> 1. If the offset of the source file is beyond the end of the source file.
> 2. If the offset + len is beyond the end of the file.
> 
> I read that the man page is talking about #2.  This is actually what the NFSv4.2 spec required for the COPY too but we’ve been discussing that it should be a short read instead.
> 
> This patch address is to address case #1. As far as I can tell it applies to all file systems.
> 
> Are you suggesting that the checks for the validity of the arguments do not belong in VFS but instead should be in each of the underlying file systems?
> 
> Not all vfs_copy_file_range() are implemented via clone_file_range(). At least I hope that “inter” NFSv4.2 COPY will also use vfs_file_copy_range() and it would not be calling clone().

I think it'd be acceptable for copy_file_range() to just return 0 even
in your case 1.  I believe that's what ordinary read and pread does.

You probably can't perform it atomically with the copy, so it's possible
that the size will change right after you check it.

I don't see a benefit to the check.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields March 4, 2017, 2:10 a.m. UTC | #6
On Fri, Mar 03, 2017 at 05:46:05PM -0500, Olga Kornievskaia wrote:
> 
> > On Mar 3, 2017, at 4:32 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > On Fri, Mar 03, 2017 at 04:08:19PM -0500, Olga Kornievskaia wrote:
> >> 
> >>> On Mar 3, 2017, at 3:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >>> 
> >>> On Thu, Mar 02, 2017 at 08:22:21AM -0800, Christoph Hellwig wrote:
> >>>> On Thu, Mar 02, 2017 at 11:01:05AM -0500, Olga Kornievskaia wrote:
> >>>>> +	if (pos_in >= i_size_read(inode_in))
> >>>>> +		return -EINVAL;
> >>>> 
> >>>> That's not how the syscall is supposed to work, we'd rather do a
> >>>> short read^^^^^copy.
> >>> 
> >>> That's what I think too, but then is COPY(2) wrong?:
> >>> 
> >>> 	EINVAL Requested  range  extends beyond the end of the source
> >>> 	  file; or the flags argument is not 0.
> >>> 
> >>> Also, copy_file_range can be implemented by ->clone_file_range, where
> >>> these kinds of checks make more sense, I think; e.g. from btrfs:
> >>> 
> >>> 	ret = -EINVAL;
> >>> 	if (off + len > src->i_size || off + len < off)
> >>> 		goto out_unlock;
> >>> 
> >>> Well, so the caller just has to be prepared for either behavior, I
> >>> guess, but that may make it more complicated to use.
> >>> 
> >> 
> >> I’m still rather very confused again by the comment and what it is proposing.
> >> 
> >> There are two checks to consider for the validity of the arguments
> >> 
> >> 1. If the offset of the source file is beyond the end of the source file.
> >> 2. If the offset + len is beyond the end of the file.
> >> 
> >> I read that the man page is talking about #2.  This is actually what the NFSv4.2 spec required for the COPY too but we’ve been discussing that it should be a short read instead.
> >> 
> >> This patch address is to address case #1. As far as I can tell it applies to all file systems.
> >> 
> >> Are you suggesting that the checks for the validity of the arguments do not belong in VFS but instead should be in each of the underlying file systems?
> >> 
> >> Not all vfs_copy_file_range() are implemented via clone_file_range(). At least I hope that “inter” NFSv4.2 COPY will also use vfs_file_copy_range() and it would not be calling clone().
> > 
> > I think it'd be acceptable for copy_file_range() to just return 0 even
> > in your case 1.  I believe that's what ordinary read and pread does.
> > 
> > You probably can't perform it atomically with the copy, so it's possible
> > that the size will change right after you check it.
> > 
> > I don't see a benefit to the check.
> 

> In read() you don’t specify the offset from which to read. It is read
> from the current file descriptor offset. I don’t find the comparison
> equal. 
> 
> It’s it more fair to compare it to lseek() which does return EINVAL if
> you specify position beyond the end of the file. 

Or pread(), which takes an offset.

But read() can read at an offset at the end of file, or even past that
(if the file was truncated), and I believe it just returns 0 in those
cases.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia March 6, 2017, 4:27 p.m. UTC | #7
> On Mar 3, 2017, at 9:10 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Fri, Mar 03, 2017 at 05:46:05PM -0500, Olga Kornievskaia wrote:
>> 
>>> On Mar 3, 2017, at 4:32 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>> 
>>> On Fri, Mar 03, 2017 at 04:08:19PM -0500, Olga Kornievskaia wrote:
>>>> 
>>>>> On Mar 3, 2017, at 3:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>>>> 
>>>>> On Thu, Mar 02, 2017 at 08:22:21AM -0800, Christoph Hellwig wrote:
>>>>>> On Thu, Mar 02, 2017 at 11:01:05AM -0500, Olga Kornievskaia wrote:
>>>>>>> +	if (pos_in >= i_size_read(inode_in))
>>>>>>> +		return -EINVAL;
>>>>>> 
>>>>>> That's not how the syscall is supposed to work, we'd rather do a
>>>>>> short read^^^^^copy.
>>>>> 
>>>>> That's what I think too, but then is COPY(2) wrong?:
>>>>> 
>>>>> 	EINVAL Requested  range  extends beyond the end of the source
>>>>> 	  file; or the flags argument is not 0.
>>>>> 
>>>>> Also, copy_file_range can be implemented by ->clone_file_range, where
>>>>> these kinds of checks make more sense, I think; e.g. from btrfs:
>>>>> 
>>>>> 	ret = -EINVAL;
>>>>> 	if (off + len > src->i_size || off + len < off)
>>>>> 		goto out_unlock;
>>>>> 
>>>>> Well, so the caller just has to be prepared for either behavior, I
>>>>> guess, but that may make it more complicated to use.
>>>>> 
>>>> 
>>>> I’m still rather very confused again by the comment and what it is proposing.
>>>> 
>>>> There are two checks to consider for the validity of the arguments
>>>> 
>>>> 1. If the offset of the source file is beyond the end of the source file.
>>>> 2. If the offset + len is beyond the end of the file.
>>>> 
>>>> I read that the man page is talking about #2.  This is actually what the NFSv4.2 spec required for the COPY too but we’ve been discussing that it should be a short read instead.
>>>> 
>>>> This patch address is to address case #1. As far as I can tell it applies to all file systems.
>>>> 
>>>> Are you suggesting that the checks for the validity of the arguments do not belong in VFS but instead should be in each of the underlying file systems?
>>>> 
>>>> Not all vfs_copy_file_range() are implemented via clone_file_range(). At least I hope that “inter” NFSv4.2 COPY will also use vfs_file_copy_range() and it would not be calling clone().
>>> 
>>> I think it'd be acceptable for copy_file_range() to just return 0 even
>>> in your case 1.  I believe that's what ordinary read and pread does.
>>> 
>>> You probably can't perform it atomically with the copy, so it's possible
>>> that the size will change right after you check it.
>>> 
>>> I don't see a benefit to the check.
>> 
> 
>> In read() you don’t specify the offset from which to read. It is read
>> from the current file descriptor offset. I don’t find the comparison
>> equal. 
>> 
>> It’s it more fair to compare it to lseek() which does return EINVAL if
>> you specify position beyond the end of the file. 
> 
> Or pread(), which takes an offset.
> 
> But read() can read at an offset at the end of file, or even past that
> (if the file was truncated), and I believe it just returns 0 in those
> cases.
> 

I don’t see copy_file_range() specifying that 0 means end of the file. Are you arguing to add that meaning to the function call? I guess in that case we’d need to take extra care to never return 0bytes to the client as a “partial” copy (say due to server rebooting).

Unless changed, NFS4.2 mandates the two checks that I’ve specified. I can add the checks in the NFS implementation itself. However, we thought at least this check belonged in the VFS layer. I’m really not super attached to getting into VFS. Actually the 2nd check is something that copy_file_range() man pages say should return EINVAL but the VFS code doesn’t enforce it. 





--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields March 6, 2017, 7:09 p.m. UTC | #8
On Mon, Mar 06, 2017 at 11:27:23AM -0500, Olga Kornievskaia wrote:
> 
> > On Mar 3, 2017, at 9:10 PM, J. Bruce Fields <bfields@fieldses.org>
> > wrote:
> > 
> > On Fri, Mar 03, 2017 at 05:46:05PM -0500, Olga Kornievskaia wrote:
> >> 
> >>> On Mar 3, 2017, at 4:32 PM, J. Bruce Fields <bfields@fieldses.org>
> >>> wrote:
> >>> 
> >>> On Fri, Mar 03, 2017 at 04:08:19PM -0500, Olga Kornievskaia wrote:
> >>>> 
> >>>>> On Mar 3, 2017, at 3:47 PM, J. Bruce Fields
> >>>>> <bfields@fieldses.org> wrote:
> >>>>> 
> >>>>> On Thu, Mar 02, 2017 at 08:22:21AM -0800, Christoph Hellwig
> >>>>> wrote:
> >>>>>> On Thu, Mar 02, 2017 at 11:01:05AM -0500, Olga Kornievskaia
> >>>>>> wrote:
> >>>>>>> +	if (pos_in >= i_size_read(inode_in)) +		return
> >>>>>>> -EINVAL;
> >>>>>> 
> >>>>>> That's not how the syscall is supposed to work, we'd rather do
> >>>>>> a short read^^^^^copy.
> >>>>> 
> >>>>> That's what I think too, but then is COPY(2) wrong?:
> >>>>> 
> >>>>> 	EINVAL Requested  range  extends beyond the end of the
> >>>>> 	source file; or the flags argument is not 0.
> >>>>> 
> >>>>> Also, copy_file_range can be implemented by ->clone_file_range,
> >>>>> where these kinds of checks make more sense, I think; e.g. from
> >>>>> btrfs:
> >>>>> 
> >>>>> 	ret = -EINVAL; if (off + len > src->i_size || off + len
> >>>>> 	< off) goto out_unlock;
> >>>>> 
> >>>>> Well, so the caller just has to be prepared for either behavior,
> >>>>> I guess, but that may make it more complicated to use.
> >>>>> 
> >>>> 
> >>>> I’m still rather very confused again by the comment and what it
> >>>> is proposing.
> >>>> 
> >>>> There are two checks to consider for the validity of the
> >>>> arguments
> >>>> 
> >>>> 1. If the offset of the source file is beyond the end of the
> >>>> source file.  2. If the offset + len is beyond the end of the
> >>>> file.
> >>>> 
> >>>> I read that the man page is talking about #2.  This is actually
> >>>> what the NFSv4.2 spec required for the COPY too but we’ve been
> >>>> discussing that it should be a short read instead.
> >>>> 
> >>>> This patch address is to address case #1. As far as I can tell it
> >>>> applies to all file systems.
> >>>> 
> >>>> Are you suggesting that the checks for the validity of the
> >>>> arguments do not belong in VFS but instead should be in each of
> >>>> the underlying file systems?
> >>>> 
> >>>> Not all vfs_copy_file_range() are implemented via
> >>>> clone_file_range(). At least I hope that “inter” NFSv4.2 COPY
> >>>> will also use vfs_file_copy_range() and it would not be calling
> >>>> clone().
> >>> 
> >>> I think it'd be acceptable for copy_file_range() to just return 0
> >>> even in your case 1.  I believe that's what ordinary read and
> >>> pread does.
> >>> 
> >>> You probably can't perform it atomically with the copy, so it's
> >>> possible that the size will change right after you check it.
> >>> 
> >>> I don't see a benefit to the check.
> >> 
> > 
> >> In read() you don’t specify the offset from which to read. It is
> >> read from the current file descriptor offset. I don’t find the
> >> comparison equal. 
> >> 
> >> It’s it more fair to compare it to lseek() which does return EINVAL
> >> if you specify position beyond the end of the file. 
> > 
> > Or pread(), which takes an offset.
> > 
> > But read() can read at an offset at the end of file, or even past
> > that (if the file was truncated), and I believe it just returns 0 in
> > those cases.
> > 
> 
> I don’t see copy_file_range() specifying that 0 means end of the file.
> Are you arguing to add that meaning to the function call?

Yes.

> I guess in
> that case we’d need to take extra care to never return 0bytes to the
> client as a “partial” copy (say due to server rebooting).

Right.

> Unless changed, NFS4.2 mandates the two checks that I’ve specified. I
> can add the checks in the NFS implementation itself. However, we
> thought at least this check belonged in the VFS layer. I’m really not
> super attached to getting into VFS. Actually the 2nd check is
> something that copy_file_range() man pages say should return EINVAL
> but the VFS code doesn’t enforce it. 

Please leave those checks out.  Let's try to fix the spec.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields March 6, 2017, 7:23 p.m. UTC | #9
On Mon, Mar 06, 2017 at 11:27:23AM -0500, Olga Kornievskaia wrote:
> I don’t see copy_file_range() specifying that 0 means end of the file.

Is there a reason that copy_file_range shouldn't be like read?  (From
read(2): "On  success,  the number of bytes read is returned (zero
indicates end of file)".

I haven't checked, but suspect that's already true of the
implementations we have.

Also, from copy_file_range():

	EINVAL Requested  range  extends beyond the end of the source
		file; or the flags argument is not 0.

Some filesystems do this, some don't; I think the man page should make
it clear that this behavior is not required.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia March 7, 2017, 2:18 p.m. UTC | #10
On Mon, Mar 6, 2017 at 2:23 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Mon, Mar 06, 2017 at 11:27:23AM -0500, Olga Kornievskaia wrote:
>> I don’t see copy_file_range() specifying that 0 means end of the file.
>
> Is there a reason that copy_file_range shouldn't be like read?  (From
> read(2): "On  success,  the number of bytes read is returned (zero
> indicates end of file)".

The only thing I can think of is the fact that copy_file_range() does
write too. And return 0 from the write is not expected/allowed (except
when input count was 0)?

> I haven't checked, but suspect that's already true of the
> implementations we have.
>
> Also, from copy_file_range():
>
>         EINVAL Requested  range  extends beyond the end of the source
>                 file; or the flags argument is not 0.
>
> Some filesystems do this, some don't; I think the man page should make
> it clear that this behavior is not required.
>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig March 7, 2017, 11:40 p.m. UTC | #11
On Fri, Mar 03, 2017 at 03:47:47PM -0500, J. Bruce Fields wrote:
> That's what I think too, but then is COPY(2) wrong?:

Either the current kernel code or the man page is wrong..

> Also, copy_file_range can be implemented by ->clone_file_range, where
> these kinds of checks make more sense, I think; e.g. from btrfs:
> 
> 	ret = -EINVAL;
> 	if (off + len > src->i_size || off + len < off)
> 		goto out_unlock;
> 
> Well, so the caller just has to be prepared for either behavior, I
> guess, but that may make it more complicated to use.

We'll need to stick to one behavior.  So between the man page and
the clone implementation I guess this patch is fine after all.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields March 8, 2017, 5:05 p.m. UTC | #12
On Tue, Mar 07, 2017 at 03:40:51PM -0800, Christoph Hellwig wrote:
> On Fri, Mar 03, 2017 at 03:47:47PM -0500, J. Bruce Fields wrote:
> > That's what I think too, but then is COPY(2) wrong?:
> 
> Either the current kernel code or the man page is wrong..
> 
> > Also, copy_file_range can be implemented by ->clone_file_range, where
> > these kinds of checks make more sense, I think; e.g. from btrfs:
> > 
> > 	ret = -EINVAL;
> > 	if (off + len > src->i_size || off + len < off)
> > 		goto out_unlock;
> > 
> > Well, so the caller just has to be prepared for either behavior, I
> > guess, but that may make it more complicated to use.
> 
> We'll need to stick to one behavior.  So between the man page and
> the clone implementation I guess this patch is fine after all.

Ugh, please, let's not.

Since copy isn't atomic that check is never going to be reliable.

As long as we allow copy to have either copy-like or clone-like
implementations, callers have to be prepared to handle either behavior
when the range is past end of file, either a short copy or an EINVAL.

The difference is that if we add this check, then the "short copy"
behavior becomes something that only happens when the timing is just
right.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig March 8, 2017, 5:25 p.m. UTC | #13
On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields wrote:
> Since copy isn't atomic that check is never going to be reliable.

That's true for everything that COPY does.  By that logic we should
not implement it at all (a logic that I'd fully support)
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia March 8, 2017, 5:32 p.m. UTC | #14
> On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields wrote:
>> Since copy isn't atomic that check is never going to be reliable.
> 
> That's true for everything that COPY does.  By that logic we should
> not implement it at all (a logic that I'd fully support)


If you were to only keep CLONE then you’d lose a huge performance gain you get from server-to-server COPY. --
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields March 8, 2017, 7:53 p.m. UTC | #15
On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia wrote:
> 
> > On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infradead.org>
> > wrote:
> > 
> > On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields wrote:
> >> Since copy isn't atomic that check is never going to be reliable.
> > 
> > That's true for everything that COPY does.  By that logic we should
> > not implement it at all (a logic that I'd fully support)
> 
> If you were to only keep CLONE then you’d lose a huge performance gain
> you get from server-to-server COPY. 

Yes.  Also, I think copy-like copy implementations have reasonable
semantics that are basically the same as read:

	- copy can return successfully with less copied than requested.
	- it's fine for the copied range to start and/or end past end of
	  file, it'll just return a short read.
	- A copy of more than 0 bytes returning 0 means you're at end of
	  file.

The particular problem here is that that doesn't fit how clone works at
all.

It feels like what happened is that copy_file_range() was made mainly
for the clone case, with the idea that copy might be reluctantly
accepted as a second-class implementation.

But the performance gain of copy offload is too big to just ignore, and
in fact it's what copy_file_range does on every filesystem but btrfs and
ocfs2 (and maybe cifs?), so I don't think we can just ignore it.

If we had separate copy_file_range and clone_file_range, I *think* it
could all be made sensible.  Am I missing something?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia March 8, 2017, 8 p.m. UTC | #16
> On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia wrote:
>> 
>>> On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infradead.org>
>>> wrote:
>>> 
>>> On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields wrote:
>>>> Since copy isn't atomic that check is never going to be reliable.
>>> 
>>> That's true for everything that COPY does.  By that logic we should
>>> not implement it at all (a logic that I'd fully support)
>> 
>> If you were to only keep CLONE then you’d lose a huge performance gain
>> you get from server-to-server COPY. 
> 
> Yes.  Also, I think copy-like copy implementations have reasonable
> semantics that are basically the same as read:
> 
> 	- copy can return successfully with less copied than requested.
> 	- it's fine for the copied range to start and/or end past end of
> 	  file, it'll just return a short read.
> 	- A copy of more than 0 bytes returning 0 means you're at end of
> 	  file.
> 
> The particular problem here is that that doesn't fit how clone works at
> all.
> 
> It feels like what happened is that copy_file_range() was made mainly
> for the clone case, with the idea that copy might be reluctantly
> accepted as a second-class implementation.
> 
> But the performance gain of copy offload is too big to just ignore, and
> in fact it's what copy_file_range does on every filesystem but btrfs and
> ocfs2 (and maybe cifs?), so I don't think we can just ignore it.
> 
> If we had separate copy_file_range and clone_file_range, I *think* it
> could all be made sensible.  Am I missing something?
> 

How would the application (cp) know when to call the clone_file_range and when to call copy_file_range?
 

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields March 8, 2017, 8:18 p.m. UTC | #17
On Wed, Mar 08, 2017 at 03:00:52PM -0500, Olga Kornievskaia wrote:
> 
> > On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia wrote:
> >> 
> >>> On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infradead.org>
> >>> wrote:
> >>> 
> >>> On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields wrote:
> >>>> Since copy isn't atomic that check is never going to be reliable.
> >>> 
> >>> That's true for everything that COPY does.  By that logic we should
> >>> not implement it at all (a logic that I'd fully support)
> >> 
> >> If you were to only keep CLONE then you’d lose a huge performance gain
> >> you get from server-to-server COPY. 
> > 
> > Yes.  Also, I think copy-like copy implementations have reasonable
> > semantics that are basically the same as read:
> > 
> > 	- copy can return successfully with less copied than requested.
> > 	- it's fine for the copied range to start and/or end past end of
> > 	  file, it'll just return a short read.
> > 	- A copy of more than 0 bytes returning 0 means you're at end of
> > 	  file.
> > 
> > The particular problem here is that that doesn't fit how clone works at
> > all.
> > 
> > It feels like what happened is that copy_file_range() was made mainly
> > for the clone case, with the idea that copy might be reluctantly
> > accepted as a second-class implementation.
> > 
> > But the performance gain of copy offload is too big to just ignore, and
> > in fact it's what copy_file_range does on every filesystem but btrfs and
> > ocfs2 (and maybe cifs?), so I don't think we can just ignore it.
> > 
> > If we had separate copy_file_range and clone_file_range, I *think* it
> > could all be made sensible.  Am I missing something?
> 
> How would the application (cp) know when to call the clone_file_range and when to call copy_file_range?

Try clone and then fall back on copy if that's not available?

Which is the same thing vfs_copy_file_range() is doing now, but it'd
seem less confusing if that logic was in the application.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust March 8, 2017, 8:18 p.m. UTC | #18
On Wed, 2017-03-08 at 15:00 -0500, Olga Kornievskaia wrote:
> > On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.org>

> > wrote:

> > 

> > On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia wrote:

> > > 

> > > > On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infradead.o

> > > > rg>

> > > > wrote:

> > > > 

> > > > On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields

> > > > wrote:

> > > > > Since copy isn't atomic that check is never going to be

> > > > > reliable.

> > > > 

> > > > That's true for everything that COPY does.  By that logic we

> > > > should

> > > > not implement it at all (a logic that I'd fully support)

> > > 

> > > If you were to only keep CLONE then you’d lose a huge performance

> > > gain

> > > you get from server-to-server COPY. 

> > 

> > Yes.  Also, I think copy-like copy implementations have reasonable

> > semantics that are basically the same as read:

> > 

> > 	- copy can return successfully with less copied than requested.

> > 	- it's fine for the copied range to start and/or end past end

> > of

> > 	  file, it'll just return a short read.

> > 	- A copy of more than 0 bytes returning 0 means you're at end

> > of

> > 	  file.

> > 

> > The particular problem here is that that doesn't fit how clone

> > works at

> > all.

> > 

> > It feels like what happened is that copy_file_range() was made

> > mainly

> > for the clone case, with the idea that copy might be reluctantly

> > accepted as a second-class implementation.


Historically? No... Christoph added clone as a valid implementation of
copy_file_range() almost a year after Zach and Anna defined the
semantics of vfs_copy_file_range(). git blame is your friend...

> > 

> > But the performance gain of copy offload is too big to just ignore,

> > and

> > in fact it's what copy_file_range does on every filesystem but

> > btrfs and

> > ocfs2 (and maybe cifs?), so I don't think we can just ignore it.

> > 

> > If we had separate copy_file_range and clone_file_range, I *think*

> > it

> > could all be made sensible.  Am I missing something?

> > 

> 

> How would the application (cp) know when to call the clone_file_range

> and when to call copy_file_range?


cp can probably call copy_file_range(), but any application that needs
atomic semantics (i.e. a binary operation success/fail) must call
clone_file_range().

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com
J. Bruce Fields March 8, 2017, 8:32 p.m. UTC | #19
On Wed, Mar 08, 2017 at 08:18:31PM +0000, Trond Myklebust wrote:
> On Wed, 2017-03-08 at 15:00 -0500, Olga Kornievskaia wrote:
> > > On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.org>
> > > wrote:
> > > 
> > > On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia wrote:
> > > > 
> > > > > On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infradead.o
> > > > > rg>
> > > > > wrote:
> > > > > 
> > > > > On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields
> > > > > wrote:
> > > > > > Since copy isn't atomic that check is never going to be
> > > > > > reliable.
> > > > > 
> > > > > That's true for everything that COPY does.  By that logic we
> > > > > should
> > > > > not implement it at all (a logic that I'd fully support)
> > > > 
> > > > If you were to only keep CLONE then you’d lose a huge performance
> > > > gain
> > > > you get from server-to-server COPY. 
> > > 
> > > Yes.  Also, I think copy-like copy implementations have reasonable
> > > semantics that are basically the same as read:
> > > 
> > > 	- copy can return successfully with less copied than requested.
> > > 	- it's fine for the copied range to start and/or end past end
> > > of
> > > 	  file, it'll just return a short read.
> > > 	- A copy of more than 0 bytes returning 0 means you're at end
> > > of
> > > 	  file.
> > > 
> > > The particular problem here is that that doesn't fit how clone
> > > works at
> > > all.
> > > 
> > > It feels like what happened is that copy_file_range() was made
> > > mainly
> > > for the clone case, with the idea that copy might be reluctantly
> > > accepted as a second-class implementation.
> 
> Historically? No... Christoph added clone as a valid implementation of
> copy_file_range() almost a year after Zach and Anna defined the
> semantics of vfs_copy_file_range(). git blame is your friend...

Yeah, I know.  It still feels to me like the interface was originally
designed with clone in mind, but that's my vague impression from the man
pages and half-remembered conversations.

Though the lack of a "just copy the whole file regardless of size" case
is weird for clone.  All you can do is stat the file and then hope it
doesn't change before you issue the copy_file_range.  But I'd think it'd
be easy for an atomic clone implementation to handle, say, getting a
snapshot of a log file while it's getting continuously appended to.

> > > But the performance gain of copy offload is too big to just ignore,
> > > and
> > > in fact it's what copy_file_range does on every filesystem but
> > > btrfs and
> > > ocfs2 (and maybe cifs?), so I don't think we can just ignore it.
> > > 
> > > If we had separate copy_file_range and clone_file_range, I *think*
> > > it
> > > could all be made sensible.  Am I missing something?
> > > 
> > 
> > How would the application (cp) know when to call the clone_file_range
> > and when to call copy_file_range?
> 
> cp can probably call copy_file_range(), but any application that needs
> atomic semantics (i.e. a binary operation success/fail) must call
> clone_file_range().

I don't believe there's a clone_file_range().  I see the vfs interface,
but no system call.

And implementing a simple cp is harder than it should be when you don't
know whether it's implemented as copy or clone.  You have to stat for
the file size first, retry if you got it wrong, and also retry if you
get a short read.  The example in the clone_file_range() man page is
incomplete.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust March 8, 2017, 8:49 p.m. UTC | #20
On Wed, 2017-03-08 at 15:32 -0500, bfields@fieldses.org wrote:
> On Wed, Mar 08, 2017 at 08:18:31PM +0000, Trond Myklebust wrote:

> > On Wed, 2017-03-08 at 15:00 -0500, Olga Kornievskaia wrote:

> > > > On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.o

> > > > rg>

> > > > wrote:

> > > > 

> > > > On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia

> > > > wrote:

> > > > > 

> > > > > > On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infrade

> > > > > > ad.o

> > > > > > rg>

> > > > > > wrote:

> > > > > > 

> > > > > > On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields

> > > > > > wrote:

> > > > > > > Since copy isn't atomic that check is never going to be

> > > > > > > reliable.

> > > > > > 

> > > > > > That's true for everything that COPY does.  By that logic

> > > > > > we

> > > > > > should

> > > > > > not implement it at all (a logic that I'd fully support)

> > > > > 

> > > > > If you were to only keep CLONE then you’d lose a huge

> > > > > performance

> > > > > gain

> > > > > you get from server-to-server COPY. 

> > > > 

> > > > Yes.  Also, I think copy-like copy implementations have

> > > > reasonable

> > > > semantics that are basically the same as read:

> > > > 

> > > > 	- copy can return successfully with less copied than

> > > > requested.

> > > > 	- it's fine for the copied range to start and/or end

> > > > past end

> > > > of

> > > > 	  file, it'll just return a short read.

> > > > 	- A copy of more than 0 bytes returning 0 means you're

> > > > at end

> > > > of

> > > > 	  file.

> > > > 

> > > > The particular problem here is that that doesn't fit how clone

> > > > works at

> > > > all.

> > > > 

> > > > It feels like what happened is that copy_file_range() was made

> > > > mainly

> > > > for the clone case, with the idea that copy might be

> > > > reluctantly

> > > > accepted as a second-class implementation.

> > 

> > Historically? No... Christoph added clone as a valid implementation

> > of

> > copy_file_range() almost a year after Zach and Anna defined the

> > semantics of vfs_copy_file_range(). git blame is your friend...

> 

> Yeah, I know.  It still feels to me like the interface was originally

> designed with clone in mind, but that's my vague impression from the

> man

> pages and half-remembered conversations.

> 

> Though the lack of a "just copy the whole file regardless of size"

> case

> is weird for clone.  All you can do is stat the file and then hope it

> doesn't change before you issue the copy_file_range.  But I'd think

> it'd

> be easy for an atomic clone implementation to handle, say, getting a

> snapshot of a log file while it's getting continuously appended to.


It really isn't that interesting in the continuously appended case
(what difference does it make if you only get data from just a few
moments ago), but I can see it being an issue in the case of random
writes where the file size is being extended.

The thing is that in both those cases, the copy_file_range() semantics
are worse, since they don't even guarantee a time-consistent copy.

> > > > But the performance gain of copy offload is too big to just

> > > > ignore,

> > > > and

> > > > in fact it's what copy_file_range does on every filesystem but

> > > > btrfs and

> > > > ocfs2 (and maybe cifs?), so I don't think we can just ignore

> > > > it.

> > > > 

> > > > If we had separate copy_file_range and clone_file_range, I

> > > > *think*

> > > > it

> > > > could all be made sensible.  Am I missing something?

> > > > 

> > > 

> > > How would the application (cp) know when to call the

> > > clone_file_range

> > > and when to call copy_file_range?

> > 

> > cp can probably call copy_file_range(), but any application that

> > needs

> > atomic semantics (i.e. a binary operation success/fail) must call

> > clone_file_range().

> 

> I don't believe there's a clone_file_range().  I see the vfs

> interface,

> but no system call.


There is a standard FICLONERANGE ioctl() that can be used on all
filesystems that support the vfs interface.

> And implementing a simple cp is harder than it should be when you

> don't

> know whether it's implemented as copy or clone.  You have to stat for

> the file size first, retry if you got it wrong, and also retry if you

> get a short read.  The example in the clone_file_range() man page is

> incomplete.


As I said, you shouldn't be using copy_file_range() either in the case
where the file is being modified.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com
J. Bruce Fields March 9, 2017, 3:29 p.m. UTC | #21
On Wed, Mar 08, 2017 at 08:49:56PM +0000, Trond Myklebust wrote:
> On Wed, 2017-03-08 at 15:32 -0500, bfields@fieldses.org wrote:
> > On Wed, Mar 08, 2017 at 08:18:31PM +0000, Trond Myklebust wrote:
> > > On Wed, 2017-03-08 at 15:00 -0500, Olga Kornievskaia wrote:
> > > > > On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.o
> > > > > rg>
> > > > > wrote:
> > > > > 
> > > > > On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia
> > > > > wrote:
> > > > > > 
> > > > > > > On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infrade
> > > > > > > ad.o
> > > > > > > rg>
> > > > > > > wrote:
> > > > > > > 
> > > > > > > On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields
> > > > > > > wrote:
> > > > > > > > Since copy isn't atomic that check is never going to be
> > > > > > > > reliable.
> > > > > > > 
> > > > > > > That's true for everything that COPY does.  By that logic
> > > > > > > we
> > > > > > > should
> > > > > > > not implement it at all (a logic that I'd fully support)
> > > > > > 
> > > > > > If you were to only keep CLONE then you’d lose a huge
> > > > > > performance
> > > > > > gain
> > > > > > you get from server-to-server COPY. 
> > > > > 
> > > > > Yes.  Also, I think copy-like copy implementations have
> > > > > reasonable
> > > > > semantics that are basically the same as read:
> > > > > 
> > > > > 	- copy can return successfully with less copied than
> > > > > requested.
> > > > > 	- it's fine for the copied range to start and/or end
> > > > > past end
> > > > > of
> > > > > 	  file, it'll just return a short read.
> > > > > 	- A copy of more than 0 bytes returning 0 means you're
> > > > > at end
> > > > > of
> > > > > 	  file.
> > > > > 
> > > > > The particular problem here is that that doesn't fit how clone
> > > > > works at
> > > > > all.
> > > > > 
> > > > > It feels like what happened is that copy_file_range() was made
> > > > > mainly
> > > > > for the clone case, with the idea that copy might be
> > > > > reluctantly
> > > > > accepted as a second-class implementation.
> > > 
> > > Historically? No... Christoph added clone as a valid implementation
> > > of
> > > copy_file_range() almost a year after Zach and Anna defined the
> > > semantics of vfs_copy_file_range(). git blame is your friend...
> > 
> > Yeah, I know.  It still feels to me like the interface was originally
> > designed with clone in mind, but that's my vague impression from the
> > man
> > pages and half-remembered conversations.
> > 
> > Though the lack of a "just copy the whole file regardless of size"
> > case
> > is weird for clone.  All you can do is stat the file and then hope it
> > doesn't change before you issue the copy_file_range.  But I'd think
> > it'd
> > be easy for an atomic clone implementation to handle, say, getting a
> > snapshot of a log file while it's getting continuously appended to.
> 
> It really isn't that interesting in the continuously appended case
> (what difference does it make if you only get data from just a few
> moments ago), but I can see it being an issue in the case of random
> writes where the file size is being extended.

Bah, yes, apologies for the bad example.

> The thing is that in both those cases, the copy_file_range() semantics
> are worse, since they don't even guarantee a time-consistent copy.
>
> > > > > But the performance gain of copy offload is too big to just
> > > > > ignore,
> > > > > and
> > > > > in fact it's what copy_file_range does on every filesystem but
> > > > > btrfs and
> > > > > ocfs2 (and maybe cifs?), so I don't think we can just ignore
> > > > > it.
> > > > > 
> > > > > If we had separate copy_file_range and clone_file_range, I
> > > > > *think*
> > > > > it
> > > > > could all be made sensible.  Am I missing something?
> > > > > 
> > > > 
> > > > How would the application (cp) know when to call the
> > > > clone_file_range
> > > > and when to call copy_file_range?
> > > 
> > > cp can probably call copy_file_range(), but any application that
> > > needs
> > > atomic semantics (i.e. a binary operation success/fail) must call
> > > clone_file_range().
> > 
> > I don't believe there's a clone_file_range().  I see the vfs
> > interface,
> > but no system call.
> 
> There is a standard FICLONERANGE ioctl() that can be used on all
> filesystems that support the vfs interface.

Oh, thanks, I forgot about that.

So I don't understand why it needed to be added to copy_file_range().
The copy and clone semantics are different enough that I think callers
want to know which they're getting.

> > And implementing a simple cp is harder than it should be when you
> > don't
> > know whether it's implemented as copy or clone.  You have to stat for
> > the file size first, retry if you got it wrong, and also retry if you
> > get a short read.  The example in the clone_file_range() man page is
> > incomplete.
> 
> As I said, you shouldn't be using copy_file_range() either in the case
> where the file is being modified.

Don't we want it to have more or less the same behavior as a read-write
loop?  People are probably running backup programs that depend on just
simple copies, and maybe the results are good enough for their purposes,
or maybe they're actually corrupting parts of their backups and don't
know, but we can't suddenly start aborting their backups with errors and
tell users it's for their own good.  So copy_file_range() callers will
need to handle EINVAL on changing files somehow.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig March 9, 2017, 3:35 p.m. UTC | #22
[please trim your replies instead of this giant unreadable garbage,
 thanks!]

On Thu, Mar 09, 2017 at 10:29:48AM -0500, bfields@fieldses.org wrote:
> So I don't understand why it needed to be added to copy_file_range().
> The copy and clone semantics are different enough that I think callers
> want to know which they're getting.

Because if a file systems implements clone is literally is always better
than doing a copy loop, so using it is an absolute non-brainer.

> Don't we want it to have more or less the same behavior as a read-write
> loop?  People are probably running backup programs that depend on just
> simple copies, and maybe the results are good enough for their purposes,
> or maybe they're actually corrupting parts of their backups and don't
> know, but we can't suddenly start aborting their backups with errors and
> tell users it's for their own good.  So copy_file_range() callers will
> need to handle EINVAL on changing files somehow.

They do, and the system call has been in the tree for almost a year and
a half, so we can't simply change it.  Fortunately we do have a flags
argument that can be used to implement your preferred semantics if you
care deeply enough about them.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields March 9, 2017, 4:16 p.m. UTC | #23
On Thu, Mar 09, 2017 at 07:35:59AM -0800, hch@infradead.org wrote:
> On Thu, Mar 09, 2017 at 10:29:48AM -0500, bfields@fieldses.org wrote:
> > So I don't understand why it needed to be added to copy_file_range().
> > The copy and clone semantics are different enough that I think callers
> > want to know which they're getting.
> 
> Because if a file systems implements clone is literally is always better
> than doing a copy loop, so using it is an absolute non-brainer.

I guess I'm just hung up on the EINVAL vs. short copy behavior.  It
seems more annoying and error-prone to be prepared for both, as opposed
to trying clone and then explicitly falling back to copy if that doesn't
work.  Maybe it's not that big a deal.

> They do, and the system call has been in the tree for almost a year and
> a half, so we can't simply change it.  Fortunately we do have a flags
> argument that can be used to implement your preferred semantics if you
> care deeply enough about them.

Yeah.

There are also some other offset, length combinations that currently
return -EINVAL, I wonder if any of those could be repurposed e.g. for a
"keep copying to end of file" call.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig March 9, 2017, 4:17 p.m. UTC | #24
On Thu, Mar 09, 2017 at 11:16:01AM -0500, bfields@fieldses.org wrote:
> I guess I'm just hung up on the EINVAL vs. short copy behavior.  It
> seems more annoying and error-prone to be prepared for both, as opposed
> to trying clone and then explicitly falling back to copy if that doesn't
> work.  Maybe it's not that big a deal.

We can do short copies^H^H^H^H^Hclones for clone just as easily,
at least for local filesystems (NFS would require some tweaks due to the
protocol).
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia March 9, 2017, 5:28 p.m. UTC | #25
> On Mar 9, 2017, at 11:17 AM, hch@infradead.org wrote:
> 
> On Thu, Mar 09, 2017 at 11:16:01AM -0500, bfields@fieldses.org wrote:
>> I guess I'm just hung up on the EINVAL vs. short copy behavior.  It
>> seems more annoying and error-prone to be prepared for both, as opposed
>> to trying clone and then explicitly falling back to copy if that doesn't
>> work.  Maybe it's not that big a deal.
> 
> We can do short copies^H^H^H^H^Hclones for clone just as easily,
> at least for local filesystems (NFS would require some tweaks due to the
> protocol).

I’m confused by the wording of “we can do … easily” . Is “can” = in the future? Currently, testing copy_file_range() on a btfs with argument of offset+len beyond the end of the file fail with EINVAL. Is NFS tweaking = revert the “MUST” in the spec for the check? 



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia March 9, 2017, 5:35 p.m. UTC | #26
> On Mar 9, 2017, at 11:16 AM, bfields@fieldses.org wrote:
> 
> On Thu, Mar 09, 2017 at 07:35:59AM -0800, hch@infradead.org wrote:
>> On Thu, Mar 09, 2017 at 10:29:48AM -0500, bfields@fieldses.org wrote:
>>> So I don't understand why it needed to be added to copy_file_range().
>>> The copy and clone semantics are different enough that I think callers
>>> want to know which they're getting.
>> 
>> Because if a file systems implements clone is literally is always better
>> than doing a copy loop, so using it is an absolute non-brainer.
> 
> I guess I'm just hung up on the EINVAL vs. short copy behavior.  It
> seems more annoying and error-prone to be prepared for both, as opposed
> to trying clone and then explicitly falling back to copy if that doesn't
> work.  Maybe it's not that big a deal.

I’m confused at which layer are we calling clone() and if we get EINVAL we call copy()? In VFS? Or are we talking about the case where there are two different system calls clone_file_range() and copy_file_range() that are provided to the “cp” and it calls one and then falls back onto another?

> 
>> They do, and the system call has been in the tree for almost a year and
>> a half, so we can't simply change it.  Fortunately we do have a flags
>> argument that can be used to implement your preferred semantics if you
>> care deeply enough about them.
> 
> Yeah.
> 
> There are also some other offset, length combinations that currently
> return -EINVAL, I wonder if any of those could be repurposed e.g. for a
> "keep copying to end of file" call.
> 
> --b.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields March 9, 2017, 6:40 p.m. UTC | #27
On Thu, Mar 09, 2017 at 12:28:03PM -0500, Olga Kornievskaia wrote:
> 
> > On Mar 9, 2017, at 11:17 AM, hch@infradead.org wrote:
> > 
> > On Thu, Mar 09, 2017 at 11:16:01AM -0500, bfields@fieldses.org
> > wrote:
> >> I guess I'm just hung up on the EINVAL vs. short copy behavior.  It
> >> seems more annoying and error-prone to be prepared for both, as
> >> opposed to trying clone and then explicitly falling back to copy if
> >> that doesn't work.  Maybe it's not that big a deal.
> > 
> > We can do short copies^H^H^H^H^Hclones for clone just as easily, at
> > least for local filesystems (NFS would require some tweaks due to
> > the protocol).
> 
> I’m confused by the wording of “we can do … easily” . Is “can” = in
> the future? Currently, testing copy_file_range() on a btfs with
> argument of offset+len beyond the end of the file fail with EINVAL. Is
> NFS tweaking = revert the “MUST” in the spec for the check? 

Checking https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-41
... looks like the CLONE result doesn't even have a length.

Most users are probably cloning the whole file, so I guess it only
matters in the case the file's changing (so that the size returned from
stat might not be correct by the time you do the clone).

Allowing short copies would be one way to handle that--I think it'd work
then to just always request a clone to the maximum length.

Alternatively if the linux interface just had a way to say "clone to end
of file", that'd solve most of the problem and be a nice fit for the
existing NFS protocol (which special-cases length 0 to mean "to end of
file").

Maybe we could special-case the maximum size_t to mean that.  (What is
that, 2^63-1?.)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig March 9, 2017, 9:55 p.m. UTC | #28
On Thu, Mar 09, 2017 at 12:28:03PM -0500, Olga Kornievskaia wrote:
> I’m confused by the wording of “we can do … easily” . Is “can” = in the future? Currently, testing copy_file_range() on a btfs with argument of offset+len beyond the end of the file fail with EINVAL. Is NFS tweaking = revert the “MUST” in the spec for the check? 

The current clone and copy documents (which are our specs) require
this behavior, so that is expected.  But if we decided we want a variant
of copy / clone that doesn't do this it would be very easy to implement
in the local file systems.  Only for NFS we'd get a failure back and
would have to retry the clone based on the updated file size.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/read_write.c b/fs/read_write.c
index 5816d4c..1d9e305 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1526,6 +1526,9 @@  ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (unlikely(ret))
 		return ret;
 
+	if (pos_in >= i_size_read(inode_in))
+		return -EINVAL;
+
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
 	    (file_out->f_flags & O_APPEND))