diff mbox series

[v1,01/11] fs: Don't copy beyond the end of the file

Message ID 20181019152932.32462-2-olga.kornievskaia@gmail.com (mailing list archive)
State New, archived
Headers show
Series client-side support for "inter" SSC copy | expand

Commit Message

Olga Kornievskaia Oct. 19, 2018, 3:29 p.m. UTC
From: Anna Schumaker <Anna.Schumaker@Netapp.com>

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
---
 fs/read_write.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Trond Myklebust Oct. 19, 2018, 4:13 p.m. UTC | #1
On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> From: Anna Schumaker <Anna.Schumaker@Netapp.com>
> 
> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
> ---
>  fs/read_write.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 39b4a21..c60790f 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file
> *file_in, loff_t pos_in,
>  	if (unlikely(ret))
>  		return ret;
>  
> +	if (pos_in >= i_size_read(inode_in))
> +		return -EINVAL;
> +
>  	if (!(file_in->f_mode & FMODE_READ) ||
>  	    !(file_out->f_mode & FMODE_WRITE) ||
>  	    (file_out->f_flags & O_APPEND))

This patch requires an ACK from the VFS maintainers if I'm to push it
upstream.
Cc: Al and Linux-fsdevel.
Jeff Layton Oct. 21, 2018, 2:29 p.m. UTC | #2
On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> From: Anna Schumaker <Anna.Schumaker@Netapp.com>
> 
> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
> ---
>  fs/read_write.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 39b4a21..c60790f 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>  	if (unlikely(ret))
>  		return ret;
>  
> +	if (pos_in >= i_size_read(inode_in))
> +		return -EINVAL;
> +
>  	if (!(file_in->f_mode & FMODE_READ) ||
>  	    !(file_out->f_mode & FMODE_WRITE) ||
>  	    (file_out->f_flags & O_APPEND))

The patch description could use a bit more fleshing-out. The
copy_file_range manpage says:

       EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.

So I guess this is intended to satisfy that requirement? If so,
i_size_read is just going to return whatever is in inode->isize. 

Could a copy_file_range call end up getting issued to copy from a file
that is already open on a range that it doesn't know about yet? i.e.
where the inode cache has not yet been updated.

It seems like that could on network filesystems (like NFS). Would this
be better handled in ->copy_file_range instead, where the driver could
make a better determination of the file size?
Olga Kornievskaia Oct. 22, 2018, 6:32 p.m. UTC | #3
On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > From: Anna Schumaker <Anna.Schumaker@Netapp.com>
> >
> > Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > ---
> >  fs/read_write.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 39b4a21..c60790f 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> >       if (unlikely(ret))
> >               return ret;
> >
> > +     if (pos_in >= i_size_read(inode_in))
> > +             return -EINVAL;
> > +
> >       if (!(file_in->f_mode & FMODE_READ) ||
> >           !(file_out->f_mode & FMODE_WRITE) ||
> >           (file_out->f_flags & O_APPEND))
>
> The patch description could use a bit more fleshing-out. The
> copy_file_range manpage says:
>
>        EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
>
> So I guess this is intended to satisfy that requirement?

I agree the description of the patch is poor. It sort of falls under
the the man page's description of range beyond the end of the source
file. But in NFSv4.2, there is an explicit wording for the validity of
the input parameters and having an input source offset that's beyond
the end of the file is what this patch is attempting to check.

> If so,
> i_size_read is just going to return whatever is in inode->isize.

> Could a copy_file_range call end up getting issued to copy from a file
> that is already open on a range that it doesn't know about yet? i.e.
> where the inode cache has not yet been updated.

I thought that with NFSv4 cache consistency, the inode->isize is
accurate. If previous open had a read delegation, any modification on
a server would trigger a CB_RECALL and the open for the copy offload
would retrieve the latest size. In case of no delegations, the open
retrieves the latest size and the call to copy_file_range() would have
an update size.

> It seems like that could on network filesystems (like NFS). Would this
> be better handled in ->copy_file_range instead, where the driver could
> make a better determination of the file size?

I'm not opposed to moving the size check into the NFS's copy_file_size
(again in my opinion NFS attribute cache has the same file size as the
inode's size). I think the thought was that such check should be done
at the VFS layer as oppose to doing it by each of the file systems.

> --
> Jeff Layton <jlayton@kernel.org>
>
Jeff Layton Oct. 22, 2018, 11:23 p.m. UTC | #4
On Mon, 2018-10-22 at 14:32 -0400, Olga Kornievskaia wrote:
> On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > > From: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > 
> > > Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > ---
> > >  fs/read_write.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > index 39b4a21..c60790f 100644
> > > --- a/fs/read_write.c
> > > +++ b/fs/read_write.c
> > > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > >       if (unlikely(ret))
> > >               return ret;
> > > 
> > > +     if (pos_in >= i_size_read(inode_in))
> > > +             return -EINVAL;
> > > +
> > >       if (!(file_in->f_mode & FMODE_READ) ||
> > >           !(file_out->f_mode & FMODE_WRITE) ||
> > >           (file_out->f_flags & O_APPEND))
> > 
> > The patch description could use a bit more fleshing-out. The
> > copy_file_range manpage says:
> > 
> >        EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
> > 
> > So I guess this is intended to satisfy that requirement?
> 
> I agree the description of the patch is poor. It sort of falls under
> the the man page's description of range beyond the end of the source
> file. But in NFSv4.2, there is an explicit wording for the validity of
> the input parameters and having an input source offset that's beyond
> the end of the file is what this patch is attempting to check.
> 

Side note:

One has to wonder why they decided to make that an -EINVAL condition?
The system call returns ssize_t. Why not just return a fewer number of
bytes in that situation?

In fact, the RETURN VALUE section of the manpage says:

       Upon successful completion, copy_file_range() will return the number of
       bytes copied between files.  This could be less than the length  origi‐
       nally requested.

Under what conditions would that occur that do not include the file
being shorter than the range you wanted to copy?

I wonder if we ought to lobby to get that changed.

> > If so,
> > i_size_read is just going to return whatever is in inode->isize.
> > Could a copy_file_range call end up getting issued to copy from a file
> > that is already open on a range that it doesn't know about yet? i.e.
> > where the inode cache has not yet been updated.
> 
> I thought that with NFSv4 cache consistency, the inode->isize is
> accurate. If previous open had a read delegation, any modification on
> a server would trigger a CB_RECALL and the open for the copy offload
> would retrieve the latest size. In case of no delegations, the open
> retrieves the latest size and the call to copy_file_range() would have
> an update size.
> 

> It seems like that could on network filesystems (like NFS). Would this
> > be better handled in ->copy_file_range instead, where the driver could
> > make a better determination of the file size?
> 
> I'm not opposed to moving the size check into the NFS's copy_file_size
> (again in my opinion NFS attribute cache has the same file size as the
> inode's size). I think the thought was that such check should be done
> at the VFS layer as oppose to doing it by each of the file systems.
> 
> 

The attribute cache is not revalidated before the i_size is fetched with
i_size_read. You're just reading what happens to be in the in-memory
inode structure. So both clients have the file open already, and neither
has a delegation:

client 1: fetches attributes from file and sees a size of 1000
client 2: writes 20 bytes at offset 1000
client 1: calls copy file range to copy 1020 bytes starting at offset 0

If client1 didn't get an attribute update before the copy_file_range
call came in, then it would still think the size was 1000 and fail the
operation. It may even be many seconds before client1 sees the updated
size.

You could argue that we're not using locking here so you're just subject
to normal open-to-close cache coherency behavior, but that's rather "not
nice".

I think we probably ought to also push this check down into the
filesystem operations as well, and have copy_file_range ensure that the
attribute cache is updated. We're dealing with copy offload here so
doing an extra GETATTR beforehand shouldn't be terribly costly.
Olga Kornievskaia Oct. 23, 2018, 4:50 p.m. UTC | #5
On Mon, Oct 22, 2018 at 7:23 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Mon, 2018-10-22 at 14:32 -0400, Olga Kornievskaia wrote:
> > On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > > > From: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > >
> > > > Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > > ---
> > > >  fs/read_write.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > index 39b4a21..c60790f 100644
> > > > --- a/fs/read_write.c
> > > > +++ b/fs/read_write.c
> > > > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > >       if (unlikely(ret))
> > > >               return ret;
> > > >
> > > > +     if (pos_in >= i_size_read(inode_in))
> > > > +             return -EINVAL;
> > > > +
> > > >       if (!(file_in->f_mode & FMODE_READ) ||
> > > >           !(file_out->f_mode & FMODE_WRITE) ||
> > > >           (file_out->f_flags & O_APPEND))
> > >
> > > The patch description could use a bit more fleshing-out. The
> > > copy_file_range manpage says:
> > >
> > >        EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
> > >
> > > So I guess this is intended to satisfy that requirement?
> >
> > I agree the description of the patch is poor. It sort of falls under
> > the the man page's description of range beyond the end of the source
> > file. But in NFSv4.2, there is an explicit wording for the validity of
> > the input parameters and having an input source offset that's beyond
> > the end of the file is what this patch is attempting to check.
> >
>
> Side note:
>
> One has to wonder why they decided to make that an -EINVAL condition?

To me that sounds like a correct use of -EINVAL error to check for
invalid arguments.

You can argue that since there were no bytes to copy then returning 0
would be an appropriate "size" of the copy. However, I would argue how
would a caller distinguish this 0 size which really means don't try to
copy more vs another case when copy_file_range() returned less byte
(0) and so that caller should loop to get the rest.

> The system call returns ssize_t. Why not just return a fewer number of
> bytes in that situation?
>
> In fact, the RETURN VALUE section of the manpage says:
>
>        Upon successful completion, copy_file_range() will return the number of
>        bytes copied between files.  This could be less than the length  origi‐
>        nally requested.
>
> Under what conditions would that occur that do not include the file
> being shorter than the range you wanted to copy?
>
> I wonder if we ought to lobby to get that changed.

Do you mean ask NFSv4.2 spec to be changed? I thought that's stuff is
"written in stone".

>
> > > If so,
> > > i_size_read is just going to return whatever is in inode->isize.
> > > Could a copy_file_range call end up getting issued to copy from a file
> > > that is already open on a range that it doesn't know about yet? i.e.
> > > where the inode cache has not yet been updated.
> >
> > I thought that with NFSv4 cache consistency, the inode->isize is
> > accurate. If previous open had a read delegation, any modification on
> > a server would trigger a CB_RECALL and the open for the copy offload
> > would retrieve the latest size. In case of no delegations, the open
> > retrieves the latest size and the call to copy_file_range() would have
> > an update size.
> >
>
> > It seems like that could on network filesystems (like NFS). Would this
> > > be better handled in ->copy_file_range instead, where the driver could
> > > make a better determination of the file size?
> >
> > I'm not opposed to moving the size check into the NFS's copy_file_size
> > (again in my opinion NFS attribute cache has the same file size as the
> > inode's size). I think the thought was that such check should be done
> > at the VFS layer as oppose to doing it by each of the file systems.
> >
> >
>
> The attribute cache is not revalidated before the i_size is fetched with
> i_size_read. You're just reading what happens to be in the in-memory
> inode structure. So both clients have the file open already, and neither
> has a delegation:
>
> client 1: fetches attributes from file and sees a size of 1000
> client 2: writes 20 bytes at offset 1000
> client 1: calls copy file range to copy 1020 bytes starting at offset 0
>
> If client1 didn't get an attribute update before the copy_file_range
> call came in, then it would still think the size was 1000 and fail the
> operation. It may even be many seconds before client1 sees the updated
> size.
>
> You could argue that we're not using locking here so you're just subject
> to normal open-to-close cache coherency behavior, but that's rather "not
> nice".

Yes I would argue that locking needs to be used. In case your
describing it is impossible to get an accurate file size. Even if the
client issues a GETATTR there is nothing prevents another client from
changing it right after that GETATTR was already replied to.

What nfscopy application does is it opens then file and then locks it
(as suggested by the spec), then calls the copy_file_range() system
call. You can argue (if we didn't get a delegation) that a file might
have been changed between the reply to the OPEN and the LOCK operation
and therefore, we should send another GETATTR just in case. To guard
against that, I can add a getattr call though I think it's an overkill
specially since linux server always grants us a read delegation.

> I think we probably ought to also push this check down into the
> filesystem operations as well, and have copy_file_range ensure that the
> attribute cache is updated. We're dealing with copy offload here so
> doing an extra GETATTR beforehand shouldn't be terribly costly.
> --
> Jeff Layton <jlayton@kernel.org>
>
Jeff Layton Oct. 24, 2018, 11:09 a.m. UTC | #6
On Tue, 2018-10-23 at 12:50 -0400, Olga Kornievskaia wrote:
> On Mon, Oct 22, 2018 at 7:23 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Mon, 2018-10-22 at 14:32 -0400, Olga Kornievskaia wrote:
> > > On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > > > > From: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > > > 
> > > > > Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > > > ---
> > > > >  fs/read_write.c | 3 +++
> > > > >  1 file changed, 3 insertions(+)
> > > > > 
> > > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > > index 39b4a21..c60790f 100644
> > > > > --- a/fs/read_write.c
> > > > > +++ b/fs/read_write.c
> > > > > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > >       if (unlikely(ret))
> > > > >               return ret;
> > > > > 
> > > > > +     if (pos_in >= i_size_read(inode_in))
> > > > > +             return -EINVAL;
> > > > > +
> > > > >       if (!(file_in->f_mode & FMODE_READ) ||
> > > > >           !(file_out->f_mode & FMODE_WRITE) ||
> > > > >           (file_out->f_flags & O_APPEND))
> > > > 
> > > > The patch description could use a bit more fleshing-out. The
> > > > copy_file_range manpage says:
> > > > 
> > > >        EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
> > > > 
> > > > So I guess this is intended to satisfy that requirement?
> > > 
> > > I agree the description of the patch is poor. It sort of falls under
> > > the the man page's description of range beyond the end of the source
> > > file. But in NFSv4.2, there is an explicit wording for the validity of
> > > the input parameters and having an input source offset that's beyond
> > > the end of the file is what this patch is attempting to check.
> > > 
> > 
> > Side note:
> > 
> > One has to wonder why they decided to make that an -EINVAL condition?
> 
> To me that sounds like a correct use of -EINVAL error to check for
> invalid arguments.
> 
> You can argue that since there were no bytes to copy then returning 0
> would be an appropriate "size" of the copy. However, I would argue how
> would a caller distinguish this 0 size which really means don't try to
> copy more vs another case when copy_file_range() returned less byte
> (0) and so that caller should loop to get the rest.
> 

I don't know -- it seems to run contrary to how read(2) and write(2)
work with an EOF condition. I don't see why the implementation wouldn't
want to just copy what you can up to the EOF of the source file. Maybe I
need to go back and review the discussion from when the syscall was
merged...

> > The system call returns ssize_t. Why not just return a fewer number of
> > bytes in that situation?
> > 
> > In fact, the RETURN VALUE section of the manpage says:
> > 
> >        Upon successful completion, copy_file_range() will return the number of
> >        bytes copied between files.  This could be less than the length  origi‐
> >        nally requested.
> > 
> > Under what conditions would that occur that do not include the file
> > being shorter than the range you wanted to copy?
> > 
> > I wonder if we ought to lobby to get that changed.
> 
> Do you mean ask NFSv4.2 spec to be changed? I thought that's stuff is
> "written in stone".
> 

No, I meant the copy_file_range spec (such as it is). I guess the v4.2
spec has this though:

   If the source offset or the source offset plus count is greater than
   the size of the source file, the operation MUST fail with
   NFS4ERR_INVAL.

I wonder if you'd be better off not trying to enforce this on the client
and simply let the server return NFS4ERR_INVAL if we're beyond EOF on
the source? That way it doesn't matter whether the client's attr cache
is up to date or not.

> > 
> > > > If so,
> > > > i_size_read is just going to return whatever is in inode->isize.
> > > > Could a copy_file_range call end up getting issued to copy from a file
> > > > that is already open on a range that it doesn't know about yet? i.e.
> > > > where the inode cache has not yet been updated.
> > > 
> > > I thought that with NFSv4 cache consistency, the inode->isize is
> > > accurate. If previous open had a read delegation, any modification on
> > > a server would trigger a CB_RECALL and the open for the copy offload
> > > would retrieve the latest size. In case of no delegations, the open
> > > retrieves the latest size and the call to copy_file_range() would have
> > > an update size.
> > > 
> > > It seems like that could on network filesystems (like NFS). Would this
> > > > be better handled in ->copy_file_range instead, where the driver could
> > > > make a better determination of the file size?
> > > 
> > > I'm not opposed to moving the size check into the NFS's copy_file_size
> > > (again in my opinion NFS attribute cache has the same file size as the
> > > inode's size). I think the thought was that such check should be done
> > > at the VFS layer as oppose to doing it by each of the file systems.
> > > 
> > > 
> > 
> > The attribute cache is not revalidated before the i_size is fetched with
> > i_size_read. You're just reading what happens to be in the in-memory
> > inode structure. So both clients have the file open already, and neither
> > has a delegation:
> > 
> > client 1: fetches attributes from file and sees a size of 1000
> > client 2: writes 20 bytes at offset 1000
> > client 1: calls copy file range to copy 1020 bytes starting at offset 0
> > 
> > If client1 didn't get an attribute update before the copy_file_range
> > call came in, then it would still think the size was 1000 and fail the
> > operation. It may even be many seconds before client1 sees the updated
> > size.
> > 
> > You could argue that we're not using locking here so you're just subject
> > to normal open-to-close cache coherency behavior, but that's rather "not
> > nice".
> 
> Yes I would argue that locking needs to be used. In case your
> describing it is impossible to get an accurate file size. Even if the
> client issues a GETATTR there is nothing prevents another client from
> changing it right after that GETATTR was already replied to.
> 
> What nfscopy application does is it opens then file and then locks it
> (as suggested by the spec), then calls the copy_file_range() system
> call. You can argue (if we didn't get a delegation) that a file might
> have been changed between the reply to the OPEN and the LOCK operation
> and therefore, we should send another GETATTR just in case. To guard
> against that, I can add a getattr call though I think it's an overkill
> specially since linux server always grants us a read delegation.
> 

The spec does say you might need to lock the files but they don't say
you SHOULD, only that servers need to be able to deal with lock
stateids.  hat said, you have a good point. We're not violating anything
by not revalidating the cache beforehand. Maybe we don't need to do this
after all.

Personally, I think this would be best enforced by the NFS server. Just
fire off the COPY request as-is and let the server vet the lengths.

Side note: a delegation is never guaranteed. knfsd won't grant one if
the inode falls afoul of the bloom filter, for instance, and that can
easily happen if (for example) there is a hash collision between
different inodes.

> > I think we probably ought to also push this check down into the
> > filesystem operations as well, and have copy_file_range ensure that the
> > attribute cache is updated. We're dealing with copy offload here so
> > doing an extra GETATTR beforehand shouldn't be terribly costly.
> > --
> > Jeff Layton <jlayton@kernel.org>
Olga Kornievskaia Oct. 24, 2018, 3:59 p.m. UTC | #7
On Wed, Oct 24, 2018 at 7:09 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Tue, 2018-10-23 at 12:50 -0400, Olga Kornievskaia wrote:
> > On Mon, Oct 22, 2018 at 7:23 PM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > On Mon, 2018-10-22 at 14:32 -0400, Olga Kornievskaia wrote:
> > > > On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > >
> > > > > On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > > > > > From: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > > > >
> > > > > > Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > > > > ---
> > > > > >  fs/read_write.c | 3 +++
> > > > > >  1 file changed, 3 insertions(+)
> > > > > >
> > > > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > > > index 39b4a21..c60790f 100644
> > > > > > --- a/fs/read_write.c
> > > > > > +++ b/fs/read_write.c
> > > > > > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > > >       if (unlikely(ret))
> > > > > >               return ret;
> > > > > >
> > > > > > +     if (pos_in >= i_size_read(inode_in))
> > > > > > +             return -EINVAL;
> > > > > > +
> > > > > >       if (!(file_in->f_mode & FMODE_READ) ||
> > > > > >           !(file_out->f_mode & FMODE_WRITE) ||
> > > > > >           (file_out->f_flags & O_APPEND))
> > > > >
> > > > > The patch description could use a bit more fleshing-out. The
> > > > > copy_file_range manpage says:
> > > > >
> > > > >        EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
> > > > >
> > > > > So I guess this is intended to satisfy that requirement?
> > > >
> > > > I agree the description of the patch is poor. It sort of falls under
> > > > the the man page's description of range beyond the end of the source
> > > > file. But in NFSv4.2, there is an explicit wording for the validity of
> > > > the input parameters and having an input source offset that's beyond
> > > > the end of the file is what this patch is attempting to check.
> > > >
> > >
> > > Side note:
> > >
> > > One has to wonder why they decided to make that an -EINVAL condition?
> >
> > To me that sounds like a correct use of -EINVAL error to check for
> > invalid arguments.
> >
> > You can argue that since there were no bytes to copy then returning 0
> > would be an appropriate "size" of the copy. However, I would argue how
> > would a caller distinguish this 0 size which really means don't try to
> > copy more vs another case when copy_file_range() returned less byte
> > (0) and so that caller should loop to get the rest.
> >
>
> I don't know -- it seems to run contrary to how read(2) and write(2)

That's because copy_file_range is not just read/write but also lseek.
I think of it as doing lseek(to source offset)->read()->lseek(to dst
offset)->write. and lseek() does return EINVAL when position is beyond
the end of the file.

> work with an EOF condition. I don't see why the implementation wouldn't
> want to just copy what you can up to the EOF of the source file. Maybe I
> need to go back and review the discussion from when the syscall was
> merged...

>
> > > The system call returns ssize_t. Why not just return a fewer number of
> > > bytes in that situation?
> > >
> > > In fact, the RETURN VALUE section of the manpage says:
> > >
> > >        Upon successful completion, copy_file_range() will return the number of
> > >        bytes copied between files.  This could be less than the length  origi‐
> > >        nally requested.
> > >
> > > Under what conditions would that occur that do not include the file
> > > being shorter than the range you wanted to copy?
> > >
> > > I wonder if we ought to lobby to get that changed.
> >
> > Do you mean ask NFSv4.2 spec to be changed? I thought that's stuff is
> > "written in stone".
> >
>
> No, I meant the copy_file_range spec (such as it is). I guess the v4.2
> spec has this though:
>
>    If the source offset or the source offset plus count is greater than
>    the size of the source file, the operation MUST fail with
>    NFS4ERR_INVAL.
>
> I wonder if you'd be better off not trying to enforce this on the client
> and simply let the server return NFS4ERR_INVAL if we're beyond EOF on
> the source? That way it doesn't matter whether the client's attr cache
> is up to date or not.
>
> > >
> > > > > If so,
> > > > > i_size_read is just going to return whatever is in inode->isize.
> > > > > Could a copy_file_range call end up getting issued to copy from a file
> > > > > that is already open on a range that it doesn't know about yet? i.e.
> > > > > where the inode cache has not yet been updated.
> > > >
> > > > I thought that with NFSv4 cache consistency, the inode->isize is
> > > > accurate. If previous open had a read delegation, any modification on
> > > > a server would trigger a CB_RECALL and the open for the copy offload
> > > > would retrieve the latest size. In case of no delegations, the open
> > > > retrieves the latest size and the call to copy_file_range() would have
> > > > an update size.
> > > >
> > > > It seems like that could on network filesystems (like NFS). Would this
> > > > > be better handled in ->copy_file_range instead, where the driver could
> > > > > make a better determination of the file size?
> > > >
> > > > I'm not opposed to moving the size check into the NFS's copy_file_size
> > > > (again in my opinion NFS attribute cache has the same file size as the
> > > > inode's size). I think the thought was that such check should be done
> > > > at the VFS layer as oppose to doing it by each of the file systems.
> > > >
> > > >
> > >
> > > The attribute cache is not revalidated before the i_size is fetched with
> > > i_size_read. You're just reading what happens to be in the in-memory
> > > inode structure. So both clients have the file open already, and neither
> > > has a delegation:
> > >
> > > client 1: fetches attributes from file and sees a size of 1000
> > > client 2: writes 20 bytes at offset 1000
> > > client 1: calls copy file range to copy 1020 bytes starting at offset 0
> > >
> > > If client1 didn't get an attribute update before the copy_file_range
> > > call came in, then it would still think the size was 1000 and fail the
> > > operation. It may even be many seconds before client1 sees the updated
> > > size.
> > >
> > > You could argue that we're not using locking here so you're just subject
> > > to normal open-to-close cache coherency behavior, but that's rather "not
> > > nice".
> >
> > Yes I would argue that locking needs to be used. In case your
> > describing it is impossible to get an accurate file size. Even if the
> > client issues a GETATTR there is nothing prevents another client from
> > changing it right after that GETATTR was already replied to.
> >
> > What nfscopy application does is it opens then file and then locks it
> > (as suggested by the spec), then calls the copy_file_range() system
> > call. You can argue (if we didn't get a delegation) that a file might
> > have been changed between the reply to the OPEN and the LOCK operation
> > and therefore, we should send another GETATTR just in case. To guard
> > against that, I can add a getattr call though I think it's an overkill
> > specially since linux server always grants us a read delegation.
> >
>
> The spec does say you might need to lock the files but they don't say
> you SHOULD, only that servers need to be able to deal with lock
> stateids.  hat said, you have a good point. We're not violating anything
> by not revalidating the cache beforehand. Maybe we don't need to do this
> after all.
>
> Personally, I think this would be best enforced by the NFS server. Just
> fire off the COPY request as-is and let the server vet the lengths.
>
> Side note: a delegation is never guaranteed. knfsd won't grant one if
> the inode falls afoul of the bloom filter, for instance, and that can
> easily happen if (for example) there is a hash collision between
> different inodes.

I could push the checking validity of the arguments to the server (but
to me it sounds lame: a failure will require a network roundtrip). But
one can argue the server needs to check the validity of the arguments
anyway (as it can't rely on the client to play nice).

I still think the check should be done in both places (client and
server). And I feel like VFS is the right place to do so (as this
check implements what the man page states). But I don't feel strongly
about dropping the patch all together (I'll add a patch to the
server).

> > > I think we probably ought to also push this check down into the
> > > filesystem operations as well, and have copy_file_range ensure that the
> > > attribute cache is updated. We're dealing with copy offload here so
> > > doing an extra GETATTR beforehand shouldn't be terribly costly.
> > > --
> > > Jeff Layton <jlayton@kernel.org>
>
> --
> Jeff Layton <jlayton@kernel.org>
>
Olga Kornievskaia Oct. 24, 2018, 6:53 p.m. UTC | #8
On Wed, Oct 24, 2018 at 11:59 AM Olga Kornievskaia
<olga.kornievskaia@gmail.com> wrote:
>
> On Wed, Oct 24, 2018 at 7:09 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Tue, 2018-10-23 at 12:50 -0400, Olga Kornievskaia wrote:
> > > On Mon, Oct 22, 2018 at 7:23 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > >
> > > > On Mon, 2018-10-22 at 14:32 -0400, Olga Kornievskaia wrote:
> > > > > On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > >
> > > > > > On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > > > > > > From: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > > > > >
> > > > > > > Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > > > > > ---
> > > > > > >  fs/read_write.c | 3 +++
> > > > > > >  1 file changed, 3 insertions(+)
> > > > > > >
> > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > > > > index 39b4a21..c60790f 100644
> > > > > > > --- a/fs/read_write.c
> > > > > > > +++ b/fs/read_write.c
> > > > > > > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > > > >       if (unlikely(ret))
> > > > > > >               return ret;
> > > > > > >
> > > > > > > +     if (pos_in >= i_size_read(inode_in))
> > > > > > > +             return -EINVAL;
> > > > > > > +
> > > > > > >       if (!(file_in->f_mode & FMODE_READ) ||
> > > > > > >           !(file_out->f_mode & FMODE_WRITE) ||
> > > > > > >           (file_out->f_flags & O_APPEND))
> > > > > >
> > > > > > The patch description could use a bit more fleshing-out. The
> > > > > > copy_file_range manpage says:
> > > > > >
> > > > > >        EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
> > > > > >
> > > > > > So I guess this is intended to satisfy that requirement?
> > > > >
> > > > > I agree the description of the patch is poor. It sort of falls under
> > > > > the the man page's description of range beyond the end of the source
> > > > > file. But in NFSv4.2, there is an explicit wording for the validity of
> > > > > the input parameters and having an input source offset that's beyond
> > > > > the end of the file is what this patch is attempting to check.
> > > > >
> > > >
> > > > Side note:
> > > >
> > > > One has to wonder why they decided to make that an -EINVAL condition?
> > >
> > > To me that sounds like a correct use of -EINVAL error to check for
> > > invalid arguments.
> > >
> > > You can argue that since there were no bytes to copy then returning 0
> > > would be an appropriate "size" of the copy. However, I would argue how
> > > would a caller distinguish this 0 size which really means don't try to
> > > copy more vs another case when copy_file_range() returned less byte
> > > (0) and so that caller should loop to get the rest.
> > >
> >
> > I don't know -- it seems to run contrary to how read(2) and write(2)
>
> That's because copy_file_range is not just read/write but also lseek.
> I think of it as doing lseek(to source offset)->read()->lseek(to dst
> offset)->write. and lseek() does return EINVAL when position is beyond
> the end of the file.
>
> > work with an EOF condition. I don't see why the implementation wouldn't
> > want to just copy what you can up to the EOF of the source file. Maybe I
> > need to go back and review the discussion from when the syscall was
> > merged...
>
> >
> > > > The system call returns ssize_t. Why not just return a fewer number of
> > > > bytes in that situation?
> > > >
> > > > In fact, the RETURN VALUE section of the manpage says:
> > > >
> > > >        Upon successful completion, copy_file_range() will return the number of
> > > >        bytes copied between files.  This could be less than the length  origi‐
> > > >        nally requested.
> > > >
> > > > Under what conditions would that occur that do not include the file
> > > > being shorter than the range you wanted to copy?
> > > >
> > > > I wonder if we ought to lobby to get that changed.
> > >
> > > Do you mean ask NFSv4.2 spec to be changed? I thought that's stuff is
> > > "written in stone".
> > >
> >
> > No, I meant the copy_file_range spec (such as it is). I guess the v4.2
> > spec has this though:
> >
> >    If the source offset or the source offset plus count is greater than
> >    the size of the source file, the operation MUST fail with
> >    NFS4ERR_INVAL.
> >
> > I wonder if you'd be better off not trying to enforce this on the client
> > and simply let the server return NFS4ERR_INVAL if we're beyond EOF on
> > the source? That way it doesn't matter whether the client's attr cache
> > is up to date or not.
> >
> > > >
> > > > > > If so,
> > > > > > i_size_read is just going to return whatever is in inode->isize.
> > > > > > Could a copy_file_range call end up getting issued to copy from a file
> > > > > > that is already open on a range that it doesn't know about yet? i.e.
> > > > > > where the inode cache has not yet been updated.
> > > > >
> > > > > I thought that with NFSv4 cache consistency, the inode->isize is
> > > > > accurate. If previous open had a read delegation, any modification on
> > > > > a server would trigger a CB_RECALL and the open for the copy offload
> > > > > would retrieve the latest size. In case of no delegations, the open
> > > > > retrieves the latest size and the call to copy_file_range() would have
> > > > > an update size.
> > > > >
> > > > > It seems like that could on network filesystems (like NFS). Would this
> > > > > > be better handled in ->copy_file_range instead, where the driver could
> > > > > > make a better determination of the file size?
> > > > >
> > > > > I'm not opposed to moving the size check into the NFS's copy_file_size
> > > > > (again in my opinion NFS attribute cache has the same file size as the
> > > > > inode's size). I think the thought was that such check should be done
> > > > > at the VFS layer as oppose to doing it by each of the file systems.
> > > > >
> > > > >
> > > >
> > > > The attribute cache is not revalidated before the i_size is fetched with
> > > > i_size_read. You're just reading what happens to be in the in-memory
> > > > inode structure. So both clients have the file open already, and neither
> > > > has a delegation:
> > > >
> > > > client 1: fetches attributes from file and sees a size of 1000
> > > > client 2: writes 20 bytes at offset 1000
> > > > client 1: calls copy file range to copy 1020 bytes starting at offset 0
> > > >
> > > > If client1 didn't get an attribute update before the copy_file_range
> > > > call came in, then it would still think the size was 1000 and fail the
> > > > operation. It may even be many seconds before client1 sees the updated
> > > > size.
> > > >
> > > > You could argue that we're not using locking here so you're just subject
> > > > to normal open-to-close cache coherency behavior, but that's rather "not
> > > > nice".
> > >
> > > Yes I would argue that locking needs to be used. In case your
> > > describing it is impossible to get an accurate file size. Even if the
> > > client issues a GETATTR there is nothing prevents another client from
> > > changing it right after that GETATTR was already replied to.
> > >
> > > What nfscopy application does is it opens then file and then locks it
> > > (as suggested by the spec), then calls the copy_file_range() system
> > > call. You can argue (if we didn't get a delegation) that a file might
> > > have been changed between the reply to the OPEN and the LOCK operation
> > > and therefore, we should send another GETATTR just in case. To guard
> > > against that, I can add a getattr call though I think it's an overkill
> > > specially since linux server always grants us a read delegation.
> > >
> >
> > The spec does say you might need to lock the files but they don't say
> > you SHOULD, only that servers need to be able to deal with lock
> > stateids.  hat said, you have a good point. We're not violating anything
> > by not revalidating the cache beforehand. Maybe we don't need to do this
> > after all.
> >
> > Personally, I think this would be best enforced by the NFS server. Just
> > fire off the COPY request as-is and let the server vet the lengths.
> >
> > Side note: a delegation is never guaranteed. knfsd won't grant one if
> > the inode falls afoul of the bloom filter, for instance, and that can
> > easily happen if (for example) there is a hash collision between
> > different inodes.
>
> I could push the checking validity of the arguments to the server (but
> to me it sounds lame: a failure will require a network roundtrip). But
> one can argue the server needs to check the validity of the arguments
> anyway (as it can't rely on the client to play nice).
>
> I still think the check should be done in both places (client and
> server). And I feel like VFS is the right place to do so (as this
> check implements what the man page states). But I don't feel strongly
> about dropping the patch all together (I'll add a patch to the
> server).

To add to my case, would it be acceptable to add a check *same as*
it's done for the vfs_clone_file_range()
pos_in+len > i_size_read() return EINVAL.

knfsd just calls in to the vfs to do the copy_file_range when it
receives it, so these checks should be a part of the VFS.

> > > > I think we probably ought to also push this check down into the
> > > > filesystem operations as well, and have copy_file_range ensure that the
> > > > attribute cache is updated. We're dealing with copy offload here so
> > > > doing an extra GETATTR beforehand shouldn't be terribly costly.
> > > > --
> > > > Jeff Layton <jlayton@kernel.org>
> >
> > --
> > Jeff Layton <jlayton@kernel.org>
> >
Olga Kornievskaia Oct. 24, 2018, 7:21 p.m. UTC | #9
On Wed, Oct 24, 2018 at 2:53 PM Olga Kornievskaia
<olga.kornievskaia@gmail.com> wrote:
>
> On Wed, Oct 24, 2018 at 11:59 AM Olga Kornievskaia
> <olga.kornievskaia@gmail.com> wrote:
> >
> > On Wed, Oct 24, 2018 at 7:09 AM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > On Tue, 2018-10-23 at 12:50 -0400, Olga Kornievskaia wrote:
> > > > On Mon, Oct 22, 2018 at 7:23 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > >
> > > > > On Mon, 2018-10-22 at 14:32 -0400, Olga Kornievskaia wrote:
> > > > > > On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > > >
> > > > > > > On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > > > > > > > From: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > > > > > >
> > > > > > > > Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > > > > > > ---
> > > > > > > >  fs/read_write.c | 3 +++
> > > > > > > >  1 file changed, 3 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > > > > > index 39b4a21..c60790f 100644
> > > > > > > > --- a/fs/read_write.c
> > > > > > > > +++ b/fs/read_write.c
> > > > > > > > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > > > > >       if (unlikely(ret))
> > > > > > > >               return ret;
> > > > > > > >
> > > > > > > > +     if (pos_in >= i_size_read(inode_in))
> > > > > > > > +             return -EINVAL;
> > > > > > > > +
> > > > > > > >       if (!(file_in->f_mode & FMODE_READ) ||
> > > > > > > >           !(file_out->f_mode & FMODE_WRITE) ||
> > > > > > > >           (file_out->f_flags & O_APPEND))
> > > > > > >
> > > > > > > The patch description could use a bit more fleshing-out. The
> > > > > > > copy_file_range manpage says:
> > > > > > >
> > > > > > >        EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
> > > > > > >
> > > > > > > So I guess this is intended to satisfy that requirement?
> > > > > >
> > > > > > I agree the description of the patch is poor. It sort of falls under
> > > > > > the the man page's description of range beyond the end of the source
> > > > > > file. But in NFSv4.2, there is an explicit wording for the validity of
> > > > > > the input parameters and having an input source offset that's beyond
> > > > > > the end of the file is what this patch is attempting to check.
> > > > > >
> > > > >
> > > > > Side note:
> > > > >
> > > > > One has to wonder why they decided to make that an -EINVAL condition?
> > > >
> > > > To me that sounds like a correct use of -EINVAL error to check for
> > > > invalid arguments.
> > > >
> > > > You can argue that since there were no bytes to copy then returning 0
> > > > would be an appropriate "size" of the copy. However, I would argue how
> > > > would a caller distinguish this 0 size which really means don't try to
> > > > copy more vs another case when copy_file_range() returned less byte
> > > > (0) and so that caller should loop to get the rest.
> > > >
> > >
> > > I don't know -- it seems to run contrary to how read(2) and write(2)
> >
> > That's because copy_file_range is not just read/write but also lseek.
> > I think of it as doing lseek(to source offset)->read()->lseek(to dst
> > offset)->write. and lseek() does return EINVAL when position is beyond
> > the end of the file.
> >
> > > work with an EOF condition. I don't see why the implementation wouldn't
> > > want to just copy what you can up to the EOF of the source file. Maybe I
> > > need to go back and review the discussion from when the syscall was
> > > merged...
> >
> > >
> > > > > The system call returns ssize_t. Why not just return a fewer number of
> > > > > bytes in that situation?
> > > > >
> > > > > In fact, the RETURN VALUE section of the manpage says:
> > > > >
> > > > >        Upon successful completion, copy_file_range() will return the number of
> > > > >        bytes copied between files.  This could be less than the length  origi‐
> > > > >        nally requested.
> > > > >
> > > > > Under what conditions would that occur that do not include the file
> > > > > being shorter than the range you wanted to copy?
> > > > >
> > > > > I wonder if we ought to lobby to get that changed.
> > > >
> > > > Do you mean ask NFSv4.2 spec to be changed? I thought that's stuff is
> > > > "written in stone".
> > > >
> > >
> > > No, I meant the copy_file_range spec (such as it is). I guess the v4.2
> > > spec has this though:
> > >
> > >    If the source offset or the source offset plus count is greater than
> > >    the size of the source file, the operation MUST fail with
> > >    NFS4ERR_INVAL.
> > >
> > > I wonder if you'd be better off not trying to enforce this on the client
> > > and simply let the server return NFS4ERR_INVAL if we're beyond EOF on
> > > the source? That way it doesn't matter whether the client's attr cache
> > > is up to date or not.
> > >
> > > > >
> > > > > > > If so,
> > > > > > > i_size_read is just going to return whatever is in inode->isize.
> > > > > > > Could a copy_file_range call end up getting issued to copy from a file
> > > > > > > that is already open on a range that it doesn't know about yet? i.e.
> > > > > > > where the inode cache has not yet been updated.
> > > > > >
> > > > > > I thought that with NFSv4 cache consistency, the inode->isize is
> > > > > > accurate. If previous open had a read delegation, any modification on
> > > > > > a server would trigger a CB_RECALL and the open for the copy offload
> > > > > > would retrieve the latest size. In case of no delegations, the open
> > > > > > retrieves the latest size and the call to copy_file_range() would have
> > > > > > an update size.
> > > > > >
> > > > > > It seems like that could on network filesystems (like NFS). Would this
> > > > > > > be better handled in ->copy_file_range instead, where the driver could
> > > > > > > make a better determination of the file size?
> > > > > >
> > > > > > I'm not opposed to moving the size check into the NFS's copy_file_size
> > > > > > (again in my opinion NFS attribute cache has the same file size as the
> > > > > > inode's size). I think the thought was that such check should be done
> > > > > > at the VFS layer as oppose to doing it by each of the file systems.
> > > > > >
> > > > > >
> > > > >
> > > > > The attribute cache is not revalidated before the i_size is fetched with
> > > > > i_size_read. You're just reading what happens to be in the in-memory
> > > > > inode structure. So both clients have the file open already, and neither
> > > > > has a delegation:
> > > > >
> > > > > client 1: fetches attributes from file and sees a size of 1000
> > > > > client 2: writes 20 bytes at offset 1000
> > > > > client 1: calls copy file range to copy 1020 bytes starting at offset 0
> > > > >
> > > > > If client1 didn't get an attribute update before the copy_file_range
> > > > > call came in, then it would still think the size was 1000 and fail the
> > > > > operation. It may even be many seconds before client1 sees the updated
> > > > > size.
> > > > >
> > > > > You could argue that we're not using locking here so you're just subject
> > > > > to normal open-to-close cache coherency behavior, but that's rather "not
> > > > > nice".
> > > >
> > > > Yes I would argue that locking needs to be used. In case your
> > > > describing it is impossible to get an accurate file size. Even if the
> > > > client issues a GETATTR there is nothing prevents another client from
> > > > changing it right after that GETATTR was already replied to.
> > > >
> > > > What nfscopy application does is it opens then file and then locks it
> > > > (as suggested by the spec), then calls the copy_file_range() system
> > > > call. You can argue (if we didn't get a delegation) that a file might
> > > > have been changed between the reply to the OPEN and the LOCK operation
> > > > and therefore, we should send another GETATTR just in case. To guard
> > > > against that, I can add a getattr call though I think it's an overkill
> > > > specially since linux server always grants us a read delegation.
> > > >
> > >
> > > The spec does say you might need to lock the files but they don't say
> > > you SHOULD, only that servers need to be able to deal with lock
> > > stateids.  hat said, you have a good point. We're not violating anything
> > > by not revalidating the cache beforehand. Maybe we don't need to do this
> > > after all.
> > >
> > > Personally, I think this would be best enforced by the NFS server. Just
> > > fire off the COPY request as-is and let the server vet the lengths.
> > >
> > > Side note: a delegation is never guaranteed. knfsd won't grant one if
> > > the inode falls afoul of the bloom filter, for instance, and that can
> > > easily happen if (for example) there is a hash collision between
> > > different inodes.
> >
> > I could push the checking validity of the arguments to the server (but
> > to me it sounds lame: a failure will require a network roundtrip). But
> > one can argue the server needs to check the validity of the arguments
> > anyway (as it can't rely on the client to play nice).
> >
> > I still think the check should be done in both places (client and
> > server). And I feel like VFS is the right place to do so (as this
> > check implements what the man page states). But I don't feel strongly
> > about dropping the patch all together (I'll add a patch to the
> > server).
>
> To add to my case, would it be acceptable to add a check *same as*
> it's done for the vfs_clone_file_range()
> pos_in+len > i_size_read() return EINVAL.

Sigh. this is not what we want either because in previous discussion,
it was agreed that it would be beneficial to just return however much
the copy could do.

The problem is that copy doesn't spell out that copy of size 0 means
"end of copy" like an EOF. So either that needs to agreed upon and
then receiving size 0 would indicate stopping. However, if return of
size 0 copy isn't signaling an end, then we need to enforce if (pos_in
> i_size_read()) EINVAL check.

Question to the community: should the semantics of copy_file_range()
be clarified to give return of 0 a significance?

> knfsd just calls in to the vfs to do the copy_file_range when it
> receives it, so these checks should be a part of the VFS.
>
> > > > > I think we probably ought to also push this check down into the
> > > > > filesystem operations as well, and have copy_file_range ensure that the
> > > > > attribute cache is updated. We're dealing with copy offload here so
> > > > > doing an extra GETATTR beforehand shouldn't be terribly costly.
> > > > > --
> > > > > Jeff Layton <jlayton@kernel.org>
> > >
> > > --
> > > Jeff Layton <jlayton@kernel.org>
> > >
diff mbox series

Patch

diff --git a/fs/read_write.c b/fs/read_write.c
index 39b4a21..c60790f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1570,6 +1570,9 @@  ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (unlikely(ret))
 		return ret;
 
+	if (pos_in >= i_size_read(inode_in))
+		return -EINVAL;
+
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
 	    (file_out->f_flags & O_APPEND))