diff mbox series

[v2,01/10] VFS generic copy_file_range() support

Message ID 20181130200348.59524-2-olga.kornievskaia@gmail.com (mailing list archive)
State New, archived
Headers show
Series server-side support for "inter" SSC copy | expand

Commit Message

Olga Kornievskaia Nov. 30, 2018, 8:03 p.m. UTC
Relax the condition that input files must be from the same
file systems.

Add checks that input parameters adhere semantics.

If no copy_file_range() support is found, then do generic
checks for the unsupported page cache ranges, LFS, limits,
and clear setuid/setgid if not running as root before calling
do_splice_direct(). Update atime,ctime,mtime afterwards.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
---
 fs/read_write.c    | 66 ++++++++++++++++++++++++++++++++++++++++++------------
 include/linux/fs.h |  7 ++++++
 mm/filemap.c       |  6 ++---
 3 files changed, 61 insertions(+), 18 deletions(-)

Comments

Amir Goldstein Dec. 1, 2018, 8:11 a.m. UTC | #1
On Fri, Nov 30, 2018 at 10:04 PM Olga Kornievskaia
<olga.kornievskaia@gmail.com> wrote:
>
> Relax the condition that input files must be from the same
> file systems.
>
> Add checks that input parameters adhere semantics.
>
> If no copy_file_range() support is found, then do generic
> checks for the unsupported page cache ranges, LFS, limits,
> and clear setuid/setgid if not running as root before calling
> do_splice_direct(). Update atime,ctime,mtime afterwards.
>
> Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> ---

This patch is either going to bring you down or make you stronger ;-)

This is not how its done. Behavior change and refactoring mixed into
one patch is wrong for several reasons. And when you relax same sb
check you need to restrict it inside filesystems, like your previous patch
did.

You already had v7 patch reviewed-by 4 developers.
What made you go and change it (and posted as v2)?

Your intentions were good trying to fix the broken syscall, but
I hope you understood that Dave didn't mean that you *have* to
add the missing generic checks as part of your work. He just
pointed out how broken the current interface is in the context of
reviewing your patch.

In any case, I hear that Dave is neck deep in fixing copy_file_range()
so changes to this function should be collaborated with him. Or better
yet, wait until he posts his fixes and carry on from there.

If I were you, I would just go back to the reviewed v7 vfs patch.

Thanks,
Amir.
Olga Kornievskaia Dec. 1, 2018, 1:23 p.m. UTC | #2
On Sat, Dec 1, 2018 at 3:11 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Fri, Nov 30, 2018 at 10:04 PM Olga Kornievskaia
> <olga.kornievskaia@gmail.com> wrote:
> >
> > Relax the condition that input files must be from the same
> > file systems.
> >
> > Add checks that input parameters adhere semantics.
> >
> > If no copy_file_range() support is found, then do generic
> > checks for the unsupported page cache ranges, LFS, limits,
> > and clear setuid/setgid if not running as root before calling
> > do_splice_direct(). Update atime,ctime,mtime afterwards.
> >
> > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> > ---
>
> This patch is either going to bring you down or make you stronger ;-)
>
> This is not how its done. Behavior change and refactoring mixed into
> one patch is wrong for several reasons. And when you relax same sb
> check you need to restrict it inside filesystems, like your previous patch
> did.
>
> You already had v7 patch reviewed-by 4 developers.
> What made you go and change it (and posted as v2)?
>
> Your intentions were good trying to fix the broken syscall, but
> I hope you understood that Dave didn't mean that you *have* to
> add the missing generic checks as part of your work. He just
> pointed out how broken the current interface is in the context of
> reviewing your patch.
>
> In any case, I hear that Dave is neck deep in fixing copy_file_range()
> so changes to this function should be collaborated with him. Or better
> yet, wait until he posts his fixes and carry on from there.
>
> If I were you, I would just go back to the reviewed v7 vfs patch.

This is NOT a replacement to the v7 vfs patch??? This is a new patch
on top of that one.

I assume that v7 patch has been OK-ed by everybody and is ready to go in???

As you recall, what was left is to provide the functionality to relax
the check for the superblocks to be the same before calling the
do_splice_direct(). This patch attempt do this. I was under the
impression that to do so extra checks were needed to be added which I
added.


>
> Thanks,
> Amir.
Olga Kornievskaia Dec. 1, 2018, 1:44 p.m. UTC | #3
On Sat, Dec 1, 2018 at 8:23 AM Olga Kornievskaia
<olga.kornievskaia@gmail.com> wrote:
>
> On Sat, Dec 1, 2018 at 3:11 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Fri, Nov 30, 2018 at 10:04 PM Olga Kornievskaia
> > <olga.kornievskaia@gmail.com> wrote:
> > >
> > > Relax the condition that input files must be from the same
> > > file systems.
> > >
> > > Add checks that input parameters adhere semantics.
> > >
> > > If no copy_file_range() support is found, then do generic
> > > checks for the unsupported page cache ranges, LFS, limits,
> > > and clear setuid/setgid if not running as root before calling
> > > do_splice_direct(). Update atime,ctime,mtime afterwards.
> > >
> > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> > > ---
> >
> > This patch is either going to bring you down or make you stronger ;-)
> >
> > This is not how its done. Behavior change and refactoring mixed into
> > one patch is wrong for several reasons. And when you relax same sb
> > check you need to restrict it inside filesystems, like your previous patch
> > did.
> >
> > You already had v7 patch reviewed-by 4 developers.
> > What made you go and change it (and posted as v2)?
> >
> > Your intentions were good trying to fix the broken syscall, but
> > I hope you understood that Dave didn't mean that you *have* to
> > add the missing generic checks as part of your work. He just
> > pointed out how broken the current interface is in the context of
> > reviewing your patch.
> >
> > In any case, I hear that Dave is neck deep in fixing copy_file_range()
> > so changes to this function should be collaborated with him. Or better
> > yet, wait until he posts his fixes and carry on from there.
> >
> > If I were you, I would just go back to the reviewed v7 vfs patch.
>
> This is NOT a replacement to the v7 vfs patch??? This is a new patch
> on top of that one.
>
> I assume that v7 patch has been OK-ed by everybody and is ready to go in???
>
> As you recall, what was left is to provide the functionality to relax
> the check for the superblocks to be the same before calling the
> do_splice_direct(). This patch attempt do this. I was under the
> impression that to do so extra checks were needed to be added which I
> added.
>

To clarify, previously I had a VFS patch with the client-side series
to support "server to server" copy offload. It needed the
functionality to be able to call copy_file_range with different super
blocks.

This patch series is for the server side support for the "server to
server" copy offload. It requires ability to call copy_file_range()
and do a copy between NFS and a local file system. Thus it needs
generic_copy_file_range.

>
> >
> > Thanks,
> > Amir.
Amir Goldstein Dec. 1, 2018, 4:59 p.m. UTC | #4
On Sat, Dec 1, 2018 at 5:57 PM Olga Kornievskaia
<olga.kornievskaia@gmail.com> wrote:
>
> On Sat, Dec 1, 2018 at 9:03 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> >
> >
> > On Sat, Dec 1, 2018, 3:44 PM Olga Kornievskaia <olga.kornievskaia@gmail.com wrote:
> >>
> >> On Sat, Dec 1, 2018 at 8:23 AM Olga Kornievskaia
> >> <olga.kornievskaia@gmail.com> wrote:
> >> >
> >> > On Sat, Dec 1, 2018 at 3:11 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >> > >
> >> > > On Fri, Nov 30, 2018 at 10:04 PM Olga Kornievskaia
> >> > > <olga.kornievskaia@gmail.com> wrote:
> >> > > >
> >> > > > Relax the condition that input files must be from the same
> >> > > > file systems.
> >> > > >
> >> > > > Add checks that input parameters adhere semantics.
> >> > > >
> >> > > > If no copy_file_range() support is found, then do generic
> >> > > > checks for the unsupported page cache ranges, LFS, limits,
> >> > > > and clear setuid/setgid if not running as root before calling
> >> > > > do_splice_direct(). Update atime,ctime,mtime afterwards.
> >> > > >
> >> > > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> >> > > > ---
> >> > >
> >> > > This patch is either going to bring you down or make you stronger ;-)
> >> > >
> >> > > This is not how its done. Behavior change and refactoring mixed into
> >> > > one patch is wrong for several reasons. And when you relax same sb
> >> > > check you need to restrict it inside filesystems, like your previous patch
> >> > > did.
> >> > >
> >> > > You already had v7 patch reviewed-by 4 developers.
> >> > > What made you go and change it (and posted as v2)?
> >> > >
> >> > > Your intentions were good trying to fix the broken syscall, but
> >> > > I hope you understood that Dave didn't mean that you *have* to
> >> > > add the missing generic checks as part of your work. He just
> >> > > pointed out how broken the current interface is in the context of
> >> > > reviewing your patch.
> >> > >
> >> > > In any case, I hear that Dave is neck deep in fixing copy_file_range()
> >> > > so changes to this function should be collaborated with him. Or better
> >> > > yet, wait until he posts his fixes and carry on from there.
> >> > >
> >> > > If I were you, I would just go back to the reviewed v7 vfs patch.
> >> >
> >> > This is NOT a replacement to the v7 vfs patch??? This is a new patch
> >> > on top of that one.
> >> >
> >> > I assume that v7 patch has been OK-ed by everybody and is ready to go in???
> >> >
> >> > As you recall, what was left is to provide the functionality to relax
> >> > the check for the superblocks to be the same before calling the
> >> > do_splice_direct(). This patch attempt do this. I was under the
> >> > impression that to do so extra checks were needed to be added which I
> >> > added.
> >> >
> >>
> >> To clarify, previously I had a VFS patch with the client-side series
> >> to support "server to server" copy offload. It needed the
> >> functionality to be able to call copy_file_range with different super
> >> blocks.
> >>
> >> This patch series is for the server side support for the "server to
> >> server" copy offload. It requires ability to call copy_file_range()
> >> and do a copy between NFS and a local file system. Thus it needs
> >> generic_copy_file_range.
> >
> >
> > Ah. Sorry for the confusion.
> > My comment on change of behavior and refactoring in same patch still hold.
> > My comment about coordinate your work with Dave Chinner still hold.
>
> Understood. I will email Dave directly and coordinate.
>
> > Raise that with a comment about adding test coverage to the new
> > generic cross fs copy API to xfstest.
>
> What kind of extra coverage are you envisioning? Something that
> requires two different file systems mounted and then does a fs copy?
>

Yes, if you add this functionality you should add test coverage for the
added functionality. It's not going to be trivial to add cross fs type tests
to xfstests, but adding cross fs (same type) should be relatively easy
(copy_file_range from test fs to scratch fs).

> > Am I mistaken that this change affects any cross fs copy file range
> > by userspace and not only by kernel nfsd?
>
> That's correct, any cross fs copy is what I'm going for here.
>

Forgive me for being thick. After briefly going over the patches, I still don't
understand if you *need* to add generic cross fs copy to implement
server side copy support in nfsd? Or if you are adding it as an added bonus
to the community along with your SSC patch set?

The first two patches of the series seem unrelated to the rest, but maybe
I'm just not getting the connection?

Thanks,
Amir.
Matthew Wilcox Dec. 1, 2018, 9:18 p.m. UTC | #5
On Fri, Nov 30, 2018 at 03:03:39PM -0500, Olga Kornievskaia wrote:
> Relax the condition that input files must be from the same
> file systems.

> +	ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> +			count > MAX_RW_COUNT ? MAX_RW_COUNT : count, 0);

Wasn't there a concern about splicing between filesystems with different
block sizes mentioned the last time this came up?  I can't find a citation
for that now.

> -	/* this could be relaxed once generic cross fs support is added */
> -	if (inode_in->i_sb != inode_out->i_sb) {
> -		ret = -EXDEV;
> -		goto done;
> -	}
Dave Chinner Dec. 1, 2018, 10 p.m. UTC | #6
On Sat, Dec 01, 2018 at 10:11:48AM +0200, Amir Goldstein wrote:
> On Fri, Nov 30, 2018 at 10:04 PM Olga Kornievskaia
> <olga.kornievskaia@gmail.com> wrote:
> >
> > Relax the condition that input files must be from the same
> > file systems.
> >
> > Add checks that input parameters adhere semantics.
> >
> > If no copy_file_range() support is found, then do generic
> > checks for the unsupported page cache ranges, LFS, limits,
> > and clear setuid/setgid if not running as root before calling
> > do_splice_direct(). Update atime,ctime,mtime afterwards.
> >
> > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> > ---
> 
> This patch is either going to bring you down or make you stronger ;-)
> 
> This is not how its done. Behavior change and refactoring mixed into
> one patch is wrong for several reasons. And when you relax same sb
> check you need to restrict it inside filesystems, like your previous patch
> did.
.....
> In any case, I hear that Dave is neck deep in fixing copy_file_range()
> so changes to this function should be collaborated with him. Or better
> yet, wait until he posts his fixes and carry on from there.

Yeah, because I've heard nothing for a month and this is kinda
important, I have a series of 8-9 patches that make all the fixes we
need, push the cross-filesystem checks down into the filesystems,
and let filesystems handle the fallback to a splice based copy
themselves (because there are way more fallback cases than just
EOPNOPSUPP and EXDEV).

I also have a patch for the man page that document all the missing
failure cases, and document where things are filesystem specific or
not.

And I also have a fstests patch that exercises all the failure cases
so that all filesystems will end up behaving the same way for all
the same cases they should.

I'm still sorting out the fstests patch (it requires changes
to xfs_io's copy-range command) so I've got some confidence that the
code actually does what it says in the man page, but I should have
that sorted in a couple of days.

Cheers,

Dave.
Dave Chinner Dec. 1, 2018, 10:36 p.m. UTC | #7
On Sat, Dec 01, 2018 at 01:18:06PM -0800, Matthew Wilcox wrote:
> On Fri, Nov 30, 2018 at 03:03:39PM -0500, Olga Kornievskaia wrote:
> > Relax the condition that input files must be from the same
> > file systems.
> 
> > +	ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> > +			count > MAX_RW_COUNT ? MAX_RW_COUNT : count, 0);
> 
> Wasn't there a concern about splicing between filesystems with different
> block sizes mentioned the last time this came up?  I can't find a citation
> for that now.

the filesystems should be able to handle that themselves - they are
just passes an iter that has a range of data regions in pages that
they copy the required data into/out of. The data transfer mechanism
itself is completely independent of filesystem block sizes....

There's lots of other problems with do_splice_direct, but I don't
think this is one of them. I coul dbe wrong - this code has pretty
much zero documentation on how it is supposed to work and what it is
supposed to do - so don't take my word for it...

Cheers,

Dave.
Olga Kornievskaia Dec. 2, 2018, 3:12 a.m. UTC | #8
On Sat, Dec 1, 2018 at 5:00 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sat, Dec 01, 2018 at 10:11:48AM +0200, Amir Goldstein wrote:
> > On Fri, Nov 30, 2018 at 10:04 PM Olga Kornievskaia
> > <olga.kornievskaia@gmail.com> wrote:
> > >
> > > Relax the condition that input files must be from the same
> > > file systems.
> > >
> > > Add checks that input parameters adhere semantics.
> > >
> > > If no copy_file_range() support is found, then do generic
> > > checks for the unsupported page cache ranges, LFS, limits,
> > > and clear setuid/setgid if not running as root before calling
> > > do_splice_direct(). Update atime,ctime,mtime afterwards.
> > >
> > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> > > ---
> >
> > This patch is either going to bring you down or make you stronger ;-)
> >
> > This is not how its done. Behavior change and refactoring mixed into
> > one patch is wrong for several reasons. And when you relax same sb
> > check you need to restrict it inside filesystems, like your previous patch
> > did.
> .....
> > In any case, I hear that Dave is neck deep in fixing copy_file_range()
> > so changes to this function should be collaborated with him. Or better
> > yet, wait until he posts his fixes and carry on from there.
>
> Yeah, because I've heard nothing for a month and this is kinda
> important

Dave I think that's unfair. It is important. NFS is actually the file
system that needed VFS support for cross fs copy_file_range and I was
working on it. If you were in doubt, you could have emailed and asked
me.

I'm unsure now what does this mean. I have a patch series with a VFS
patch that went thru the extensive review (people spend time on it)
and an NFS patch series that depends on it that is ready for the
upstream push. Are you saying that the VFS patch is no longer welcomed
and thus NFS series is no longer viable either?

, I have a series of 8-9 patches that make all the fixes we
> need, push the cross-filesystem checks down into the filesystems,
> and let filesystems handle the fallback to a splice based copy
> themselves (because there are way more fallback cases than just
> EOPNOPSUPP and EXDEV).

Are you saying it is each individual filesystem responsibility to
fallback on splice? Isn't that a step backwards? Each individual
filesystem is going to implement the same code of calling
do_splice_direct() to do the functionally that could and should be in
VFS?

>
> I also have a patch for the man page that document all the missing
> failure cases, and document where things are filesystem specific or
> not.
>
> And I also have a fstests patch that exercises all the failure cases
> so that all filesystems will end up behaving the same way for all
> the same cases they should.
>
> I'm still sorting out the fstests patch (it requires changes
> to xfs_io's copy-range command) so I've got some confidence that the
> code actually does what it says in the man page, but I should have
> that sorted in a couple of days.
>
> Cheers,
>
> Dave.
>
> --
> Dave Chinner
> david@fromorbit.com
Olga Kornievskaia Dec. 2, 2018, 3:19 p.m. UTC | #9
On Sat, Dec 1, 2018 at 10:12 PM Olga Kornievskaia
<olga.kornievskaia@gmail.com> wrote:
>
> On Sat, Dec 1, 2018 at 5:00 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Sat, Dec 01, 2018 at 10:11:48AM +0200, Amir Goldstein wrote:
> > > On Fri, Nov 30, 2018 at 10:04 PM Olga Kornievskaia
> > > <olga.kornievskaia@gmail.com> wrote:
> > > >
> > > > Relax the condition that input files must be from the same
> > > > file systems.
> > > >
> > > > Add checks that input parameters adhere semantics.
> > > >
> > > > If no copy_file_range() support is found, then do generic
> > > > checks for the unsupported page cache ranges, LFS, limits,
> > > > and clear setuid/setgid if not running as root before calling
> > > > do_splice_direct(). Update atime,ctime,mtime afterwards.
> > > >
> > > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> > > > ---
> > >
> > > This patch is either going to bring you down or make you stronger ;-)
> > >
> > > This is not how its done. Behavior change and refactoring mixed into
> > > one patch is wrong for several reasons. And when you relax same sb
> > > check you need to restrict it inside filesystems, like your previous patch
> > > did.
> > .....
> > > In any case, I hear that Dave is neck deep in fixing copy_file_range()
> > > so changes to this function should be collaborated with him. Or better
> > > yet, wait until he posts his fixes and carry on from there.
> >
> > Yeah, because I've heard nothing for a month and this is kinda
> > important
>
> Dave I think that's unfair. It is important. NFS is actually the file
> system that needed VFS support for cross fs copy_file_range and I was
> working on it. If you were in doubt, you could have emailed and asked
> me.

Just to be clear. What I think was unfair in that comment was the
wording "this is kinda important". I think a lot stems from lack of
clarity in the the mailing list communications. I object to the fact
that it wasn't clear who was going to implement the functionality.
Since the work was needed by NFS I didn't want to assume that somebody
in VFS would just do it for us. At the time nobody in VFS stood up and
said they would do the work and thus I tried to do my best.

I'm grateful, and would have been in the first place, that somebody
did support generic cross-filesystem functionality. Thus I'm by no
means speaking against Dave's work.

> I'm unsure now what does this mean. I have a patch series with a VFS
> patch that went thru the extensive review (people spend time on it)
> and an NFS patch series that depends on it that is ready for the
> upstream push. Are you saying that the VFS patch is no longer welcomed
> and thus NFS series is no longer viable either?

I'm unclear of the fate of the patch set that has the (v7) VFS patch
that was reviewed and approved and is thought to be pushed for 4.21.
It is unclear if the new work is on top of that or not.

> , I have a series of 8-9 patches that make all the fixes we
> > need, push the cross-filesystem checks down into the filesystems,
> > and let filesystems handle the fallback to a splice based copy
> > themselves (because there are way more fallback cases than just
> > EOPNOPSUPP and EXDEV).
>
> Are you saying it is each individual filesystem responsibility to
> fallback on splice? Isn't that a step backwards? Each individual
> filesystem is going to implement the same code of calling
> do_splice_direct() to do the functionally that could and should be in
> VFS?
>
> >
> > I also have a patch for the man page that document all the missing
> > failure cases, and document where things are filesystem specific or
> > not.
> >
> > And I also have a fstests patch that exercises all the failure cases
> > so that all filesystems will end up behaving the same way for all
> > the same cases they should.
> >
> > I'm still sorting out the fstests patch (it requires changes
> > to xfs_io's copy-range command) so I've got some confidence that the
> > code actually does what it says in the man page, but I should have
> > that sorted in a couple of days.
> >
> > Cheers,
> >
> > Dave.
> >
> > --
> > Dave Chinner
> > david@fromorbit.com
Dave Chinner Dec. 2, 2018, 8:47 p.m. UTC | #10
On Sat, Dec 01, 2018 at 10:12:05PM -0500, Olga Kornievskaia wrote:
> On Sat, Dec 1, 2018 at 5:00 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Sat, Dec 01, 2018 at 10:11:48AM +0200, Amir Goldstein wrote:
> > > On Fri, Nov 30, 2018 at 10:04 PM Olga Kornievskaia
> > > <olga.kornievskaia@gmail.com> wrote:
> > > >
> > > > Relax the condition that input files must be from the same
> > > > file systems.
> > > >
> > > > Add checks that input parameters adhere semantics.
> > > >
> > > > If no copy_file_range() support is found, then do generic
> > > > checks for the unsupported page cache ranges, LFS, limits,
> > > > and clear setuid/setgid if not running as root before calling
> > > > do_splice_direct(). Update atime,ctime,mtime afterwards.
> > > >
> > > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> > > > ---
> > >
> > > This patch is either going to bring you down or make you stronger ;-)
> > >
> > > This is not how its done. Behavior change and refactoring mixed into
> > > one patch is wrong for several reasons. And when you relax same sb
> > > check you need to restrict it inside filesystems, like your previous patch
> > > did.
> > .....
> > > In any case, I hear that Dave is neck deep in fixing copy_file_range()
> > > so changes to this function should be collaborated with him. Or better
> > > yet, wait until he posts his fixes and carry on from there.
> >
> > Yeah, because I've heard nothing for a month and this is kinda
> > important
> 
> Dave I think that's unfair. It is important. NFS is actually the file
> system that needed VFS support for cross fs copy_file_range and I was
> working on it. If you were in doubt, you could have emailed and asked
> me.

Last I heard from you was "this isn't my problem and I don't have
time to deal with it". You were fairly unambiguous in saying you
weren't going to spend any time on it.

> I'm unsure now what does this mean. I have a patch series with a VFS
> patch that went thru the extensive review (people spend time on it)
> and an NFS patch series that depends on it that is ready for the
> upstream push. Are you saying that the VFS patch is no longer welcomed
> and thus NFS series is no longer viable either?

No, I'm saying that this is urgent work and needs to be separated
from the NFS patch series, of which there are now two and you've
split copy_file_range() changes across both patch sets.
copy_file_range() is broken for *everyone*, not just NFS.  i.e.
fixing these problems should not be tied to some other filesystem
feature patchset.

> , I have a series of 8-9 patches that make all the fixes we
> > need, push the cross-filesystem checks down into the filesystems,
> > and let filesystems handle the fallback to a splice based copy
> > themselves (because there are way more fallback cases than just
> > EOPNOPSUPP and EXDEV).
> 
> Are you saying it is each individual filesystem responsibility to
> fallback on splice? Isn't that a step backwards? Each individual
> filesystem is going to implement the same code of calling
> do_splice_direct() to do the functionally that could and should be in
> VFS?

I've done this because one of the problems I've found is that
different filesystems *do not fall back consistently*. e.g. the NFS
client will return -EINVAL if src/dst are the same file, but -EINVAL
is not one of the errors that the vfs code falls back to a data copy
on.

This is despite the fact that the fallback path can copy to/from
the same file, we support same file copy through the
->remap_file_range offload, etc. IOWs, the behaviour of the syscall
when it comes to single file ranges is completely inconsistent
because fallbacks are implemented on a filesystem-by-filesystem
basis.

I called the fallback generic_copy_file_range(), and filesystems that
implement ->copy_file_range() are responsible for calling it
themselves if they want a fallback. That's because there may be
different error/constraint conditions at the filesystem level that
prevent offloading the copy, and we can't distinguish at the VFs
between "-EINVAL means fallback because it was a single file copy"
and "-EINVAL means fail, parameter out of range".

IOWs, if you implement ->copy_file_range() you take full
resposnsibility for implementing the copying function. This is
exactly what we do for all the other file methods, so this is just
making the implementation behaviour consistent with the rest of the
code.

FWIW, this also points out a problem with the copy_file_range()
definition - it does not say WTF should happen if the copy ranges
/overlap/ in the same file. clone is clear on that - support is
determined by the filesystem (i.e. "EINVAL [...] XFS and Btrfs do
not support overlapping reflink ranges in the same file."). For
copying, the fallback code can't copy the file data correctly if the
ranges overlap, so I've added checks to make this illegal and added
that overlapping ranges are not supported to the man page.....

These are the sort of API definition problems that I'm fixing with
right now, and I'm writing tests to make sure that all filesystems
will behave the same way for given copy scenarios.

i.e. I'm not doing this so I can get a NFS feature patchset merged,
I'm doing this to make the copy_file_range API well defined and
robust and allow implementations to be verified against the
specification the man page lays out.

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/read_write.c b/fs/read_write.c
index 7b9e59d..2d309b0 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1540,6 +1540,44 @@  static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 }
 #endif
 
+ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				loff_t len, unsigned int flags)
+{
+	ssize_t ret;
+	loff_t size_in = i_size_read(file_inode(file_in)), count;
+
+	/* preform generic checks for unsupported page cache ranges, LFS
+	 * limits. If pos exceeds the limit, returns EFBIG
+	 */
+	count = min(len, size_in - pos_in);
+	ret = generic_access_check_limits(file_in, pos_in, &count);
+	if (ret)
+		goto done;
+	ret = generic_write_check_limits(file_out, pos_out, &count);
+	if (ret)
+		goto done;
+	/* If not running as root, clear setuid/setgid bits. This keeps
+	 * people from modifying setuid and setgid binaries.
+	 */
+	if (!IS_NOSEC(file_inode(file_out))) {
+		ret = file_remove_privs(file_out);
+		if (ret)
+			goto done;
+	}
+
+	ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
+			count > MAX_RW_COUNT ? MAX_RW_COUNT : count, 0);
+
+	file_accessed(file_in);
+	if (!(file_out->f_mode & FMODE_NOCMTIME))
+		file_update_time(file_out);
+
+done:
+	return ret;
+}
+EXPORT_SYMBOL(generic_copy_file_range);
+
 /*
  * copy_file_range() differs from regular file read and write in that it
  * specifically allows return partial success.  When it does so is up to
@@ -1552,6 +1590,7 @@  ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
 	ssize_t ret;
+	loff_t size_in;
 
 	if (flags != 0)
 		return -EINVAL;
@@ -1577,6 +1616,15 @@  ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (len == 0)
 		return 0;
 
+	/* Ensure offsets don't wrap. */
+	if (pos_in + len < pos_in || pos_out + len < pos_out)
+		return -EINVAL;
+
+	size_in = i_size_read(inode_in);
+	/* Ensure that source range is within EOF. */
+	if (pos_in >= size_in || pos_in + len > size_in)
+		return -EINVAL;
+
 	file_start_write(file_out);
 
 	/*
@@ -1597,22 +1645,12 @@  ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 		}
 	}
 
-	if (file_out->f_op->copy_file_range) {
+	if (file_out->f_op->copy_file_range)
 		ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
 						      pos_out, len, flags);
-		if (ret != -EOPNOTSUPP)
-			goto done;
-	}
-
-	/* this could be relaxed once generic cross fs support is added */
-	if (inode_in->i_sb != inode_out->i_sb) {
-		ret = -EXDEV;
-		goto done;
-	}
-
-	ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
-			len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
-
+	else
+		ret = generic_copy_file_range(file_in, pos_in, file_out,
+					      pos_out, len, flags);
 done:
 	if (ret > 0) {
 		fsnotify_access(file_in);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c95c080..c88ad09 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1874,6 +1874,9 @@  extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+extern ssize_t generic_copy_file_range(struct file *file_int, loff_t pos_in,
+				       struct file *file_out, loff_t pos_out,
+				       loff_t len, unsigned int flags);
 extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 					 struct file *file_out, loff_t pos_out,
 					 loff_t *count,
@@ -3016,6 +3019,10 @@  static inline void remove_inode_hash(struct inode *inode)
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
+extern int generic_access_check_limits(struct file *file, loff_t pos,
+				       loff_t *count);
+extern int generic_write_check_limits(struct file *file, loff_t pos,
+				      loff_t *count);
 extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out,
 				loff_t *count, unsigned int remap_flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 81adec8..894f3ae 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2829,8 +2829,7 @@  struct page *read_cache_page_gfp(struct address_space *mapping,
  * LFS limits.  If pos is under the limit it becomes a short access.  If it
  * exceeds the limit we return -EFBIG.
  */
-static int generic_access_check_limits(struct file *file, loff_t pos,
-				       loff_t *count)
+int generic_access_check_limits(struct file *file, loff_t pos, loff_t *count)
 {
 	struct inode *inode = file->f_mapping->host;
 	loff_t max_size = inode->i_sb->s_maxbytes;
@@ -2844,8 +2843,7 @@  static int generic_access_check_limits(struct file *file, loff_t pos,
 	return 0;
 }
 
-static int generic_write_check_limits(struct file *file, loff_t pos,
-				      loff_t *count)
+int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
 {
 	loff_t limit = rlimit(RLIMIT_FSIZE);