mbox series

[v2,0/8] Fixes for major copy_file_range() issues

Message ID 20190526061100.21761-1-amir73il@gmail.com
Headers show
Series Fixes for major copy_file_range() issues | expand


Amir Goldstein May 26, 2019, 6:10 a.m. UTC
Hi Darrick,

Following is a re-work of Dave Chinner's patches from December [4].
I have updated the kernel patches [1] xfstests [2] and man-page [3]
according to the feedback on v1.

NOTE that this work changes user visible behavior of copy_file_range(2)!
It introduces new errors for cases that were not checked before and it
allows cross-device copy by default. After this work, cifs copy offload
should be possible between two shares on the same server, but I did not
check this functionality.

The major difference from v1 is to conform to short read(2) semantics
that are already implemented for copy_file_range(2) instead of the
documented EINVAL, as suggested by Christoph.

My tests of this work included testing various filesystems for the
fallback default copy_file_range implementation, both filesystems that
support copy_file_range and filesystems that do not. My tests did not
include actual copy offload with nfs/cifs/ceph, so any such tests by
said filesystem developers would be much appreciated.
Special thanks to Olga Kornievskaia and Luis Henriques for helping me
test this work on nfs and ceph.

Darrick, seeing that you and Dave invested most in this work and
previous similar fixes and tests of the remap_file_range series, I
though it would be best if you carried these patches through the xfs
tree or collaborate their merge with the vfs tree.

Please note that the patch [8/8] is not related to copy_file_range,
but I included it in the series because it belongs in the context.

The man page update patch (again, mostly Dave's work) is appended
to the series.


Changes since v1:
- Short read instead of EINVAL (Christoph)
- generic_file_rw_checks() helper (Darrick)
- generic_copy_file_range_prep() helper (Christoph)
- Not calling ->remap_file_range() with different sb
- Not calling ->copy_file_range() with different fs type
- Remove changes to overlayfs
- Extra fix to clone/dedupe checks

Amir Goldstein (5):
  vfs: introduce generic_file_rw_checks()
  vfs: add missing checks to copy_file_range
  vfs: copy_file_range needs to strip setuid bits
  vfs: allow copy_file_range to copy across devices
  vfs: remove redundant checks from generic_remap_checks()

Dave Chinner (3):
  vfs: introduce generic_copy_file_range()
  vfs: no fallback for ->copy_file_range
  vfs: copy_file_range should update file timestamps

 fs/ceph/file.c     |  32 +++++++++-
 fs/cifs/cifsfs.c   |  13 +++-
 fs/fuse/file.c     |  28 ++++++++-
 fs/nfs/nfs42proc.c |   8 ++-
 fs/nfs/nfs4file.c  |  23 ++++++-
 fs/read_write.c    | 146 +++++++++++++++++++++++++++++++++------------
 include/linux/fs.h |   9 +++
 mm/filemap.c       |  81 +++++++++++++++++++++++--
 8 files changed, 283 insertions(+), 57 deletions(-)

[1] https://github.com/amir73il/linux/commits/copy_file_range-v2
[2] https://github.com/amir73il/xfstests/commits/copy_file_range-v2
[3] https://github.com/amir73il/man-pages/commits/copy_file_range-v2
[4] https://lore.kernel.org/linux-fsdevel/20181203083416.28978-1-david@fromorbit.com/

Original cover letter by Dave:

Hi folks,

As most of you already know, we really suck at introducing new
functionality. The recent problems we found with clone/dedupe file
range interfaces also plague the copy_file_range() API and
implementation. Not only doesn't it do exactly what the man page
says, the man page doesn't document everything the syscall does

There's a few problems:
	- can overwrite setuid files
	- can read from and overwrite active swap files
	- can overwrite immutable files
	- doesn't update timestamps
	- doesn't obey resource limits
	- doesn't catch overlapping copy ranges to the same file
	- doesn't consistently implement fallback strategies
	- does error out when the source range extends past EOF like
	  the man page says it should
	- isn't consistent with clone file range behaviour
	- inconsistent behaviour between filesystems
	- inconsistent fallback implementations

And so on. There's so much wrong, and I haven't even got to the
problems that the generic fallback code (i.e. do_splice_direct()
has). That's for another day.

So, what this series attempts to do is clean up the code, implement
all the missing checks, provide an infrastructure layout that allows
for consistent behaviour across filesystems and allows filesysetms
to control fallback mechanisms and cross-device copies.

I'll repeat that so it's clear: the series also enabled cross-device
copies once all the problems are sorted out.

To that end, the current fallback code is moved to
generic_copy_file_range(), and that is called only if the filesystem
does not provide a ->copy_file_range implementation. If the
filesystem provides such a method, it must implement the page cache
copy fallback itself by calling generic_copy_file_range() when
appropriate. I did this because different filesystems have different
copy-offload capabilities and so need to fall back in different
situations. It's easier to have them call generic_copy_file_range()
to do that copy when necessary than it is to have them try to
communicate back up to vfs_copy_file_range() that it should run a
fallback copy.

To make all the implementations perform the same validity checks,
I've created a generic_copy_file_checks() which is similar to the
checks we do for clone/dedupe. It's not quite the same, but the core
is very similar. This strips setuid, updates timestamps, checks and
enforces filesystem and resource limits, bounds checks the copy
ranges, etc.

This needs to be run before we call ->remap_file_range() so that we
end up with consistent behaviour across copy_file_range() calls.
e.g. we want an XFS filesystem with reflink=1 (i.e. supports
->remap_file_range()) to behave the same as an XFS filesystem with
reflink=0. Hence we need to check all the parameters up front so we
don't end up with calls to ->remap_file_range() resulting in
different behaviour.

It also means that ->copy_file_range implementations only need to
bounds checking the input against fileystem internal constraints,
not everything. This makes the filesystem implementations simpler,
and means they can call the fallback generic_copy_file_range()
implementation without having to care about further bounds checking.

I have not changed the fallback behaviour of the CIFS, Ceph or NFS
client implementations. They still reject copy_file_range() to the
same file with EINVAL, even though it is supported by the fallback
and filesystems that implement ->remap_file_range(). I'll leave it
for the maintainers to decide if they want to implement the manual
data copy fallback or not. My personal opinion is that they should
implement the fallback where-ever they can, but userspace has to be
prepared for copy_file_range() to fail and so implementing the
fallback is an optional feature.

In terms of testing, Darrick and I have been beating the hell out of
copy_file_range with fsx on XFS to sort out all the data corruption
problems it has exposed (we're still working on that). Patches have
been posted to enhance fsx and fsstress in fstests to exercise
clone/dedupe/copy_file_range. Thread here:


I've also written a bounds/behaviour exercising test:


I don't know whether I've got all the permission tests right in this
patchset. There's absolutely no documentation telling us when we
should use file_permission, inode_permission, etc in the
documentation or the code, so I just added the things that made the
tests do the things i think are the right things to be doing.

To run the tests, you'll also need modifications to xfs_io to allow
it to modify state appropriately. This is something we have
overlooked in the past, and so a lots of xfs_io based behaviour
checking is not actually testing the syscall we thought it was
testing but is instead testing the permission checking of the open()
syscall. Those patches are here:


These changes really need to go in before we merge any more
copy_file_range() features - we need to get the basics right and get
test coverage over it before we unleash things like NFS server-side
copies on unsuspecting users with filesystems that have busted
copy_file_range() implementations.

I'll be appending a man page patch to this series that documents all
the errors this syscall can throw, the expected behaviours, etc. The
test and the man page were written together first, and the
implementation changes were done second. So if you don't agree with
the behaviour, discuss what the man page patch should say and define,
then I'll change the test to reflect that and I'll go from there.