mbox series

[RFC,v5,0/4] copy_file_range in cephfs kernel client

Message ID 20181009104806.6821-1-lhenriques@suse.com (mailing list archive)
Headers show
Series copy_file_range in cephfs kernel client | expand

Message

Luis Henriques Oct. 9, 2018, 10:48 a.m. UTC
Hi,

finally, here's a new iteration of my copy_file_range patchset.  I've an
extra patch (not included in this RFC) that adds tracepoints to this
syscall.  I wanted to know if that's something we would like to start
including in the kernel cephfs client.  I can prepare a new rev that
includes this extra patch.

And here's the changelog since v4:

- Complete rewrite of ceph_copy_file_range function.  Now the copy loop
  includes only the remote object copies, while do_splice_direct is
  invoked (at most) twice -- before and after object copy loop.

- Check sizes after every put/get caps cycle

- Added new checks to ensure the files being copied have the same
  layouts

Changes since v3:

- release/get caps before doing the do_splice_direct calls

- if an error occurs after data has been copied already, return the
  amount of bytes already copied instead of the error

- fix oid init/destroy (was broken since a pre-v1 version of the patch)

- always mark Fw as dirty, without FILE_BUFFER

- call file_update_time on destination file

- added an extra mount option (nocopyfrom) which allows an admin to force
  a fallback to VFS copy_file_range

- added a some more debug messages

Changes since v2:

- Files size checks are now done after we have all the required caps

Here's the main changes since v1, after Zheng's review:

1. ceph_osdc_copy_from() now receives source and destination snapids
   instead of ceph_vino structs

2. Also get FILE_RD capabilities in ceph_copy_file_range() for source
   file as other clients may have dirty data in their cache.

3. Fallback to VFS copy_file_range default implementation if we're
  copying beyond source file EOF

Note that 2. required an extra patch modifying ceph_try_get_caps() so
that it could perform a non-blocking attempt at getting CEPH_CAP_FILE_RD
capabilities.

And here's the original (v1) RFC cover letter just for reference:

This series is my initial attempt at getting a copy_file_range syscall
implementation in the kernel cephfs client using the 'copy-from' RADOS
operation.

The idea of getting this implemented was from Greg -- or, at least, he
created a feature in the tracker [1].  I just decided to give it a try
as the feature wasn't assigned to anyone ;-)

I have this patchset sitting on my laptop for a while already, waiting
for me to revisit it, review some of its TODOs... but I finally decided
to send it out as-is instead, to get some early feedback.

The first patch implements the copy-from operation in the libceph
module.  Unfortunately, the documentation for this operation is
nonexistent and I had to do a lot of digging to figure out the details
(and I probably I missed something!).  For example, initially I was
hoping that this operation could be used to copy more than one object at
the time.  Doing an OSD request per object copy is not ideal, but
unfortunately it seems to be the only way.  Anyway, my expectations are
that this new operation will be useful for other features in the future.

The 2nd patch is where the copy_file_range is implemented and could
probably be optimised, but I didn't bother with that for now.  The
important bit is that we still may need to do some manual copies if the
offsets aren't object aligned or if the length is smaller than the
object size.  I'm using do_splice_direct() for the manual copies as it
was the easiest way to get a PoC running, but maybe there are better
ways.

I've done some functional testing on this PoC.  And it also passes the
generic xfstest suite, in particular the copy_file_range specific tests
(430-434).  But I haven't done any benchmarks to measure any performance
changes in using this syscall.

Any feedback is welcome, specially regarding the TODOs on the code.

[1] https://tracker.ceph.com/issues/21944

Luis Henriques (4):
  ceph: add non-blocking parameter to ceph_try_get_caps()
  ceph: support the RADOS copy-from operation
  ceph: support copy_file_range file operation
  ceph: new mount option to disable usage of RADOS 'copy-from' op

 Documentation/filesystems/ceph.txt |   5 +
 fs/ceph/addr.c                     |   2 +-
 fs/ceph/caps.c                     |   7 +-
 fs/ceph/file.c                     | 302 ++++++++++++++++++++++++++++-
 fs/ceph/super.c                    |  13 ++
 fs/ceph/super.h                    |   3 +-
 include/linux/ceph/osd_client.h    |  17 ++
 include/linux/ceph/rados.h         |  19 ++
 net/ceph/osd_client.c              |  72 +++++++
 9 files changed, 434 insertions(+), 6 deletions(-)

Comments

Ilya Dryomov Oct. 9, 2018, 3:36 p.m. UTC | #1
On Tue, Oct 9, 2018 at 12:47 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> Hi,
>
> finally, here's a new iteration of my copy_file_range patchset.  I've an
> extra patch (not included in this RFC) that adds tracepoints to this
> syscall.  I wanted to know if that's something we would like to start
> including in the kernel cephfs client.  I can prepare a new rev that
> includes this extra patch.

No, if we start introducing tracepoints, it'll have to be throught
libceph, rbd and ceph, replacing some of the douts.  Tracepoints in
some places and douts in other places is a no go.  On top of that there
is the whole tracepoint ABI stability mess, although it's less of an
issue for individual filesystems...

In any case it doesn't belong in this series.

>
> And here's the changelog since v4:
>
> - Complete rewrite of ceph_copy_file_range function.  Now the copy loop
>   includes only the remote object copies, while do_splice_direct is
>   invoked (at most) twice -- before and after object copy loop.
>
> - Check sizes after every put/get caps cycle
>
> - Added new checks to ensure the files being copied have the same
>   layouts
>
> Changes since v3:
>
> - release/get caps before doing the do_splice_direct calls
>
> - if an error occurs after data has been copied already, return the
>   amount of bytes already copied instead of the error
>
> - fix oid init/destroy (was broken since a pre-v1 version of the patch)

CephFS object names are bounded, so it uses a non-allocating version,
ceph_oid_printf().  Calling ceph_oid_destroy() is not necessary.

Thanks,

                Ilya
Luis Henriques Oct. 9, 2018, 5:35 p.m. UTC | #2
Ilya Dryomov <idryomov@gmail.com> writes:

> On Tue, Oct 9, 2018 at 12:47 PM Luis Henriques <lhenriques@suse.com> wrote:
>>
>> Hi,
>>
>> finally, here's a new iteration of my copy_file_range patchset.  I've an
>> extra patch (not included in this RFC) that adds tracepoints to this
>> syscall.  I wanted to know if that's something we would like to start
>> including in the kernel cephfs client.  I can prepare a new rev that
>> includes this extra patch.
>
> No, if we start introducing tracepoints, it'll have to be throught
> libceph, rbd and ceph, replacing some of the douts.  Tracepoints in
> some places and douts in other places is a no go.  On top of that there
> is the whole tracepoint ABI stability mess, although it's less of an
> issue for individual filesystems...
>
> In any case it doesn't belong in this series.
>

First of all, thanks a lot for your time reviewing this patchset.  I've
skimmed through your comments and I believe they all make perfect
sense.  I'll go through all of them and prepare a new revision in the
next few days.

[ I still need to revisit those Op flags as I don't understand why I
  assumed they would both be using CEPH_OSD_OP_FLAG_FADVISE_*.  I
  thought I saw a similar usage in the ceph code, but a quick grep
  didn't show anything. ]

Regarding tracepoints, I agree that having both dynamic debug (dout) and
tracepoints isn't a great idea.  My preference would be to move to
tracepoints but it would take a while to visit all the douts and come up
with a set of patches that would convert the relevant ones to
tracepoints (I'm sure there would be a lot of douts that could actually
be dropped).

Anyway, do you think it's worth opening a feature request so that some
day this could be done?  Or would you rather continue using dynamic
debugging only?

Cheers,
Ilya Dryomov Oct. 10, 2018, 11:29 a.m. UTC | #3
On Tue, Oct 9, 2018 at 7:34 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> Ilya Dryomov <idryomov@gmail.com> writes:
>
> > On Tue, Oct 9, 2018 at 12:47 PM Luis Henriques <lhenriques@suse.com> wrote:
> >>
> >> Hi,
> >>
> >> finally, here's a new iteration of my copy_file_range patchset.  I've an
> >> extra patch (not included in this RFC) that adds tracepoints to this
> >> syscall.  I wanted to know if that's something we would like to start
> >> including in the kernel cephfs client.  I can prepare a new rev that
> >> includes this extra patch.
> >
> > No, if we start introducing tracepoints, it'll have to be throught
> > libceph, rbd and ceph, replacing some of the douts.  Tracepoints in
> > some places and douts in other places is a no go.  On top of that there
> > is the whole tracepoint ABI stability mess, although it's less of an
> > issue for individual filesystems...
> >
> > In any case it doesn't belong in this series.
> >
>
> First of all, thanks a lot for your time reviewing this patchset.  I've
> skimmed through your comments and I believe they all make perfect
> sense.  I'll go through all of them and prepare a new revision in the
> next few days.

I only looked at net/ceph part.

>
> [ I still need to revisit those Op flags as I don't understand why I
>   assumed they would both be using CEPH_OSD_OP_FLAG_FADVISE_*.  I
>   thought I saw a similar usage in the ceph code, but a quick grep
>   didn't show anything. ]
>
> Regarding tracepoints, I agree that having both dynamic debug (dout) and
> tracepoints isn't a great idea.  My preference would be to move to
> tracepoints but it would take a while to visit all the douts and come up
> with a set of patches that would convert the relevant ones to
> tracepoints (I'm sure there would be a lot of douts that could actually
> be dropped).
>
> Anyway, do you think it's worth opening a feature request so that some
> day this could be done?  Or would you rather continue using dynamic
> debugging only?

There is an old ticket for this - http://tracker.ceph.com/issues/2374.

With the addition of eBPF tracepoints became a lot more more useful, so
I'm all for a carefully designed initial set of tracepoints.

Thanks,

                Ilya
Luis Henriques Oct. 11, 2018, 8:40 a.m. UTC | #4
Ilya Dryomov <idryomov@gmail.com> writes:

> On Tue, Oct 9, 2018 at 7:34 PM Luis Henriques <lhenriques@suse.com> wrote:
>>
>> Ilya Dryomov <idryomov@gmail.com> writes:
>>
>> > On Tue, Oct 9, 2018 at 12:47 PM Luis Henriques <lhenriques@suse.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> finally, here's a new iteration of my copy_file_range patchset.  I've an
>> >> extra patch (not included in this RFC) that adds tracepoints to this
>> >> syscall.  I wanted to know if that's something we would like to start
>> >> including in the kernel cephfs client.  I can prepare a new rev that
>> >> includes this extra patch.
>> >
>> > No, if we start introducing tracepoints, it'll have to be throught
>> > libceph, rbd and ceph, replacing some of the douts.  Tracepoints in
>> > some places and douts in other places is a no go.  On top of that there
>> > is the whole tracepoint ABI stability mess, although it's less of an
>> > issue for individual filesystems...
>> >
>> > In any case it doesn't belong in this series.
>> >
>>
>> First of all, thanks a lot for your time reviewing this patchset.  I've
>> skimmed through your comments and I believe they all make perfect
>> sense.  I'll go through all of them and prepare a new revision in the
>> next few days.
>
> I only looked at net/ceph part.
>
>>
>> [ I still need to revisit those Op flags as I don't understand why I
>>   assumed they would both be using CEPH_OSD_OP_FLAG_FADVISE_*.  I
>>   thought I saw a similar usage in the ceph code, but a quick grep
>>   didn't show anything. ]
>>
>> Regarding tracepoints, I agree that having both dynamic debug (dout) and
>> tracepoints isn't a great idea.  My preference would be to move to
>> tracepoints but it would take a while to visit all the douts and come up
>> with a set of patches that would convert the relevant ones to
>> tracepoints (I'm sure there would be a lot of douts that could actually
>> be dropped).
>>
>> Anyway, do you think it's worth opening a feature request so that some
>> day this could be done?  Or would you rather continue using dynamic
>> debugging only?
>
> There is an old ticket for this - http://tracker.ceph.com/issues/2374.
>
> With the addition of eBPF tracepoints became a lot more more useful, so
> I'm all for a carefully designed initial set of tracepoints.

Ah, awesome!  I've added that ticket to my "watch" list ;-)

Cheers,
Luis Henriques Oct. 12, 2018, 2 p.m. UTC | #5
Luis Henriques <lhenriques@suse.com> writes:

> Ilya Dryomov <idryomov@gmail.com> writes:

<snip>

> [ I still need to revisit those Op flags as I don't understand why I
>   assumed they would both be using CEPH_OSD_OP_FLAG_FADVISE_*.  I
>   thought I saw a similar usage in the ceph code, but a quick grep
>   didn't show anything. ]

Ok, I've took another look at this and I believe do_copy() (in file
src/tools/rados/rados.cc) is what really confused me about these flags.
This code seems to be using src/dst fadvise flags for the operation
too.

I've created a PR to fix it:

https://github.com/ceph/ceph/pull/24561

Cheers,