mbox series

[v3,0/3] copy_file_range in cephfs kernel client

Message ID 20180906160620.16277-1-lhenriques@suse.com (mailing list archive)
Headers show
Series copy_file_range in cephfs kernel client | expand

Message

Luis Henriques Sept. 6, 2018, 4:06 p.m. UTC
Changes since v2:

- Files size checks are now done after we have all the required caps

Here's the main changes since v1, after Zheng's review:

1. ceph_osdc_copy_from() now receives source and destination snapids
   instead of ceph_vino structs

2. Also get FILE_RD capabilities in ceph_copy_file_range() for source
   file as other clients may have dirty data in their cache.

3. Fallback to VFS copy_file_range default implementation if we're
  copying beyond source file EOF

Note that 2. required an extra patch modifying ceph_try_get_caps() so
that it could perform a non-blocking attempt at getting CEPH_CAP_FILE_RD
capabilities.

And here's the original (v1) RFC cover letter just for reference:

This series is my initial attempt at getting a copy_file_range syscall
implementation in the kernel cephfs client using the 'copy-from' RADOS
operation.

The idea of getting this implemented was from Greg -- or, at least, he
created a feature in the tracker [1].  I just decided to give it a try
as the feature wasn't assigned to anyone ;-)

I have this patchset sitting on my laptop for a while already, waiting
for me to revisit it, review some of its TODOs... but I finally decided
to send it out as-is instead, to get some early feedback.

The first patch implements the copy-from operation in the libceph
module.  Unfortunately, the documentation for this operation is
nonexistent and I had to do a lot of digging to figure out the details
(and I probably I missed something!).  For example, initially I was
hoping that this operation could be used to copy more than one object at
the time.  Doing an OSD request per object copy is not ideal, but
unfortunately it seems to be the only way.  Anyway, my expectations are
that this new operation will be useful for other features in the future.

The 2nd patch is where the copy_file_range is implemented and could
probably be optimised, but I didn't bother with that for now.  The
important bit is that we still may need to do some manual copies if the
offsets aren't object aligned or if the length is smaller than the
object size.  I'm using do_splice_direct() for the manual copies as it
was the easiest way to get a PoC running, but maybe there are better
ways.

I've done some functional testing on this PoC.  And it also passes the
generic xfstest suite, in particular the copy_file_range specific tests
(430-434).  But I haven't done any benchmarks to measure any performance
changes in using this syscall.

Any feedback is welcome, specially regarding the TODOs on the code.

[1] https://tracker.ceph.com/issues/21944


Luis Henriques (3):
  ceph: add non-blocking parameter to ceph_try_get_caps()
  ceph: support the RADOS copy-from operation
  ceph: support copy_file_range file operation

 fs/ceph/addr.c                  |   2 +-
 fs/ceph/caps.c                  |   7 +-
 fs/ceph/file.c                  | 221 ++++++++++++++++++++++++++++++++
 fs/ceph/super.h                 |   2 +-
 include/linux/ceph/osd_client.h |  17 +++
 include/linux/ceph/rados.h      |  19 +++
 net/ceph/osd_client.c           |  72 +++++++++++
 7 files changed, 335 insertions(+), 5 deletions(-)

Comments

Gregory Farnum Sept. 6, 2018, 8:36 p.m. UTC | #1
I don't have much useful to say here (unless Zheng wants me to look
carefully at the use of copy-get), but I'm excited to see this getting
done! :)

One thing I will note is that it might be a good idea if at least one
of the system admin or the Ceph cluster admin can disable this
behavior, just in case bugs turn up with the copy-from op. I don't
expect any as this is a pretty friendly use-case for it (protected by
FS caps, hurray!) but it is the first non-cache-tiering user that will
turn up in the wild I'm aware of.
-Greg

On Thu, Sep 6, 2018 at 9:06 AM, Luis Henriques <lhenriques@suse.com> wrote:
> Changes since v2:
>
> - Files size checks are now done after we have all the required caps
>
> Here's the main changes since v1, after Zheng's review:
>
> 1. ceph_osdc_copy_from() now receives source and destination snapids
>    instead of ceph_vino structs
>
> 2. Also get FILE_RD capabilities in ceph_copy_file_range() for source
>    file as other clients may have dirty data in their cache.
>
> 3. Fallback to VFS copy_file_range default implementation if we're
>   copying beyond source file EOF
>
> Note that 2. required an extra patch modifying ceph_try_get_caps() so
> that it could perform a non-blocking attempt at getting CEPH_CAP_FILE_RD
> capabilities.
>
> And here's the original (v1) RFC cover letter just for reference:
>
> This series is my initial attempt at getting a copy_file_range syscall
> implementation in the kernel cephfs client using the 'copy-from' RADOS
> operation.
>
> The idea of getting this implemented was from Greg -- or, at least, he
> created a feature in the tracker [1].  I just decided to give it a try
> as the feature wasn't assigned to anyone ;-)
>
> I have this patchset sitting on my laptop for a while already, waiting
> for me to revisit it, review some of its TODOs... but I finally decided
> to send it out as-is instead, to get some early feedback.
>
> The first patch implements the copy-from operation in the libceph
> module.  Unfortunately, the documentation for this operation is
> nonexistent and I had to do a lot of digging to figure out the details
> (and I probably I missed something!).  For example, initially I was
> hoping that this operation could be used to copy more than one object at
> the time.  Doing an OSD request per object copy is not ideal, but
> unfortunately it seems to be the only way.  Anyway, my expectations are
> that this new operation will be useful for other features in the future.
>
> The 2nd patch is where the copy_file_range is implemented and could
> probably be optimised, but I didn't bother with that for now.  The
> important bit is that we still may need to do some manual copies if the
> offsets aren't object aligned or if the length is smaller than the
> object size.  I'm using do_splice_direct() for the manual copies as it
> was the easiest way to get a PoC running, but maybe there are better
> ways.
>
> I've done some functional testing on this PoC.  And it also passes the
> generic xfstest suite, in particular the copy_file_range specific tests
> (430-434).  But I haven't done any benchmarks to measure any performance
> changes in using this syscall.
>
> Any feedback is welcome, specially regarding the TODOs on the code.
>
> [1] https://tracker.ceph.com/issues/21944
>
>
> Luis Henriques (3):
>   ceph: add non-blocking parameter to ceph_try_get_caps()
>   ceph: support the RADOS copy-from operation
>   ceph: support copy_file_range file operation
>
>  fs/ceph/addr.c                  |   2 +-
>  fs/ceph/caps.c                  |   7 +-
>  fs/ceph/file.c                  | 221 ++++++++++++++++++++++++++++++++
>  fs/ceph/super.h                 |   2 +-
>  include/linux/ceph/osd_client.h |  17 +++
>  include/linux/ceph/rados.h      |  19 +++
>  net/ceph/osd_client.c           |  72 +++++++++++
>  7 files changed, 335 insertions(+), 5 deletions(-)
>
Luis Henriques Sept. 7, 2018, 9:11 a.m. UTC | #2
Gregory Farnum <gfarnum@redhat.com> writes:

> I don't have much useful to say here (unless Zheng wants me to look
> carefully at the use of copy-get), but I'm excited to see this getting
> done! :)
>
> One thing I will note is that it might be a good idea if at least one
> of the system admin or the Ceph cluster admin can disable this
> behavior, just in case bugs turn up with the copy-from op. I don't
> expect any as this is a pretty friendly use-case for it (protected by
> FS caps, hurray!) but it is the first non-cache-tiering user that will
> turn up in the wild I'm aware of.

If I understand you correctly, you're talking about adding a knob so
that the OSDs could simply return an error when requested to perform a
'copy-from' op, right?  That shouldn't be too difficult, I believe.

Returning an error in PrimaryLogPG::do_osd_ops if that knob is set
should do the job, but I really don't feel comfortable touching the OSDs
code.  If you really think this is something useful, I can try to spend
some time looking into that, but it will take me some time :)

Cheers,
Gregory Farnum Sept. 7, 2018, 6:40 p.m. UTC | #3
On Fri, Sep 7, 2018 at 2:11 AM, Luis Henriques <lhenriques@suse.com> wrote:
> Gregory Farnum <gfarnum@redhat.com> writes:
>
>> I don't have much useful to say here (unless Zheng wants me to look
>> carefully at the use of copy-get), but I'm excited to see this getting
>> done! :)
>>
>> One thing I will note is that it might be a good idea if at least one
>> of the system admin or the Ceph cluster admin can disable this
>> behavior, just in case bugs turn up with the copy-from op. I don't
>> expect any as this is a pretty friendly use-case for it (protected by
>> FS caps, hurray!) but it is the first non-cache-tiering user that will
>> turn up in the wild I'm aware of.
>
> If I understand you correctly, you're talking about adding a knob so
> that the OSDs could simply return an error when requested to perform a
> 'copy-from' op, right?  That shouldn't be too difficult, I believe.

Oh heavens no. I mean a switch on the client side to not invoke the
copy-from op at all. Probably that would just mean doing a full read
of the source and write of the destination, but perhaps you could also
just unhook the copy_file_range implementation from the VFS?

I would assume this is pretty trivial but if it's complicated then
don't worry about it — I bring it up mostly because Josh thought there
were some outstanding issues with this RADOS op but on deeper
inspection he was wrong and conflating some other things. :)
-Greg

>
> Returning an error in PrimaryLogPG::do_osd_ops if that knob is set
> should do the job, but I really don't feel comfortable touching the OSDs
> code.  If you really think this is something useful, I can try to spend
> some time looking into that, but it will take me some time :)
>
> Cheers,
> --
> Luis
>
>> -Greg
>>
>> On Thu, Sep 6, 2018 at 9:06 AM, Luis Henriques <lhenriques@suse.com> wrote:
>>> Changes since v2:
>>>
>>> - Files size checks are now done after we have all the required caps
>>>
>>> Here's the main changes since v1, after Zheng's review:
>>>
>>> 1. ceph_osdc_copy_from() now receives source and destination snapids
>>>    instead of ceph_vino structs
>>>
>>> 2. Also get FILE_RD capabilities in ceph_copy_file_range() for source
>>>    file as other clients may have dirty data in their cache.
>>>
>>> 3. Fallback to VFS copy_file_range default implementation if we're
>>>   copying beyond source file EOF
>>>
>>> Note that 2. required an extra patch modifying ceph_try_get_caps() so
>>> that it could perform a non-blocking attempt at getting CEPH_CAP_FILE_RD
>>> capabilities.
>>>
>>> And here's the original (v1) RFC cover letter just for reference:
>>>
>>> This series is my initial attempt at getting a copy_file_range syscall
>>> implementation in the kernel cephfs client using the 'copy-from' RADOS
>>> operation.
>>>
>>> The idea of getting this implemented was from Greg -- or, at least, he
>>> created a feature in the tracker [1].  I just decided to give it a try
>>> as the feature wasn't assigned to anyone ;-)
>>>
>>> I have this patchset sitting on my laptop for a while already, waiting
>>> for me to revisit it, review some of its TODOs... but I finally decided
>>> to send it out as-is instead, to get some early feedback.
>>>
>>> The first patch implements the copy-from operation in the libceph
>>> module.  Unfortunately, the documentation for this operation is
>>> nonexistent and I had to do a lot of digging to figure out the details
>>> (and I probably I missed something!).  For example, initially I was
>>> hoping that this operation could be used to copy more than one object at
>>> the time.  Doing an OSD request per object copy is not ideal, but
>>> unfortunately it seems to be the only way.  Anyway, my expectations are
>>> that this new operation will be useful for other features in the future.
>>>
>>> The 2nd patch is where the copy_file_range is implemented and could
>>> probably be optimised, but I didn't bother with that for now.  The
>>> important bit is that we still may need to do some manual copies if the
>>> offsets aren't object aligned or if the length is smaller than the
>>> object size.  I'm using do_splice_direct() for the manual copies as it
>>> was the easiest way to get a PoC running, but maybe there are better
>>> ways.
>>>
>>> I've done some functional testing on this PoC.  And it also passes the
>>> generic xfstest suite, in particular the copy_file_range specific tests
>>> (430-434).  But I haven't done any benchmarks to measure any performance
>>> changes in using this syscall.
>>>
>>> Any feedback is welcome, specially regarding the TODOs on the code.
>>>
>>> [1] https://tracker.ceph.com/issues/21944
>>>
>>>
>>> Luis Henriques (3):
>>>   ceph: add non-blocking parameter to ceph_try_get_caps()
>>>   ceph: support the RADOS copy-from operation
>>>   ceph: support copy_file_range file operation
>>>
>>>  fs/ceph/addr.c                  |   2 +-
>>>  fs/ceph/caps.c                  |   7 +-
>>>  fs/ceph/file.c                  | 221 ++++++++++++++++++++++++++++++++
>>>  fs/ceph/super.h                 |   2 +-
>>>  include/linux/ceph/osd_client.h |  17 +++
>>>  include/linux/ceph/rados.h      |  19 +++
>>>  net/ceph/osd_client.c           |  72 +++++++++++
>>>  7 files changed, 335 insertions(+), 5 deletions(-)
>>>
>>
Luis Henriques Sept. 10, 2018, 9:30 a.m. UTC | #4
Gregory Farnum <gfarnum@redhat.com> writes:

> On Fri, Sep 7, 2018 at 2:11 AM, Luis Henriques <lhenriques@suse.com> wrote:
>> Gregory Farnum <gfarnum@redhat.com> writes:
>>
>>> I don't have much useful to say here (unless Zheng wants me to look
>>> carefully at the use of copy-get), but I'm excited to see this getting
>>> done! :)
>>>
>>> One thing I will note is that it might be a good idea if at least one
>>> of the system admin or the Ceph cluster admin can disable this
>>> behavior, just in case bugs turn up with the copy-from op. I don't
>>> expect any as this is a pretty friendly use-case for it (protected by
>>> FS caps, hurray!) but it is the first non-cache-tiering user that will
>>> turn up in the wild I'm aware of.
>>
>> If I understand you correctly, you're talking about adding a knob so
>> that the OSDs could simply return an error when requested to perform a
>> 'copy-from' op, right?  That shouldn't be too difficult, I believe.
>
> Oh heavens no. I mean a switch on the client side to not invoke the
> copy-from op at all. Probably that would just mean doing a full read
> of the source and write of the destination, but perhaps you could also
> just unhook the copy_file_range implementation from the VFS?
>
> I would assume this is pretty trivial but if it's complicated then
> don't worry about it — I bring it up mostly because Josh thought there
> were some outstanding issues with this RADOS op but on deeper
> inspection he was wrong and conflating some other things. :)

Ah, sure!  That could be done.  I guess that a new mount option
('nocopyfrom'?) could be added, which would cause the ceph
copy_file_range implementation to simply fallback to the default VFS
implementation (i.e. simply do a do_splice_direct).  I'll suggest
something in an extra patch in v4 of this series.  Thanks!

Cheers,