[RFC,v2,0/4] ceph: safely use 'copy-from' Op on Octopus OSDs
mbox series

Message ID 20191114105736.8636-1-lhenriques@suse.com
Headers show
Series
  • ceph: safely use 'copy-from' Op on Octopus OSDs
Related show

Message

Luis Henriques Nov. 14, 2019, 10:57 a.m. UTC
Hi!

So, after the feedback I got from v1 [1] I've sent out a pull-request
for the OSDs [2] which encodes require_osd_release into the OSDMap
client data.  This allows the client to figure out which ceph release
the OSDs cluster is running and decide whether or not it's safe to use
the copy-from Op for copy_file_range.

This new patchset I'm sending simply adds enough functionality to the
kernel client so that it can take advantage of this OSD patch:

0001 - adds the ability to decode TYPE_MSGR2 addresses.  This is a
       required functionality for enabling SERVER_NAUTILUS in the
       client.  I hope I got the new format right, as I couldn't figure
       out what the hard-coded values (see comments) really mean.

0002 - allows the client to retrieve the new require_osd_release field
       from the OSDMap if available.  This patch also adds SERVER_MIMIC,
       SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
       which TBH I'm not sure if that's a safe thing to do -- the only
       issue I've seen was that Nautilus requires the ability to decode
       TYPE_MSGR2 address, but I may have missed others.

0003 - debug code to add require_osd_release to the osdmap debugfs file.

0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
       if the OSDs are >= Octopus.

Also note that, as suggested by Ilya, I've dropped the patch that would
change the default mount options to 'copyfrom'.

These patches have been tested with the xfstests generic test suite, and
with a couple of other (local) tests that exercise the cephfs
copy_file_range syscall.  I didn't saw any issues, but as I said above,
I'm not really sure if adding the SERVER_* flags to the supported
features have other side effects.

[1] https://lore.kernel.org/lkml/20191108141555.31176-1-lhenriques@suse.com/
[2] https://github.com/ceph/ceph/pull/31611

Cheers,
--
Luis

Luis Henriques (4):
  ceph: add support for TYPE_MSGR2 address decode
  ceph: get the require_osd_release field from the osdmap
  ceph: add require_osd_release field to osdmap debugfs
  ceph: add support for sending truncate_{seq,size} in 'copy-from' Op

 fs/ceph/file.c                     | 10 +++++++-
 include/linux/ceph/ceph_features.h | 10 ++++++--
 include/linux/ceph/decode.h        |  3 ++-
 include/linux/ceph/osd_client.h    |  1 +
 include/linux/ceph/osdmap.h        |  1 +
 include/linux/ceph/rados.h         | 23 ++++++++++++++++++
 net/ceph/ceph_strings.c            | 38 ++++++++++++++++++++++++++++++
 net/ceph/debugfs.c                 |  2 ++
 net/ceph/decode.c                  | 33 ++++++++++++++++++++++++--
 net/ceph/osd_client.c              |  7 +++++-
 net/ceph/osdmap.c                  | 21 +++++++++++++++++
 11 files changed, 142 insertions(+), 7 deletions(-)

Comments

Jeff Layton Nov. 14, 2019, 1:15 p.m. UTC | #1
On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> Hi!
> 
> So, after the feedback I got from v1 [1] I've sent out a pull-request
> for the OSDs [2] which encodes require_osd_release into the OSDMap
> client data.  This allows the client to figure out which ceph release
> the OSDs cluster is running and decide whether or not it's safe to use
> the copy-from Op for copy_file_range.
> 
> This new patchset I'm sending simply adds enough functionality to the
> kernel client so that it can take advantage of this OSD patch:
> 
> 0001 - adds the ability to decode TYPE_MSGR2 addresses.  This is a
>        required functionality for enabling SERVER_NAUTILUS in the
>        client.  I hope I got the new format right, as I couldn't figure
>        out what the hard-coded values (see comments) really mean.
> 

nit: the first 3 patch subject lines should probably be prefixed with
"libceph:"

> 0002 - allows the client to retrieve the new require_osd_release field
>        from the OSDMap if available.  This patch also adds SERVER_MIMIC,
>        SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
>        which TBH I'm not sure if that's a safe thing to do -- the only
>        issue I've seen was that Nautilus requires the ability to decode
>        TYPE_MSGR2 address, but I may have missed others.
> 

Yes, this needs to be done with care. We have to ensure that the server
side isn't assuming that the client supports something that it doesn't.
I think that means just trawling through the code and verifying whether
this is safe.

> 0003 - debug code to add require_osd_release to the osdmap debugfs file.
> 
> 0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
>        if the OSDs are >= Octopus.
> 
> Also note that, as suggested by Ilya, I've dropped the patch that would
> change the default mount options to 'copyfrom'.
> 
> These patches have been tested with the xfstests generic test suite, and
> with a couple of other (local) tests that exercise the cephfs
> copy_file_range syscall.  I didn't saw any issues, but as I said above,
> I'm not really sure if adding the SERVER_* flags to the supported
> features have other side effects.
> 
> [1] https://lore.kernel.org/lkml/20191108141555.31176-1-lhenriques@suse.com/
> [2] https://github.com/ceph/ceph/pull/31611
> 

I'm just getting caught up on the discussion here, but why was it
decided to do it this way instead of just adding a new OSD
"copy-from-no-truncseq" operation? Once you tried it once and an OSD
didn't support it, you could just give up on using it any longer? That
seems a lot simpler than trying to monkey with feature bits.
Sage Weil Nov. 14, 2019, 1:28 p.m. UTC | #2
On Thu, 14 Nov 2019, Jeff Layton wrote:
> On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> > Hi!
> > 
> > So, after the feedback I got from v1 [1] I've sent out a pull-request
> > for the OSDs [2] which encodes require_osd_release into the OSDMap
> > client data.  This allows the client to figure out which ceph release
> > the OSDs cluster is running and decide whether or not it's safe to use
> > the copy-from Op for copy_file_range.
> > 
> > This new patchset I'm sending simply adds enough functionality to the
> > kernel client so that it can take advantage of this OSD patch:
> > 
> > 0001 - adds the ability to decode TYPE_MSGR2 addresses.  This is a
> >        required functionality for enabling SERVER_NAUTILUS in the
> >        client.  I hope I got the new format right, as I couldn't figure
> >        out what the hard-coded values (see comments) really mean.
> > 
> 
> nit: the first 3 patch subject lines should probably be prefixed with
> "libceph:"
> 
> > 0002 - allows the client to retrieve the new require_osd_release field
> >        from the OSDMap if available.  This patch also adds SERVER_MIMIC,
> >        SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
> >        which TBH I'm not sure if that's a safe thing to do -- the only
> >        issue I've seen was that Nautilus requires the ability to decode
> >        TYPE_MSGR2 address, but I may have missed others.
> > 
> 
> Yes, this needs to be done with care. We have to ensure that the server
> side isn't assuming that the client supports something that it doesn't.
> I think that means just trawling through the code and verifying whether
> this is safe.
> 
> > 0003 - debug code to add require_osd_release to the osdmap debugfs file.
> > 
> > 0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
> >        if the OSDs are >= Octopus.
> > 
> > Also note that, as suggested by Ilya, I've dropped the patch that would
> > change the default mount options to 'copyfrom'.
> > 
> > These patches have been tested with the xfstests generic test suite, and
> > with a couple of other (local) tests that exercise the cephfs
> > copy_file_range syscall.  I didn't saw any issues, but as I said above,
> > I'm not really sure if adding the SERVER_* flags to the supported
> > features have other side effects.
> > 
> > [1] https://lore.kernel.org/lkml/20191108141555.31176-1-lhenriques@suse.com/
> > [2] https://github.com/ceph/ceph/pull/31611
> > 
> 
> I'm just getting caught up on the discussion here, but why was it
> decided to do it this way instead of just adding a new OSD
> "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> didn't support it, you could just give up on using it any longer? That
> seems a lot simpler than trying to monkey with feature bits.

I don't remember the original discussion either, but in retrospect that 
does seem much simpler--especially since hte client is conditioning 
sending this based on the the require_osd_release.  It seems like passing 
a flag to the copy-from op would be more reasonable instead of conditional 
feature-based behavior.

Greg, do you remember why we ended up here?

sage
Ilya Dryomov Nov. 14, 2019, 2:13 p.m. UTC | #3
On Thu, Nov 14, 2019 at 2:28 PM Sage Weil <sweil@redhat.com> wrote:
>
> On Thu, 14 Nov 2019, Jeff Layton wrote:
> > On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> > > Hi!
> > >
> > > So, after the feedback I got from v1 [1] I've sent out a pull-request
> > > for the OSDs [2] which encodes require_osd_release into the OSDMap
> > > client data.  This allows the client to figure out which ceph release
> > > the OSDs cluster is running and decide whether or not it's safe to use
> > > the copy-from Op for copy_file_range.
> > >
> > > This new patchset I'm sending simply adds enough functionality to the
> > > kernel client so that it can take advantage of this OSD patch:
> > >
> > > 0001 - adds the ability to decode TYPE_MSGR2 addresses.  This is a
> > >        required functionality for enabling SERVER_NAUTILUS in the
> > >        client.  I hope I got the new format right, as I couldn't figure
> > >        out what the hard-coded values (see comments) really mean.
> > >
> >
> > nit: the first 3 patch subject lines should probably be prefixed with
> > "libceph:"
> >
> > > 0002 - allows the client to retrieve the new require_osd_release field
> > >        from the OSDMap if available.  This patch also adds SERVER_MIMIC,
> > >        SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
> > >        which TBH I'm not sure if that's a safe thing to do -- the only
> > >        issue I've seen was that Nautilus requires the ability to decode
> > >        TYPE_MSGR2 address, but I may have missed others.
> > >
> >
> > Yes, this needs to be done with care. We have to ensure that the server
> > side isn't assuming that the client supports something that it doesn't.
> > I think that means just trawling through the code and verifying whether
> > this is safe.
> >
> > > 0003 - debug code to add require_osd_release to the osdmap debugfs file.
> > >
> > > 0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
> > >        if the OSDs are >= Octopus.
> > >
> > > Also note that, as suggested by Ilya, I've dropped the patch that would
> > > change the default mount options to 'copyfrom'.
> > >
> > > These patches have been tested with the xfstests generic test suite, and
> > > with a couple of other (local) tests that exercise the cephfs
> > > copy_file_range syscall.  I didn't saw any issues, but as I said above,
> > > I'm not really sure if adding the SERVER_* flags to the supported
> > > features have other side effects.
> > >
> > > [1] https://lore.kernel.org/lkml/20191108141555.31176-1-lhenriques@suse.com/
> > > [2] https://github.com/ceph/ceph/pull/31611
> > >
> >
> > I'm just getting caught up on the discussion here, but why was it
> > decided to do it this way instead of just adding a new OSD
> > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > didn't support it, you could just give up on using it any longer? That
> > seems a lot simpler than trying to monkey with feature bits.
>
> I don't remember the original discussion either, but in retrospect that
> does seem much simpler--especially since hte client is conditioning
> sending this based on the the require_osd_release.  It seems like passing
> a flag to the copy-from op would be more reasonable instead of conditional
> feature-based behavior.

Yeah, I suggested adding require_osd_release to the client portion just
because we are running into it more and more: Objecter relies on it for
RESEND_ON_SPLIT for example.  It needs to be accessible so that patches
like that can be carried over to the kernel client without workarounds.

copy-from in its existing form is another example.  AFAIU the problem
is that copy-from op doesn't reject unknown flags.  Luis added a flag
in https://github.com/ceph/ceph/pull/25374, but it is simply ignored on
nautilus and older releases, potentially leading to data corruption.

Adding a new op that would be an alias for CEPH_OSD_OP_COPY_FROM with
CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ like Jeff is suggesting, or a new
copy-from2 op that would behave just like copy-from, but reject unknown
flags to avoid similar compatibility issues in the future is probably
the best thing we can do from the client perspective.

Thanks,

                Ilya
Sage Weil Nov. 14, 2019, 2:17 p.m. UTC | #4
On Thu, 14 Nov 2019, Ilya Dryomov wrote:
> > > I'm just getting caught up on the discussion here, but why was it
> > > decided to do it this way instead of just adding a new OSD
> > > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > > didn't support it, you could just give up on using it any longer? That
> > > seems a lot simpler than trying to monkey with feature bits.
> >
> > I don't remember the original discussion either, but in retrospect that
> > does seem much simpler--especially since hte client is conditioning
> > sending this based on the the require_osd_release.  It seems like passing
> > a flag to the copy-from op would be more reasonable instead of conditional
> > feature-based behavior.
> 
> Yeah, I suggested adding require_osd_release to the client portion just
> because we are running into it more and more: Objecter relies on it for
> RESEND_ON_SPLIT for example.  It needs to be accessible so that patches
> like that can be carried over to the kernel client without workarounds.
> 
> copy-from in its existing form is another example.  AFAIU the problem
> is that copy-from op doesn't reject unknown flags.  Luis added a flag
> in https://github.com/ceph/ceph/pull/25374, but it is simply ignored on
> nautilus and older releases, potentially leading to data corruption.
> 
> Adding a new op that would be an alias for CEPH_OSD_OP_COPY_FROM with
> CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ like Jeff is suggesting, or a new
> copy-from2 op that would behave just like copy-from, but reject unknown
> flags to avoid similar compatibility issues in the future is probably
> the best thing we can do from the client perspective.

Yeah, I think copy-from2 is the best path.  I think that means we should 
revert what we merged to ceph.git a few weeks back, Luis!  Sorry we didn't 
put all the pieces together before...

sage
Luis Henriques Nov. 14, 2019, 3:24 p.m. UTC | #5
On Thu, Nov 14, 2019 at 02:17:44PM +0000, Sage Weil wrote:
> On Thu, 14 Nov 2019, Ilya Dryomov wrote:
> > > > I'm just getting caught up on the discussion here, but why was it
> > > > decided to do it this way instead of just adding a new OSD
> > > > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > > > didn't support it, you could just give up on using it any longer? That
> > > > seems a lot simpler than trying to monkey with feature bits.
> > >
> > > I don't remember the original discussion either, but in retrospect that
> > > does seem much simpler--especially since hte client is conditioning
> > > sending this based on the the require_osd_release.  It seems like passing
> > > a flag to the copy-from op would be more reasonable instead of conditional
> > > feature-based behavior.
> > 
> > Yeah, I suggested adding require_osd_release to the client portion just
> > because we are running into it more and more: Objecter relies on it for
> > RESEND_ON_SPLIT for example.  It needs to be accessible so that patches
> > like that can be carried over to the kernel client without workarounds.
> > 
> > copy-from in its existing form is another example.  AFAIU the problem
> > is that copy-from op doesn't reject unknown flags.  Luis added a flag
> > in https://github.com/ceph/ceph/pull/25374, but it is simply ignored on
> > nautilus and older releases, potentially leading to data corruption.
> > 
> > Adding a new op that would be an alias for CEPH_OSD_OP_COPY_FROM with
> > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ like Jeff is suggesting, or a new
> > copy-from2 op that would behave just like copy-from, but reject unknown
> > flags to avoid similar compatibility issues in the future is probably
> > the best thing we can do from the client perspective.
> 
> Yeah, I think copy-from2 is the best path.  I think that means we should 
> revert what we merged to ceph.git a few weeks back, Luis!  Sorry we didn't 
> put all the pieces together before...

Well, that's an unexpected turn.  I'm not disagreeing with that decision
but since my initial pull request was done one year ago (almost to the
day!), it's a bit disappointing to see that in the end I'm back to
square one :-)

I guess that the PR I mentioned in the cover letter can also be dropped,
as it's not really usable by the kernel client (at least not until it
fully supports all the features up to SERVER_OCTOPUS).

Anyway, thanks everyone for the review.

Cheers,
--
Luís
Ilya Dryomov Nov. 14, 2019, 3:47 p.m. UTC | #6
On Thu, Nov 14, 2019 at 4:24 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> On Thu, Nov 14, 2019 at 02:17:44PM +0000, Sage Weil wrote:
> > On Thu, 14 Nov 2019, Ilya Dryomov wrote:
> > > > > I'm just getting caught up on the discussion here, but why was it
> > > > > decided to do it this way instead of just adding a new OSD
> > > > > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > > > > didn't support it, you could just give up on using it any longer? That
> > > > > seems a lot simpler than trying to monkey with feature bits.
> > > >
> > > > I don't remember the original discussion either, but in retrospect that
> > > > does seem much simpler--especially since hte client is conditioning
> > > > sending this based on the the require_osd_release.  It seems like passing
> > > > a flag to the copy-from op would be more reasonable instead of conditional
> > > > feature-based behavior.
> > >
> > > Yeah, I suggested adding require_osd_release to the client portion just
> > > because we are running into it more and more: Objecter relies on it for
> > > RESEND_ON_SPLIT for example.  It needs to be accessible so that patches
> > > like that can be carried over to the kernel client without workarounds.
> > >
> > > copy-from in its existing form is another example.  AFAIU the problem
> > > is that copy-from op doesn't reject unknown flags.  Luis added a flag
> > > in https://github.com/ceph/ceph/pull/25374, but it is simply ignored on
> > > nautilus and older releases, potentially leading to data corruption.
> > >
> > > Adding a new op that would be an alias for CEPH_OSD_OP_COPY_FROM with
> > > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ like Jeff is suggesting, or a new
> > > copy-from2 op that would behave just like copy-from, but reject unknown
> > > flags to avoid similar compatibility issues in the future is probably
> > > the best thing we can do from the client perspective.
> >
> > Yeah, I think copy-from2 is the best path.  I think that means we should
> > revert what we merged to ceph.git a few weeks back, Luis!  Sorry we didn't
> > put all the pieces together before...
>
> Well, that's an unexpected turn.  I'm not disagreeing with that decision
> but since my initial pull request was done one year ago (almost to the
> day!), it's a bit disappointing to see that in the end I'm back to
> square one :-)

Well, I think literally every line from that PR will still go in, just
wrapped in a new OSD op.  Backwards compatibility is hard...

>
> I guess that the PR I mentioned in the cover letter can also be dropped,
> as it's not really usable by the kernel client (at least not until it
> fully supports all the features up to SERVER_OCTOPUS).

No, some form of https://github.com/ceph/ceph/pull/31611 should go in.
I'm pretty certain it will come up at some point in the future, even if
the new field isn't immediately usable today.  Someone porting a change
to the kernel client a couple years from now will thank you for it ;)

Thanks,

                Ilya
Gregory Farnum Nov. 14, 2019, 6:28 p.m. UTC | #7
On Thu, Nov 14, 2019 at 5:28 AM Sage Weil <sweil@redhat.com> wrote:
>
> On Thu, 14 Nov 2019, Jeff Layton wrote:
> > On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> > > Hi!
> > >
> > > So, after the feedback I got from v1 [1] I've sent out a pull-request
> > > for the OSDs [2] which encodes require_osd_release into the OSDMap
> > > client data.  This allows the client to figure out which ceph release
> > > the OSDs cluster is running and decide whether or not it's safe to use
> > > the copy-from Op for copy_file_range.
> > >
> > > This new patchset I'm sending simply adds enough functionality to the
> > > kernel client so that it can take advantage of this OSD patch:
> > >
> > > 0001 - adds the ability to decode TYPE_MSGR2 addresses.  This is a
> > >        required functionality for enabling SERVER_NAUTILUS in the
> > >        client.  I hope I got the new format right, as I couldn't figure
> > >        out what the hard-coded values (see comments) really mean.
> > >
> >
> > nit: the first 3 patch subject lines should probably be prefixed with
> > "libceph:"
> >
> > > 0002 - allows the client to retrieve the new require_osd_release field
> > >        from the OSDMap if available.  This patch also adds SERVER_MIMIC,
> > >        SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
> > >        which TBH I'm not sure if that's a safe thing to do -- the only
> > >        issue I've seen was that Nautilus requires the ability to decode
> > >        TYPE_MSGR2 address, but I may have missed others.
> > >
> >
> > Yes, this needs to be done with care. We have to ensure that the server
> > side isn't assuming that the client supports something that it doesn't.
> > I think that means just trawling through the code and verifying whether
> > this is safe.
> >
> > > 0003 - debug code to add require_osd_release to the osdmap debugfs file.
> > >
> > > 0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
> > >        if the OSDs are >= Octopus.
> > >
> > > Also note that, as suggested by Ilya, I've dropped the patch that would
> > > change the default mount options to 'copyfrom'.
> > >
> > > These patches have been tested with the xfstests generic test suite, and
> > > with a couple of other (local) tests that exercise the cephfs
> > > copy_file_range syscall.  I didn't saw any issues, but as I said above,
> > > I'm not really sure if adding the SERVER_* flags to the supported
> > > features have other side effects.
> > >
> > > [1] https://lore.kernel.org/lkml/20191108141555.31176-1-lhenriques@suse.com/
> > > [2] https://github.com/ceph/ceph/pull/31611
> > >
> >
> > I'm just getting caught up on the discussion here, but why was it
> > decided to do it this way instead of just adding a new OSD
> > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > didn't support it, you could just give up on using it any longer? That
> > seems a lot simpler than trying to monkey with feature bits.
>
> I don't remember the original discussion either, but in retrospect that
> does seem much simpler--especially since hte client is conditioning
> sending this based on the the require_osd_release.  It seems like passing
> a flag to the copy-from op would be more reasonable instead of conditional
> feature-based behavior.
>
> Greg, do you remember why we ended up here?

Well, I can look it up. We discussed it being a different op in
February ("OSD 'copy-from' operation and truncate_seq value") and...it
looks like that conversation ended with that being the plan?

I can't see why it changed in the making though, and everyone seems to
have forgotten about it at the next pass.
-Greg

>
> sage
>