diff mbox series

RADOS EC: is it okay to reduce the number of commits required for reply to client?

Message ID CAPHfcngzug4HDzHtZcb80xdf-NZFNjc2Q7r3vJWSHJufjcgKKQ@mail.gmail.com (mailing list archive)
State New, archived
Headers show
Series RADOS EC: is it okay to reduce the number of commits required for reply to client? | expand

Commit Message

Alex Xu Sept. 19, 2019, 7:05 a.m. UTC
Hi Cephers,

We are testing the write performance of Ceph EC (Luminous, 8 + 4), and
noticed that tail latency is extremly high. Say, avgtime of 10th
commit is 40ms, acceptable as it's an all HDD cluster; 11th is 80ms,
doubled; then 12th is 160ms, doubled again, which is not so good. Then
we made a small modification and tested again, and did get a much
better result. The patch is quite simple (for test only of course):

     i->second.on_all_commit = 0;

As far as what I see, everything still goes well (maybe because of the
rwlock in primary OSD? not sure though), but I'm afraid it might break
data consistency in some ways not aware of. So I'm writing to ask if
someone could kindly provide expertise comments on this or maybe share
any known drawbacks. Thank you!

PS: OSD is backended with filestore, not bluestore, if that matters.

Regards,
Alex

Comments

Gregory Farnum Sept. 25, 2019, 9 p.m. UTC | #1
On Thu, Sep 19, 2019 at 12:06 AM Alex Xu <alexu4993@gmail.com> wrote:
>
> Hi Cephers,
>
> We are testing the write performance of Ceph EC (Luminous, 8 + 4), and
> noticed that tail latency is extremly high. Say, avgtime of 10th
> commit is 40ms, acceptable as it's an all HDD cluster; 11th is 80ms,
> doubled; then 12th is 160ms, doubled again, which is not so good. Then
> we made a small modification and tested again, and did get a much
> better result. The patch is quite simple (for test only of course):
>
> --- a/src/osd/ECBackend.cc
> +++ b/src/osd/ECBackend.cc
> @@ -1188,7 +1188,7 @@ void ECBackend::handle_sub_write_reply(
>      i->second.on_all_applied = 0;
>      i->second.trace.event("ec write all applied");
>    }
> -  if (i->second.pending_commit.empty() && i->second.on_all_commit) {
> +  if (i->second.pending_commit.size() == 2 &&
> i->second.on_all_commit) {  // 8 + 4 - 10 = 2
>      dout(10) << __func__ << " Calling on_all_commit on " << i->second << dendl;
>      i->second.on_all_commit->complete(0);
>      i->second.on_all_commit = 0;
>
> As far as what I see, everything still goes well (maybe because of the
> rwlock in primary OSD? not sure though), but I'm afraid it might break
> data consistency in some ways not aware of. So I'm writing to ask if
> someone could kindly provide expertise comments on this or maybe share
> any known drawbacks. Thank you!

Unfortunately this is one of those things that will work okay in
everyday use but fail catastrophically if something else goes wrong.

Ceph assumes throughout its codebase that if a write committed to
disk, it can be retrieved as long as k OSDs are available from the
PG's acting set when that write was committed. But by letting writes
"commit" on less than the full acting set, you might lose some of the
OSDs which *did* ack the write and no longer be able to find out what
it was even though you have >k OSDs available!

This will result in a very, very confused Ceph recovery algorithm,
lots of badness, and unhappy cluster users. :(
-Greg

>
> PS: OSD is backended with filestore, not bluestore, if that matters.
>
> Regards,
> Alex
diff mbox series

Patch

--- a/src/osd/ECBackend.cc
+++ b/src/osd/ECBackend.cc
@@ -1188,7 +1188,7 @@  void ECBackend::handle_sub_write_reply(
     i->second.on_all_applied = 0;
     i->second.trace.event("ec write all applied");
   }
-  if (i->second.pending_commit.empty() && i->second.on_all_commit) {
+  if (i->second.pending_commit.size() == 2 &&
i->second.on_all_commit) {  // 8 + 4 - 10 = 2
     dout(10) << __func__ << " Calling on_all_commit on " << i->second << dendl;
     i->second.on_all_commit->complete(0);