diff mbox

bug #10915 client: hangs on umount if it had an MDS session evicted

Message ID CA+jMu4S5waxnQLOjbHesO64hUU2n=GknqLE2uXdzDGhwq+w6rA@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Rishabh Dave March 21, 2018, 11:09 a.m. UTC
Hi,

I am trying to fix the bug - http://tracker.ceph.com/issues/10915.
Patrick helped me to get started with it. I was able to reproduce it
locally on vstart cluster and I am currently trying to fix it by
getting the client unmounted on eviction. Once I could do this I would
(as Patrick suggested) add a new option like
"client_unmount_on_blacklist" and modify my code accordingly.

The information about the blacklist seems to be available to the
client code [1] but, as far as I can see, that line (i.e. the if-block
containing it) never gets executed. MDS blacklists the client, evicts
the session but client fails to notice that. I suppose, this
in-congruence causes it to hang.

The reason why the client fails to notice is that it never actually
looks at the blacklist after the session is evicted --
handle_osd_map() never gets called after MDSRank::evict_session() is
called. I did write a patch that would make the client check its
address in the blacklist by calling a (new) function in
ms_handle_reset() but it did not help. Looks like not only the client
doesn't check the blacklist but also even if it were to, it would find
an outdated version.

To verify this, I wrote some debug code to iterate and display the
blacklist towards the end of and after MDSRank::evict_session(). The
blacklist turned out to be empty in both the location. Shouldn't
blacklist be updated at least in or right after
MDSRank::evict_session() gets executed? I think before fixing client,
I need to have some sort of fix somewhere here [2].

And how can I get a stacktrace for commands like "bin/ceph tell mds.a
client evict id=xxxx"?

Also I have attached the patch containing modifications I have used so far.

Thanks,
Rishabh

[1] https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L2420
[2] https://github.com/ceph/ceph/blob/master/src/mds/MDSRank.cc#L2737

Comments

John Spray March 22, 2018, 12:10 p.m. UTC | #1
On Wed, Mar 21, 2018 at 7:09 PM, Rishabh Dave <ridave@redhat.com> wrote:
> Hi,
>
> I am trying to fix the bug - http://tracker.ceph.com/issues/10915.
> Patrick helped me to get started with it. I was able to reproduce it
> locally on vstart cluster and I am currently trying to fix it by
> getting the client unmounted on eviction. Once I could do this I would
> (as Patrick suggested) add a new option like
> "client_unmount_on_blacklist" and modify my code accordingly.
>
> The information about the blacklist seems to be available to the
> client code [1] but, as far as I can see, that line (i.e. the if-block
> containing it) never gets executed. MDS blacklists the client, evicts
> the session but client fails to notice that. I suppose, this
> in-congruence causes it to hang.
>
> The reason why the client fails to notice is that it never actually
> looks at the blacklist after the session is evicted --
> handle_osd_map() never gets called after MDSRank::evict_session() is
> called. I did write a patch that would make the client check its
> address in the blacklist by calling a (new) function in
> ms_handle_reset() but it did not help. Looks like not only the client
> doesn't check the blacklist but also even if it were to, it would find
> an outdated version.
>
> To verify this, I wrote some debug code to iterate and display the
> blacklist towards the end of and after MDSRank::evict_session(). The
> blacklist turned out to be empty in both the location. Shouldn't
> blacklist be updated at least in or right after
> MDSRank::evict_session() gets executed? I think before fixing client,
> I need to have some sort of fix somewhere here [2].

The client only gets osdmap updates when it tries to communicate with
an OSD, and the OSD tells it that its current map epoch is too old.

In the case that the client isn't doing any data operations (i.e. no
osd ops), then the client doesn't find out that its blacklisted.  But
that's okay, because the client's awareness of its own
blacklisted-ness should only be needed in the case that there is some
dirty data that needs to be thrown away in the special if(blacklisted)
paths.

So if it's not hanging on any OSD operations (those operations would
have resulted in an updated osdmap), the question is what is it
hanging on?  Is it trying to open a new session with the MDS?

John

John

> And how can I get a stacktrace for commands like "bin/ceph tell mds.a
> client evict id=xxxx"?
>
> Also I have attached the patch containing modifications I have used so far.
>
> Thanks,
> Rishabh
>
> [1] https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L2420
> [2] https://github.com/ceph/ceph/blob/master/src/mds/MDSRank.cc#L2737
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rishabh Dave March 27, 2018, 1:24 p.m. UTC | #2
On 22 March 2018 at 17:40, John Spray <jspray@redhat.com> wrote:
> The client only gets osdmap updates when it tries to communicate with
> an OSD, and the OSD tells it that its current map epoch is too old.
>
> In the case that the client isn't doing any data operations (i.e. no
> osd ops), then the client doesn't find out that its blacklisted.  But
> that's okay, because the client's awareness of its own
> blacklisted-ness should only be needed in the case that there is some
> dirty data that needs to be thrown away in the special if(blacklisted)
> paths.
>
> So if it's not hanging on any OSD operations (those operations would
> have resulted in an updated osdmap), the question is what is it
> hanging on?  Is it trying to open a new session with the MDS?

Looks like client still "thinks" that it has a session open (since
condition at [1] was false when I checked it myself) and then it waits
for a reply [2]. This is exactly where it hangs. I have written a fix
and raised a PR for it [3]. Basically, it replaces
caller_cond.Wait(client_lock) by caller_cond.WaitInterval(client_lock,
utime_t(10, 0)).

By the way, would it be correct for client to realize that it is
blacklisted here [4]? When I checked, it wasn't so -- the (copy of)
blacklist didn't have the client address.

Also, I think it would better if an evicted/blacklisted client
receives some sort of reply on making a request (like here [2] in this
bug's case) from MDS that would convey that it can't access CephFS
anymore. I don't know if this would be appropriate to do so, but this
would not make the client wait infinitely. Though, this might be risky
considering that client can, then, flood the MDS with requests. In
that case, maybe MDS should send a reply to client only once to avoid
that.

[1] https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L1689
[2] https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L1719
[3] https://github.com/ceph/ceph/pull/21065
[4] https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L1658
[5] https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L2410
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Spray March 27, 2018, 1:43 p.m. UTC | #3
On Tue, Mar 27, 2018 at 2:24 PM, Rishabh Dave <ridave@redhat.com> wrote:
> On 22 March 2018 at 17:40, John Spray <jspray@redhat.com> wrote:
>> The client only gets osdmap updates when it tries to communicate with
>> an OSD, and the OSD tells it that its current map epoch is too old.
>>
>> In the case that the client isn't doing any data operations (i.e. no
>> osd ops), then the client doesn't find out that its blacklisted.  But
>> that's okay, because the client's awareness of its own
>> blacklisted-ness should only be needed in the case that there is some
>> dirty data that needs to be thrown away in the special if(blacklisted)
>> paths.
>>
>> So if it's not hanging on any OSD operations (those operations would
>> have resulted in an updated osdmap), the question is what is it
>> hanging on?  Is it trying to open a new session with the MDS?
>
> Looks like client still "thinks" that it has a session open (since
> condition at [1] was false when I checked it myself) and then it waits
> for a reply [2]. This is exactly where it hangs. I have written a fix
> and raised a PR for it [3]. Basically, it replaces
> caller_cond.Wait(client_lock) by caller_cond.WaitInterval(client_lock,
> utime_t(10, 0)).

So that patch is now auto-unmounting a client if any request takes
longer than 10 seconds, which is not quite right. A request can take
that long for other reasons, such as an ongoing MDS failover, or just
a very slow/busy MDS, and we wouldn't want to start aborting requests
because of that.

Instead, you could do a check in Client::tick that looks at
mds_requests.begin() (doesn't matter which request we check), and if
it has been stuck for a long time then call into
objecter->get_latest_version to ensure we have the latest OSDMap (and
blacklist) even if there's no data IO going on.

I don't think the client_unmount_on_blacklist behaviour is going to be
a good idea in practice -- if someone has a workload writing files out
to a mounted /mnt/cephfs, and it gets auto-unmounted, they'll start
writing data out to their root filesystem instead!  We need to leave
the actual unmounting to the administrator so that they can stop
whatever workloads were using the mount point as well.

John



>
> By the way, would it be correct for client to realize that it is
> blacklisted here [4]? When I checked, it wasn't so -- the (copy of)
> blacklist didn't have the client address.
>
> Also, I think it would better if an evicted/blacklisted client
> receives some sort of reply on making a request (like here [2] in this
> bug's case) from MDS that would convey that it can't access CephFS
> anymore. I don't know if this would be appropriate to do so, but this
> would not make the client wait infinitely. Though, this might be risky
> considering that client can, then, flood the MDS with requests. In
> that case, maybe MDS should send a reply to client only once to avoid
> that.



>
> [1] https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L1689
> [2] https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L1719
> [3] https://github.com/ceph/ceph/pull/21065
> [4] https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L1658
> [5] https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L2410
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rishabh Dave March 27, 2018, 1:54 p.m. UTC | #4
On 27 March 2018 at 19:13, John Spray <jspray@redhat.com> wrote:>
> So that patch is now auto-unmounting a client if any request takes
> longer than 10 seconds, which is not quite right. A request can take
> that long for other reasons, such as an ongoing MDS failover, or just
> a very slow/busy MDS, and we wouldn't want to start aborting requests
> because of that.

I was going to ask about the very situation you mentioned here at the PR's page.

> Instead, you could do a check in Client::tick that looks at
> mds_requests.begin() (doesn't matter which request we check), and if
> it has been stuck for a long time then call into
> objecter->get_latest_version to ensure we have the latest OSDMap (and
> blacklist) even if there's no data IO going on.

Ok. I'll make the changes accordingly.

> I don't think the client_unmount_on_blacklist behaviour is going to be
> a good idea in practice -- if someone has a workload writing files out
> to a mounted /mnt/cephfs, and it gets auto-unmounted, they'll start
> writing data out to their root filesystem instead!  We need to leave
> the actual unmounting to the administrator so that they can stop
> whatever workloads were using the mount point as well.

So, would it preferable to just get rid of it or to set it to false by default?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick Donnelly March 29, 2018, 6:19 p.m. UTC | #5
On Tue, Mar 27, 2018 at 6:54 AM, Rishabh Dave <ridave@redhat.com> wrote:
> On 27 March 2018 at 19:13, John Spray <jspray@redhat.com> wrote:>
>> I don't think the client_unmount_on_blacklist behaviour is going to be
>> a good idea in practice -- if someone has a workload writing files out
>> to a mounted /mnt/cephfs, and it gets auto-unmounted, they'll start
>> writing data out to their root filesystem instead!  We need to leave
>> the actual unmounting to the administrator so that they can stop
>> whatever workloads were using the mount point as well.
>
> So, would it preferable to just get rid of it or to set it to false by default?

John raises a good point. I think instead just have the CephFS Client
clean up its state such that umount(2) succeeds.
diff mbox

Patch

diff --git a/src/client/Client.cc b/src/client/Client.cc
index 6c464d5a36..4e1f0b442c 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -2489,7 +2489,6 @@  void Client::handle_osd_map(MOSDMap *m)
 // ------------------------
 // incoming messages
 
-
 bool Client::ms_dispatch(Message *m)
 {
   Mutex::Locker l(client_lock);
@@ -13489,9 +13488,22 @@  void Client::ms_handle_connect(Connection *con)
   ldout(cct, 10) << __func__ << " on " << con->get_peer_addr() << dendl;
 }
 
+void Client::unmount_if_blacklisted()
+{
+  std::set<entity_addr_t> new_blacklists;
+  objecter->consume_blacklist_events(&new_blacklists);
+
+  const auto myaddr = messenger->get_myaddr();
+  if (new_blacklists.count(myaddr)) {
+    cout << "UNMOUNTING!!!" << std::endl;
+    this->unmount();
+  }
+}
+
 bool Client::ms_handle_reset(Connection *con)
 {
   ldout(cct, 0) << __func__ << " on " << con->get_peer_addr() << dendl;
+  this->unmount_if_blacklisted();
   return false;
 }
 
diff --git a/src/client/Client.h b/src/client/Client.h
index ae5b188538..fd7b1f50da 100644
--- a/src/client/Client.h
+++ b/src/client/Client.h
@@ -558,6 +558,7 @@  protected:
 
   // friends
   friend class SyntheticClient;
+  void unmount_if_blacklisted();
   bool ms_dispatch(Message *m) override;
 
   void ms_handle_connect(Connection *con) override;
diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc
index d36d680d57..3ec16791e1 100644
--- a/src/mds/MDSRank.cc
+++ b/src/mds/MDSRank.cc
@@ -2717,6 +2717,9 @@  bool MDSRank::evict_client(int64_t session_id,
     bool wait, bool blacklist, std::stringstream& err_ss,
     Context *on_killed)
 {
+  FILE *fq = fopen("time", "a+");
+  fprintf(fq, "mds: MDSRank::evict_client()\n");
+
   assert(mds_lock.is_locked_by_me());
 
   // Mutually exclusive args
@@ -2823,6 +2826,17 @@  bool MDSRank::evict_client(int64_t session_id,
     }
   }
 
+  fprintf(fq, "will print the blacklist -\n");
+  std::set<entity_addr_t> blacklist2;
+  objecter->consume_blacklist_events(&blacklist2);
+  int j = 0;
+  for (std::set<entity_addr_t>::iterator i = blacklist2.begin();
+        i != blacklist2.end(); ++i, ++j) {
+    stringstream ss; ss <<  *i;
+    fprintf(fq, "blacklist[%d] = %s", j, ss.str().c_str());
+  }
+  fprintf(fq, "blacklist ends\n");
+  fclose(fq);
   return true;
 }
 
@@ -2900,6 +2914,20 @@  bool MDSRankDispatcher::handle_command(
     evict_clients(filter, m);
 
     *need_reply = false;
+  FILE *fq = fopen("time", "a+");
+  fprintf(fq, "mds: MDSRank::ms_dispatch\n");
+  fprintf(fq, "mds: will print the blacklist -\n");
+  std::set<entity_addr_t> blacklist2;
+  objecter->consume_blacklist_events(&blacklist2);
+  int j = 0;
+  for (std::set<entity_addr_t>::iterator i = blacklist2.begin();
+        i != blacklist2.end(); ++i, ++j) {
+    stringstream ss; ss <<  *i;
+    fprintf(fq, "blacklist[%d] = %s", j, ss.str().c_str());
+  }
+  fprintf(fq, "mds: blacklist ends\n");
+  fclose(fq);
+
     return true;
   } else if (prefix == "damage ls") {
     Formatter *f = new JSONFormatter(true);