Message ID | 20190828094855.49918-1-chenerqi@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ceph: reconnect connection if session hang in opening state | expand |
On Wed, 2019-08-28 at 17:48 +0800, chenerqi@gmail.com wrote: > From: Erqi Chen <chenerqi@gmail.com> > > If client mds session is evicted in CEPH_MDS_SESSION_OPENING state, > mds won't send session msg to client, and delayed_work skip > CEPH_MDS_SESSION_OPENING state session, the session hang forever. > ceph_con_keepalive reconnct connection for CEPH_MDS_SESSION_OPENING > session to avoid session hang. > > Fixes: https://tracker.ceph.com/issues/41551 > Signed-off-by: Erqi Chen chenerqi@gmail.com > --- > fs/ceph/mds_client.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c > index 920e9f0..eee4b63 100644 > --- a/fs/ceph/mds_client.c > +++ b/fs/ceph/mds_client.c > @@ -4044,7 +4044,7 @@ static void delayed_work(struct work_struct *work) > pr_info("mds%d hung\n", s->s_mds); > } > } > - if (s->s_state < CEPH_MDS_SESSION_OPEN) { > + if (s->s_state < CEPH_MDS_SESSION_OPENING) { > /* this mds is failed or recovering, just wait */ > ceph_put_mds_session(s); > continue; Just for my own edification: OPENING == we've sent (or are sending) the session open request OPEN == we've gotten the reply from the MDS and it was successful So in this case, the client got blacklisted after sending the request but before the reply? Ok. So this should make it send a keepalive (or cap) message, at which point the client discovers the connection is closed and then goes to reconnect the session. This sounds sane to me, but I wonder if this would be better expressed as: if (s->s_state == CEPH_MDS_SESSION_NEW) It always seems odd to me that we rely on the numerical values in this enum. That said, we do that all over the code, so I'm inclined to just merge this as-is (assuming Zheng concurs).
On Wed, Aug 28, 2019 at 8:05 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Wed, 2019-08-28 at 17:48 +0800, chenerqi@gmail.com wrote: > > From: Erqi Chen <chenerqi@gmail.com> > > > > If client mds session is evicted in CEPH_MDS_SESSION_OPENING state, > > mds won't send session msg to client, and delayed_work skip > > CEPH_MDS_SESSION_OPENING state session, the session hang forever. > > ceph_con_keepalive reconnct connection for CEPH_MDS_SESSION_OPENING > > session to avoid session hang. > > > > Fixes: https://tracker.ceph.com/issues/41551 > > Signed-off-by: Erqi Chen chenerqi@gmail.com > > --- > > fs/ceph/mds_client.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c > > index 920e9f0..eee4b63 100644 > > --- a/fs/ceph/mds_client.c > > +++ b/fs/ceph/mds_client.c > > @@ -4044,7 +4044,7 @@ static void delayed_work(struct work_struct *work) > > pr_info("mds%d hung\n", s->s_mds); > > } > > } > > - if (s->s_state < CEPH_MDS_SESSION_OPEN) { > > + if (s->s_state < CEPH_MDS_SESSION_OPENING) { > > /* this mds is failed or recovering, just wait */ > > ceph_put_mds_session(s); > > continue; > > Just for my own edification: > > OPENING == we've sent (or are sending) the session open request > OPEN == we've gotten the reply from the MDS and it was successful > > So in this case, the client got blacklisted after sending the request > but before the reply? Ok. > > So this should make it send a keepalive (or cap) message, at which point > the client discovers the connection is closed and then goes to reconnect > the session. This sounds sane to me, but I wonder if this would be > better expressed as: > > if (s->s_state == CEPH_MDS_SESSION_NEW) > should also avoid keepalive for CEPH_MDS_SESSION_RESTARTING and CEPH_MDS_SESSION_REJECTED > It always seems odd to me that we rely on the numerical values in this > enum. That said, we do that all over the code, so I'm inclined to just > merge this as-is (assuming Zheng concurs). > > -- > Jeff Layton <jlayton@kernel.org> >
So in this case, the client got blacklisted after sending the request
but before the reply? Ok.
>>
Blacklist is disabled in my env, client fails to response caps revoke,
mds tick evicts it, and before evict closed the session, it cost much
time to remove client caps which is running in finish thread, the next
mds tick comes, I suspect if the client with STATE_KILLING session is
put to to_evict list again, it happens to kill client's session when
the client opening new session.
I would like to propose a new patch as follow change:
- if (s->s_state < CEPH_MDS_SESSION_OPEN) {
+ if (s->s_state == CEPH_MDS_SESSION_NEW ||
+ s->s_state == CEPH_MDS_SESSION_RESTARTING ||
+ s->s_state == CEPH_MDS_SESSION_REJECTED)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 920e9f0..eee4b63 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -4044,7 +4044,7 @@ static void delayed_work(struct work_struct *work) pr_info("mds%d hung\n", s->s_mds); } } - if (s->s_state < CEPH_MDS_SESSION_OPEN) { + if (s->s_state < CEPH_MDS_SESSION_OPENING) { /* this mds is failed or recovering, just wait */ ceph_put_mds_session(s); continue;