diff mbox

Fix local messages and lost wake-up in SimpleMessenger

Message ID or7gt2ypox.fsf@livre.localdomain (mailing list archive)
State New, archived
Headers show

Commit Message

Alexandre Oliva Aug. 14, 2012, 4:55 a.m. UTC
Don't wait for a signal if the dispatch queue is non-empty already,
and support a non-NULL pipe for local responses.  If we were to
require pipe to be NULL for that, the response would be lost.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
---
 src/msg/SimpleMessenger.cc |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

Comments

Sage Weil Aug. 14, 2012, 4 p.m. UTC | #1
A huge pile of SimpleMessenger refactors, cleanups, and fixes (wip-msgr) 
was just merged into master, and this code was changed around quite a bit.  
Did you have a specific test case that was triggering the problem?  If so, 
do you mind running it on the current master?

Most of the cleanups/fixes were motivated by testing with socket failure 
injection enabled (e.g., 'ms inject socket failures = 500' to simulate a 
failure on every ~500th socket read or write).  This is being added to the 
rados regression test suite.

Thanks!
sage


On Tue, 14 Aug 2012, Alexandre Oliva wrote:

> Don't wait for a signal if the dispatch queue is non-empty already,
> and support a non-NULL pipe for local responses.  If we were to
> require pipe to be NULL for that, the response would be lost.
> 
> Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
> ---
>  src/msg/SimpleMessenger.cc |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/src/msg/SimpleMessenger.cc b/src/msg/SimpleMessenger.cc
> index 53e36cd..973219f 100644
> --- a/src/msg/SimpleMessenger.cc
> +++ b/src/msg/SimpleMessenger.cc
> @@ -373,7 +373,7 @@ void SimpleMessenger::DispatchQueue::entry()
>        }
>        lock.Lock();
>      }
> -    if (!stop)
> +    if (queued_pipes.empty() && !stop)
>        cond.Wait(lock); //wait for something to be put on queue
>    }
>    lock.Unlock();
> @@ -2477,7 +2477,7 @@ void SimpleMessenger::submit_message(Message *m, Connection *con, const entity_a
>    }
>  
>    // local?
> -  if (!pipe && my_inst.addr == dest_addr) {
> +  if (my_inst.addr == dest_addr) {
>      if (!destination_stopped) {
>        // local
>        ldout(cct,20) << "submit_message " << *m << " local" << dendl;
> -- 
> 1.7.7.6
> 
> -- 
> Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
> You must be the change you wish to see in the world. -- Gandhi
> Be Free! -- http://FSFLA.org/   FSF Latin America board member
> Free Software Evangelist      Red Hat Brazil Compiler Engineer
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexandre Oliva Aug. 14, 2012, 6:50 p.m. UTC | #2
On Aug 14, 2012, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:

> Don't wait for a signal if the dispatch queue is non-empty already,
> and support a non-NULL pipe for local responses.  If we were to
> require pipe to be NULL for that, the response would be lost.

Although this patch applies to 0.50, I developed it on 0.48argonaut, I
failed to mention.  It occurred to me that I haven't even tried to
duplicate on 0.50 the problem I'd observed consistently on 0.48, namely,
that accessing (stat) hardlinks that were not in the MDS cache would
hard-block the client, because the MDS would issue an archortable
request to itself, and the response wouldn't find its way back.

At first, I thought the response had been queued but not picked up
because it would be signalled before the dispatcher returned and
Wait()ed for the signal, so it could be a lost wake-up race condition.
That turned out to not fix the problem, but I figured I'd leave the hunk
in, for it appeared to make sense on its own (the spelling of the
empty() call is different in 0.50 and 0.48, FWIW).

> -    if (!stop)
> +    if (queued_pipes.empty() && !stop)
>        cond.Wait(lock); //wait for something to be put on queue


Debugging further, I noticed that the anchortable request was issued the
first time (and never re-issued, for the request remained in the pending
queue), but the response didn't make it back: it was handled as a remote
request, rather than a local one, because we find a pipe at the time of
posting the response.  However, since there aren't threads for such
local pipes, the response is AFAICT queued but never sent.  Once I
enabled the destination for the response to be recognized as local in
spite of the existence of the pipe, the response found its way back to
the (local) requester and the client didn't hang any more.  At that
point, I stopped digging.

>    // local?
> -  if (!pipe && my_inst.addr == dest_addr) {
> +  if (my_inst.addr == dest_addr) {


I ran a 0.48 with this (or rather the equivalent) change for a bit, then
upgraded to 0.50, ported the patches, tested them for a bit, and posted
the patches.  The client hang was 100% reproducible with the unpatched
0.48, as long as the hardlinked inode was out of the MDS cache, and it
could only be temporarily averted by accessing the hardlinked file by
another name that, I suppose, didn't require a self-request of
anchortable lookup.  After the patch, the problem is gone for good.
Sage Weil Aug. 14, 2012, 10:49 p.m. UTC | #3
On Tue, 14 Aug 2012, Alexandre Oliva wrote:
> On Aug 14, 2012, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:
> 
> > Don't wait for a signal if the dispatch queue is non-empty already,
> > and support a non-NULL pipe for local responses.  If we were to
> > require pipe to be NULL for that, the response would be lost.
> 
> Although this patch applies to 0.50, I developed it on 0.48argonaut, I
> failed to mention.  It occurred to me that I haven't even tried to
> duplicate on 0.50 the problem I'd observed consistently on 0.48, namely,
> that accessing (stat) hardlinks that were not in the MDS cache would
> hard-block the client, because the MDS would issue an archortable
> request to itself, and the response wouldn't find its way back.
> 
> At first, I thought the response had been queued but not picked up
> because it would be signalled before the dispatcher returned and
> Wait()ed for the signal, so it could be a lost wake-up race condition.
> That turned out to not fix the problem, but I figured I'd leave the hunk
> in, for it appeared to make sense on its own (the spelling of the
> empty() call is different in 0.50 and 0.48, FWIW).
> 
> > -    if (!stop)
> > +    if (queued_pipes.empty() && !stop)
> >        cond.Wait(lock); //wait for something to be put on queue
> 
> 
> Debugging further, I noticed that the anchortable request was issued the
> first time (and never re-issued, for the request remained in the pending
> queue), but the response didn't make it back: it was handled as a remote
> request, rather than a local one, because we find a pipe at the time of
> posting the response.  However, since there aren't threads for such
> local pipes, the response is AFAICT queued but never sent.  Once I
> enabled the destination for the response to be recognized as local in
> spite of the existence of the pipe, the response found its way back to
> the (local) requester and the client didn't hang any more.  At that
> point, I stopped digging.

That's a nice bit of debugging!
 
> >    // local?
> > -  if (!pipe && my_inst.addr == dest_addr) {
> > +  if (my_inst.addr == dest_addr) {
> 
> 
> I ran a 0.48 with this (or rather the equivalent) change for a bit, then
> upgraded to 0.50, ported the patches, tested them for a bit, and posted
> the patches.  The client hang was 100% reproducible with the unpatched
> 0.48, as long as the hardlinked inode was out of the MDS cache, and it
> could only be temporarily averted by accessing the hardlinked file by
> another name that, I suppose, didn't require a self-request of
> anchortable lookup.  After the patch, the problem is gone for good.

Do you mind trying to reproduce this on the current master?  A quick 
inspection of the code makes me think this isn't present after the 
refactor, but I suspect it'll be quicker for you to reproduce than for me.  

BTW, if you have a script or anything you were using to test, that'd be 
great to add to our test suite.  Something like

  #!/bin/sh

  mkdir sub1
  touch sub1/a

  # put hardlink in separate dir so b lookup doesn't load a's inode
  mkdir sub2
  ln sub1/a sub2/b
 
  # push it out of the mds cache
  for dir in `seq 1 100` ; do
    mkdir $dir
    for file in `seq 1 100` ; do
      touch $dir/$file
    done
  done

  stat b

?

Thanks again!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexandre Oliva Aug. 15, 2012, 2:28 a.m. UTC | #4
On Aug 14, 2012, Sage Weil <sage@inktank.com> wrote:

> Do you mind trying to reproduce this on the current master?

Tried, but couldn't.  It looks like it is indeed fixed in master, but
not in 0.50.

> BTW, if you have a script or anything you were using to test, that'd be 
> great to add to our test suite.

It occurred occasionally while rsyncing (with -aH) trees containing
hardlinks into the ceph filesystem, but it became a real problem when I
decided to stop using snapshots and use hardlink farms instead.
Creating a hardlink farm of a tree containing hardlinks (like
/usr/share/zoneinfo) presumably compounds the problem, but I've hit it
most often after creating linkfarms of a few backups of my phone's root
directory, some old SHR GNU/Linux release, probably just because those
have enough directories and files to get stuff kicked out of the mds
cache.

You could probably trigger it with something like this:

rsync -aH /usr/share/zoneinfo/ /mnt/ceph/sub1/
rsync -aH /usr/share/zoneinfo/ /mnt/ceph/sub2/ --link-dest=/mnt/ceph/sub1/

umount /mnt/ceph
# ceph-fuse survives the umount, so wait
# till it flushes everything to the mds:
while killall -0 ceph-fuse; do sleep 1; done 
restart mds # maybe not needed
mount /mnt/ceph

# kick all of sub1/ and sub2/ out of the cache somehow, and then:

ls -lR /mnt/ceph/sub2 # freezes

# restarting mds at this point may get it stuck at rejoin
diff mbox

Patch

diff --git a/src/msg/SimpleMessenger.cc b/src/msg/SimpleMessenger.cc
index 53e36cd..973219f 100644
--- a/src/msg/SimpleMessenger.cc
+++ b/src/msg/SimpleMessenger.cc
@@ -373,7 +373,7 @@  void SimpleMessenger::DispatchQueue::entry()
       }
       lock.Lock();
     }
-    if (!stop)
+    if (queued_pipes.empty() && !stop)
       cond.Wait(lock); //wait for something to be put on queue
   }
   lock.Unlock();
@@ -2477,7 +2477,7 @@  void SimpleMessenger::submit_message(Message *m, Connection *con, const entity_a
   }
 
   // local?
-  if (!pipe && my_inst.addr == dest_addr) {
+  if (my_inst.addr == dest_addr) {
     if (!destination_stopped) {
       // local
       ldout(cct,20) << "submit_message " << *m << " local" << dendl;