radosgw crash within libfcgi
diff mbox

Message ID 445856759.19469749.1435182024677.JavaMail.zimbra@redhat.com
State New
Headers show

Commit Message

Yehuda Sadeh June 24, 2015, 9:40 p.m. UTC
Also, looking at the code, I see an extra call to FCGX_Finish_r():


Maybe this is a problem on the specific libfcgi version that you're using?

----- Original Message -----
> From: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
> To: "GuangYang" <yguang11@outlook.com>
> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
> Sent: Wednesday, June 24, 2015 2:21:04 PM
> Subject: Re: radosgw crash within libfcgi
> 
> 
> 
> ----- Original Message -----
> > From: "GuangYang" <yguang11@outlook.com>
> > To: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
> > Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
> > Sent: Wednesday, June 24, 2015 2:12:23 PM
> > Subject: RE: radosgw crash within libfcgi
> > 
> > ----------------------------------------
> > > Date: Wed, 24 Jun 2015 17:04:05 -0400
> > > From: yehuda@redhat.com
> > > To: yguang11@outlook.com
> > > CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
> > > Subject: Re: radosgw crash within libfcgi
> > >
> > >
> > >
> > > ----- Original Message -----
> > >> From: "GuangYang" <yguang11@outlook.com>
> > >> To: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
> > >> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
> > >> Sent: Wednesday, June 24, 2015 1:53:20 PM
> > >> Subject: RE: radosgw crash within libfcgi
> > >>
> > >> Thanks Yehuda for the response.
> > >>
> > >> We already patched libfcgi to use poll instead of select to overcome the
> > >> limitation.
> > >>
> > >> Thanks,
> > >> Guang
> > >>
> > >>
> > >> ----------------------------------------
> > >>> Date: Wed, 24 Jun 2015 14:40:25 -0400
> > >>> From: yehuda@redhat.com
> > >>> To: yguang11@outlook.com
> > >>> CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
> > >>> Subject: Re: radosgw crash within libfcgi
> > >>>
> > >>>
> > >>>
> > >>> ----- Original Message -----
> > >>>> From: "GuangYang" <yguang11@outlook.com>
> > >>>> To: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com,
> > >>>> yehuda@redhat.com
> > >>>> Sent: Wednesday, June 24, 2015 10:09:58 AM
> > >>>> Subject: radosgw crash within libfcgi
> > >>>>
> > >>>> Hello Cephers,
> > >>>> Recently we have several radosgw daemon crashes with the same
> > >>>> following
> > >>>> kernel log:
> > >>>>
> > >>>> Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip
> > >>>> 00007ffa069996f2 sp 00007ff55c432710 error 6 in
> > >
> > > error 6 is sigabrt, right? With invalid pointer I'd expect to get
> > > segfault.
> > > Is the pointer actually invalid?
> > With (ip - {address_load_the_sharded_library}) to get the instruction which
> > caused this crash, the objdump shows the crash happened at instruction 46f2
> > (see below), which was to assign '-1' to the CGX_Request::ipcFd to -1, but
> > I
> > don't quite understand how/why it could crash there.
> > 
> > 0000000000004690 <FCGX_Free>:
> >     4690:       48 89 5c 24 f0          mov    %rbx,-0x10(%rsp)
> >     4695:       48 89 6c 24 f8          mov    %rbp,-0x8(%rsp)
> >     469a:       48 83 ec 18             sub    $0x18,%rsp
> >     469e:       48 85 ff                test   %rdi,%rdi
> >     46a1:       48 89 fb                mov    %rdi,%rbx
> >     46a4:       89 f5                   mov    %esi,%ebp
> >     46a6:       74 28                   je     46d0 <FCGX_Free+0x40>
> >     46a8:       48 8d 7f 08             lea    0x8(%rdi),%rdi
> >     46ac:       e8 67 e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
> >     46b1:       48 8d 7b 10             lea    0x10(%rbx),%rdi
> >     46b5:       e8 5e e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
> >     46ba:       48 8d 7b 18             lea    0x18(%rbx),%rdi
> >     46be:       e8 55 e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
> >     46c3:       48 8d 7b 28             lea    0x28(%rbx),%rdi
> >     46c7:       e8 d4 f4 ff ff          callq  3ba0 <FCGX_PutS+0x40>
> >     46cc:       85 ed                   test   %ebp,%ebp
> >     46ce:       75 10                   jne    46e0 <FCGX_Free+0x50>
> >     46d0:       48 8b 5c 24 08          mov    0x8(%rsp),%rbx
> >     46d5:       48 8b 6c 24 10          mov    0x10(%rsp),%rbp
> >     46da:       48 83 c4 18             add    $0x18,%rsp
> >     46de:       c3                      retq
> >     46df:       90                      nop
> >     46e0:       31 f6                   xor    %esi,%esi
> >     46e2:       83 7b 4c 00             cmpl   $0x0,0x4c(%rbx)
> >     46e6:       8b 7b 30                mov    0x30(%rbx),%edi
> >     46e9:       40 0f 94 c6             sete   %sil
> >     46ed:       e8 86 e6 ff ff          callq  2d78 <OS_IpcClose@plt>
> >     46f2:       c7 43 30 ff ff ff ff    movl   $0xffffffff,0x30(%rbx)
> 
> info registers?
> 
> Not too familiar with the specific message, but it could be that
> OS_IpcClose() aborts (not highly unlikely) and it only dumps the return
> address of the current function (shouldn't be referenced as ip though).
> 
> What's rbx? Is the memory at %rbx + 0x30 valid?
> 
> Also, did you by any chance upgrade the binaries while the code was running?
> is the code running over nfs?
> 
> Yehuda
> 
> > >
> > > Yehuda
> > >
> > >
> > >>>> libfcgi.so.0.0.0[7ffa06995000+a000] in
> > >>>> libfcgi.so.0.0.0[7ffa06995000+a000]
> > >>>>
> > >>>> Looking at the assembly, it seems crashing at this point -
> > >>>> http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035,
> > >>>> which
> > >>>> confused me. I tried to see if there is any other reference holding
> > >>>> the
> > >>>> FCGX_Request which release the handle without any luck.
> > >>>>
> > >>>> There are also other observations:
> > >>>> 1> Several radosgw daemon across different hosts crashed around the
> > >>>> same
> > >>>> time.
> > >>>> 2> Apache's error log has some fcgi error complaining ##idle timeout##
> > >>>> during the time.
> > >>>>
> > >>>> Does anyone experience similar issue?
> > >>>>
> > >>>
> > >>> In the past we've had issues with libfcgi that were related to the
> > >>> number
> > >>> of open fds on the process (> 1024). The issue was a buggy libfcgi that
> > >>> was using select() instead of poll(), so this might be the issue you're
> > >>> noticing.
> > >>>
> > >>> Yehuda
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >>> in
> > >>> the body of a message to majordomo@vger.kernel.org
> > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >> N????y??b???v???{.n????z??ay????j?f??????????:+v????????zZ+????"?!?
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/src/rgw/rgw_main.cc b/src/rgw/rgw_main.cc
index 9a8aa5f..0aa7ded 100644
--- a/src/rgw/rgw_main.cc
+++ b/src/rgw/rgw_main.cc
@@ -669,8 +669,6 @@  void RGWFCGXProcess::handle_request(RGWRequest *r)
     dout(20) << "process_request() returned " << ret << dendl;
   }
 
-  FCGX_Finish_r(fcgx);
-
   delete req;
 }