diff mbox

operate one file in multi clients with libceph

Message ID BANLkTin+D66OuXUbiRkDBNiLx--8eu2hkw@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Simon Tian May 19, 2011, 6:58 a.m. UTC
None

Comments

Sage Weil May 19, 2011, 5:11 p.m. UTC | #1
On Thu, 19 May 2011, Simon Tian wrote:
> Hi Sage,
> 
>     I've test this patch.
> With 8 times test, only 1 (the first time) got a thread assert fail.As
> back trace 1 showed:

Do you have a simple reproducer for this?

sage


> ==========================  back trace 1   ==================================
> (gdb) bt
> #0  0x000000367fa30265 in raise () from /lib64/libc.so.6
> #1  0x000000367fa31d10 in abort () from /lib64/libc.so.6
> #2  0x0000003682ebec44 in __gnu_cxx::__verbose_terminate_handler() ()
> from /usr/lib64/libstdc++.so.6
> #3  0x0000003682ebcdb6 in ?? () from /usr/lib64/libstdc++.so.6
> #4  0x0000003682ebcde3 in std::terminate() () from /usr/lib64/libstdc++.so.6
> #5  0x0000003682ebceca in __cxa_throw () from /usr/lib64/libstdc++.so.6
> #6  0x00007ffff7c51a54 in ceph::__ceph_assert_fail
> (assertion=0x7ffff7ce31f1 "in->cap_refs[(32 << 8)] == 1",
>     file=0x7ffff7ce1431 "client/Client.cc", line=2190,
> func=0x7ffff7ce5fc0 "void Client::_flush(Inode*, Context*)")
>     at common/assert.cc:86
> #7  0x00007ffff7b39bb0 in Client::_flush (this=0x629090, in=0x630f50,
> onfinish=0x7fffe8000b60) at client/Client.cc:2190
> #8  0x00007ffff7b52426 in Client::handle_cap_grant (this=0x629090,
> in=0x630f50, mds=0, cap=0x62ee50, m=0x631fc0)
>     at client/Client.cc:2930
> #9  0x00007ffff7b52d2a in Client::handle_caps (this=0x629090,
> m=0x631fc0) at client/Client.cc:2711
> #10 0x00007ffff7b560a0 in Client::ms_dispatch (this=0x629090,
> m=0x631fc0) at client/Client.cc:1444
> #11 0x00007ffff7bcaff9 in Messenger::ms_deliver_dispatch
> (this=0x628350, m=0x631fc0) at msg/Messenger.h:98
> #12 0x00007ffff7bb2607 in SimpleMessenger::dispatch_entry
> (this=0x628350) at msg/SimpleMessenger.cc:352
> #13 0x00007ffff7b22641 in SimpleMessenger::DispatchThread::entry
> (this=0x6287d8) at msg/SimpleMessenger.h:533
> #14 0x00007ffff7b5aa04 in Thread::_entry_func (arg=0x6287d8) at
> ./common/Thread.h:41
> #15 0x00000036802064a7 in start_thread () from /lib64/libpthread.so.0
> #16 0x000000367fad3c2d in clone () from /lib64/libc.so.6
> ============================================================
> 
> In addition, when writing data for a long time, the write thread will
> hang on a pthread_cond_wait, as back trace 2 showed:
> ==========================  back trace 2  ==================================
> Thread 10 (Thread 0x43f87940 (LWP 19986)):
> #0  0x000000312be0ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x00007f147a8e763b in Cond::Wait (this=0x43f86b00,
> mutex=@0x66d790) at ./common/Cond.h:46
> #2  0x00007f147a86630d in Client::wait_on_list (this=0x66d430,
> ls=@0x674df0) at client/Client.cc:2140
> #3  0x00007f147a8771ff in Client::get_caps (this=0x66d430,
> in=0x6749d0, need=4096, want=8192, got=0x43f86e8c, endoff=235470848)
>     at client/Client.cc:1827
> #4  0x00007f147a886003 in Client::_write (this=0x66d430, f=0x671510,
> offset=235470336, size=512,
>     buf=0x6452b0 'ÿÿ' <repeats 33 times>, ".001", 'ÿÿ' <repeats 165
> times>...) at client/Client.cc:5055
> #5  0x00007f147a886d73 in Client::write (this=0x66d430, fd=10,
>     buf=0x6452b0 'ÿÿ' <repeats 33 times>, ".001", 'ÿÿ' <repeats 165
> times>..., size=512, offset=235470336) at client/Client.cc:5007
> #6  0x00007f147a85822d in ceph_write (fd=10, buf=0x6452b0 'ÿÿ' <repeats
> 33 times>, ".001", 'ÿÿ' <repeats 165 times>..., size=512,
>     offset=235470336) at libceph.cc:322
> #7  0x000000000042fa27 in TdcCephImpl::AsyncWrite (this=0x647c30,
> offset=235470336, length=512,
>     buf=0x6452b0 'ÿÿ' <repeats 33 times>, ".001", 'ÿÿ' <repeats 165
> times>..., queue=@0x644b60, io=0x6456a0)
>     at /opt/tsk/tdc-tapdisk/td-connector2.0/common/tdc_ceph_impl.cpp:191
> #8  0x000000000042ffc5 in TdcCephImpl::AsyncProcess (this=0x647c30,
> io=0x6456a0, queue=@0x644b60, netfd=5)
>     at /opt/tsk/tdc-tapdisk/td-connector2.0/common/tdc_ceph_impl.cpp:268
> #9  0x00000000004192be in SubmitRequest_Intra (arg=0x644b00) at
> /opt/tsk/tdc-tapdisk/td-connector2.0/td_connector_server.cpp:390
> #10 0x000000312be064a7 in start_thread () from /lib64/libpthread.so.0
> #11 0x000000312b6d3c2d in clone () from /lib64/libc.so.6
> ============================================================
> 
> Seem not very safe with the patch.
> 
> So I applied this patch:
> diff --git a/src/client/Client.cc b/src/client/Client.cc
> index 7f7fb08..bf0997a 100644
> --- a/src/client/Client.cc
> +++ b/src/client/Client.cc
> @@ -4934,7 +4934,6 @@ public:
> 
>  void Client::sync_write_commit(Inode *in)
>  {
> -  client_lock.Lock();
> + int r = client_lock.Lock();
>   assert(unsafe_sync_write > 0);
>   unsafe_sync_write--;
> 
> @@ -4947,8 +4946,6 @@ void Client::sync_write_commit(Inode *in)
>   }
> 
>   put_inode(in);
> + if (r)
>   client_lock.Unlock();
>  }
> 
>  int Client::write(int fd, const char *buf, loff_t size, loff_t offset)
> 
> 
> After some test, back trace 1 didn't appear, but back trace 2 still appear.
> Hmm,, not safe also.
> 
> Thx!
> Simon
> 
> 
> 
> 2011/5/19 Sage Weil <sage@newdream.net>:
> > Hi Simon,
> >
> > On Wed, 18 May 2011, Simon Tian wrote:
> >
> >> Hi,
> >>
> >>     Could any one give me an answer?
> >>
> >>    My application need to open one file with two clients in different
> >> hosts. One for write/read and one for read only. The client could be
> >> developed based on libceph or librbd.
> >>    I tried librbd, exception appear too.
> >
> > I opened a bug for this, http://tracker.newdream.net/issues/1097.
> >
> > Can you try the patch below?  I this may just be something missed in a
> > locking rewrite way back when in a rare code path:
> >
> > Thanks!
> > sage
> >
> >
> >
> > diff --git a/src/client/Client.cc b/src/client/Client.cc
> > index 7f7fb08..bf0997a 100644
> > --- a/src/client/Client.cc
> > +++ b/src/client/Client.cc
> > @@ -4934,7 +4934,6 @@ public:
> >
> >  void Client::sync_write_commit(Inode *in)
> >  {
> > -  client_lock.Lock();
> >   assert(unsafe_sync_write > 0);
> >   unsafe_sync_write--;
> >
> > @@ -4947,8 +4946,6 @@ void Client::sync_write_commit(Inode *in)
> >   }
> >
> >   put_inode(in);
> > -
> > -  client_lock.Unlock();
> >  }
> >
> >  int Client::write(int fd, const char *buf, loff_t size, loff_t offset)
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>
Sage Weil May 19, 2011, 5:20 p.m. UTC | #2
On Thu, 19 May 2011, Sage Weil wrote:
> On Thu, 19 May 2011, Simon Tian wrote:
> > Hi Sage,
> > 
> >     I've test this patch.
> > With 8 times test, only 1 (the first time) got a thread assert fail.As
> > back trace 1 showed:
> 
> Do you have a simple reproducer for this?

Nevermind, I'm easily triggering this with two cfuse mounts.

sage


> 
> sage
> 
> 
> > ==========================  back trace 1   ==================================
> > (gdb) bt
> > #0  0x000000367fa30265 in raise () from /lib64/libc.so.6
> > #1  0x000000367fa31d10 in abort () from /lib64/libc.so.6
> > #2  0x0000003682ebec44 in __gnu_cxx::__verbose_terminate_handler() ()
> > from /usr/lib64/libstdc++.so.6
> > #3  0x0000003682ebcdb6 in ?? () from /usr/lib64/libstdc++.so.6
> > #4  0x0000003682ebcde3 in std::terminate() () from /usr/lib64/libstdc++.so.6
> > #5  0x0000003682ebceca in __cxa_throw () from /usr/lib64/libstdc++.so.6
> > #6  0x00007ffff7c51a54 in ceph::__ceph_assert_fail
> > (assertion=0x7ffff7ce31f1 "in->cap_refs[(32 << 8)] == 1",
> >     file=0x7ffff7ce1431 "client/Client.cc", line=2190,
> > func=0x7ffff7ce5fc0 "void Client::_flush(Inode*, Context*)")
> >     at common/assert.cc:86
> > #7  0x00007ffff7b39bb0 in Client::_flush (this=0x629090, in=0x630f50,
> > onfinish=0x7fffe8000b60) at client/Client.cc:2190
> > #8  0x00007ffff7b52426 in Client::handle_cap_grant (this=0x629090,
> > in=0x630f50, mds=0, cap=0x62ee50, m=0x631fc0)
> >     at client/Client.cc:2930
> > #9  0x00007ffff7b52d2a in Client::handle_caps (this=0x629090,
> > m=0x631fc0) at client/Client.cc:2711
> > #10 0x00007ffff7b560a0 in Client::ms_dispatch (this=0x629090,
> > m=0x631fc0) at client/Client.cc:1444
> > #11 0x00007ffff7bcaff9 in Messenger::ms_deliver_dispatch
> > (this=0x628350, m=0x631fc0) at msg/Messenger.h:98
> > #12 0x00007ffff7bb2607 in SimpleMessenger::dispatch_entry
> > (this=0x628350) at msg/SimpleMessenger.cc:352
> > #13 0x00007ffff7b22641 in SimpleMessenger::DispatchThread::entry
> > (this=0x6287d8) at msg/SimpleMessenger.h:533
> > #14 0x00007ffff7b5aa04 in Thread::_entry_func (arg=0x6287d8) at
> > ./common/Thread.h:41
> > #15 0x00000036802064a7 in start_thread () from /lib64/libpthread.so.0
> > #16 0x000000367fad3c2d in clone () from /lib64/libc.so.6
> > ============================================================
> > 
> > In addition, when writing data for a long time, the write thread will
> > hang on a pthread_cond_wait, as back trace 2 showed:
> > ==========================  back trace 2  ==================================
> > Thread 10 (Thread 0x43f87940 (LWP 19986)):
> > #0  0x000000312be0ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > /lib64/libpthread.so.0
> > #1  0x00007f147a8e763b in Cond::Wait (this=0x43f86b00,
> > mutex=@0x66d790) at ./common/Cond.h:46
> > #2  0x00007f147a86630d in Client::wait_on_list (this=0x66d430,
> > ls=@0x674df0) at client/Client.cc:2140
> > #3  0x00007f147a8771ff in Client::get_caps (this=0x66d430,
> > in=0x6749d0, need=4096, want=8192, got=0x43f86e8c, endoff=235470848)
> >     at client/Client.cc:1827
> > #4  0x00007f147a886003 in Client::_write (this=0x66d430, f=0x671510,
> > offset=235470336, size=512,
> >     buf=0x6452b0 'ÿÿ' <repeats 33 times>, ".001", 'ÿÿ' <repeats 165
> > times>...) at client/Client.cc:5055
> > #5  0x00007f147a886d73 in Client::write (this=0x66d430, fd=10,
> >     buf=0x6452b0 'ÿÿ' <repeats 33 times>, ".001", 'ÿÿ' <repeats 165
> > times>..., size=512, offset=235470336) at client/Client.cc:5007
> > #6  0x00007f147a85822d in ceph_write (fd=10, buf=0x6452b0 'ÿÿ' <repeats
> > 33 times>, ".001", 'ÿÿ' <repeats 165 times>..., size=512,
> >     offset=235470336) at libceph.cc:322
> > #7  0x000000000042fa27 in TdcCephImpl::AsyncWrite (this=0x647c30,
> > offset=235470336, length=512,
> >     buf=0x6452b0 'ÿÿ' <repeats 33 times>, ".001", 'ÿÿ' <repeats 165
> > times>..., queue=@0x644b60, io=0x6456a0)
> >     at /opt/tsk/tdc-tapdisk/td-connector2.0/common/tdc_ceph_impl.cpp:191
> > #8  0x000000000042ffc5 in TdcCephImpl::AsyncProcess (this=0x647c30,
> > io=0x6456a0, queue=@0x644b60, netfd=5)
> >     at /opt/tsk/tdc-tapdisk/td-connector2.0/common/tdc_ceph_impl.cpp:268
> > #9  0x00000000004192be in SubmitRequest_Intra (arg=0x644b00) at
> > /opt/tsk/tdc-tapdisk/td-connector2.0/td_connector_server.cpp:390
> > #10 0x000000312be064a7 in start_thread () from /lib64/libpthread.so.0
> > #11 0x000000312b6d3c2d in clone () from /lib64/libc.so.6
> > ============================================================
> > 
> > Seem not very safe with the patch.
> > 
> > So I applied this patch:
> > diff --git a/src/client/Client.cc b/src/client/Client.cc
> > index 7f7fb08..bf0997a 100644
> > --- a/src/client/Client.cc
> > +++ b/src/client/Client.cc
> > @@ -4934,7 +4934,6 @@ public:
> > 
> >  void Client::sync_write_commit(Inode *in)
> >  {
> > -  client_lock.Lock();
> > + int r = client_lock.Lock();
> >   assert(unsafe_sync_write > 0);
> >   unsafe_sync_write--;
> > 
> > @@ -4947,8 +4946,6 @@ void Client::sync_write_commit(Inode *in)
> >   }
> > 
> >   put_inode(in);
> > + if (r)
> >   client_lock.Unlock();
> >  }
> > 
> >  int Client::write(int fd, const char *buf, loff_t size, loff_t offset)
> > 
> > 
> > After some test, back trace 1 didn't appear, but back trace 2 still appear.
> > Hmm,, not safe also.
> > 
> > Thx!
> > Simon
> > 
> > 
> > 
> > 2011/5/19 Sage Weil <sage@newdream.net>:
> > > Hi Simon,
> > >
> > > On Wed, 18 May 2011, Simon Tian wrote:
> > >
> > >> Hi,
> > >>
> > >>     Could any one give me an answer?
> > >>
> > >>    My application need to open one file with two clients in different
> > >> hosts. One for write/read and one for read only. The client could be
> > >> developed based on libceph or librbd.
> > >>    I tried librbd, exception appear too.
> > >
> > > I opened a bug for this, http://tracker.newdream.net/issues/1097.
> > >
> > > Can you try the patch below?  I this may just be something missed in a
> > > locking rewrite way back when in a rare code path:
> > >
> > > Thanks!
> > > sage
> > >
> > >
> > >
> > > diff --git a/src/client/Client.cc b/src/client/Client.cc
> > > index 7f7fb08..bf0997a 100644
> > > --- a/src/client/Client.cc
> > > +++ b/src/client/Client.cc
> > > @@ -4934,7 +4934,6 @@ public:
> > >
> > >  void Client::sync_write_commit(Inode *in)
> > >  {
> > > -  client_lock.Lock();
> > >   assert(unsafe_sync_write > 0);
> > >   unsafe_sync_write--;
> > >
> > > @@ -4947,8 +4946,6 @@ void Client::sync_write_commit(Inode *in)
> > >   }
> > >
> > >   put_inode(in);
> > > -
> > > -  client_lock.Unlock();
> > >  }
> > >
> > >  int Client::write(int fd, const char *buf, loff_t size, loff_t offset)
> > >
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> >
Sage Weil May 19, 2011, 9:45 p.m. UTC | #3
On Thu, 19 May 2011, Sage Weil wrote:
> On Thu, 19 May 2011, Sage Weil wrote:
> > On Thu, 19 May 2011, Simon Tian wrote:
> > > Hi Sage,
> > > 
> > >     I've test this patch.
> > > With 8 times test, only 1 (the first time) got a thread assert fail.As
> > > back trace 1 showed:
> > 
> > Do you have a simple reproducer for this?
> 
> Nevermind, I'm easily triggering this with two cfuse mounts.

Hi Simon,

I just pushed a series of fixes to this code to the master branch that 
fixes both of these crashes in my testing.  Can you try it out?

Thanks!
sage



> 
> sage
> 
> 
> > 
> > sage
> > 
> > 
> > > ==========================  back trace 1   ==================================
> > > (gdb) bt
> > > #0  0x000000367fa30265 in raise () from /lib64/libc.so.6
> > > #1  0x000000367fa31d10 in abort () from /lib64/libc.so.6
> > > #2  0x0000003682ebec44 in __gnu_cxx::__verbose_terminate_handler() ()
> > > from /usr/lib64/libstdc++.so.6
> > > #3  0x0000003682ebcdb6 in ?? () from /usr/lib64/libstdc++.so.6
> > > #4  0x0000003682ebcde3 in std::terminate() () from /usr/lib64/libstdc++.so.6
> > > #5  0x0000003682ebceca in __cxa_throw () from /usr/lib64/libstdc++.so.6
> > > #6  0x00007ffff7c51a54 in ceph::__ceph_assert_fail
> > > (assertion=0x7ffff7ce31f1 "in->cap_refs[(32 << 8)] == 1",
> > >     file=0x7ffff7ce1431 "client/Client.cc", line=2190,
> > > func=0x7ffff7ce5fc0 "void Client::_flush(Inode*, Context*)")
> > >     at common/assert.cc:86
> > > #7  0x00007ffff7b39bb0 in Client::_flush (this=0x629090, in=0x630f50,
> > > onfinish=0x7fffe8000b60) at client/Client.cc:2190
> > > #8  0x00007ffff7b52426 in Client::handle_cap_grant (this=0x629090,
> > > in=0x630f50, mds=0, cap=0x62ee50, m=0x631fc0)
> > >     at client/Client.cc:2930
> > > #9  0x00007ffff7b52d2a in Client::handle_caps (this=0x629090,
> > > m=0x631fc0) at client/Client.cc:2711
> > > #10 0x00007ffff7b560a0 in Client::ms_dispatch (this=0x629090,
> > > m=0x631fc0) at client/Client.cc:1444
> > > #11 0x00007ffff7bcaff9 in Messenger::ms_deliver_dispatch
> > > (this=0x628350, m=0x631fc0) at msg/Messenger.h:98
> > > #12 0x00007ffff7bb2607 in SimpleMessenger::dispatch_entry
> > > (this=0x628350) at msg/SimpleMessenger.cc:352
> > > #13 0x00007ffff7b22641 in SimpleMessenger::DispatchThread::entry
> > > (this=0x6287d8) at msg/SimpleMessenger.h:533
> > > #14 0x00007ffff7b5aa04 in Thread::_entry_func (arg=0x6287d8) at
> > > ./common/Thread.h:41
> > > #15 0x00000036802064a7 in start_thread () from /lib64/libpthread.so.0
> > > #16 0x000000367fad3c2d in clone () from /lib64/libc.so.6
> > > ============================================================
> > > 
> > > In addition, when writing data for a long time, the write thread will
> > > hang on a pthread_cond_wait, as back trace 2 showed:
> > > ==========================  back trace 2  ==================================
> > > Thread 10 (Thread 0x43f87940 (LWP 19986)):
> > > #0  0x000000312be0ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > > /lib64/libpthread.so.0
> > > #1  0x00007f147a8e763b in Cond::Wait (this=0x43f86b00,
> > > mutex=@0x66d790) at ./common/Cond.h:46
> > > #2  0x00007f147a86630d in Client::wait_on_list (this=0x66d430,
> > > ls=@0x674df0) at client/Client.cc:2140
> > > #3  0x00007f147a8771ff in Client::get_caps (this=0x66d430,
> > > in=0x6749d0, need=4096, want=8192, got=0x43f86e8c, endoff=235470848)
> > >     at client/Client.cc:1827
> > > #4  0x00007f147a886003 in Client::_write (this=0x66d430, f=0x671510,
> > > offset=235470336, size=512,
> > >     buf=0x6452b0 'ÿÿ' <repeats 33 times>, ".001", 'ÿÿ' <repeats 165
> > > times>...) at client/Client.cc:5055
> > > #5  0x00007f147a886d73 in Client::write (this=0x66d430, fd=10,
> > >     buf=0x6452b0 'ÿÿ' <repeats 33 times>, ".001", 'ÿÿ' <repeats 165
> > > times>..., size=512, offset=235470336) at client/Client.cc:5007
> > > #6  0x00007f147a85822d in ceph_write (fd=10, buf=0x6452b0 'ÿÿ' <repeats
> > > 33 times>, ".001", 'ÿÿ' <repeats 165 times>..., size=512,
> > >     offset=235470336) at libceph.cc:322
> > > #7  0x000000000042fa27 in TdcCephImpl::AsyncWrite (this=0x647c30,
> > > offset=235470336, length=512,
> > >     buf=0x6452b0 'ÿÿ' <repeats 33 times>, ".001", 'ÿÿ' <repeats 165
> > > times>..., queue=@0x644b60, io=0x6456a0)
> > >     at /opt/tsk/tdc-tapdisk/td-connector2.0/common/tdc_ceph_impl.cpp:191
> > > #8  0x000000000042ffc5 in TdcCephImpl::AsyncProcess (this=0x647c30,
> > > io=0x6456a0, queue=@0x644b60, netfd=5)
> > >     at /opt/tsk/tdc-tapdisk/td-connector2.0/common/tdc_ceph_impl.cpp:268
> > > #9  0x00000000004192be in SubmitRequest_Intra (arg=0x644b00) at
> > > /opt/tsk/tdc-tapdisk/td-connector2.0/td_connector_server.cpp:390
> > > #10 0x000000312be064a7 in start_thread () from /lib64/libpthread.so.0
> > > #11 0x000000312b6d3c2d in clone () from /lib64/libc.so.6
> > > ============================================================
> > > 
> > > Seem not very safe with the patch.
> > > 
> > > So I applied this patch:
> > > diff --git a/src/client/Client.cc b/src/client/Client.cc
> > > index 7f7fb08..bf0997a 100644
> > > --- a/src/client/Client.cc
> > > +++ b/src/client/Client.cc
> > > @@ -4934,7 +4934,6 @@ public:
> > > 
> > >  void Client::sync_write_commit(Inode *in)
> > >  {
> > > -  client_lock.Lock();
> > > + int r = client_lock.Lock();
> > >   assert(unsafe_sync_write > 0);
> > >   unsafe_sync_write--;
> > > 
> > > @@ -4947,8 +4946,6 @@ void Client::sync_write_commit(Inode *in)
> > >   }
> > > 
> > >   put_inode(in);
> > > + if (r)
> > >   client_lock.Unlock();
> > >  }
> > > 
> > >  int Client::write(int fd, const char *buf, loff_t size, loff_t offset)
> > > 
> > > 
> > > After some test, back trace 1 didn't appear, but back trace 2 still appear.
> > > Hmm,, not safe also.
> > > 
> > > Thx!
> > > Simon
> > > 
> > > 
> > > 
> > > 2011/5/19 Sage Weil <sage@newdream.net>:
> > > > Hi Simon,
> > > >
> > > > On Wed, 18 May 2011, Simon Tian wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >>     Could any one give me an answer?
> > > >>
> > > >>    My application need to open one file with two clients in different
> > > >> hosts. One for write/read and one for read only. The client could be
> > > >> developed based on libceph or librbd.
> > > >>    I tried librbd, exception appear too.
> > > >
> > > > I opened a bug for this, http://tracker.newdream.net/issues/1097.
> > > >
> > > > Can you try the patch below?  I this may just be something missed in a
> > > > locking rewrite way back when in a rare code path:
> > > >
> > > > Thanks!
> > > > sage
> > > >
> > > >
> > > >
> > > > diff --git a/src/client/Client.cc b/src/client/Client.cc
> > > > index 7f7fb08..bf0997a 100644
> > > > --- a/src/client/Client.cc
> > > > +++ b/src/client/Client.cc
> > > > @@ -4934,7 +4934,6 @@ public:
> > > >
> > > >  void Client::sync_write_commit(Inode *in)
> > > >  {
> > > > -  client_lock.Lock();
> > > >   assert(unsafe_sync_write > 0);
> > > >   unsafe_sync_write--;
> > > >
> > > > @@ -4947,8 +4946,6 @@ void Client::sync_write_commit(Inode *in)
> > > >   }
> > > >
> > > >   put_inode(in);
> > > > -
> > > > -  client_lock.Unlock();
> > > >  }
> > > >
> > > >  int Client::write(int fd, const char *buf, loff_t size, loff_t offset)
> > > >
> > > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > >
Simon Tian May 20, 2011, 8:56 a.m. UTC | #4
OK,  I'll test the branch and give the result these days.

Thanks very much!

Simon

2011/5/20 Sage Weil <sage@newdream.net>:
> On Thu, 19 May 2011, Sage Weil wrote:
>> On Thu, 19 May 2011, Sage Weil wrote:
>> > On Thu, 19 May 2011, Simon Tian wrote:
>> > > Hi Sage,
>> > >
>> > >     I've test this patch.
>> > > With 8 times test, only 1 (the first time) got a thread assert fail.As
>> > > back trace 1 showed:
>> >
>> > Do you have a simple reproducer for this?
>>
>> Nevermind, I'm easily triggering this with two cfuse mounts.
>
> Hi Simon,
>
> I just pushed a series of fixes to this code to the master branch that
> fixes both of these crashes in my testing.  Can you try it out?
>
> Thanks!
> sage
>
>
>
>>
>> sage
>>
>>
>> >
>> > sage
>> >
>> >
>> > > ==========================  back trace 1   ==================================
>> > > (gdb) bt
>> > > #0  0x000000367fa30265 in raise () from /lib64/libc.so.6
>> > > #1  0x000000367fa31d10 in abort () from /lib64/libc.so.6
>> > > #2  0x0000003682ebec44 in __gnu_cxx::__verbose_terminate_handler() ()
>> > > from /usr/lib64/libstdc++.so.6
>> > > #3  0x0000003682ebcdb6 in ?? () from /usr/lib64/libstdc++.so.6
>> > > #4  0x0000003682ebcde3 in std::terminate() () from /usr/lib64/libstdc++.so.6
>> > > #5  0x0000003682ebceca in __cxa_throw () from /usr/lib64/libstdc++.so.6
>> > > #6  0x00007ffff7c51a54 in ceph::__ceph_assert_fail
>> > > (assertion=0x7ffff7ce31f1 "in->cap_refs[(32 << 8)] == 1",
>> > >     file=0x7ffff7ce1431 "client/Client.cc", line=2190,
>> > > func=0x7ffff7ce5fc0 "void Client::_flush(Inode*, Context*)")
>> > >     at common/assert.cc:86
>> > > #7  0x00007ffff7b39bb0 in Client::_flush (this=0x629090, in=0x630f50,
>> > > onfinish=0x7fffe8000b60) at client/Client.cc:2190
>> > > #8  0x00007ffff7b52426 in Client::handle_cap_grant (this=0x629090,
>> > > in=0x630f50, mds=0, cap=0x62ee50, m=0x631fc0)
>> > >     at client/Client.cc:2930
>> > > #9  0x00007ffff7b52d2a in Client::handle_caps (this=0x629090,
>> > > m=0x631fc0) at client/Client.cc:2711
>> > > #10 0x00007ffff7b560a0 in Client::ms_dispatch (this=0x629090,
>> > > m=0x631fc0) at client/Client.cc:1444
>> > > #11 0x00007ffff7bcaff9 in Messenger::ms_deliver_dispatch
>> > > (this=0x628350, m=0x631fc0) at msg/Messenger.h:98
>> > > #12 0x00007ffff7bb2607 in SimpleMessenger::dispatch_entry
>> > > (this=0x628350) at msg/SimpleMessenger.cc:352
>> > > #13 0x00007ffff7b22641 in SimpleMessenger::DispatchThread::entry
>> > > (this=0x6287d8) at msg/SimpleMessenger.h:533
>> > > #14 0x00007ffff7b5aa04 in Thread::_entry_func (arg=0x6287d8) at
>> > > ./common/Thread.h:41
>> > > #15 0x00000036802064a7 in start_thread () from /lib64/libpthread.so.0
>> > > #16 0x000000367fad3c2d in clone () from /lib64/libc.so.6
>> > > ============================================================
>> > >
>> > > In addition, when writing data for a long time, the write thread will
>> > > hang on a pthread_cond_wait, as back trace 2 showed:
>> > > ==========================  back trace 2  ==================================
>> > > Thread 10 (Thread 0x43f87940 (LWP 19986)):
>> > > #0  0x000000312be0ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from
>> > > /lib64/libpthread.so.0
>> > > #1  0x00007f147a8e763b in Cond::Wait (this=0x43f86b00,
>> > > mutex=@0x66d790) at ./common/Cond.h:46
>> > > #2  0x00007f147a86630d in Client::wait_on_list (this=0x66d430,
>> > > ls=@0x674df0) at client/Client.cc:2140
>> > > #3  0x00007f147a8771ff in Client::get_caps (this=0x66d430,
>> > > in=0x6749d0, need=4096, want=8192, got=0x43f86e8c, endoff=235470848)
>> > >     at client/Client.cc:1827
>> > > #4  0x00007f147a886003 in Client::_write (this=0x66d430, f=0x671510,
>> > > offset=235470336, size=512,
>> > >     buf=0x6452b0 'ÿÿ' <repeats 33 times>, ".001", 'ÿÿ' <repeats 165
>> > > times>...) at client/Client.cc:5055
>> > > #5  0x00007f147a886d73 in Client::write (this=0x66d430, fd=10,
>> > >     buf=0x6452b0 'ÿÿ' <repeats 33 times>, ".001", 'ÿÿ' <repeats 165
>> > > times>..., size=512, offset=235470336) at client/Client.cc:5007
>> > > #6  0x00007f147a85822d in ceph_write (fd=10, buf=0x6452b0 'ÿÿ' <repeats
>> > > 33 times>, ".001", 'ÿÿ' <repeats 165 times>..., size=512,
>> > >     offset=235470336) at libceph.cc:322
>> > > #7  0x000000000042fa27 in TdcCephImpl::AsyncWrite (this=0x647c30,
>> > > offset=235470336, length=512,
>> > >     buf=0x6452b0 'ÿÿ' <repeats 33 times>, ".001", 'ÿÿ' <repeats 165
>> > > times>..., queue=@0x644b60, io=0x6456a0)
>> > >     at /opt/tsk/tdc-tapdisk/td-connector2.0/common/tdc_ceph_impl.cpp:191
>> > > #8  0x000000000042ffc5 in TdcCephImpl::AsyncProcess (this=0x647c30,
>> > > io=0x6456a0, queue=@0x644b60, netfd=5)
>> > >     at /opt/tsk/tdc-tapdisk/td-connector2.0/common/tdc_ceph_impl.cpp:268
>> > > #9  0x00000000004192be in SubmitRequest_Intra (arg=0x644b00) at
>> > > /opt/tsk/tdc-tapdisk/td-connector2.0/td_connector_server.cpp:390
>> > > #10 0x000000312be064a7 in start_thread () from /lib64/libpthread.so.0
>> > > #11 0x000000312b6d3c2d in clone () from /lib64/libc.so.6
>> > > ============================================================
>> > >
>> > > Seem not very safe with the patch.
>> > >
>> > > So I applied this patch:
>> > > diff --git a/src/client/Client.cc b/src/client/Client.cc
>> > > index 7f7fb08..bf0997a 100644
>> > > --- a/src/client/Client.cc
>> > > +++ b/src/client/Client.cc
>> > > @@ -4934,7 +4934,6 @@ public:
>> > >
>> > >  void Client::sync_write_commit(Inode *in)
>> > >  {
>> > > -  client_lock.Lock();
>> > > + int r = client_lock.Lock();
>> > >   assert(unsafe_sync_write > 0);
>> > >   unsafe_sync_write--;
>> > >
>> > > @@ -4947,8 +4946,6 @@ void Client::sync_write_commit(Inode *in)
>> > >   }
>> > >
>> > >   put_inode(in);
>> > > + if (r)
>> > >   client_lock.Unlock();
>> > >  }
>> > >
>> > >  int Client::write(int fd, const char *buf, loff_t size, loff_t offset)
>> > >
>> > >
>> > > After some test, back trace 1 didn't appear, but back trace 2 still appear.
>> > > Hmm,, not safe also.
>> > >
>> > > Thx!
>> > > Simon
>> > >
>> > >
>> > >
>> > > 2011/5/19 Sage Weil <sage@newdream.net>:
>> > > > Hi Simon,
>> > > >
>> > > > On Wed, 18 May 2011, Simon Tian wrote:
>> > > >
>> > > >> Hi,
>> > > >>
>> > > >>     Could any one give me an answer?
>> > > >>
>> > > >>    My application need to open one file with two clients in different
>> > > >> hosts. One for write/read and one for read only. The client could be
>> > > >> developed based on libceph or librbd.
>> > > >>    I tried librbd, exception appear too.
>> > > >
>> > > > I opened a bug for this, http://tracker.newdream.net/issues/1097.
>> > > >
>> > > > Can you try the patch below?  I this may just be something missed in a
>> > > > locking rewrite way back when in a rare code path:
>> > > >
>> > > > Thanks!
>> > > > sage
>> > > >
>> > > >
>> > > >
>> > > > diff --git a/src/client/Client.cc b/src/client/Client.cc
>> > > > index 7f7fb08..bf0997a 100644
>> > > > --- a/src/client/Client.cc
>> > > > +++ b/src/client/Client.cc
>> > > > @@ -4934,7 +4934,6 @@ public:
>> > > >
>> > > >  void Client::sync_write_commit(Inode *in)
>> > > >  {
>> > > > -  client_lock.Lock();
>> > > >   assert(unsafe_sync_write > 0);
>> > > >   unsafe_sync_write--;
>> > > >
>> > > > @@ -4947,8 +4946,6 @@ void Client::sync_write_commit(Inode *in)
>> > > >   }
>> > > >
>> > > >   put_inode(in);
>> > > > -
>> > > > -  client_lock.Unlock();
>> > > >  }
>> > > >
>> > > >  int Client::write(int fd, const char *buf, loff_t size, loff_t offset)
>> > > >
>> > > >
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > > the body of a message to majordomo@vger.kernel.org
>> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > >
>> > >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

==========================  back trace 1   ==================================
(gdb) bt
#0  0x000000367fa30265 in raise () from /lib64/libc.so.6
#1  0x000000367fa31d10 in abort () from /lib64/libc.so.6
#2  0x0000003682ebec44 in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib64/libstdc++.so.6
#3  0x0000003682ebcdb6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x0000003682ebcde3 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x0000003682ebceca in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6  0x00007ffff7c51a54 in ceph::__ceph_assert_fail
(assertion=0x7ffff7ce31f1 "in->cap_refs[(32 << 8)] == 1",
    file=0x7ffff7ce1431 "client/Client.cc", line=2190,
func=0x7ffff7ce5fc0 "void Client::_flush(Inode*, Context*)")
    at common/assert.cc:86
#7  0x00007ffff7b39bb0 in Client::_flush (this=0x629090, in=0x630f50,
onfinish=0x7fffe8000b60) at client/Client.cc:2190
#8  0x00007ffff7b52426 in Client::handle_cap_grant (this=0x629090,
in=0x630f50, mds=0, cap=0x62ee50, m=0x631fc0)
    at client/Client.cc:2930
#9  0x00007ffff7b52d2a in Client::handle_caps (this=0x629090,
m=0x631fc0) at client/Client.cc:2711
#10 0x00007ffff7b560a0 in Client::ms_dispatch (this=0x629090,
m=0x631fc0) at client/Client.cc:1444
#11 0x00007ffff7bcaff9 in Messenger::ms_deliver_dispatch
(this=0x628350, m=0x631fc0) at msg/Messenger.h:98
#12 0x00007ffff7bb2607 in SimpleMessenger::dispatch_entry
(this=0x628350) at msg/SimpleMessenger.cc:352
#13 0x00007ffff7b22641 in SimpleMessenger::DispatchThread::entry
(this=0x6287d8) at msg/SimpleMessenger.h:533
#14 0x00007ffff7b5aa04 in Thread::_entry_func (arg=0x6287d8) at
./common/Thread.h:41
#15 0x00000036802064a7 in start_thread () from /lib64/libpthread.so.0
#16 0x000000367fad3c2d in clone () from /lib64/libc.so.6
============================================================

In addition, when writing data for a long time, the write thread will
hang on a pthread_cond_wait, as back trace 2 showed:
==========================  back trace 2  ==================================
Thread 10 (Thread 0x43f87940 (LWP 19986)):
#0  0x000000312be0ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007f147a8e763b in Cond::Wait (this=0x43f86b00,
mutex=@0x66d790) at ./common/Cond.h:46
#2  0x00007f147a86630d in Client::wait_on_list (this=0x66d430,
ls=@0x674df0) at client/Client.cc:2140
#3  0x00007f147a8771ff in Client::get_caps (this=0x66d430,
in=0x6749d0, need=4096, want=8192, got=0x43f86e8c, endoff=235470848)
    at client/Client.cc:1827
#4  0x00007f147a886003 in Client::_write (this=0x66d430, f=0x671510,
offset=235470336, size=512,
    buf=0x6452b0 '?' <repeats 33 times>, ".001", '?' <repeats 165
times>...) at client/Client.cc:5055
#5  0x00007f147a886d73 in Client::write (this=0x66d430, fd=10,
    buf=0x6452b0 '?' <repeats 33 times>, ".001", '?' <repeats 165
times>..., size=512, offset=235470336) at client/Client.cc:5007
#6  0x00007f147a85822d in ceph_write (fd=10, buf=0x6452b0 '?' <repeats
33 times>, ".001", '?' <repeats 165 times>..., size=512,
    offset=235470336) at libceph.cc:322
#7  0x000000000042fa27 in TdcCephImpl::AsyncWrite (this=0x647c30,
offset=235470336, length=512,
    buf=0x6452b0 '?' <repeats 33 times>, ".001", '?' <repeats 165
times>..., queue=@0x644b60, io=0x6456a0)
    at /opt/tsk/tdc-tapdisk/td-connector2.0/common/tdc_ceph_impl.cpp:191
#8  0x000000000042ffc5 in TdcCephImpl::AsyncProcess (this=0x647c30,
io=0x6456a0, queue=@0x644b60, netfd=5)
    at /opt/tsk/tdc-tapdisk/td-connector2.0/common/tdc_ceph_impl.cpp:268
#9  0x00000000004192be in SubmitRequest_Intra (arg=0x644b00) at
/opt/tsk/tdc-tapdisk/td-connector2.0/td_connector_server.cpp:390
#10 0x000000312be064a7 in start_thread () from /lib64/libpthread.so.0
#11 0x000000312b6d3c2d in clone () from /lib64/libc.so.6
============================================================

Seem not very safe with the patch.

So I applied this patch:
diff --git a/src/client/Client.cc b/src/client/Client.cc
index 7f7fb08..bf0997a 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -4934,7 +4934,6 @@  public:

 void Client::sync_write_commit(Inode *in)
 {
-  client_lock.Lock();
+ int r = client_lock.Lock();
  assert(unsafe_sync_write > 0);
  unsafe_sync_write--;

@@ -4947,8 +4946,6 @@  void Client::sync_write_commit(Inode *in)
  }

  put_inode(in);
+ if (r)
  client_lock.Unlock();
 }

 int Client::write(int fd, const char *buf, loff_t size, loff_t offset)