ceph: fix writeback thread waits on itself
diff mbox

Message ID 20180517032944.13230-1-zyan@redhat.com
State New
Headers show

Commit Message

Yan, Zheng May 17, 2018, 3:29 a.m. UTC
In the case of -ENOSPC, writeback thread may wait on itself. The call
stack looks like:

  inode_wait_for_writeback+0x26/0x40
  evict+0xb5/0x1a0
  iput+0x1d2/0x220
  ceph_put_wrbuffer_cap_refs+0xe0/0x2c0 [ceph]
  writepages_finish+0x2d3/0x410 [ceph]
  __complete_request+0x26/0x60 [libceph]
  complete_request+0x2e/0x70 [libceph]
  __submit_request+0x256/0x330 [libceph]
  submit_request+0x2b/0x30 [libceph]
  ceph_osdc_start_request+0x25/0x40 [libceph]
  ceph_writepages_start+0xdfe/0x1320 [ceph]
  do_writepages+0x1f/0x70
  __writeback_single_inode+0x45/0x330
  writeback_sb_inodes+0x26a/0x600
  __writeback_inodes_wb+0x92/0xc0
  wb_writeback+0x274/0x330
  wb_workfn+0x2d5/0x3b0

The fix is make writepages_finish() not drop inode's last reference.

Link: http://tracker.ceph.com/issues/23978
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/addr.c  | 21 +++++++++++++++++++++
 fs/ceph/inode.c | 12 ++++++++++--
 2 files changed, 31 insertions(+), 2 deletions(-)

Comments

Ilya Dryomov May 17, 2018, 9:32 a.m. UTC | #1
On Thu, May 17, 2018 at 5:29 AM, Yan, Zheng <zyan@redhat.com> wrote:
> In the case of -ENOSPC, writeback thread may wait on itself. The call
> stack looks like:
>
>   inode_wait_for_writeback+0x26/0x40
>   evict+0xb5/0x1a0
>   iput+0x1d2/0x220
>   ceph_put_wrbuffer_cap_refs+0xe0/0x2c0 [ceph]
>   writepages_finish+0x2d3/0x410 [ceph]
>   __complete_request+0x26/0x60 [libceph]
>   complete_request+0x2e/0x70 [libceph]
>   __submit_request+0x256/0x330 [libceph]
>   submit_request+0x2b/0x30 [libceph]
>   ceph_osdc_start_request+0x25/0x40 [libceph]
>   ceph_writepages_start+0xdfe/0x1320 [ceph]
>   do_writepages+0x1f/0x70
>   __writeback_single_inode+0x45/0x330
>   writeback_sb_inodes+0x26a/0x600
>   __writeback_inodes_wb+0x92/0xc0
>   wb_writeback+0x274/0x330
>   wb_workfn+0x2d5/0x3b0

This is exactly what I was worried about when Jeff introduced the
possibility of complete_request() on the submit thread.  Do you think
this is the only such case or there may be others?

Another related issue is that normally ->r_callback is invoked
without any libceph locks held -- handle_reply() drops both osd->lock
and osdc->lock before calling __complete_request().  In this case it
is called with both of these locks held.

Given that umount -f will use the same mechanism, could you please
double check all fs/ceph callbacks?  I wonder if we should maybe do
something different in libceph...

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton May 17, 2018, 11:40 a.m. UTC | #2
On Thu, 2018-05-17 at 11:32 +0200, Ilya Dryomov wrote:
> On Thu, May 17, 2018 at 5:29 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > In the case of -ENOSPC, writeback thread may wait on itself. The call
> > stack looks like:
> > 
> >   inode_wait_for_writeback+0x26/0x40
> >   evict+0xb5/0x1a0
> >   iput+0x1d2/0x220
> >   ceph_put_wrbuffer_cap_refs+0xe0/0x2c0 [ceph]
> >   writepages_finish+0x2d3/0x410 [ceph]
> >   __complete_request+0x26/0x60 [libceph]
> >   complete_request+0x2e/0x70 [libceph]
> >   __submit_request+0x256/0x330 [libceph]
> >   submit_request+0x2b/0x30 [libceph]
> >   ceph_osdc_start_request+0x25/0x40 [libceph]
> >   ceph_writepages_start+0xdfe/0x1320 [ceph]
> >   do_writepages+0x1f/0x70
> >   __writeback_single_inode+0x45/0x330
> >   writeback_sb_inodes+0x26a/0x600
> >   __writeback_inodes_wb+0x92/0xc0
> >   wb_writeback+0x274/0x330
> >   wb_workfn+0x2d5/0x3b0
> 
> This is exactly what I was worried about when Jeff introduced the
> possibility of complete_request() on the submit thread.  Do you think
> this is the only such case or there may be others?
> 
> Another related issue is that normally ->r_callback is invoked
> without any libceph locks held -- handle_reply() drops both osd->lock
> and osdc->lock before calling __complete_request().  In this case it
> is called with both of these locks held.
> 

Not in the "fail_request" case. The lack of clear locking rules with
these callbacks makes it really difficult to suss out these problems.

> Given that umount -f will use the same mechanism, could you please
> double check all fs/ceph callbacks?  I wonder if we should maybe do
> something different in libceph...


Might a simpler fix be to just have __submit_request queue the
complete_request callback to a workqueue in the ENOSPC case? That should
be a rare thing in most cases.
Ilya Dryomov May 17, 2018, 12:27 p.m. UTC | #3
On Thu, May 17, 2018 at 1:40 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Thu, 2018-05-17 at 11:32 +0200, Ilya Dryomov wrote:
>> On Thu, May 17, 2018 at 5:29 AM, Yan, Zheng <zyan@redhat.com> wrote:
>> > In the case of -ENOSPC, writeback thread may wait on itself. The call
>> > stack looks like:
>> >
>> >   inode_wait_for_writeback+0x26/0x40
>> >   evict+0xb5/0x1a0
>> >   iput+0x1d2/0x220
>> >   ceph_put_wrbuffer_cap_refs+0xe0/0x2c0 [ceph]
>> >   writepages_finish+0x2d3/0x410 [ceph]
>> >   __complete_request+0x26/0x60 [libceph]
>> >   complete_request+0x2e/0x70 [libceph]
>> >   __submit_request+0x256/0x330 [libceph]
>> >   submit_request+0x2b/0x30 [libceph]
>> >   ceph_osdc_start_request+0x25/0x40 [libceph]
>> >   ceph_writepages_start+0xdfe/0x1320 [ceph]
>> >   do_writepages+0x1f/0x70
>> >   __writeback_single_inode+0x45/0x330
>> >   writeback_sb_inodes+0x26a/0x600
>> >   __writeback_inodes_wb+0x92/0xc0
>> >   wb_writeback+0x274/0x330
>> >   wb_workfn+0x2d5/0x3b0
>>
>> This is exactly what I was worried about when Jeff introduced the
>> possibility of complete_request() on the submit thread.  Do you think
>> this is the only such case or there may be others?
>>
>> Another related issue is that normally ->r_callback is invoked
>> without any libceph locks held -- handle_reply() drops both osd->lock
>> and osdc->lock before calling __complete_request().  In this case it
>> is called with both of these locks held.
>>
>
> Not in the "fail_request" case. The lack of clear locking rules with
> these callbacks makes it really difficult to suss out these problems.

Yeah, it was (is?) pretty much the same with Objecter in userspace.
The locking issue is old and I guess we have learned to be careful
there.  Calling the callback from the submit thread is new.

>
>> Given that umount -f will use the same mechanism, could you please
>> double check all fs/ceph callbacks?  I wonder if we should maybe do
>> something different in libceph...
>
> Might a simpler fix be to just have __submit_request queue the
> complete_request callback to a workqueue in the ENOSPC case? That should
> be a rare thing in most cases.

That was my thought as well, but it needs to be justified and this
stack trace is actually a bad example.  In the common case the callback
is invoked by the messenger, so blocking is undesirable.  Blocking on
writeback is particularly so -- unless I'm misunderstanding something,
that can deadlock even under normal conditions.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yan, Zheng May 17, 2018, 2:07 p.m. UTC | #4
> On May 17, 2018, at 20:27, Ilya Dryomov <idryomov@gmail.com> wrote:
> 
> On Thu, May 17, 2018 at 1:40 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> On Thu, 2018-05-17 at 11:32 +0200, Ilya Dryomov wrote:
>>> On Thu, May 17, 2018 at 5:29 AM, Yan, Zheng <zyan@redhat.com> wrote:
>>>> In the case of -ENOSPC, writeback thread may wait on itself. The call
>>>> stack looks like:
>>>> 
>>>> inode_wait_for_writeback+0x26/0x40
>>>> evict+0xb5/0x1a0
>>>> iput+0x1d2/0x220
>>>> ceph_put_wrbuffer_cap_refs+0xe0/0x2c0 [ceph]
>>>> writepages_finish+0x2d3/0x410 [ceph]
>>>> __complete_request+0x26/0x60 [libceph]
>>>> complete_request+0x2e/0x70 [libceph]
>>>> __submit_request+0x256/0x330 [libceph]
>>>> submit_request+0x2b/0x30 [libceph]
>>>> ceph_osdc_start_request+0x25/0x40 [libceph]
>>>> ceph_writepages_start+0xdfe/0x1320 [ceph]
>>>> do_writepages+0x1f/0x70
>>>> __writeback_single_inode+0x45/0x330
>>>> writeback_sb_inodes+0x26a/0x600
>>>> __writeback_inodes_wb+0x92/0xc0
>>>> wb_writeback+0x274/0x330
>>>> wb_workfn+0x2d5/0x3b0
>>> 
>>> This is exactly what I was worried about when Jeff introduced the
>>> possibility of complete_request() on the submit thread.  Do you think
>>> this is the only such case or there may be others?
>>> 
>>> Another related issue is that normally ->r_callback is invoked
>>> without any libceph locks held -- handle_reply() drops both osd->lock
>>> and osdc->lock before calling __complete_request().  In this case it
>>> is called with both of these locks held.
>>> 
>> 
>> Not in the "fail_request" case. The lack of clear locking rules with
>> these callbacks makes it really difficult to suss out these problems.
> 
> Yeah, it was (is?) pretty much the same with Objecter in userspace.
> The locking issue is old and I guess we have learned to be careful
> there.  Calling the callback from the submit thread is new.
> 
>> 
>>> Given that umount -f will use the same mechanism, could you please
>>> double check all fs/ceph callbacks?  I wonder if we should maybe do
>>> something different in libceph...
>> 
>> Might a simpler fix be to just have __submit_request queue the
>> complete_request callback to a workqueue in the ENOSPC case? That should
>> be a rare thing in most cases.
> 
> That was my thought as well, but it needs to be justified and this
> stack trace is actually a bad example.  In the common case the callback
> is invoked by the messenger, so blocking is undesirable.  Blocking on
> writeback is particularly so -- unless I'm misunderstanding something,
> that can deadlock even under normal conditions.

It can’t happen on normal condition. writepages_finish() drops inode’s last reference only when there is no more dirty/writeback page.  Writeback should be already done or be about to done. 

Regards
Yan, Zheng

> 
> Thanks,
> 
>              Ilya

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilya Dryomov May 18, 2018, 5:05 p.m. UTC | #5
On Thu, May 17, 2018 at 4:07 PM, Yan, Zheng <zyan@redhat.com> wrote:
>
>
>> On May 17, 2018, at 20:27, Ilya Dryomov <idryomov@gmail.com> wrote:
>>
>> On Thu, May 17, 2018 at 1:40 PM, Jeff Layton <jlayton@redhat.com> wrote:
>>> On Thu, 2018-05-17 at 11:32 +0200, Ilya Dryomov wrote:
>>>> On Thu, May 17, 2018 at 5:29 AM, Yan, Zheng <zyan@redhat.com> wrote:
>>>>> In the case of -ENOSPC, writeback thread may wait on itself. The call
>>>>> stack looks like:
>>>>>
>>>>> inode_wait_for_writeback+0x26/0x40
>>>>> evict+0xb5/0x1a0
>>>>> iput+0x1d2/0x220
>>>>> ceph_put_wrbuffer_cap_refs+0xe0/0x2c0 [ceph]
>>>>> writepages_finish+0x2d3/0x410 [ceph]
>>>>> __complete_request+0x26/0x60 [libceph]
>>>>> complete_request+0x2e/0x70 [libceph]
>>>>> __submit_request+0x256/0x330 [libceph]
>>>>> submit_request+0x2b/0x30 [libceph]
>>>>> ceph_osdc_start_request+0x25/0x40 [libceph]
>>>>> ceph_writepages_start+0xdfe/0x1320 [ceph]
>>>>> do_writepages+0x1f/0x70
>>>>> __writeback_single_inode+0x45/0x330
>>>>> writeback_sb_inodes+0x26a/0x600
>>>>> __writeback_inodes_wb+0x92/0xc0
>>>>> wb_writeback+0x274/0x330
>>>>> wb_workfn+0x2d5/0x3b0
>>>>
>>>> This is exactly what I was worried about when Jeff introduced the
>>>> possibility of complete_request() on the submit thread.  Do you think
>>>> this is the only such case or there may be others?
>>>>
>>>> Another related issue is that normally ->r_callback is invoked
>>>> without any libceph locks held -- handle_reply() drops both osd->lock
>>>> and osdc->lock before calling __complete_request().  In this case it
>>>> is called with both of these locks held.
>>>>
>>>
>>> Not in the "fail_request" case. The lack of clear locking rules with
>>> these callbacks makes it really difficult to suss out these problems.
>>
>> Yeah, it was (is?) pretty much the same with Objecter in userspace.
>> The locking issue is old and I guess we have learned to be careful
>> there.  Calling the callback from the submit thread is new.
>>
>>>
>>>> Given that umount -f will use the same mechanism, could you please
>>>> double check all fs/ceph callbacks?  I wonder if we should maybe do
>>>> something different in libceph...
>>>
>>> Might a simpler fix be to just have __submit_request queue the
>>> complete_request callback to a workqueue in the ENOSPC case? That should
>>> be a rare thing in most cases.
>>
>> That was my thought as well, but it needs to be justified and this
>> stack trace is actually a bad example.  In the common case the callback
>> is invoked by the messenger, so blocking is undesirable.  Blocking on
>> writeback is particularly so -- unless I'm misunderstanding something,
>> that can deadlock even under normal conditions.
>
> It can’t happen on normal condition. writepages_finish() drops inode’s last reference only when there is no more dirty/writeback page.  Writeback should be already done or be about to done.

I see, so at most it will wait for the writeback thread to get to
inode_sync_complete()?

While this patch isn't too ugly, I'm leaning towards adding a finisher
for all complete_request() special cases.  I'm not convinced this is the
only problematic site in fs/ceph and there is a patch pending that makes
blocking optional in rbd, so the space is about to grow.

I have put this on my list for next week.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 5f7ad3d0df2e..9db2f4108951 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -772,6 +772,17 @@  static void writepages_finish(struct ceph_osd_request *req)
 		ceph_release_pages(osd_data->pages, num_pages);
 	}
 
+	if (rc < 0 && total_pages) {
+		/*
+		 * In the case of error, this function may directly get
+		 * called by the thread that does writeback. The writeback
+		 * thread should not drop inode's last reference. Otherwise
+		 * iput_final() may call inode_wait_for_writeback(), which
+		 * waits on writeback.
+		 */
+		ihold(inode);
+	}
+
 	ceph_put_wrbuffer_cap_refs(ci, total_pages, snapc);
 
 	osd_data = osd_req_op_extent_osd_data(req, 0);
@@ -781,6 +792,16 @@  static void writepages_finish(struct ceph_osd_request *req)
 	else
 		kfree(osd_data->pages);
 	ceph_osdc_put_request(req);
+
+	if (rc < 0 && total_pages) {
+		for (;;) {
+			if (atomic_add_unless(&inode->i_count, -1, 1))
+				break;
+			/* let writeback work drop the last reference */
+			if (queue_work(fsc->wb_wq, &ci->i_wb_work))
+				break;
+		}
+	}
 }
 
 /*
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index df3875fdfa41..aa7c5a4ff137 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1752,9 +1752,17 @@  static void ceph_writeback_work(struct work_struct *work)
 	struct ceph_inode_info *ci = container_of(work, struct ceph_inode_info,
 						  i_wb_work);
 	struct inode *inode = &ci->vfs_inode;
+	int wrbuffer_refs;
+
+	spin_lock(&ci->i_ceph_lock);
+	wrbuffer_refs = ci->i_wrbuffer_ref;
+	spin_unlock(&ci->i_ceph_lock);
+
+	if (wrbuffer_refs) {
+		dout("writeback %p\n", inode);
+		filemap_fdatawrite(&inode->i_data);
+	}
 
-	dout("writeback %p\n", inode);
-	filemap_fdatawrite(&inode->i_data);
 	iput(inode);
 }