diff mbox

[RFC] cifs: Fix possible deadlock with cifs and work queues

Message ID 20140319151252.16ed3ac6@gandalf.local.home (mailing list archive)
State New, archived
Headers show

Commit Message

Steven Rostedt March 19, 2014, 7:12 p.m. UTC
We just had a customer report a deadlock in their 3.8-rt kernel.
Looking into this, it is very possible to have the same deadlock in
mainline in Linus's tree as it stands today.

It is much easier to deadlock in the -rt kernel because reader locks
are serialized, where a down_read() can block another down_read(). But
because rwsems are fair locks, if a writer is waiting, a new reader
will then block. This means that if it is possible for a reader to
deadlock another reader, this can happen if a write comes along and
blocks on a current reader. That will prevent another reader from
running, and if that new reader requires to wake up a reader that owns
the lock, you have your deadlock.

Here's the situation with CIFS and workqueues:

The cifs system has several workqueues used in file.c and other places.
One of them is used for completion of a read and to release the
page_lock which wakes up the reader. There are several other workqueues
that do various other tasks.

A holder of the reader lock can sleep on a page_lock() and expect the
reader workqueue to wake it up (page_unlock()). The reader workqueue
takes no locks so this does not seem to be a problem (but it is).

The other workqueues can take the rwsem for read or for write. But our
issue that we tripped over was that it grabs it for read (remember in
-rt readers are serialized). But this can also happen if a separate
writer is waiting on the lock as that would cause a reader to block on
another reader too.

All the workqueue callbacks are executed on the same workqueue:

	queue_work(cifsiod_wq, &rdata->work);
	[...]
	queue_work(cifsiod_wq, &cfile->oplock_break);

Now if the reader workqueue callback is queued after one of these
workqueues that can take the rwsem, we can hit a deadlock. The
workqueue code looks to be able to prevent deadlocks of these kinds,
but I do not totally understand the workqueue scheduled work structure
and perhaps if the kworker thread structure blocks hard it wont move
works around.

Here's what we see:

	rdata->work is scheduled after cfile->oplock_break

	CPU0						CPU1
	----						----

  do_sync_read()
    cifs_strict_readv()
      down_read(cinode->lock_sem);
      generic_file_aio_read()
        __lock_page_killable()
          __wait_on_bit_lock()

       * BLOCKED *

						process_one_work()
						  cifs_oplock_break()
						    cifs_has_mand_locks()
						      down_read(cinode->lock_sem);

						   * BLOCKED *

					      [ note, cifs_oplock_break() can
					        also call cifs_push_locks which takes
					        the lock with down_write() ]


But we noticed that the rdata->work was queued to run under the same
workqueue task and this work is to wake up the owner of the semaphore.
But because the workqueue task is blocked waiting on that lock, it will
never wake it up.

My question to Tejun is, if we create another workqueue, to add the
rdata->work to, would that prevent the above problem? Or what other
fixes can we do?

This is only compiled tested, we have not given it to our customer yet.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Peter Zijlstra March 19, 2014, 7:34 p.m. UTC | #1
On Wed, Mar 19, 2014 at 03:12:52PM -0400, Steven Rostedt wrote:
> My question to Tejun is, if we create another workqueue, to add the
> rdata->work to, would that prevent the above problem? Or what other
> fixes can we do?

The way I understand workqueues is that we cannot guarantee concurrency
like this. It tries, but there's no guarantee.

WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So
given 511 other blocked works, the described problem will always happen.

Creating another workqueue doesn't actually create more threads.

There is the kthread_work stuff for if you want a guaranteed worker
thread.
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steven Rostedt March 19, 2014, 7:43 p.m. UTC | #2
On Wed, 19 Mar 2014 20:34:07 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Mar 19, 2014 at 03:12:52PM -0400, Steven Rostedt wrote:
> > My question to Tejun is, if we create another workqueue, to add the
> > rdata->work to, would that prevent the above problem? Or what other
> > fixes can we do?
> 
> The way I understand workqueues is that we cannot guarantee concurrency
> like this. It tries, but there's no guarantee.
> 
> WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So
> given 511 other blocked works, the described problem will always happen.
> 
> Creating another workqueue doesn't actually create more threads.

But I noticed this:

 Before patch:

# ps aux |grep cifs
root      3119  0.0  0.0      0     0 ?        S<   14:17   0:00 [cifsiod]

 After patch:

# ps aux |grep cifs
root      1109  0.0  0.0      0     0 ?        S<   15:11   0:00 [cifsiod]
root      1111  0.0  0.0      0     0 ?        S<   15:11   0:00 [cifsiord]

It looks to me that it does create new threads.

-- Steve


> 
> There is the kthread_work stuff for if you want a guaranteed worker
> thread.

--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steven Rostedt March 19, 2014, 7:46 p.m. UTC | #3
On Wed, 19 Mar 2014 15:43:39 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 19 Mar 2014 20:34:07 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Mar 19, 2014 at 03:12:52PM -0400, Steven Rostedt wrote:
> > > My question to Tejun is, if we create another workqueue, to add the
> > > rdata->work to, would that prevent the above problem? Or what other
> > > fixes can we do?
> > 
> > The way I understand workqueues is that we cannot guarantee concurrency
> > like this. It tries, but there's no guarantee.
> > 
> > WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So
> > given 511 other blocked works, the described problem will always happen.
> > 
> > Creating another workqueue doesn't actually create more threads.
> 
> But I noticed this:
> 
>  Before patch:
> 
> # ps aux |grep cifs
> root      3119  0.0  0.0      0     0 ?        S<   14:17   0:00 [cifsiod]
> 
>  After patch:
> 
> # ps aux |grep cifs
> root      1109  0.0  0.0      0     0 ?        S<   15:11   0:00 [cifsiod]
> root      1111  0.0  0.0      0     0 ?        S<   15:11   0:00 [cifsiord]
> 
> It looks to me that it does create new threads.
> 

Or is that just the rescuer thread? I can rewrite the patch to use
kthread_work instead too.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 19, 2014, 7:47 p.m. UTC | #4
On Wed, Mar 19, 2014 at 03:43:39PM -0400, Steven Rostedt wrote:
> On Wed, 19 Mar 2014 20:34:07 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Mar 19, 2014 at 03:12:52PM -0400, Steven Rostedt wrote:
> > > My question to Tejun is, if we create another workqueue, to add the
> > > rdata->work to, would that prevent the above problem? Or what other
> > > fixes can we do?
> > 
> > The way I understand workqueues is that we cannot guarantee concurrency
> > like this. It tries, but there's no guarantee.
> > 
> > WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So
> > given 511 other blocked works, the described problem will always happen.
> > 
> > Creating another workqueue doesn't actually create more threads.
> 
> But I noticed this:
> 
>  Before patch:
> 
> # ps aux |grep cifs
> root      3119  0.0  0.0      0     0 ?        S<   14:17   0:00 [cifsiod]
> 
>  After patch:
> 
> # ps aux |grep cifs
> root      1109  0.0  0.0      0     0 ?        S<   15:11   0:00 [cifsiod]
> root      1111  0.0  0.0      0     0 ?        S<   15:11   0:00 [cifsiord]
> 
> It looks to me that it does create new threads.

Ah, I think that's because of the MEM_RECLAIM, not sure if that will
normally participate in running works. Its been a long time since I
looked at any of that.

Lets wait for TJ to wake up.
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo March 19, 2014, 8:28 p.m. UTC | #5
Hello, Steven, Peter.

On Wed, Mar 19, 2014 at 08:34:07PM +0100, Peter Zijlstra wrote:
> The way I understand workqueues is that we cannot guarantee concurrency
> like this. It tries, but there's no guarantee.

So, the guarantee is that if a workqueue has WQ_MEM_RECLAIM, it'll
always have at least one worker thread working on it, so workqueues
which may be depended upon during memory reclaim should have the flag
set and must not require more than single level of concurrency to make
forward progress.  Workqueues w/o memory reclaim set depend on the
fact that eventually memory will be reclaimed and enough number of
workers necessary to make forward progress will be made available.

> WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So
> given 511 other blocked works, the described problem will always happen.

That actually is per-workqueue limit and workqueue core will try to
create as many workers as possible to satisfy the demanded
concurrency.  ie. having two workqueues with the same max_active means
that the total number of workers may reach 2 * max_active; however,
this is no guarantee.  If the system is under memory pressure and the
workqueues don't have MEM_RECLAIM set, they may not get any
concurrency until more memory is made available.

> Creating another workqueue doesn't actually create more threads.

It looks like the issue Steven is describing is caused by having a
dependency chain longer than 1 through rwsem in a MEM_RECLAIM
workqueue.  Moving the write work items to a separate workqueue breaks
the r-w-r chain and ensures that forward progress can be made with
single level of concurrency on both workqueues, so, yeah, it looks
like the correct fix to me.  It it scarily subtle tho and quite likely
to present in other code paths too. :(

Thanks.
Jeff Layton March 20, 2014, 7:28 p.m. UTC | #6
On Wed, 19 Mar 2014 15:12:52 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> We just had a customer report a deadlock in their 3.8-rt kernel.
> Looking into this, it is very possible to have the same deadlock in
> mainline in Linus's tree as it stands today.
> 
> It is much easier to deadlock in the -rt kernel because reader locks
> are serialized, where a down_read() can block another down_read(). But
> because rwsems are fair locks, if a writer is waiting, a new reader
> will then block. This means that if it is possible for a reader to
> deadlock another reader, this can happen if a write comes along and
> blocks on a current reader. That will prevent another reader from
> running, and if that new reader requires to wake up a reader that owns
> the lock, you have your deadlock.
> 
> Here's the situation with CIFS and workqueues:
> 
> The cifs system has several workqueues used in file.c and other
> places. One of them is used for completion of a read and to release
> the page_lock which wakes up the reader. There are several other
> workqueues that do various other tasks.
> 
> A holder of the reader lock can sleep on a page_lock() and expect the
> reader workqueue to wake it up (page_unlock()). The reader workqueue
> takes no locks so this does not seem to be a problem (but it is).
> 
> The other workqueues can take the rwsem for read or for write. But our
> issue that we tripped over was that it grabs it for read (remember in
> -rt readers are serialized). But this can also happen if a separate
> writer is waiting on the lock as that would cause a reader to block on
> another reader too.
> 
> All the workqueue callbacks are executed on the same workqueue:
> 
> 	queue_work(cifsiod_wq, &rdata->work);
> 	[...]
> 	queue_work(cifsiod_wq, &cfile->oplock_break);
> 
> Now if the reader workqueue callback is queued after one of these
> workqueues that can take the rwsem, we can hit a deadlock. The
> workqueue code looks to be able to prevent deadlocks of these kinds,
> but I do not totally understand the workqueue scheduled work structure
> and perhaps if the kworker thread structure blocks hard it wont move
> works around.
> 
> Here's what we see:
> 
> 	rdata->work is scheduled after cfile->oplock_break
> 
> 	CPU0						CPU1
> 	----						----
> 
>   do_sync_read()
>     cifs_strict_readv()
>       down_read(cinode->lock_sem);
>       generic_file_aio_read()
>         __lock_page_killable()
>           __wait_on_bit_lock()
> 
>        * BLOCKED *
> 
> 						process_one_work()
> 						  cifs_oplock_break()
> 						    cifs_has_mand_locks()
> 						      down_read(cinode->lock_sem);
> 
> 						   * BLOCKED *
> 
> 					      [ note,
> cifs_oplock_break() can also call cifs_push_locks which takes
> 					        the lock with
> down_write() ]
> 
> 
> But we noticed that the rdata->work was queued to run under the same
> workqueue task and this work is to wake up the owner of the semaphore.
> But because the workqueue task is blocked waiting on that lock, it
> will never wake it up.
> 
> My question to Tejun is, if we create another workqueue, to add the
> rdata->work to, would that prevent the above problem? Or what other
> fixes can we do?
> 
> This is only compiled tested, we have not given it to our customer
> yet.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> 

Cc'ing Pavel since he wrote most of the lock pushing code...

Nice analysis! I think eventually we'll need to overhaul this code not
to use rw semaphores, but that's going to take some redesign. (Wonder
if we could change it to use seqlocks or something?)

Out of curiousity, does this eventually time out and unwedge itself?
Usually when the server doesn't get a response to an oplock break in
around a minute or so it gives up and allows the thing that caused the
oplock break to proceed anyway. Not great for performance but it out to
eventually make progress due to that.

In any case, this looks like a reasonable fix for now, but I suspect you
can hit similar problems in the write codepath too. What may be best is
turn this around and queue the oplock break to the new workqueue
instead of the read completion job.

> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index 849f613..6656058 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -86,6 +86,7 @@ extern mempool_t *cifs_req_poolp;
>  extern mempool_t *cifs_mid_poolp;
>  
>  struct workqueue_struct	*cifsiod_wq;
> +struct workqueue_struct	*cifsiord_wq;
>  
>  #ifdef CONFIG_CIFS_SMB2
>  __u8 cifs_client_guid[SMB2_CLIENT_GUID_SIZE];
> @@ -1199,9 +1200,15 @@ init_cifs(void)
>  		goto out_clean_proc;
>  	}
>  
> +	cifsiord_wq = alloc_workqueue("cifsiord",
> WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
> +	if (!cifsiord_wq) {
> +		rc = -ENOMEM;
> +		goto out_destroy_wq;
> +	}
> +
>  	rc = cifs_fscache_register();
>  	if (rc)
> -		goto out_destroy_wq;
> +		goto out_destroy_rwq;
>  
>  	rc = cifs_init_inodecache();
>  	if (rc)
> @@ -1249,6 +1256,8 @@ out_destroy_inodecache:
>  	cifs_destroy_inodecache();
>  out_unreg_fscache:
>  	cifs_fscache_unregister();
> +out_destroy_rwq:
> +	destroy_workqueue(cifsiord_wq);
>  out_destroy_wq:
>  	destroy_workqueue(cifsiod_wq);
>  out_clean_proc:
> @@ -1273,6 +1282,7 @@ exit_cifs(void)
>  	cifs_destroy_inodecache();
>  	cifs_fscache_unregister();
>  	destroy_workqueue(cifsiod_wq);
> +	destroy_workqueue(cifsiord_wq);
>  	cifs_proc_clean();
>  }
>  
> diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
> index c0f3718..75d1941 100644
> --- a/fs/cifs/cifsglob.h
> +++ b/fs/cifs/cifsglob.h
> @@ -1561,6 +1561,7 @@ void cifs_oplock_break(struct work_struct
> *work); 
>  extern const struct slow_work_ops cifs_oplock_break_ops;
>  extern struct workqueue_struct *cifsiod_wq;
> +extern struct workqueue_struct *cifsiord_wq;
>  
>  extern mempool_t *cifs_mid_poolp;
>  
> diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
> index f3264bd..ca04a2e 100644
> --- a/fs/cifs/cifssmb.c
> +++ b/fs/cifs/cifssmb.c
> @@ -1571,7 +1571,7 @@ cifs_readv_callback(struct mid_q_entry *mid)
>  		rdata->result = -EIO;
>  	}
>  
> -	queue_work(cifsiod_wq, &rdata->work);
> +	queue_work(cifsiord_wq, &rdata->work);
>  	DeleteMidQEntry(mid);
>  	add_credits(server, 1, 0);
>  }
> diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
> index 8603447..b74bf61 100644
> --- a/fs/cifs/smb2pdu.c
> +++ b/fs/cifs/smb2pdu.c
> @@ -1742,7 +1742,7 @@ smb2_readv_callback(struct mid_q_entry *mid)
>  	if (rdata->result)
>  		cifs_stats_fail_inc(tcon, SMB2_READ_HE);
>  
> -	queue_work(cifsiod_wq, &rdata->work);
> +	queue_work(cifsiord_wq, &rdata->work);
>  	DeleteMidQEntry(mid);
>  	add_credits(server, credits_received, 0);
>  }
Steven Rostedt March 20, 2014, 8:57 p.m. UTC | #7
On Thu, 20 Mar 2014 15:28:33 -0400
Jeffrey Layton <jlayton@redhat.com> wrote:

 
> Nice analysis! I think eventually we'll need to overhaul this code not

Note, Ulrich Obergfell helped a bit in the initial analysis. He found
from a customer core dump that the kworker thread was blocked on the
cinode->lock_sem, and the reader was blocked as well. That was enough
for me to find where the problem laid.

> to use rw semaphores, but that's going to take some redesign. (Wonder
> if we could change it to use seqlocks or something?)
> 
> Out of curiousity, does this eventually time out and unwedge itself?
> Usually when the server doesn't get a response to an oplock break in
> around a minute or so it gives up and allows the thing that caused the
> oplock break to proceed anyway. Not great for performance but it out to
> eventually make progress due to that.

No, I believe it's hard locked. Nothing is going to wake up the oplock
break  if it is blocked on a down_read(). Only the release of the rwsem
will do that. It's the subtle way the kworker threads are done.

> 
> In any case, this looks like a reasonable fix for now, but I suspect you
> can hit similar problems in the write codepath too. What may be best is
> turn this around and queue the oplock break to the new workqueue
> instead of the read completion job.

Or perhaps give both the read and write their own workqueues? We have
to look at all the work queue handlers, and be careful about any users
that take the lock_sem, and separate them out.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton March 20, 2014, 9:02 p.m. UTC | #8
On Thu, 20 Mar 2014 16:57:03 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 20 Mar 2014 15:28:33 -0400
> Jeffrey Layton <jlayton@redhat.com> wrote:
> 
>  
> > Nice analysis! I think eventually we'll need to overhaul this code not
> 
> Note, Ulrich Obergfell helped a bit in the initial analysis. He found
> from a customer core dump that the kworker thread was blocked on the
> cinode->lock_sem, and the reader was blocked as well. That was enough
> for me to find where the problem laid.
> 

Kudos to Uli, then ;)

> > to use rw semaphores, but that's going to take some redesign. (Wonder
> > if we could change it to use seqlocks or something?)
> > 
> > Out of curiousity, does this eventually time out and unwedge itself?
> > Usually when the server doesn't get a response to an oplock break in
> > around a minute or so it gives up and allows the thing that caused the
> > oplock break to proceed anyway. Not great for performance but it out to
> > eventually make progress due to that.
> 
> No, I believe it's hard locked. Nothing is going to wake up the oplock
> break  if it is blocked on a down_read(). Only the release of the rwsem
> will do that. It's the subtle way the kworker threads are done.
> 

Eventually the server should just allow the read to complete even if
the client doesn't respond to the oplock break. It has to since clients
can suddenly drop off the net while holding an oplock. That should
allow everything to unwedge eventually (though it may take a while).

If that's not happening then I'd be curious as to why...

> > 
> > In any case, this looks like a reasonable fix for now, but I suspect you
> > can hit similar problems in the write codepath too. What may be best is
> > turn this around and queue the oplock break to the new workqueue
> > instead of the read completion job.
> 
> Or perhaps give both the read and write their own workqueues? We have
> to look at all the work queue handlers, and be careful about any users
> that take the lock_sem, and separate them out.
> 

Yeah, I haven't looked closely yet but I'm fairly sure that you could
hit the same situation in the write codepath as well. Whether adding
more workqueues will really help, I'm not sure of yet...
Jeff Layton March 20, 2014, 11:53 p.m. UTC | #9
On Wed, 19 Mar 2014 15:12:52 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> We just had a customer report a deadlock in their 3.8-rt kernel.
> Looking into this, it is very possible to have the same deadlock in
> mainline in Linus's tree as it stands today.
> 
> It is much easier to deadlock in the -rt kernel because reader locks
> are serialized, where a down_read() can block another down_read(). But
> because rwsems are fair locks, if a writer is waiting, a new reader
> will then block. This means that if it is possible for a reader to
> deadlock another reader, this can happen if a write comes along and
> blocks on a current reader. That will prevent another reader from
> running, and if that new reader requires to wake up a reader that owns
> the lock, you have your deadlock.
> 
> Here's the situation with CIFS and workqueues:
> 
> The cifs system has several workqueues used in file.c and other places.
> One of them is used for completion of a read and to release the
> page_lock which wakes up the reader. There are several other workqueues
> that do various other tasks.
> 
> A holder of the reader lock can sleep on a page_lock() and expect the
> reader workqueue to wake it up (page_unlock()). The reader workqueue
> takes no locks so this does not seem to be a problem (but it is).
> 
> The other workqueues can take the rwsem for read or for write. But our
> issue that we tripped over was that it grabs it for read (remember in
> -rt readers are serialized). But this can also happen if a separate
> writer is waiting on the lock as that would cause a reader to block on
> another reader too.
> 
> All the workqueue callbacks are executed on the same workqueue:
> 
> 	queue_work(cifsiod_wq, &rdata->work);
> 	[...]
> 	queue_work(cifsiod_wq, &cfile->oplock_break);
> 
> Now if the reader workqueue callback is queued after one of these
> workqueues that can take the rwsem, we can hit a deadlock. The
> workqueue code looks to be able to prevent deadlocks of these kinds,
> but I do not totally understand the workqueue scheduled work structure
> and perhaps if the kworker thread structure blocks hard it wont move
> works around.
> 
> Here's what we see:
> 
> 	rdata->work is scheduled after cfile->oplock_break
> 
> 	CPU0						CPU1
> 	----						----
> 
>   do_sync_read()
>     cifs_strict_readv()
>       down_read(cinode->lock_sem);
>       generic_file_aio_read()
>         __lock_page_killable()
>           __wait_on_bit_lock()
> 
>        * BLOCKED *
> 
> 						process_one_work()
> 						  cifs_oplock_break()
> 						    cifs_has_mand_locks()
> 						      down_read(cinode->lock_sem);
> 
> 						   * BLOCKED *
> 
> 					      [ note, cifs_oplock_break() can
> 					        also call cifs_push_locks which takes
> 					        the lock with down_write() ]
> 
> 

Wait...why does the work running on CPU1 end up blocked on down_read()?
Is it really getting stuck on the down_write you mention?
Steven Rostedt March 21, 2014, 2:19 a.m. UTC | #10
On Thu, 20 Mar 2014 19:53:46 -0400
Jeff Layton <jlayton@redhat.com> wrote:
 
> Wait...why does the work running on CPU1 end up blocked on down_read()?
> Is it really getting stuck on the down_write you mention?
> 

rwsems are fair locks. Readers will not block on a reader lock unless
there's a writer waiting. That's the key. As soon as a writer blocks on
a lock that is held by a reader (or multiple readers), new readers
coming in will also block to let the writer get a chance. Otherwise, it
is a unfair lock and the readers can starve the writer.

But people tend to forget that a waiting writer causes readers to block
on each other, and if the reader locks can deadlock each other, they
will deadlock with a writer waiting.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steven Rostedt March 21, 2014, 2:23 a.m. UTC | #11
On Thu, 20 Mar 2014 17:02:39 -0400
Jeff Layton <jlayton@redhat.com> wrote:

> Eventually the server should just allow the read to complete even if
> the client doesn't respond to the oplock break. It has to since clients
> can suddenly drop off the net while holding an oplock. That should
> allow everything to unwedge eventually (though it may take a while).
> 
> If that's not happening then I'd be curious as to why...

The problem is that the data is being filled in the page and the reader
is waiting for the page lock to be released. The kworker for the reader
will issue the complete() and unlock the page to wake up the reader.

But because the other workqueue callback calls down_read(), and there
can be a down_write() waiting for the reader to finish, this
down_read() will block on the lock as well (rwsems are fair locks).
This blocks the other workqueue callback from issuing the complete and
page_unlock() that will wake up the reader that is holding the rwsem
with down_read().

DEADLOCK.


-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Shilovsky March 21, 2014, 8:32 a.m. UTC | #12
2014-03-21 6:23 GMT+04:00 Steven Rostedt <rostedt@goodmis.org>:
> On Thu, 20 Mar 2014 17:02:39 -0400
> Jeff Layton <jlayton@redhat.com> wrote:
>
>> Eventually the server should just allow the read to complete even if
>> the client doesn't respond to the oplock break. It has to since clients
>> can suddenly drop off the net while holding an oplock. That should
>> allow everything to unwedge eventually (though it may take a while).
>>
>> If that's not happening then I'd be curious as to why...
>
> The problem is that the data is being filled in the page and the reader
> is waiting for the page lock to be released. The kworker for the reader
> will issue the complete() and unlock the page to wake up the reader.
>
> But because the other workqueue callback calls down_read(), and there
> can be a down_write() waiting for the reader to finish, this
> down_read() will block on the lock as well (rwsems are fair locks).
> This blocks the other workqueue callback from issuing the complete and
> page_unlock() that will wake up the reader that is holding the rwsem
> with down_read().
>
> DEADLOCK.

Thank you for reporting and clarifying the issue!

Read and write codepaths both obtain lock_sem for read and then wait
for cifsiod_wq to complete and release lock_sem. They don't do any
lock_sem operations inside their work task queued to cifsiod_wq. But
oplock code can obtain/release lock_sem in its work task. So, that's
why I agree with Jeff and suggest to move the oplock code to a
different work queue (cifsioopd_wq?) but leave read and write
codepaths use cifsiod_wq.
Jeff Layton March 21, 2014, 11:59 a.m. UTC | #13
On Fri, 21 Mar 2014 12:32:12 +0400
Pavel Shilovsky <piastry@etersoft.ru> wrote:

> 2014-03-21 6:23 GMT+04:00 Steven Rostedt <rostedt@goodmis.org>:
> > On Thu, 20 Mar 2014 17:02:39 -0400
> > Jeff Layton <jlayton@redhat.com> wrote:
> >
> >> Eventually the server should just allow the read to complete even if
> >> the client doesn't respond to the oplock break. It has to since clients
> >> can suddenly drop off the net while holding an oplock. That should
> >> allow everything to unwedge eventually (though it may take a while).
> >>
> >> If that's not happening then I'd be curious as to why...
> >
> > The problem is that the data is being filled in the page and the reader
> > is waiting for the page lock to be released. The kworker for the reader
> > will issue the complete() and unlock the page to wake up the reader.
> >
> > But because the other workqueue callback calls down_read(), and there
> > can be a down_write() waiting for the reader to finish, this
> > down_read() will block on the lock as well (rwsems are fair locks).
> > This blocks the other workqueue callback from issuing the complete and
> > page_unlock() that will wake up the reader that is holding the rwsem
> > with down_read().
> >
> > DEADLOCK.
> 
> Thank you for reporting and clarifying the issue!
> 
> Read and write codepaths both obtain lock_sem for read and then wait
> for cifsiod_wq to complete and release lock_sem. They don't do any
> lock_sem operations inside their work task queued to cifsiod_wq. But
> oplock code can obtain/release lock_sem in its work task. So, that's
> why I agree with Jeff and suggest to move the oplock code to a
> different work queue (cifsioopd_wq?) but leave read and write
> codepaths use cifsiod_wq.
> 

Yes, thanks for the clarification. I missed the fact that a 3rd task
was blocked on a down_write. I think that fix will help prevent the
full-blown deadlock, but it'll still be susceptible to long stalls in
some situations.

In Steven's example the work on CPU0 is blocked on a SMB_READ. That
read may not be completing because the server has issued an oplock
break to the client and is waiting on the response. That oplock break
response is blocked because it's blocked on the down_write.

In that situation, what breaks the deadlock is that the server
eventually gives up waiting on the oplock break response, but that can
take on the order of a minute.

So yeah, I think we should add a dedicated workqueue for oplock breaks
as an interim fix, but I also think we need to consider revamping the
lock pushing code such that oplock breaks are not subject to being
blocked like that.
Steven Rostedt March 21, 2014, 12:17 p.m. UTC | #14
On Fri, 21 Mar 2014 12:32:12 +0400
Pavel Shilovsky <piastry@etersoft.ru> wrote:


> Read and write codepaths both obtain lock_sem for read and then wait
> for cifsiod_wq to complete and release lock_sem. They don't do any
> lock_sem operations inside their work task queued to cifsiod_wq. But
> oplock code can obtain/release lock_sem in its work task. So, that's
> why I agree with Jeff and suggest to move the oplock code to a
> different work queue (cifsioopd_wq?) but leave read and write
> codepaths use cifsiod_wq.

OK, how about I submit a second patch that moves the reader and writer
to its own "safe" workqueue?

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton March 21, 2014, 12:41 p.m. UTC | #15
On Fri, 21 Mar 2014 08:17:06 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 21 Mar 2014 12:32:12 +0400
> Pavel Shilovsky <piastry@etersoft.ru> wrote:
> 
> 
> > Read and write codepaths both obtain lock_sem for read and then wait
> > for cifsiod_wq to complete and release lock_sem. They don't do any
> > lock_sem operations inside their work task queued to cifsiod_wq. But
> > oplock code can obtain/release lock_sem in its work task. So, that's
> > why I agree with Jeff and suggest to move the oplock code to a
> > different work queue (cifsioopd_wq?) but leave read and write
> > codepaths use cifsiod_wq.
> 
> OK, how about I submit a second patch that moves the reader and writer
> to its own "safe" workqueue?
> 
> -- Steve
> 

That'd probably work fine too. The main point is to make sure oplock
breaks run on a different workqueue from where read or write completion
jobs run since they are operating on the lock_sem. The other jobs that
get queued to cifsiod_wq don't touch the lock_sem and shouldn't be a
problem.
Steven Rostedt March 21, 2014, 12:54 p.m. UTC | #16
On Fri, 21 Mar 2014 08:41:28 -0400
Jeff Layton <jlayton@redhat.com> wrote:


> That'd probably work fine too. The main point is to make sure oplock
> breaks run on a different workqueue from where read or write completion
> jobs run since they are operating on the lock_sem. The other jobs that
> get queued to cifsiod_wq don't touch the lock_sem and shouldn't be a
> problem.
> 

OK, I'll take a look at them, and maybe I'll just move the oplock
workqueue. I think you are correct and it may be best to move the one
that takes the lock. Keep that one separate and that will keep the
others from being blocked by it.

Thanks, I'll write something up in a bit.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 849f613..6656058 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -86,6 +86,7 @@  extern mempool_t *cifs_req_poolp;
 extern mempool_t *cifs_mid_poolp;
 
 struct workqueue_struct	*cifsiod_wq;
+struct workqueue_struct	*cifsiord_wq;
 
 #ifdef CONFIG_CIFS_SMB2
 __u8 cifs_client_guid[SMB2_CLIENT_GUID_SIZE];
@@ -1199,9 +1200,15 @@  init_cifs(void)
 		goto out_clean_proc;
 	}
 
+	cifsiord_wq = alloc_workqueue("cifsiord", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
+	if (!cifsiord_wq) {
+		rc = -ENOMEM;
+		goto out_destroy_wq;
+	}
+
 	rc = cifs_fscache_register();
 	if (rc)
-		goto out_destroy_wq;
+		goto out_destroy_rwq;
 
 	rc = cifs_init_inodecache();
 	if (rc)
@@ -1249,6 +1256,8 @@  out_destroy_inodecache:
 	cifs_destroy_inodecache();
 out_unreg_fscache:
 	cifs_fscache_unregister();
+out_destroy_rwq:
+	destroy_workqueue(cifsiord_wq);
 out_destroy_wq:
 	destroy_workqueue(cifsiod_wq);
 out_clean_proc:
@@ -1273,6 +1282,7 @@  exit_cifs(void)
 	cifs_destroy_inodecache();
 	cifs_fscache_unregister();
 	destroy_workqueue(cifsiod_wq);
+	destroy_workqueue(cifsiord_wq);
 	cifs_proc_clean();
 }
 
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index c0f3718..75d1941 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1561,6 +1561,7 @@  void cifs_oplock_break(struct work_struct *work);
 
 extern const struct slow_work_ops cifs_oplock_break_ops;
 extern struct workqueue_struct *cifsiod_wq;
+extern struct workqueue_struct *cifsiord_wq;
 
 extern mempool_t *cifs_mid_poolp;
 
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index f3264bd..ca04a2e 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1571,7 +1571,7 @@  cifs_readv_callback(struct mid_q_entry *mid)
 		rdata->result = -EIO;
 	}
 
-	queue_work(cifsiod_wq, &rdata->work);
+	queue_work(cifsiord_wq, &rdata->work);
 	DeleteMidQEntry(mid);
 	add_credits(server, 1, 0);
 }
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 8603447..b74bf61 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -1742,7 +1742,7 @@  smb2_readv_callback(struct mid_q_entry *mid)
 	if (rdata->result)
 		cifs_stats_fail_inc(tcon, SMB2_READ_HE);
 
-	queue_work(cifsiod_wq, &rdata->work);
+	queue_work(cifsiord_wq, &rdata->work);
 	DeleteMidQEntry(mid);
 	add_credits(server, credits_received, 0);
 }