diff mbox series

[3/3] ib_srp: Fix a deadlock

Message ID 20220215182650.19839-4-bvanassche@acm.org (mailing list archive)
State Superseded
Headers show
Series Fix a deadlock in the ib_srp driver | expand

Commit Message

Bart Van Assche Feb. 15, 2022, 6:26 p.m. UTC
Wait on tl_err_work instead of flushing system_long_wq since flushing
system_long_wq is deadlock-prone.

Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: ef6c49d87c34 ("IB/srp: Eliminate state SRP_TARGET_DEAD")
Reported-by: syzbot+831661966588c802aae9@syzkaller.appspotmail.com
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

Comments

Leon Romanovsky Feb. 15, 2022, 6:51 p.m. UTC | #1
On Tue, Feb 15, 2022 at 10:26:50AM -0800, Bart Van Assche wrote:
> Wait on tl_err_work instead of flushing system_long_wq since flushing
> system_long_wq is deadlock-prone.
> 
> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Fixes: ef6c49d87c34 ("IB/srp: Eliminate state SRP_TARGET_DEAD")
> Reported-by: syzbot+831661966588c802aae9@syzkaller.appspotmail.com
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  drivers/infiniband/ulp/srp/ib_srp.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index 2db7429b42e1..8e1561a6d325 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -4044,12 +4044,10 @@ static void srp_remove_one(struct ib_device *device, void *client_data)
>  		mutex_lock(&host->target_mutex);
>  		list_for_each_entry(target, &host->target_list, list)
>  			srp_queue_remove_work(target);
> +		list_for_each_entry(target, &host->target_list, list)
> +			flush_work(&target->tl_err_work);

Sorry for my silly question, but why do you do flush and not cancel
here? You anyway remove SRP device, so the result of flush is not
really important, am I right?

Thanks

>  		mutex_unlock(&host->target_mutex);
>  
> -		/*
> -		 * Wait for tl_err and target port removal tasks.
> -		 */
> -		flush_workqueue(system_long_wq);
>  		flush_workqueue(srp_remove_wq);
>  
>  		kfree(host);
Bart Van Assche Feb. 15, 2022, 8:37 p.m. UTC | #2
On 2/15/22 10:51, Leon Romanovsky wrote:
> On Tue, Feb 15, 2022 at 10:26:50AM -0800, Bart Van Assche wrote:
>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
>> index 2db7429b42e1..8e1561a6d325 100644
>> --- a/drivers/infiniband/ulp/srp/ib_srp.c
>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
>> @@ -4044,12 +4044,10 @@ static void srp_remove_one(struct ib_device *device, void *client_data)
>>   		mutex_lock(&host->target_mutex);
>>   		list_for_each_entry(target, &host->target_list, list)
>>   			srp_queue_remove_work(target);
>> +		list_for_each_entry(target, &host->target_list, list)
>> +			flush_work(&target->tl_err_work);
> 
> Sorry for my silly question, but why do you do flush and not cancel
> here? You anyway remove SRP device, so the result of flush is not
> really important, am I right?

That's a great question. It probably doesn't matter much whether 
flush_work() or cancel_work_sync() is called in this context since 
srp_queue_remove_work() indirectly cancels tl_err_work. See also the 
following code in srp_remove_target():
  	cancel_work_sync(&target->tl_err_work);

Thanks,

Bart.
Jason Gunthorpe Feb. 15, 2022, 8:41 p.m. UTC | #3
On Tue, Feb 15, 2022 at 12:37:43PM -0800, Bart Van Assche wrote:
> On 2/15/22 10:51, Leon Romanovsky wrote:
> > On Tue, Feb 15, 2022 at 10:26:50AM -0800, Bart Van Assche wrote:
> > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> > > index 2db7429b42e1..8e1561a6d325 100644
> > > +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> > > @@ -4044,12 +4044,10 @@ static void srp_remove_one(struct ib_device *device, void *client_data)
> > >   		mutex_lock(&host->target_mutex);
> > >   		list_for_each_entry(target, &host->target_list, list)
> > >   			srp_queue_remove_work(target);
> > > +		list_for_each_entry(target, &host->target_list, list)
> > > +			flush_work(&target->tl_err_work);
> > 
> > Sorry for my silly question, but why do you do flush and not cancel
> > here? You anyway remove SRP device, so the result of flush is not
> > really important, am I right?
> 
> That's a great question. It probably doesn't matter much whether
> flush_work() or cancel_work_sync() is called in this context since
> srp_queue_remove_work() indirectly cancels tl_err_work. See also the
> following code in srp_remove_target():
>  	cancel_work_sync(&target->tl_err_work);

If it is already canceled then why call flush?

Jason
Bart Van Assche Feb. 15, 2022, 8:50 p.m. UTC | #4
On 2/15/22 12:41, Jason Gunthorpe wrote:
> On Tue, Feb 15, 2022 at 12:37:43PM -0800, Bart Van Assche wrote:
>> On 2/15/22 10:51, Leon Romanovsky wrote:
>>> On Tue, Feb 15, 2022 at 10:26:50AM -0800, Bart Van Assche wrote:
>>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
>>>> index 2db7429b42e1..8e1561a6d325 100644
>>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
>>>> @@ -4044,12 +4044,10 @@ static void srp_remove_one(struct ib_device *device, void *client_data)
>>>>    		mutex_lock(&host->target_mutex);
>>>>    		list_for_each_entry(target, &host->target_list, list)
>>>>    			srp_queue_remove_work(target);
>>>> +		list_for_each_entry(target, &host->target_list, list)
>>>> +			flush_work(&target->tl_err_work);
>>>
>>> Sorry for my silly question, but why do you do flush and not cancel
>>> here? You anyway remove SRP device, so the result of flush is not
>>> really important, am I right?
>>
>> That's a great question. It probably doesn't matter much whether
>> flush_work() or cancel_work_sync() is called in this context since
>> srp_queue_remove_work() indirectly cancels tl_err_work. See also the
>> following code in srp_remove_target():
>>   	cancel_work_sync(&target->tl_err_work);
> 
> If it is already canceled then why call flush?

I'll see whether I can leave the flush_work(&target->tl_err_work) calls out.

Thanks,

Bart.
diff mbox series

Patch

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 2db7429b42e1..8e1561a6d325 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -4044,12 +4044,10 @@  static void srp_remove_one(struct ib_device *device, void *client_data)
 		mutex_lock(&host->target_mutex);
 		list_for_each_entry(target, &host->target_list, list)
 			srp_queue_remove_work(target);
+		list_for_each_entry(target, &host->target_list, list)
+			flush_work(&target->tl_err_work);
 		mutex_unlock(&host->target_mutex);
 
-		/*
-		 * Wait for tl_err and target port removal tasks.
-		 */
-		flush_workqueue(system_long_wq);
 		flush_workqueue(srp_remove_wq);
 
 		kfree(host);