diff mbox

[RFC] scsi: reduce protection of scan_mutex in scsi_remove_device

Message ID 20170421211302.2667649-1-songliubraving@fb.com (mailing list archive)
State Changes Requested, archived
Headers show

Commit Message

Song Liu April 21, 2017, 9:13 p.m. UTC
When a device is deleted through sysfs handle "delete", the code
locks shost->scan_mutex. If multiple devices are deleted at the
same time, these deletes will be handled in series.

On the other hand, some devices do long latency IO during deletion,
for example, sd_shutdown() may do sync cache and/or start_stop.
It is not necessary for these commands to run in series.

To reduce latency of parallel "delete" requests, this patch reduces
the protection of scan_mutex. The only function with Scsi_Host
called in __scsi_remove_device() is the optional slave_destroy().
Therefore, the protection of scan_mutex is only necessary for this
function.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/scsi/scsi_sysfs.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

Comments

Bart Van Assche April 21, 2017, 9:17 p.m. UTC | #1
On Fri, 2017-04-21 at 14:13 -0700, Song Liu wrote:
> When a device is deleted through sysfs handle "delete", [ ... ]

If I try to use that sysfs attribute then I encounter a deadlock (see
also the report below). How is it possible that you have not hit that
deadlock in your tests?

Bart.

======================================================
[ INFO: possible circular locking dependency detected ]
4.11.0-rc6-dbg+ #3 Tainted: G          I    
-------------------------------------------------------
bash/7858 is trying to acquire lock:
 (&shost->scan_mutex){+.+.+.}, at: [<ffffffff814de090>] scsi_remove_device+0x20/0x40

but task is already holding lock:
 (s_active#326){++++.+}, at: [<ffffffff81293e20>] kernfs_remove_self+0xe0/0x140

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (s_active#326){++++.+}:
       lock_acquire+0xd5/0x1c0
       __kernfs_remove+0x248/0x310
       kernfs_remove_by_name_ns+0x45/0xa0
       remove_files.isra.1+0x35/0x70
       sysfs_remove_group+0x44/0x90
       sysfs_remove_groups+0x2e/0x50
       device_remove_attrs+0x5e/0x80
       device_del+0x1fd/0x320
       __scsi_remove_device+0xe9/0x120
       scsi_forget_host+0x60/0x70
       scsi_remove_host+0x71/0x110
       0xffffffffa0703690
       process_one_work+0x20b/0x6a0
       worker_thread+0x4e/0x4a0
       kthread+0x113/0x150
       ret_from_fork+0x2e/0x40

-> #0 (&shost->scan_mutex){+.+.+.}:
       __lock_acquire+0x1109/0x1280
       lock_acquire+0xd5/0x1c0
       __mutex_lock+0x83/0x980
       mutex_lock_nested+0x1b/0x20
       scsi_remove_device+0x20/0x40
       sdev_store_delete+0x27/0x30
       dev_attr_store+0x18/0x30
       sysfs_kf_write+0x45/0x60
       kernfs_fop_write+0x13c/0x1c0
       __vfs_write+0x28/0x140
       vfs_write+0xc8/0x1e0
       SyS_write+0x49/0xa0
       entry_SYSCALL_64_fastpath+0x18/0xad

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(s_active#326);
                               lock(&shost->scan_mutex);
                               lock(s_active#326);
  lock(&shost->scan_mutex);

 *** DEADLOCK ***

3 locks held by bash/7858:
 #0:  (sb_writers#4){.+.+.+}, at: [<ffffffff81206b45>] vfs_write+0x195/0x1e0
 #1:  (&of->mutex){+.+.+.}, at: [<ffffffff81294af6>] kernfs_fop_write+0x106/0x1c0
 #2:  (s_active#326){++++.+}, at: [<ffffffff81293e20>] kernfs_remove_self+0xe0/0x140

stack backtrace:
CPU: 3 PID: 7858 Comm: bash Tainted: G          I     4.11.0-rc6-dbg+ #3
Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014
Call Trace:
 dump_stack+0x68/0x93
 print_circular_bug+0x1be/0x210
 __lock_acquire+0x1109/0x1280
 lock_acquire+0xd5/0x1c0
 __mutex_lock+0x83/0x980
 mutex_lock_nested+0x1b/0x20
 scsi_remove_device+0x20/0x40
 sdev_store_delete+0x27/0x30
 dev_attr_store+0x18/0x30
 sysfs_kf_write+0x45/0x60
 kernfs_fop_write+0x13c/0x1c0
 __vfs_write+0x28/0x140
 vfs_write+0xc8/0x1e0
 SyS_write+0x49/0xa0
 entry_SYSCALL_64_fastpath+0x18/0xad
RIP: 0033:0x7fec0f748500
RSP: 002b:00007ffc1ddaec98 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000046 RCX: 00007fec0f748500
RDX: 0000000000000002 RSI: 0000000002012aa0 RDI: 0000000000000001
RBP: 00007ffc1ddaebf0 R08: 00007fec0fa0a740 R09: 00007fec10061100
R10: 0000000000000098 R11: 0000000000000246 R12: 00007fec10084bf0
R13: 0000000000000001 R14: 0000000000000000 R15: 00007ffc1ddaec18
Bart Van Assche April 21, 2017, 9:20 p.m. UTC | #2
On Fri, 2017-04-21 at 14:13 -0700, Song Liu wrote:
> On the other hand, some devices do long latency IO during deletion,
> for example, sd_shutdown() may do sync cache and/or start_stop.
> It is not necessary for these commands to run in series.

Have you noticed my patch series that makes sd_shutdown() submit the
SYNCHRONIZE CACHE command asynchronously? Have you tried whether that
patch series would be a good alternative?

Thanks,

Bart.
Song Liu April 21, 2017, 10:20 p.m. UTC | #3
> On Apr 21, 2017, at 2:17 PM, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
> 
> On Fri, 2017-04-21 at 14:13 -0700, Song Liu wrote:
>> When a device is deleted through sysfs handle "delete", [ ... ]
> 
> If I try to use that sysfs attribute then I encounter a deadlock (see
> also the report below). How is it possible that you have not hit that
> deadlock in your tests?
> 
> Bart.


Hmm... Seems my test environment (VM) doesn't call the following path

       __scsi_remove_device+0xe9/0x120
       scsi_forget_host+0x60/0x70
       scsi_remove_host+0x71/0x110

Let me fix it. 

Thanks,
Song
Song Liu April 21, 2017, 10:31 p.m. UTC | #4
> On Apr 21, 2017, at 2:20 PM, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
> 
> On Fri, 2017-04-21 at 14:13 -0700, Song Liu wrote:
>> On the other hand, some devices do long latency IO during deletion,
>> for example, sd_shutdown() may do sync cache and/or start_stop.
>> It is not necessary for these commands to run in series.
> 
> Have you noticed my patch series that makes sd_shutdown() submit the
> SYNCHRONIZE CACHE command asynchronously? Have you tried whether that
> patch series would be a good alternative?
> 
> Thanks,
> 
> Bart.

The asynchronous SYNCHRONIZE CACHE will not help our use case, where the 
latency comes from sd_start_stop_device(). Seems it is not easy to make 
the START STOP UNIT command async. 

Thanks,
Song
Christoph Hellwig April 24, 2017, 3:28 p.m. UTC | #5
On Fri, Apr 21, 2017 at 02:13:02PM -0700, Song Liu wrote:
> When a device is deleted through sysfs handle "delete", the code
> locks shost->scan_mutex. If multiple devices are deleted at the
> same time, these deletes will be handled in series.
> 
> On the other hand, some devices do long latency IO during deletion,
> for example, sd_shutdown() may do sync cache and/or start_stop.
> It is not necessary for these commands to run in series.
> 
> To reduce latency of parallel "delete" requests, this patch reduces
> the protection of scan_mutex. The only function with Scsi_Host
> called in __scsi_remove_device() is the optional slave_destroy().
> Therefore, the protection of scan_mutex is only necessary for this
> function.

And I don't think it makes sense for slave_destroy either.  Please
	do a quick audit of the instances and drop the lock for it, too.

> 
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index 82dfe07..e7a9e28 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -610,7 +610,7 @@ static int scsi_sdev_check_buf_bit(const char *buf)
>  			return 1;
>  		else if (buf[0] == '0')
>  			return 0;
> -		else 
> +		else
>  			return -EINVAL;

Also the patch has a few odd whitespace changes like this.  Please
remove those before reposting.
Bart Van Assche April 25, 2017, 5:23 p.m. UTC | #6
On Fri, 2017-04-21 at 14:13 -0700, Song Liu wrote:
> When a device is deleted through sysfs handle "delete", the code
> locks shost->scan_mutex. If multiple devices are deleted at the
> same time, these deletes will be handled in series.
> 
> On the other hand, some devices do long latency IO during deletion,
> for example, sd_shutdown() may do sync cache and/or start_stop.
> It is not necessary for these commands to run in series.
> 
> To reduce latency of parallel "delete" requests, this patch reduces
> the protection of scan_mutex. The only function with Scsi_Host
> called in __scsi_remove_device() is the optional slave_destroy().
> Therefore, the protection of scan_mutex is only necessary for this
> function.

Hello Song,

What you wrote above is wrong. It is necessary to serialize SCSI
scanning against SCSI device removal. That is why scan_mutex is held
around the __scsi_remove_device() call. When adding a LUN the following
(and several other) actions are performed:
* Allocation of memory for struct scsi_device.
* Allocation of a block layer request queue.
* Initialization of the .sdev_gendev and .sdev_dev device structures.
* Association of the transport layer driver with the SCSI device.
* Association of a device handler with the SCSI device (e.g. ALUA).
* Making the .sdev_gendev and .sdev_dev devices visible in sysfs.
* Making the transport layer attributes visible in sysfs.
* Creating a bsg device node in sysfs.
* Association of an upper layer driver
(e.g. sd or sr) with the SCSI LUN.

Removal of a LUN means undoing all of the above. If adding and removing
a LUN would not be not serialized then there would be a risk that removal
and immediate reregistration of a LUN will fail due to the reregistration
process trying to add sysfs attributes that were not yet removed by the
removal step.

Bart.
Song Liu April 25, 2017, 5:42 p.m. UTC | #7
> On Apr 25, 2017, at 10:23 AM, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
> 
> On Fri, 2017-04-21 at 14:13 -0700, Song Liu wrote:
>> When a device is deleted through sysfs handle "delete", the code
>> locks shost->scan_mutex. If multiple devices are deleted at the
>> same time, these deletes will be handled in series.
>> 
>> On the other hand, some devices do long latency IO during deletion,
>> for example, sd_shutdown() may do sync cache and/or start_stop.
>> It is not necessary for these commands to run in series.
>> 
>> To reduce latency of parallel "delete" requests, this patch reduces
>> the protection of scan_mutex. The only function with Scsi_Host
>> called in __scsi_remove_device() is the optional slave_destroy().
>> Therefore, the protection of scan_mutex is only necessary for this
>> function.
> 
> Hello Song,
> 
> What you wrote above is wrong. It is necessary to serialize SCSI
> scanning against SCSI device removal. That is why scan_mutex is held
> around the __scsi_remove_device() call. When adding a LUN the following
> (and several other) actions are performed:
> * Allocation of memory for struct scsi_device.
> * Allocation of a block layer request queue.
> * Initialization of the .sdev_gendev and .sdev_dev device structures.
> * Association of the transport layer driver with the SCSI device.
> * Association of a device handler with the SCSI device (e.g. ALUA).
> * Making the .sdev_gendev and .sdev_dev devices visible in sysfs.
> * Making the transport layer attributes visible in sysfs.
> * Creating a bsg device node in sysfs.
> * Association of an upper layer driver
> (e.g. sd or sr) with the SCSI LUN.
> 
> Removal of a LUN means undoing all of the above. If adding and removing
> a LUN would not be not serialized then there would be a risk that removal
> and immediate reregistration of a LUN will fail due to the reregistration
> process trying to add sysfs attributes that were not yet removed by the
> removal step.
> 
> Bart.

Hello Bart, 

Thanks a lot for looking into this. 

I have been studying the code recently. I am wondering whether the following 
would work:

1. Introduce a new mutex for scsi_device to protect most operations in the 
   list you gathered above;

2. For operations like host->slave_destroy(), ensure they access scsi_host 
   data with host_lock (or another spin lock). 

   I looked into all instances of slave_destroy, only 2 of them: 
   dc395x_slave_destroy() and visorhba_slave_destroy() access scsi_host data 
   without protection of spin lock. 

3. Once 1 and 2 is ready, __scsi_remove_device() only need to hold the mutex
   for the scsi_device. scan_mutex is no longer required. 

Is this a valid path?

Thanks again,
Song
Bart Van Assche April 25, 2017, 5:52 p.m. UTC | #8
On Tue, 2017-04-25 at 17:42 +0000, Song Liu wrote:
> I have been studying the code recently. I am wondering whether the following 
> would work:
> 
> 1. Introduce a new mutex for scsi_device to protect most operations in the 
>    list you gathered above;
> 
> 2. For operations like host->slave_destroy(), ensure they access scsi_host 
>    data with host_lock (or another spin lock). 
> 
>    I looked into all instances of slave_destroy, only 2 of them: 
>    dc395x_slave_destroy() and visorhba_slave_destroy() access scsi_host data 
>    without protection of spin lock. 
> 
> 3. Once 1 and 2 is ready, __scsi_remove_device() only need to hold the mutex
>    for the scsi_device. scan_mutex is no longer required. 
> 
> Is this a valid path?

Sorry but I don't think so. Unlocking and reacquiring scan_mutex would create
the potential that LUN scanning occurs in the meantime and hence that it fails
because LUN removal is incomplete.

Bart.
Bart Van Assche April 25, 2017, 8:59 p.m. UTC | #9
On Fri, 2017-04-21 at 22:31 +0000, Song Liu wrote:
> On Apr 21, 2017, at 2:20 PM, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
> > On Fri, 2017-04-21 at 14:13 -0700, Song Liu wrote:
> > > On the other hand, some devices do long latency IO during deletion,
> > > for example, sd_shutdown() may do sync cache and/or start_stop.
> > > It is not necessary for these commands to run in series.
> > 
> > Have you noticed my patch series that makes sd_shutdown() submit the
> > SYNCHRONIZE CACHE command asynchronously? Have you tried whether that
> > patch series would be a good alternative?
> 
> The asynchronous SYNCHRONIZE CACHE will not help our use case, where the 
> latency comes from sd_start_stop_device(). Seems it is not easy to make 
> the START STOP UNIT command async. 

Hello Song,

It should be possible to make the START STOP UNIT command asynchronous too
by issuing it asynchronously from inside sd_sync_cache_done(). To avoid
dereferencing a stale struct scsi_disk pointer you will either have to hold
an additional reference as long as the SYNCHRONIZE CACHE command is in
progress or copy the data needed from struct scsi_disk into the SCSI request.

Bart.
Song Liu April 25, 2017, 9:17 p.m. UTC | #10
> On Apr 25, 2017, at 10:52 AM, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
> 
> On Tue, 2017-04-25 at 17:42 +0000, Song Liu wrote:
>> I have been studying the code recently. I am wondering whether the following 
>> would work:
>> 
>> 1. Introduce a new mutex for scsi_device to protect most operations in the 
>>   list you gathered above;
>> 
>> 2. For operations like host->slave_destroy(), ensure they access scsi_host 
>>   data with host_lock (or another spin lock). 
>> 
>>   I looked into all instances of slave_destroy, only 2 of them: 
>>   dc395x_slave_destroy() and visorhba_slave_destroy() access scsi_host data 
>>   without protection of spin lock. 
>> 
>> 3. Once 1 and 2 is ready, __scsi_remove_device() only need to hold the mutex
>>   for the scsi_device. scan_mutex is no longer required. 
>> 
>> Is this a valid path?
> 
> Sorry but I don't think so. Unlocking and reacquiring scan_mutex would create
> the potential that LUN scanning occurs in the meantime and hence that it fails
> because LUN removal is incomplete.
> 


Hello Bart, 

I am not sure I fully understand the problem here. If I understand the logic 
correctly, when a device is being removed, it will stay in scsi_host->__devices
until fully the remove routine is finished. And LUN scanning in parallel will
find the device with scsi_device_lookup_by_target(), and thus it would not 
rescan the device until the device is fully removed? Did I miss anything here?

Thanks,
Song
Song Liu April 25, 2017, 9:29 p.m. UTC | #11
> On Apr 25, 2017, at 1:59 PM, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
> 
> On Fri, 2017-04-21 at 22:31 +0000, Song Liu wrote:
>> On Apr 21, 2017, at 2:20 PM, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
>>> On Fri, 2017-04-21 at 14:13 -0700, Song Liu wrote:
>>>> On the other hand, some devices do long latency IO during deletion,
>>>> for example, sd_shutdown() may do sync cache and/or start_stop.
>>>> It is not necessary for these commands to run in series.
>>> 
>>> Have you noticed my patch series that makes sd_shutdown() submit the
>>> SYNCHRONIZE CACHE command asynchronously? Have you tried whether that
>>> patch series would be a good alternative?
>> 
>> The asynchronous SYNCHRONIZE CACHE will not help our use case, where the 
>> latency comes from sd_start_stop_device(). Seems it is not easy to make 
>> the START STOP UNIT command async. 
> 
> Hello Song,
> 
> It should be possible to make the START STOP UNIT command asynchronous too
> by issuing it asynchronously from inside sd_sync_cache_done(). To avoid
> dereferencing a stale struct scsi_disk pointer you will either have to hold
> an additional reference as long as the SYNCHRONIZE CACHE command is in
> progress or copy the data needed from struct scsi_disk into the SCSI request.
> 

Hello Bart, 

As you stated in the patch: blk_cleanup_queue() in __scsi_remove_device() 
will wait for async SYNCHRONIZE CACHE and START STOP UNIT command to complete. 
Since __scsi_remove_device() is called with scan_mutex held, simultaneous 
START STOP UNIT requires are still serialized by the scan_mutex. 

Song
Bart Van Assche April 25, 2017, 10:17 p.m. UTC | #12
On Tue, 2017-04-25 at 21:17 +0000, Song Liu wrote:
> I am not sure I fully understand the problem here. If I understand the logic 
> correctly, when a device is being removed, it will stay in scsi_host->__devices
> until fully the remove routine is finished. And LUN scanning in parallel will
> find the device with scsi_device_lookup_by_target(), and thus it would not 
> rescan the device until the device is fully removed? Did I miss anything here?

Hello Song,

The SCSI core is already complicated enough. Please don't complicate it further
by making subtle changes to the semantics of scan_mutex. Please also note that
I have proposed an alternative, namely to make the START STOP UNIT command
asynchronous.

Bart.
diff mbox

Patch

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 82dfe07..e7a9e28 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -610,7 +610,7 @@  static int scsi_sdev_check_buf_bit(const char *buf)
 			return 1;
 		else if (buf[0] == '0')
 			return 0;
-		else 
+		else
 			return -EINVAL;
 	} else
 		return -EINVAL;
@@ -774,7 +774,7 @@  store_queue_type_field(struct device *dev, struct device_attribute *attr,
 
 	if (!sdev->tagged_supported)
 		return -EINVAL;
-		
+
 	sdev_printk(KERN_INFO, sdev,
 		    "ignoring write to deprecated queue_type attribute");
 	return count;
@@ -1302,8 +1302,11 @@  void __scsi_remove_device(struct scsi_device *sdev)
 	blk_cleanup_queue(sdev->request_queue);
 	cancel_work_sync(&sdev->requeue_work);
 
-	if (sdev->host->hostt->slave_destroy)
+	if (sdev->host->hostt->slave_destroy) {
+		mutex_lock(&sdev->host->scan_mutex);
 		sdev->host->hostt->slave_destroy(sdev);
+		mutex_unlock(&sdev->host->scan_mutex);
+	}
 	transport_destroy_device(dev);
 
 	/*
@@ -1322,11 +1325,7 @@  void __scsi_remove_device(struct scsi_device *sdev)
  **/
 void scsi_remove_device(struct scsi_device *sdev)
 {
-	struct Scsi_Host *shost = sdev->host;
-
-	mutex_lock(&shost->scan_mutex);
 	__scsi_remove_device(sdev);
-	mutex_unlock(&shost->scan_mutex);
 }
 EXPORT_SYMBOL(scsi_remove_device);