diff mbox series

[v2,1/2] scsi: target/tcm_loop: ignore already deleted scsi device

Message ID 20190820090429.1961085-1-naohiro.aota@wdc.com (mailing list archive)
State New, archived
Headers show
Series [v2,1/2] scsi: target/tcm_loop: ignore already deleted scsi device | expand

Commit Message

Naohiro Aota Aug. 20, 2019, 9:04 a.m. UTC
If there is no corresponding scsi_device for a LUN,
tcm_loop_port_unlink() complains that it "Unable to locate struct
scsi_device for " the device and keep %tl_tpg_port_count as is. However,
such situation is legal when we delete a SCSI device using
/sys/class/scsi_device/${lun}/device/delete. We can safely ignore the
missing SCSI device case here.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 drivers/target/loopback/tcm_loop.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

Comments

Bart Van Assche Aug. 20, 2019, 2:02 p.m. UTC | #1
On 8/20/19 2:04 AM, Naohiro Aota wrote:
> If there is no corresponding scsi_device for a LUN,
> tcm_loop_port_unlink() complains that it "Unable to locate struct
> scsi_device for " the device and keep %tl_tpg_port_count as is. However,
> such situation is legal when we delete a SCSI device using
> /sys/class/scsi_device/${lun}/device/delete. We can safely ignore the
> missing SCSI device case here.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   drivers/target/loopback/tcm_loop.c | 18 +++++++++---------
>   1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/target/loopback/tcm_loop.c b/drivers/target/loopback/tcm_loop.c
> index 3305b47fdf53..0942f3bd7eec 100644
> --- a/drivers/target/loopback/tcm_loop.c
> +++ b/drivers/target/loopback/tcm_loop.c
> @@ -654,16 +654,16 @@ static void tcm_loop_port_unlink(
>   
>   	sd = scsi_device_lookup(tl_hba->sh, 0, tl_tpg->tl_tpgt,
>   				se_lun->unpacked_lun);
> -	if (!sd) {
> -		pr_err("Unable to locate struct scsi_device for %d:%d:%llu\n",
> -		       0, tl_tpg->tl_tpgt, se_lun->unpacked_lun);
> -		return;
> +	if (sd) {
> +		/*
> +		 * Remove Linux/SCSI struct scsi_device by HCTL
> +		 */
> +		scsi_remove_device(sd);
> +		scsi_device_put(sd);
> +	} else {
> +		pr_debug("Unable to locate struct scsi_device for %d:%d:%llu\n",
> +			 0, tl_tpg->tl_tpgt, se_lun->unpacked_lun);
>   	}
> -	/*
> -	 * Remove Linux/SCSI struct scsi_device by HCTL
> -	 */
> -	scsi_remove_device(sd);
> -	scsi_device_put(sd);
>   
>   	atomic_dec_mb(&tl_tpg->tl_tpg_port_count);

The above patch looks wrong to me. I think this patch does not fix the 
reference leak present in the current code. Have you considered to 
modify tcm_loop_port_link() such that it saves the pointer returned by 
scsi_add_device() and to use that pointer in tcm_loop_port_unlink()?

Bart.
Mike Christie Aug. 20, 2019, 3:27 p.m. UTC | #2
On 08/20/2019 09:02 AM, Bart Van Assche wrote:
> On 8/20/19 2:04 AM, Naohiro Aota wrote:
>> If there is no corresponding scsi_device for a LUN,
>> tcm_loop_port_unlink() complains that it "Unable to locate struct
>> scsi_device for " the device and keep %tl_tpg_port_count as is. However,
>> such situation is legal when we delete a SCSI device using
>> /sys/class/scsi_device/${lun}/device/delete. We can safely ignore the
>> missing SCSI device case here.
>>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>   drivers/target/loopback/tcm_loop.c | 18 +++++++++---------
>>   1 file changed, 9 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/target/loopback/tcm_loop.c
>> b/drivers/target/loopback/tcm_loop.c
>> index 3305b47fdf53..0942f3bd7eec 100644
>> --- a/drivers/target/loopback/tcm_loop.c
>> +++ b/drivers/target/loopback/tcm_loop.c
>> @@ -654,16 +654,16 @@ static void tcm_loop_port_unlink(
>>         sd = scsi_device_lookup(tl_hba->sh, 0, tl_tpg->tl_tpgt,
>>                   se_lun->unpacked_lun);
>> -    if (!sd) {
>> -        pr_err("Unable to locate struct scsi_device for %d:%d:%llu\n",
>> -               0, tl_tpg->tl_tpgt, se_lun->unpacked_lun);
>> -        return;
>> +    if (sd) {
>> +        /*
>> +         * Remove Linux/SCSI struct scsi_device by HCTL
>> +         */
>> +        scsi_remove_device(sd);
>> +        scsi_device_put(sd);
>> +    } else {
>> +        pr_debug("Unable to locate struct scsi_device for %d:%d:%llu\n",
>> +             0, tl_tpg->tl_tpgt, se_lun->unpacked_lun);
>>       }
>> -    /*
>> -     * Remove Linux/SCSI struct scsi_device by HCTL
>> -     */
>> -    scsi_remove_device(sd);
>> -    scsi_device_put(sd);
>>         atomic_dec_mb(&tl_tpg->tl_tpg_port_count);
> 
> The above patch looks wrong to me. I think this patch does not fix the
> reference leak present in the current code. Have you considered to
> modify tcm_loop_port_link() such that it saves the pointer returned by
> scsi_add_device() and to use that pointer in tcm_loop_port_unlink()?
> 

Are you guys talking about different issues?

tcm loop does not take a reference to the scsi_device at creation/link
time then need to release at removal/unlink time. The above
scsi_device_put is for the successful scsi_device_lookup call. tcm loop
works like a scsi host driver that does its own scanning via
scsi_add_device (maybe similar to scsi drivers that are raid cards).
Like other host drivers it does not take a reference to the device when
it is added and relies on scsi-ml to handle all that for it before doing
operations like queuecommand.

The leak is if you removed the scsi_device via the scsi ml sysfs
interface then there is no way to completely unlink the lio port because
if scsi_device_lookup fails we return from the function and do not do
not release our refcount on the tl_tpg_port_count.
Bart Van Assche Aug. 20, 2019, 3:43 p.m. UTC | #3
On 8/20/19 8:27 AM, Mike Christie wrote:
> tcm loop does not take a reference to the scsi_device at creation/link
> time then need to release at removal/unlink time. The above
> scsi_device_put is for the successful scsi_device_lookup call. tcm loop
> works like a scsi host driver that does its own scanning via
> scsi_add_device (maybe similar to scsi drivers that are raid cards).
> Like other host drivers it does not take a reference to the device when
> it is added and relies on scsi-ml to handle all that for it before doing
> operations like queuecommand.
> 
> The leak is if you removed the scsi_device via the scsi ml sysfs
> interface then there is no way to completely unlink the lio port because
> if scsi_device_lookup fails we return from the function and do not do
> not release our refcount on the tl_tpg_port_count.

Hi Mike,

Does this mean that you think that this patch is the right way to 
address the reported issue?

Thanks,

Bart.
Mike Christie Aug. 20, 2019, 5:19 p.m. UTC | #4
On 08/20/2019 10:43 AM, Bart Van Assche wrote:
> On 8/20/19 8:27 AM, Mike Christie wrote:
>> tcm loop does not take a reference to the scsi_device at creation/link
>> time then need to release at removal/unlink time. The above
>> scsi_device_put is for the successful scsi_device_lookup call. tcm loop
>> works like a scsi host driver that does its own scanning via
>> scsi_add_device (maybe similar to scsi drivers that are raid cards).
>> Like other host drivers it does not take a reference to the device when
>> it is added and relies on scsi-ml to handle all that for it before doing
>> operations like queuecommand.
>>
>> The leak is if you removed the scsi_device via the scsi ml sysfs
>> interface then there is no way to completely unlink the lio port because
>> if scsi_device_lookup fails we return from the function and do not do
>> not release our refcount on the tl_tpg_port_count.
> 
> Hi Mike,
> 
> Does this mean that you think that this patch is the right way to
> address the reported issue?
> 

It fixes that very specific issue, but it leaves others. Maybe it could
be used in a patchset that builds on it?

I think we could hit issues like:

1. tcm_loop_port_unlink runs and releases tl_tpg_port_count count.
2. userspace initiated scan commands were in progress and complete.
target_fabric_port_unlink->core_dev_del_lun->core_tpg_remove_lun was
waiting for those last IOs and now completes and deletes the mapped lun
from lio.
3. scsi-ml scan completed before the unmapping and so now we have a
scsi_device but no lio lun mapping.


The problem with this is that:

1. We can hit mismatched settings like this:

A. We now have this scsi_device at LUN0, but no lio mapping. User thinks
everything got cleaned up ok, so they decide to map another lio lun.
B. tcm_loop_port_link just does a scsi_add_device which does
scsi_probe_and_add_lun with rescan=true. So the original scsi_device is
returned. It is not reprobed so whatever settings the old device has
will be used. Maybe that scsi_device was for a disk and now the user was
adding a CD.

2. And hit races like:

A. tl_tpg_port_count might now be zero so the tpg and nexus can be removed.
B. That removal can race with IO being sent to the scsi_device. If a
command has passed tcm_loop_submission_work's NULL tl_nexus check then
we will hit a NULL pointer crash later in the function.
Naohiro Aota Aug. 22, 2019, 6:51 a.m. UTC | #5
On Tue, Aug 20, 2019 at 12:19:25PM -0500, Mike Christie wrote:
>CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe.
>
>
>On 08/20/2019 10:43 AM, Bart Van Assche wrote:
>> On 8/20/19 8:27 AM, Mike Christie wrote:
>>> tcm loop does not take a reference to the scsi_device at creation/link
>>> time then need to release at removal/unlink time. The above
>>> scsi_device_put is for the successful scsi_device_lookup call. tcm loop
>>> works like a scsi host driver that does its own scanning via
>>> scsi_add_device (maybe similar to scsi drivers that are raid cards).
>>> Like other host drivers it does not take a reference to the device when
>>> it is added and relies on scsi-ml to handle all that for it before doing
>>> operations like queuecommand.
>>>
>>> The leak is if you removed the scsi_device via the scsi ml sysfs
>>> interface then there is no way to completely unlink the lio port because
>>> if scsi_device_lookup fails we return from the function and do not do
>>> not release our refcount on the tl_tpg_port_count.
>>
>> Hi Mike,
>>
>> Does this mean that you think that this patch is the right way to
>> address the reported issue?
>>
>
>It fixes that very specific issue, but it leaves others. Maybe it could
>be used in a patchset that builds on it?
>
>I think we could hit issues like:
>
>1. tcm_loop_port_unlink runs and releases tl_tpg_port_count count.
>2. userspace initiated scan commands were in progress and complete.
>target_fabric_port_unlink->core_dev_del_lun->core_tpg_remove_lun was
>waiting for those last IOs and now completes and deletes the mapped lun
>from lio.
>3. scsi-ml scan completed before the unmapping and so now we have a
>scsi_device but no lio lun mapping.

Right, this can happen. Actually, this can happen even without my
patch if the scan can occur between scsi_remove_device() in
tcm_loop_port_unlink() and core_tpg_remove_lun(), right?

I presumed we need to use some lock around here. I digged around the
code but cannot figure out a suitable lock here. Actually, I tried
(struct Scsi_Host)->scan_mutex, but it didn't work.

Or, we should block tcm_loop_queuecommand() to proceed the INQUIRY
commads on this LUN? move/introduce new hook after
transport_clear_lun_ref(lun)?

Any idea?

>The problem with this is that:
>
>1. We can hit mismatched settings like this:
>
>A. We now have this scsi_device at LUN0, but no lio mapping. User thinks
>everything got cleaned up ok, so they decide to map another lio lun.
>B. tcm_loop_port_link just does a scsi_add_device which does
>scsi_probe_and_add_lun with rescan=true. So the original scsi_device is
>returned. It is not reprobed so whatever settings the old device has
>will be used. Maybe that scsi_device was for a disk and now the user was
>adding a CD.
>
>2. And hit races like:
>
>A. tl_tpg_port_count might now be zero so the tpg and nexus can be removed.
>B. That removal can race with IO being sent to the scsi_device. If a
>command has passed tcm_loop_submission_work's NULL tl_nexus check then
>we will hit a NULL pointer crash later in the function.
>
>
diff mbox series

Patch

diff --git a/drivers/target/loopback/tcm_loop.c b/drivers/target/loopback/tcm_loop.c
index 3305b47fdf53..0942f3bd7eec 100644
--- a/drivers/target/loopback/tcm_loop.c
+++ b/drivers/target/loopback/tcm_loop.c
@@ -654,16 +654,16 @@  static void tcm_loop_port_unlink(
 
 	sd = scsi_device_lookup(tl_hba->sh, 0, tl_tpg->tl_tpgt,
 				se_lun->unpacked_lun);
-	if (!sd) {
-		pr_err("Unable to locate struct scsi_device for %d:%d:%llu\n",
-		       0, tl_tpg->tl_tpgt, se_lun->unpacked_lun);
-		return;
+	if (sd) {
+		/*
+		 * Remove Linux/SCSI struct scsi_device by HCTL
+		 */
+		scsi_remove_device(sd);
+		scsi_device_put(sd);
+	} else {
+		pr_debug("Unable to locate struct scsi_device for %d:%d:%llu\n",
+			 0, tl_tpg->tl_tpgt, se_lun->unpacked_lun);
 	}
-	/*
-	 * Remove Linux/SCSI struct scsi_device by HCTL
-	 */
-	scsi_remove_device(sd);
-	scsi_device_put(sd);
 
 	atomic_dec_mb(&tl_tpg->tl_tpg_port_count);