Message ID | 20190820090429.1961085-1-naohiro.aota@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,1/2] scsi: target/tcm_loop: ignore already deleted scsi device | expand |
On 8/20/19 2:04 AM, Naohiro Aota wrote: > If there is no corresponding scsi_device for a LUN, > tcm_loop_port_unlink() complains that it "Unable to locate struct > scsi_device for " the device and keep %tl_tpg_port_count as is. However, > such situation is legal when we delete a SCSI device using > /sys/class/scsi_device/${lun}/device/delete. We can safely ignore the > missing SCSI device case here. > > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> > --- > drivers/target/loopback/tcm_loop.c | 18 +++++++++--------- > 1 file changed, 9 insertions(+), 9 deletions(-) > > diff --git a/drivers/target/loopback/tcm_loop.c b/drivers/target/loopback/tcm_loop.c > index 3305b47fdf53..0942f3bd7eec 100644 > --- a/drivers/target/loopback/tcm_loop.c > +++ b/drivers/target/loopback/tcm_loop.c > @@ -654,16 +654,16 @@ static void tcm_loop_port_unlink( > > sd = scsi_device_lookup(tl_hba->sh, 0, tl_tpg->tl_tpgt, > se_lun->unpacked_lun); > - if (!sd) { > - pr_err("Unable to locate struct scsi_device for %d:%d:%llu\n", > - 0, tl_tpg->tl_tpgt, se_lun->unpacked_lun); > - return; > + if (sd) { > + /* > + * Remove Linux/SCSI struct scsi_device by HCTL > + */ > + scsi_remove_device(sd); > + scsi_device_put(sd); > + } else { > + pr_debug("Unable to locate struct scsi_device for %d:%d:%llu\n", > + 0, tl_tpg->tl_tpgt, se_lun->unpacked_lun); > } > - /* > - * Remove Linux/SCSI struct scsi_device by HCTL > - */ > - scsi_remove_device(sd); > - scsi_device_put(sd); > > atomic_dec_mb(&tl_tpg->tl_tpg_port_count); The above patch looks wrong to me. I think this patch does not fix the reference leak present in the current code. Have you considered to modify tcm_loop_port_link() such that it saves the pointer returned by scsi_add_device() and to use that pointer in tcm_loop_port_unlink()? Bart.
On 08/20/2019 09:02 AM, Bart Van Assche wrote: > On 8/20/19 2:04 AM, Naohiro Aota wrote: >> If there is no corresponding scsi_device for a LUN, >> tcm_loop_port_unlink() complains that it "Unable to locate struct >> scsi_device for " the device and keep %tl_tpg_port_count as is. However, >> such situation is legal when we delete a SCSI device using >> /sys/class/scsi_device/${lun}/device/delete. We can safely ignore the >> missing SCSI device case here. >> >> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> >> --- >> drivers/target/loopback/tcm_loop.c | 18 +++++++++--------- >> 1 file changed, 9 insertions(+), 9 deletions(-) >> >> diff --git a/drivers/target/loopback/tcm_loop.c >> b/drivers/target/loopback/tcm_loop.c >> index 3305b47fdf53..0942f3bd7eec 100644 >> --- a/drivers/target/loopback/tcm_loop.c >> +++ b/drivers/target/loopback/tcm_loop.c >> @@ -654,16 +654,16 @@ static void tcm_loop_port_unlink( >> sd = scsi_device_lookup(tl_hba->sh, 0, tl_tpg->tl_tpgt, >> se_lun->unpacked_lun); >> - if (!sd) { >> - pr_err("Unable to locate struct scsi_device for %d:%d:%llu\n", >> - 0, tl_tpg->tl_tpgt, se_lun->unpacked_lun); >> - return; >> + if (sd) { >> + /* >> + * Remove Linux/SCSI struct scsi_device by HCTL >> + */ >> + scsi_remove_device(sd); >> + scsi_device_put(sd); >> + } else { >> + pr_debug("Unable to locate struct scsi_device for %d:%d:%llu\n", >> + 0, tl_tpg->tl_tpgt, se_lun->unpacked_lun); >> } >> - /* >> - * Remove Linux/SCSI struct scsi_device by HCTL >> - */ >> - scsi_remove_device(sd); >> - scsi_device_put(sd); >> atomic_dec_mb(&tl_tpg->tl_tpg_port_count); > > The above patch looks wrong to me. I think this patch does not fix the > reference leak present in the current code. Have you considered to > modify tcm_loop_port_link() such that it saves the pointer returned by > scsi_add_device() and to use that pointer in tcm_loop_port_unlink()? > Are you guys talking about different issues? tcm loop does not take a reference to the scsi_device at creation/link time then need to release at removal/unlink time. The above scsi_device_put is for the successful scsi_device_lookup call. tcm loop works like a scsi host driver that does its own scanning via scsi_add_device (maybe similar to scsi drivers that are raid cards). Like other host drivers it does not take a reference to the device when it is added and relies on scsi-ml to handle all that for it before doing operations like queuecommand. The leak is if you removed the scsi_device via the scsi ml sysfs interface then there is no way to completely unlink the lio port because if scsi_device_lookup fails we return from the function and do not do not release our refcount on the tl_tpg_port_count.
On 8/20/19 8:27 AM, Mike Christie wrote: > tcm loop does not take a reference to the scsi_device at creation/link > time then need to release at removal/unlink time. The above > scsi_device_put is for the successful scsi_device_lookup call. tcm loop > works like a scsi host driver that does its own scanning via > scsi_add_device (maybe similar to scsi drivers that are raid cards). > Like other host drivers it does not take a reference to the device when > it is added and relies on scsi-ml to handle all that for it before doing > operations like queuecommand. > > The leak is if you removed the scsi_device via the scsi ml sysfs > interface then there is no way to completely unlink the lio port because > if scsi_device_lookup fails we return from the function and do not do > not release our refcount on the tl_tpg_port_count. Hi Mike, Does this mean that you think that this patch is the right way to address the reported issue? Thanks, Bart.
On 08/20/2019 10:43 AM, Bart Van Assche wrote: > On 8/20/19 8:27 AM, Mike Christie wrote: >> tcm loop does not take a reference to the scsi_device at creation/link >> time then need to release at removal/unlink time. The above >> scsi_device_put is for the successful scsi_device_lookup call. tcm loop >> works like a scsi host driver that does its own scanning via >> scsi_add_device (maybe similar to scsi drivers that are raid cards). >> Like other host drivers it does not take a reference to the device when >> it is added and relies on scsi-ml to handle all that for it before doing >> operations like queuecommand. >> >> The leak is if you removed the scsi_device via the scsi ml sysfs >> interface then there is no way to completely unlink the lio port because >> if scsi_device_lookup fails we return from the function and do not do >> not release our refcount on the tl_tpg_port_count. > > Hi Mike, > > Does this mean that you think that this patch is the right way to > address the reported issue? > It fixes that very specific issue, but it leaves others. Maybe it could be used in a patchset that builds on it? I think we could hit issues like: 1. tcm_loop_port_unlink runs and releases tl_tpg_port_count count. 2. userspace initiated scan commands were in progress and complete. target_fabric_port_unlink->core_dev_del_lun->core_tpg_remove_lun was waiting for those last IOs and now completes and deletes the mapped lun from lio. 3. scsi-ml scan completed before the unmapping and so now we have a scsi_device but no lio lun mapping. The problem with this is that: 1. We can hit mismatched settings like this: A. We now have this scsi_device at LUN0, but no lio mapping. User thinks everything got cleaned up ok, so they decide to map another lio lun. B. tcm_loop_port_link just does a scsi_add_device which does scsi_probe_and_add_lun with rescan=true. So the original scsi_device is returned. It is not reprobed so whatever settings the old device has will be used. Maybe that scsi_device was for a disk and now the user was adding a CD. 2. And hit races like: A. tl_tpg_port_count might now be zero so the tpg and nexus can be removed. B. That removal can race with IO being sent to the scsi_device. If a command has passed tcm_loop_submission_work's NULL tl_nexus check then we will hit a NULL pointer crash later in the function.
On Tue, Aug 20, 2019 at 12:19:25PM -0500, Mike Christie wrote: >CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe. > > >On 08/20/2019 10:43 AM, Bart Van Assche wrote: >> On 8/20/19 8:27 AM, Mike Christie wrote: >>> tcm loop does not take a reference to the scsi_device at creation/link >>> time then need to release at removal/unlink time. The above >>> scsi_device_put is for the successful scsi_device_lookup call. tcm loop >>> works like a scsi host driver that does its own scanning via >>> scsi_add_device (maybe similar to scsi drivers that are raid cards). >>> Like other host drivers it does not take a reference to the device when >>> it is added and relies on scsi-ml to handle all that for it before doing >>> operations like queuecommand. >>> >>> The leak is if you removed the scsi_device via the scsi ml sysfs >>> interface then there is no way to completely unlink the lio port because >>> if scsi_device_lookup fails we return from the function and do not do >>> not release our refcount on the tl_tpg_port_count. >> >> Hi Mike, >> >> Does this mean that you think that this patch is the right way to >> address the reported issue? >> > >It fixes that very specific issue, but it leaves others. Maybe it could >be used in a patchset that builds on it? > >I think we could hit issues like: > >1. tcm_loop_port_unlink runs and releases tl_tpg_port_count count. >2. userspace initiated scan commands were in progress and complete. >target_fabric_port_unlink->core_dev_del_lun->core_tpg_remove_lun was >waiting for those last IOs and now completes and deletes the mapped lun >from lio. >3. scsi-ml scan completed before the unmapping and so now we have a >scsi_device but no lio lun mapping. Right, this can happen. Actually, this can happen even without my patch if the scan can occur between scsi_remove_device() in tcm_loop_port_unlink() and core_tpg_remove_lun(), right? I presumed we need to use some lock around here. I digged around the code but cannot figure out a suitable lock here. Actually, I tried (struct Scsi_Host)->scan_mutex, but it didn't work. Or, we should block tcm_loop_queuecommand() to proceed the INQUIRY commads on this LUN? move/introduce new hook after transport_clear_lun_ref(lun)? Any idea? >The problem with this is that: > >1. We can hit mismatched settings like this: > >A. We now have this scsi_device at LUN0, but no lio mapping. User thinks >everything got cleaned up ok, so they decide to map another lio lun. >B. tcm_loop_port_link just does a scsi_add_device which does >scsi_probe_and_add_lun with rescan=true. So the original scsi_device is >returned. It is not reprobed so whatever settings the old device has >will be used. Maybe that scsi_device was for a disk and now the user was >adding a CD. > >2. And hit races like: > >A. tl_tpg_port_count might now be zero so the tpg and nexus can be removed. >B. That removal can race with IO being sent to the scsi_device. If a >command has passed tcm_loop_submission_work's NULL tl_nexus check then >we will hit a NULL pointer crash later in the function. > >
diff --git a/drivers/target/loopback/tcm_loop.c b/drivers/target/loopback/tcm_loop.c index 3305b47fdf53..0942f3bd7eec 100644 --- a/drivers/target/loopback/tcm_loop.c +++ b/drivers/target/loopback/tcm_loop.c @@ -654,16 +654,16 @@ static void tcm_loop_port_unlink( sd = scsi_device_lookup(tl_hba->sh, 0, tl_tpg->tl_tpgt, se_lun->unpacked_lun); - if (!sd) { - pr_err("Unable to locate struct scsi_device for %d:%d:%llu\n", - 0, tl_tpg->tl_tpgt, se_lun->unpacked_lun); - return; + if (sd) { + /* + * Remove Linux/SCSI struct scsi_device by HCTL + */ + scsi_remove_device(sd); + scsi_device_put(sd); + } else { + pr_debug("Unable to locate struct scsi_device for %d:%d:%llu\n", + 0, tl_tpg->tl_tpgt, se_lun->unpacked_lun); } - /* - * Remove Linux/SCSI struct scsi_device by HCTL - */ - scsi_remove_device(sd); - scsi_device_put(sd); atomic_dec_mb(&tl_tpg->tl_tpg_port_count);
If there is no corresponding scsi_device for a LUN, tcm_loop_port_unlink() complains that it "Unable to locate struct scsi_device for " the device and keep %tl_tpg_port_count as is. However, such situation is legal when we delete a SCSI device using /sys/class/scsi_device/${lun}/device/delete. We can safely ignore the missing SCSI device case here. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> --- drivers/target/loopback/tcm_loop.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)