diff mbox series

scsi: sd: Move sd_read_cpr() out of the q->limits_lock region

Message ID 20240801054234.540532-1-shinichiro.kawasaki@wdc.com (mailing list archive)
State New, archived
Headers show
Series scsi: sd: Move sd_read_cpr() out of the q->limits_lock region | expand

Commit Message

Shinichiro Kawasaki Aug. 1, 2024, 5:42 a.m. UTC
Commit 804e498e0496 ("sd: convert to the atomic queue limits API")
introduced pairs of function calls to queue_limits_start_update() and
queue_limits_commit_update(). These two functions lock and unlock
q->limits_lock. In sd_revalidate_disk(), sd_read_cpr() is called after
queue_limits_start_update() call and before
queue_limits_commit_update() call. sd_read_cpr() locks q->sysfs_dir_lock
and &q->sysfs_lock. Then new lock dependencies were created between
q->limits_lock, q->sysfs_dir_lock and q->sysfs_lock, as follows:

sd_revalidate_disk
  queue_limits_start_update
    mutex_lock(&q->limits_lock)
  sd_read_cpr
    disk_set_independent_access_ranges
      mutex_lock(&q->sysfs_dir_lock)
      mutex_lock(&q->sysfs_lock)
      mutex_unlock(&q->sysfs_lock)
      mutex_unlock(&q->sysfs_dir_lock)
  queue_limits_commit_update
    mutex_unlock(&q->limits_lock)

However, the three locks already had reversed dependencies in other
places. Then the new dependencies triggered the lockdep WARN "possible
circular locking dependency detected" [1]. This WARN was observed by
running the blktests test case srp/002.

To avoid the WARN, move the sd_read_cpr() call in sd_revalidate_disk()
after the queue_limits_commit_update() call. In other words, move the
sd_read_cpr() call out of the q->limits_lock region.

[1] https://lore.kernel.org/linux-scsi/vlmv53ni3ltwxplig5qnw4xsl2h6ccxijfbqzekx76vxoim5a5@dekv7q3es3tx/

Fixes: 804e498e0496 ("sd: convert to the atomic queue limits API")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 drivers/scsi/sd.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Coelho, Luciano Aug. 1, 2024, 5:49 a.m. UTC | #1
On Thu, 2024-08-01 at 14:42 +0900, Shin'ichiro Kawasaki wrote:
> Commit 804e498e0496 ("sd: convert to the atomic queue limits API")
> introduced pairs of function calls to queue_limits_start_update() and
> queue_limits_commit_update(). These two functions lock and unlock
> q->limits_lock. In sd_revalidate_disk(), sd_read_cpr() is called after
> queue_limits_start_update() call and before
> queue_limits_commit_update() call. sd_read_cpr() locks q->sysfs_dir_lock
> and &q->sysfs_lock. Then new lock dependencies were created between
> q->limits_lock, q->sysfs_dir_lock and q->sysfs_lock, as follows:
> 
> sd_revalidate_disk
>   queue_limits_start_update
>     mutex_lock(&q->limits_lock)
>   sd_read_cpr
>     disk_set_independent_access_ranges
>       mutex_lock(&q->sysfs_dir_lock)
>       mutex_lock(&q->sysfs_lock)
>       mutex_unlock(&q->sysfs_lock)
>       mutex_unlock(&q->sysfs_dir_lock)
>   queue_limits_commit_update
>     mutex_unlock(&q->limits_lock)
> 
> However, the three locks already had reversed dependencies in other
> places. Then the new dependencies triggered the lockdep WARN "possible
> circular locking dependency detected" [1]. This WARN was observed by
> running the blktests test case srp/002.
> 
> To avoid the WARN, move the sd_read_cpr() call in sd_revalidate_disk()
> after the queue_limits_commit_update() call. In other words, move the
> sd_read_cpr() call out of the q->limits_lock region.
> 
> [1] https://lore.kernel.org/linux-scsi/vlmv53ni3ltwxplig5qnw4xsl2h6ccxijfbqzekx76vxoim5a5@dekv7q3es3tx/
> 
> Fixes: 804e498e0496 ("sd: convert to the atomic queue limits API")
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
>  drivers/scsi/sd.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index adeaa8ab9951..08cbe3815006 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3753,7 +3753,6 @@ static int sd_revalidate_disk(struct gendisk *disk)
>  			sd_read_block_limits_ext(sdkp);
>  			sd_read_block_characteristics(sdkp, &lim);
>  			sd_zbc_read_zones(sdkp, &lim, buffer);
> -			sd_read_cpr(sdkp);
>  		}
>  
>  		sd_print_capacity(sdkp, old_capacity);
> @@ -3808,6 +3807,14 @@ static int sd_revalidate_disk(struct gendisk *disk)
>  	if (err)
>  		return err;
>  
> +	/*
> +	 * Query concurrent positioning ranges after
> +	 * queue_limits_commit_update() unlocked q->limits_lock to avoid
> +	 * deadlock with q->sysfs_dir_lock and q->sysfs_lock.
> +	 */
> +	if (sdkp->media_present && scsi_device_supports_vpd(sdp))
> +		sd_read_cpr(sdkp);
> +
>  	/*
>  	 * For a zoned drive, revalidating the zones can be done only once
>  	 * the gendisk capacity is set. So if this fails, set back the gendisk

This seems to do the trick! At least on our setups we're not seeing the
deadlock issue anymore.

Thanks, Shinichiro!

Tested-by: Luca Coelho <luciano.coelho@intel.com>

--
Cheers,
Luca.
Damien Le Moal Aug. 1, 2024, 7:09 a.m. UTC | #2
On 8/1/24 2:42 PM, Shin'ichiro Kawasaki wrote:
> Commit 804e498e0496 ("sd: convert to the atomic queue limits API")
> introduced pairs of function calls to queue_limits_start_update() and
> queue_limits_commit_update(). These two functions lock and unlock
> q->limits_lock. In sd_revalidate_disk(), sd_read_cpr() is called after
> queue_limits_start_update() call and before
> queue_limits_commit_update() call. sd_read_cpr() locks q->sysfs_dir_lock
> and &q->sysfs_lock. Then new lock dependencies were created between
> q->limits_lock, q->sysfs_dir_lock and q->sysfs_lock, as follows:
> 
> sd_revalidate_disk
>   queue_limits_start_update
>     mutex_lock(&q->limits_lock)
>   sd_read_cpr
>     disk_set_independent_access_ranges
>       mutex_lock(&q->sysfs_dir_lock)
>       mutex_lock(&q->sysfs_lock)
>       mutex_unlock(&q->sysfs_lock)
>       mutex_unlock(&q->sysfs_dir_lock)
>   queue_limits_commit_update
>     mutex_unlock(&q->limits_lock)
> 
> However, the three locks already had reversed dependencies in other
> places. Then the new dependencies triggered the lockdep WARN "possible
> circular locking dependency detected" [1]. This WARN was observed by
> running the blktests test case srp/002.
> 
> To avoid the WARN, move the sd_read_cpr() call in sd_revalidate_disk()
> after the queue_limits_commit_update() call. In other words, move the
> sd_read_cpr() call out of the q->limits_lock region.
> 
> [1] https://lore.kernel.org/linux-scsi/vlmv53ni3ltwxplig5qnw4xsl2h6ccxijfbqzekx76vxoim5a5@dekv7q3es3tx/
> 
> Fixes: 804e498e0496 ("sd: convert to the atomic queue limits API")
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>

Given that sd_read_cpr() does not change any limit, looks good to me.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Christoph Hellwig Aug. 1, 2024, 2:20 p.m. UTC | #3
Looks good, thanks!

Reviewed-by: Christoph Hellwig <hch@lst.de>
Bart Van Assche Aug. 1, 2024, 4:07 p.m. UTC | #4
On 7/31/24 10:42 PM, Shin'ichiro Kawasaki wrote:
> To avoid the WARN, move the sd_read_cpr() call in sd_revalidate_disk()
> after the queue_limits_commit_update() call. In other words, move the
> sd_read_cpr() call out of the q->limits_lock region.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
diff mbox series

Patch

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index adeaa8ab9951..08cbe3815006 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3753,7 +3753,6 @@  static int sd_revalidate_disk(struct gendisk *disk)
 			sd_read_block_limits_ext(sdkp);
 			sd_read_block_characteristics(sdkp, &lim);
 			sd_zbc_read_zones(sdkp, &lim, buffer);
-			sd_read_cpr(sdkp);
 		}
 
 		sd_print_capacity(sdkp, old_capacity);
@@ -3808,6 +3807,14 @@  static int sd_revalidate_disk(struct gendisk *disk)
 	if (err)
 		return err;
 
+	/*
+	 * Query concurrent positioning ranges after
+	 * queue_limits_commit_update() unlocked q->limits_lock to avoid
+	 * deadlock with q->sysfs_dir_lock and q->sysfs_lock.
+	 */
+	if (sdkp->media_present && scsi_device_supports_vpd(sdp))
+		sd_read_cpr(sdkp);
+
 	/*
 	 * For a zoned drive, revalidating the zones can be done only once
 	 * the gendisk capacity is set. So if this fails, set back the gendisk