Message ID | 20211004120650.153218-3-adrian.hunter@intel.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | scsi: ufs: Start to make driver synchronization easier to understand | expand |
On 10/4/21 5:06 AM, Adrian Hunter wrote: > To fit its new purpose as a more general purpose sleeping lock for the > host. > > [ ... ] > > diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h > index 9b1ef272fb3c..495e1c0afae3 100644 > --- a/drivers/scsi/ufs/ufshcd.h > +++ b/drivers/scsi/ufs/ufshcd.h > @@ -897,7 +897,7 @@ struct ufs_hba { > enum bkops_status urgent_bkops_lvl; > bool is_urgent_bkops_lvl_checked; > > - struct rw_semaphore clk_scaling_lock; > + struct rw_semaphore host_rw_sem; > unsigned char desc_size[QUERY_DESC_IDN_MAX]; > atomic_t scsi_block_reqs_cnt; Hi Adrian, It seems to me that this patch series goes in another direction than the direction the JEDEC UFS committee is going into. The UFSHCI 4.0 specification will support MCQ (Multi-Circular queue). We will benefit most from the v4.0 MCQ support if there is no contention between the CPUs that submit UFS commands to different queues. I think the intention of this patch series is to make a single synchronization object protect all submission queues. I'm concerned that this will prevent to fully benefit from multiqueue support. Has it been considered to eliminate the clk_scaling_lock and instead to use RCU to serialize clock scaling against command processing? One possible approach is to use blk_mq_freeze_queue() and blk_mq_unfreeze_queue() around the clock scaling code. A disadvantage of using RCU is that waiting for an RCU grace period takes some time - about 10 ms on my test setup. I have not yet verified what the performance and time impact would be of using an expedited RCU grace period instead of a regular RCU grace period. Bart.
On 04/10/2021 19:52, Bart Van Assche wrote: > On 10/4/21 5:06 AM, Adrian Hunter wrote: >> To fit its new purpose as a more general purpose sleeping lock for the >> host. >> >> [ ... ] >> >> diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h >> index 9b1ef272fb3c..495e1c0afae3 100644 >> --- a/drivers/scsi/ufs/ufshcd.h >> +++ b/drivers/scsi/ufs/ufshcd.h >> @@ -897,7 +897,7 @@ struct ufs_hba { >> enum bkops_status urgent_bkops_lvl; >> bool is_urgent_bkops_lvl_checked; >> - struct rw_semaphore clk_scaling_lock; >> + struct rw_semaphore host_rw_sem; >> unsigned char desc_size[QUERY_DESC_IDN_MAX]; >> atomic_t scsi_block_reqs_cnt; > > Hi Adrian, Thanks for looking at this. > > It seems to me that this patch series goes in another direction than the > direction the JEDEC UFS committee is going into. The UFSHCI 4.0 specification > will support MCQ (Multi-Circular queue). We will benefit most from the v4.0 > MCQ support if there is no contention between the CPUs that submit UFS commands > to different queues. I think the intention of this patch series is to make a > single synchronization object protect all submission queues. The intention is to make the locking easier to understand. We need either to share access to the host (e.g. ufshcd_queuecommand) or provide for exclusive ownership (e.g. ufshcd_err_handler, PM, Shutdown). We should be able to do that with 1 rw_semaphore. > I'm concerned that > this will prevent to fully benefit from multiqueue support. Has it been You are talking about contention between ufshcd_queuecommand() running simultaneously on 2 CPUs right? In that case, down_read() should be practically atomic, so no contention unless a third process is waiting on down_write() which never happens under normal circumstances. > Has it been > considered to eliminate the clk_scaling_lock and instead to use RCU to > serialize clock scaling against command processing? One possible approach is to > use blk_mq_freeze_queue() and blk_mq_unfreeze_queue() around the clock scaling > code. A disadvantage of using RCU is that waiting for an RCU grace period takes > some time - about 10 ms on my test setup. I have not yet verified what the > performance and time impact would be of using an expedited RCU grace period > instead of a regular RCU grace period. It is probably worth measuring the performance of clk_scaling_lock first.
On 10/4/21 10:06 PM, Adrian Hunter wrote: > On 04/10/2021 19:52, Bart Van Assche wrote: >> I'm concerned that >> this will prevent to fully benefit from multiqueue support. Has it been > > You are talking about contention between ufshcd_queuecommand() running > simultaneously on 2 CPUs right? In that case, down_read() should be practically > atomic, so no contention unless a third process is waiting on down_write() > which never happens under normal circumstances. > >> Has it been >> considered to eliminate the clk_scaling_lock and instead to use RCU to >> serialize clock scaling against command processing? One possible approach is to >> use blk_mq_freeze_queue() and blk_mq_unfreeze_queue() around the clock scaling >> code. A disadvantage of using RCU is that waiting for an RCU grace period takes >> some time - about 10 ms on my test setup. I have not yet verified what the >> performance and time impact would be of using an expedited RCU grace period >> instead of a regular RCU grace period. > > It is probably worth measuring the performance of clk_scaling_lock first. Upcoming UFS devices support several million IOPS. My experience, and that of everyone else working with such storage devices is that every single atomic operation in the hot path causes a measurable performance overhead. down_read() is a synchronization operation and implementing synchronization operations without using atomic loads or stores is not possible. This is why I see clk_scaling_lock as a performance bottleneck. Thanks, Bart.
diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c index 350ecfd90306..3912b74d50ae 100644 --- a/drivers/scsi/ufs/ufshcd.c +++ b/drivers/scsi/ufs/ufshcd.c @@ -9630,7 +9630,7 @@ int ufshcd_init(struct ufs_hba *hba, void __iomem *mmio_base, unsigned int irq) /* Initialize mutex for exception event control */ mutex_init(&hba->ee_ctrl_mutex); - init_rwsem(&hba->clk_scaling_lock); + init_rwsem(&hba->host_rw_sem); ufshcd_init_clk_gating(hba); diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h index 9b1ef272fb3c..495e1c0afae3 100644 --- a/drivers/scsi/ufs/ufshcd.h +++ b/drivers/scsi/ufs/ufshcd.h @@ -897,7 +897,7 @@ struct ufs_hba { enum bkops_status urgent_bkops_lvl; bool is_urgent_bkops_lvl_checked; - struct rw_semaphore clk_scaling_lock; + struct rw_semaphore host_rw_sem; unsigned char desc_size[QUERY_DESC_IDN_MAX]; atomic_t scsi_block_reqs_cnt; @@ -1420,32 +1420,32 @@ static inline int ufshcd_rpmb_rpm_put(struct ufs_hba *hba) static inline void ufshcd_down_read(struct ufs_hba *hba) { - down_read(&hba->clk_scaling_lock); + down_read(&hba->host_rw_sem); } static inline void ufshcd_up_read(struct ufs_hba *hba) { - up_read(&hba->clk_scaling_lock); + up_read(&hba->host_rw_sem); } static inline int ufshcd_down_read_trylock(struct ufs_hba *hba) { - return down_read_trylock(&hba->clk_scaling_lock); + return down_read_trylock(&hba->host_rw_sem); } static inline void ufshcd_down_write(struct ufs_hba *hba) { - down_write(&hba->clk_scaling_lock); + down_write(&hba->host_rw_sem); } static inline void ufshcd_up_write(struct ufs_hba *hba) { - up_write(&hba->clk_scaling_lock); + up_write(&hba->host_rw_sem); } static inline void ufshcd_downgrade_write(struct ufs_hba *hba) { - downgrade_write(&hba->clk_scaling_lock); + downgrade_write(&hba->host_rw_sem); } #endif /* End of Header */
To fit its new purpose as a more general purpose sleeping lock for the host. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> --- drivers/scsi/ufs/ufshcd.c | 2 +- drivers/scsi/ufs/ufshcd.h | 14 +++++++------- 2 files changed, 8 insertions(+), 8 deletions(-)