Message ID | 20230509065230.32552-3-ed.tsai@mediatek.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: improve the share tag set performance | expand |
> The tags allocation is limited by the fair sharing algorithm. It hurts > the performance for UFS devices, because the queue depth of general I/O > is reduced by half once the UFS send a control command. Ack. However, I think the decision of that should be of the platform owner, And not in the core driver. Thanks, Avri > > Signed-off-by: Ed Tsai <ed.tsai@mediatek.com> > --- > drivers/ufs/core/ufshcd.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c > index 17d7bb875fee..e96a50265285 100644 > --- a/drivers/ufs/core/ufshcd.c > +++ b/drivers/ufs/core/ufshcd.c > @@ -5149,6 +5149,9 @@ static int ufshcd_slave_configure(struct scsi_device > *sdev) > blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1); > if (hba->quirks & UFSHCD_QUIRK_4KB_DMA_ALIGNMENT) > blk_queue_update_dma_alignment(q, 4096 - 1); > + > + blk_queue_flag_clear(QUEUE_FLAG_FAIR_TAG_SHARING, q); > + > /* > * Block runtime-pm until all consumers are added. > * Refer ufshcd_setup_links(). > -- > 2.18.0
On 5/9/23 01:03, Avri Altman wrote: > However, I think the decision of that should be of the platform owner, > And not in the core driver. Hi Avri, I don't see any use case in which performance of a UFS device would be improved by leaving QUEUE_FLAG_FAIR_TAG_SHARING enabled. Are you perhaps aware of such a use case? Thanks, Bart.
> On 5/9/23 01:03, Avri Altman wrote: > > However, I think the decision of that should be of the platform owner, > > And not in the core driver. > > Hi Avri, > > I don't see any use case in which performance of a UFS device would be > improved > by leaving QUEUE_FLAG_FAIR_TAG_SHARING enabled. Are you perhaps aware > of such a > use case? Following your argument, then why fair allocation exists in the first place? When running benchmarks I am hacking the scheduler's "fair" tag allocation as well. That's why I acked this change. Since this change may affect the IO profile as a whole, I think the platform owners should have the flexibility not to use it, Should they choose to. Thanks, Avri > > Thanks, > > Bart.
On 5/9/23 09:19, Avri Altman wrote:
> Following your argument, then why fair allocation exists in the first place?
Does this patch answer your question?
[PATCH v2] block: Improve shared tag set performance
https://lore.kernel.org/linux-block/20230103195337.158625-1-bvanassche@acm.org/
Thanks,
Bart.
> > On 5/9/23 09:19, Avri Altman wrote: > > Following your argument, then why fair allocation exists in the first place? > > Does this patch answer your question? > > [PATCH v2] block: Improve shared tag set performance > https://lore.kernel.org/linux-block/20230103195337.158625-1- > bvanassche@acm.org/ OK. Why it was needed to invent a new flag instead of just clear BLK_MQ_F_TAG_QUEUE_SHARED? Thanks, Avri > > Thanks, > > Bart.
On 5/9/23 22:21, Avri Altman wrote: > Why it was needed to invent a new flag instead of just clear > BLK_MQ_F_TAG_QUEUE_SHARED? Hi Avri, The meaning of BLK_MQ_F_TAG_QUEUE_SHARED is "the tag set is shared across multiple request queues". Clearing BLK_MQ_F_TAG_QUEUE_SHARED if the tag set is shared would be wrong: it would break all the code outside the tag allocation code that tests this flag. Thanks, Bart.
On Tue, May 09, 2023 at 02:52:30PM +0800, Ed Tsai wrote: > The tags allocation is limited by the fair sharing algorithm. It hurts > the performance for UFS devices, because the queue depth of general I/O > is reduced by half once the UFS send a control command. But it is there for a reason. You completely fail to explain why you think your change is safe, and also why you did not try to even explain where the overhead is and how else you tried to mitigate it.
On 5/11/23 08:34, Christoph Hellwig wrote: > On Tue, May 09, 2023 at 02:52:30PM +0800, Ed Tsai wrote: >> The tags allocation is limited by the fair sharing algorithm. It hurts >> the performance for UFS devices, because the queue depth of general I/O >> is reduced by half once the UFS send a control command. > > But it is there for a reason. You completely fail to explain why you > think your change is safe, and also why you did not try to even explain > where the overhead is and how else you tried to mitigate it. Hi Christoph, For which devices is the fair sharing algorithm useful? As far as I know the legacy block layer did not have an equivalent of the fair sharing algorithm and I'm not aware of any complaints about the legacy block layer regarding to fairness. This is why I proposed in January to remove the fair sharing code entirely. See also https://lore.kernel.org/linux-block/20230103195337.158625-1-bvanassche@acm.org/. Thanks, Bart.
On Thu, May 11, 2023 at 08:38:04AM -0700, Bart Van Assche wrote: > For which devices is the fair sharing algorithm useful? As far as I know the > legacy block layer did not have an equivalent of the fair sharing algorithm > and I'm not aware of any complaints about the legacy block layer regarding > to fairness. This is why I proposed in January to remove the fair sharing > code entirely. See also https://lore.kernel.org/linux-block/20230103195337.158625-1-bvanassche@acm.org/. Because the old code did not do tag allocation itself? Either way I don't think a "I'll opt out for a random driver" is the proper approach when you think it's not needed. Especially not without any data explaining why just that driver is a special snowflake.
On 5/12/23 07:02, Christoph Hellwig wrote: > On Thu, May 11, 2023 at 08:38:04AM -0700, Bart Van Assche wrote: >> For which devices is the fair sharing algorithm useful? As far as I know the >> legacy block layer did not have an equivalent of the fair sharing algorithm >> and I'm not aware of any complaints about the legacy block layer regarding >> to fairness. This is why I proposed in January to remove the fair sharing >> code entirely. See also https://lore.kernel.org/linux-block/20230103195337.158625-1-bvanassche@acm.org/. > > Because the old code did not do tag allocation itself? Either way I > don't think a "I'll opt out for a random driver" is the proper approach > when you think it's not needed. Especially not without any data > explaining why just that driver is a special snowflake. Hi Christoph, I'm still wondering whether there are any drivers that benefit from the fair tag sharing algorithm. If the number of tags is large enough (NVMe), the number of tags exceeds the number of requests in flight and hence the fair tag sharing algorithm is not necessary. The fair tag sharing algorithm has a negative impact on all SCSI devices with multiple logical units. This is because logical units are considered active until (request timeout) seconds have elapsed after the logical unit stopped being used (see also the blk_mq_tag_idle() call in blk_mq_timeout_work()). UFS users are hit by this because UFS 3.0 devices have a limited queue depth (32) and because power management commands are submitted to a logical unit (WLUN). Hence, it happens often that the block layer "active queue" counter is equal to 2 while only one logical unit is being used actively (a logical unit backed by NAND flash). The performance difference between queue depths 16 and 32 for UFS devices is significant. Is my understanding correct that in the legacy block layer implementation blk_queue_start_tag() had to be called to assign a tag to a request? I haven't found any code in the Linux kernel v4.20 implementation of blk_queue_start_tag() that implements fairness in case a request tag map (struct blk_queue_tag) is shared across request queues (one request queue per logical unit in case of SCSI). Do you agree with my conclusion that from the point of view of the SCSI core in general and the UFS driver in particular the fair tag sharing algorithm in the blk-mq code introduced a performance regression? Thanks, Bart.
Hi, 在 2023/05/13 2:12, Bart Van Assche 写道: > The fair tag sharing algorithm has a negative impact on all SCSI devices > with multiple logical units. This is because logical units are > considered active until (request timeout) seconds have elapsed after the > logical unit stopped being used (see also the blk_mq_tag_idle() call in > blk_mq_timeout_work()). UFS users are hit by this because UFS 3.0 > devices have a limited queue depth (32) and because power management > commands are submitted to a logical unit (WLUN). Hence, it happens often > that the block layer "active queue" counter is equal to 2 while only one > logical unit is being used actively (a logical unit backed by NAND > flash). The performance difference between queue depths 16 and 32 for > UFS devices is significant. We meet similiar problem before, but I think remove tag fair sharing might cause some problems, because get tag is not fair currently, for example 2 devices share 32 tag, while device a issue large amount of io concurrently, and device b only issue one io, in this case, if fair tag sharing is removed, device b can get bad io latency. By the way, I tried to propose a way to workaround this by following: 1) disable fair tag sharing untill get tag found no tag is avaiable; 2) enable fair tag sharing again if the disk donesn't faild to get tag for a period of time; Can this approch be considered? Thanks, Kuai
On 5/12/23 20:09, Yu Kuai wrote: > 在 2023/05/13 2:12, Bart Van Assche 写道: >> The fair tag sharing algorithm has a negative impact on all SCSI >> devices with multiple logical units. This is because logical units are >> considered active until (request timeout) seconds have elapsed after >> the logical unit stopped being used (see also the blk_mq_tag_idle() >> call in blk_mq_timeout_work()). UFS users are hit by this because UFS >> 3.0 devices have a limited queue depth (32) and because power >> management commands are submitted to a logical unit (WLUN). Hence, it >> happens often that the block layer "active queue" counter is equal to >> 2 while only one logical unit is being used actively (a logical unit >> backed by NAND flash). The performance difference between queue depths >> 16 and 32 for UFS devices is significant. > > We meet similiar problem before, but I think remove tag fair sharing > might cause some problems, because get tag is not fair currently, for > example 2 devices share 32 tag, while device a issue large amount of > io concurrently, and device b only issue one io, in this case, if fair > tag sharing is removed, device b can get bad io latency. > > By the way, I tried to propose a way to workaround this by following: > > 1) disable fair tag sharing untill get tag found no tag is avaiable; > 2) enable fair tag sharing again if the disk donesn't faild to get tag > for a period of time; > > Can this approch be considered? I'm afraid that this approach won't help for the UFS driver since it is likely that all tags are in use by a single logical unit during an IOPS test. Hence, fair sharing would be enabled even when we don't want it to be enabled. I propose that we switch to one of these two approaches: * Either remove the fair tag sharing code entirely and rely on the fairness mechanism provided by the sbitmap code. I'm referring to how __sbitmap_queue_wake_up() uses the wake_index member variable. * Or make the behavior of the fairness algorithm configurable from user space. One possible approach is to make the proportion of tags for a logical unit / NVMe namespace configurable via sysfs. This will allow to reduce the number of tags for the WLUN of UFS devices. Thanks, Bart.
Hi, 在 2023/05/16 23:12, Bart Van Assche 写道: > On 5/12/23 20:09, Yu Kuai wrote: >> 在 2023/05/13 2:12, Bart Van Assche 写道: >>> The fair tag sharing algorithm has a negative impact on all SCSI >>> devices with multiple logical units. This is because logical units >>> are considered active until (request timeout) seconds have elapsed >>> after the logical unit stopped being used (see also the >>> blk_mq_tag_idle() call in blk_mq_timeout_work()). UFS users are hit >>> by this because UFS 3.0 devices have a limited queue depth (32) and >>> because power management commands are submitted to a logical unit >>> (WLUN). Hence, it happens often that the block layer "active queue" >>> counter is equal to 2 while only one logical unit is being used >>> actively (a logical unit backed by NAND flash). The performance >>> difference between queue depths 16 and 32 for UFS devices is >>> significant. >> >> We meet similiar problem before, but I think remove tag fair sharing >> might cause some problems, because get tag is not fair currently, for >> example 2 devices share 32 tag, while device a issue large amount of >> io concurrently, and device b only issue one io, in this case, if fair >> tag sharing is removed, device b can get bad io latency. >> >> By the way, I tried to propose a way to workaround this by following: >> >> 1) disable fair tag sharing untill get tag found no tag is avaiable; >> 2) enable fair tag sharing again if the disk donesn't faild to get tag >> for a period of time; >> >> Can this approch be considered? > > I'm afraid that this approach won't help for the UFS driver since it is > likely that all tags are in use by a single logical unit during an IOPS > test. Hence, fair sharing would be enabled even when we don't want it to > be enabled. It's right my original method is not flexible. > > I propose that we switch to one of these two approaches: How about a smoothing method that the device with more io will share more tag, and each device will get at least one tag? Thanks, Kuai > * Either remove the fair tag sharing code entirely and rely on the > fairness mechanism provided by the sbitmap code. I'm referring to how > __sbitmap_queue_wake_up() uses the wake_index member variable. > * Or make the behavior of the fairness algorithm configurable from user > space. One possible approach is to make the proportion of tags for a > logical unit / NVMe namespace configurable via sysfs. This will allow to > reduce the number of tags for the WLUN of UFS devices. > > Thanks, > > Bart. > > > . >
On 5/17/23 00:49, Yu Kuai wrote: > 在 2023/05/16 23:12, Bart Van Assche 写道: >> I propose that we switch to one of these two approaches: > > How about a smoothing method that the device with more io will share > more tag, and each device will get at least one tag? Hi Yu, hctx_may_queue() is called from the hot path (blk_mq_get_tag()). I'm pretty sure that adding any nontrivial code in that path will cause a performance (IOPS) regression. So I don't think that adding a smoothing method in hctx_may_queue() is a realistic option. Thanks, Bart.
Hi, 在 2023/05/18 2:23, Bart Van Assche 写道: > On 5/17/23 00:49, Yu Kuai wrote: >> 在 2023/05/16 23:12, Bart Van Assche 写道: >>> I propose that we switch to one of these two approaches: >> >> How about a smoothing method that the device with more io will share >> more tag, and each device will get at least one tag? > > Hi Yu, > > hctx_may_queue() is called from the hot path (blk_mq_get_tag()). I'm > pretty sure that adding any nontrivial code in that path will cause a > performance (IOPS) regression. So I don't think that adding a smoothing > method in hctx_may_queue() is a realistic option. > Currently, fair share from hctx_may_queue() requires two atomic_read(active_queues and active_requests), I think this smoothing method can be placed into get_tag fail path, for example, the more times a disk failed to get tag in a period of time, the more tag this disk can get, and all the information can be updated here(perhaps directly record how many tags a disk can get, then hctx_may_queue() still only require 2 atomic_read()). Thanks, Bart
On 5/17/23 18:49, Yu Kuai wrote: > Currently, fair share from hctx_may_queue() requires two > atomic_read(active_queues and active_requests), I think this smoothing > method can be placed into get_tag fail path, for example, the more times > a disk failed to get tag in a period of time, the more tag this disk can > get, and all the information can be updated here(perhaps directly > record how many tags a disk can get, then hctx_may_queue() still only > require 2 atomic_read()). That sounds interesting to me. Do you perhaps plan to implement this approach and to post it as a patch? Thanks, Bart.
Hi, 在 2023/05/18 10:23, Bart Van Assche 写道: > On 5/17/23 18:49, Yu Kuai wrote: >> Currently, fair share from hctx_may_queue() requires two >> atomic_read(active_queues and active_requests), I think this smoothing >> method can be placed into get_tag fail path, for example, the more times >> a disk failed to get tag in a period of time, the more tag this disk can >> get, and all the information can be updated here(perhaps directly >> record how many tags a disk can get, then hctx_may_queue() still only >> require 2 atomic_read()). > > That sounds interesting to me. Do you perhaps plan to implement this > approach and to post it as a patch? Of course, I'll try to send a RFC patch. Thanks, Kuai > > Thanks, > > Bart. > > > . >
On 5/18/23 00:55, Yu Kuai wrote: > 在 2023/05/18 10:23, Bart Van Assche 写道: >> On 5/17/23 18:49, Yu Kuai wrote: >>> Currently, fair share from hctx_may_queue() requires two >>> atomic_read(active_queues and active_requests), I think this smoothing >>> method can be placed into get_tag fail path, for example, the more times >>> a disk failed to get tag in a period of time, the more tag this disk can >>> get, and all the information can be updated here(perhaps directly >>> record how many tags a disk can get, then hctx_may_queue() still only >>> require 2 atomic_read()). >> >> That sounds interesting to me. Do you perhaps plan to implement this >> approach and to post it as a patch? > > Of course, I'll try to send a RFC patch. Hi Kuai, Has this RFC patch already been posted or did I perhaps miss it? Thanks, Bart.
Hi, 在 2023/06/13 22:07, Bart Van Assche 写道: > On 5/18/23 00:55, Yu Kuai wrote: >> 在 2023/05/18 10:23, Bart Van Assche 写道: >>> On 5/17/23 18:49, Yu Kuai wrote: >>>> Currently, fair share from hctx_may_queue() requires two >>>> atomic_read(active_queues and active_requests), I think this smoothing >>>> method can be placed into get_tag fail path, for example, the more >>>> times >>>> a disk failed to get tag in a period of time, the more tag this disk >>>> can >>>> get, and all the information can be updated here(perhaps directly >>>> record how many tags a disk can get, then hctx_may_queue() still only >>>> require 2 atomic_read()). >>> >>> That sounds interesting to me. Do you perhaps plan to implement this >>> approach and to post it as a patch? >> >> Of course, I'll try to send a RFC patch. > > Hi Kuai, > > Has this RFC patch already been posted or did I perhaps miss it? Sorry for the delay, I finished switch from not share to fair share directly verion, however, I found that it's not that simple to add a smoothing method, and I'm stuck here for now. I'll try to send a RFC verion soon, the smoothing method may not be flexible, but I'll make sure it can work for your simple case that 2 device share tags, and one only issue few io while the other issue lots of io. Thanks, Kuai > > Thanks, > > Bart. > > > . >
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index 17d7bb875fee..e96a50265285 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -5149,6 +5149,9 @@ static int ufshcd_slave_configure(struct scsi_device *sdev) blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1); if (hba->quirks & UFSHCD_QUIRK_4KB_DMA_ALIGNMENT) blk_queue_update_dma_alignment(q, 4096 - 1); + + blk_queue_flag_clear(QUEUE_FLAG_FAIR_TAG_SHARING, q); + /* * Block runtime-pm until all consumers are added. * Refer ufshcd_setup_links().
The tags allocation is limited by the fair sharing algorithm. It hurts the performance for UFS devices, because the queue depth of general I/O is reduced by half once the UFS send a control command. Signed-off-by: Ed Tsai <ed.tsai@mediatek.com> --- drivers/ufs/core/ufshcd.c | 3 +++ 1 file changed, 3 insertions(+)