diff mbox series

[2/2] ufs: don't use the fair tag sharings

Message ID 20230509065230.32552-3-ed.tsai@mediatek.com (mailing list archive)
State Changes Requested
Headers show
Series block: improve the share tag set performance | expand

Commit Message

Ed Tsai (蔡宗軒) May 9, 2023, 6:52 a.m. UTC
The tags allocation is limited by the fair sharing algorithm. It hurts
the performance for UFS devices, because the queue depth of general I/O
is reduced by half once the UFS send a control command.

Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
---
 drivers/ufs/core/ufshcd.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Avri Altman May 9, 2023, 8:03 a.m. UTC | #1
> The tags allocation is limited by the fair sharing algorithm. It hurts
> the performance for UFS devices, because the queue depth of general I/O
> is reduced by half once the UFS send a control command.
Ack.

However, I think the decision of that should be of the platform owner,
And not in the core driver.

Thanks,
Avri

> 
> Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
> ---
>  drivers/ufs/core/ufshcd.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> index 17d7bb875fee..e96a50265285 100644
> --- a/drivers/ufs/core/ufshcd.c
> +++ b/drivers/ufs/core/ufshcd.c
> @@ -5149,6 +5149,9 @@ static int ufshcd_slave_configure(struct scsi_device
> *sdev)
>         blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1);
>         if (hba->quirks & UFSHCD_QUIRK_4KB_DMA_ALIGNMENT)
>                 blk_queue_update_dma_alignment(q, 4096 - 1);
> +
> +       blk_queue_flag_clear(QUEUE_FLAG_FAIR_TAG_SHARING, q);
> +
>         /*
>          * Block runtime-pm until all consumers are added.
>          * Refer ufshcd_setup_links().
> --
> 2.18.0
Bart Van Assche May 9, 2023, 2:04 p.m. UTC | #2
On 5/9/23 01:03, Avri Altman wrote:
> However, I think the decision of that should be of the platform owner,
> And not in the core driver.

Hi Avri,

I don't see any use case in which performance of a UFS device would be improved
by leaving QUEUE_FLAG_FAIR_TAG_SHARING enabled. Are you perhaps aware of such a
use case?

Thanks,

Bart.
Avri Altman May 9, 2023, 4:19 p.m. UTC | #3
> On 5/9/23 01:03, Avri Altman wrote:
> > However, I think the decision of that should be of the platform owner,
> > And not in the core driver.
> 
> Hi Avri,
> 
> I don't see any use case in which performance of a UFS device would be
> improved
> by leaving QUEUE_FLAG_FAIR_TAG_SHARING enabled. Are you perhaps aware
> of such a
> use case?
Following your argument, then why fair allocation exists in the first place?

When running benchmarks I am hacking the scheduler's "fair" tag allocation as well.
That's why I acked this change.

Since this change may affect the IO profile as a whole,
I think the platform owners should have the flexibility not to use it,
Should they choose to.

Thanks,
Avri
> 
> Thanks,
> 
> Bart.
Bart Van Assche May 9, 2023, 4:30 p.m. UTC | #4
On 5/9/23 09:19, Avri Altman wrote:
> Following your argument, then why fair allocation exists in the first place?

Does this patch answer your question?

[PATCH v2] block: Improve shared tag set performance
https://lore.kernel.org/linux-block/20230103195337.158625-1-bvanassche@acm.org/

Thanks,

Bart.
Avri Altman May 10, 2023, 5:21 a.m. UTC | #5
> 
> On 5/9/23 09:19, Avri Altman wrote:
> > Following your argument, then why fair allocation exists in the first place?
> 
> Does this patch answer your question?
> 
> [PATCH v2] block: Improve shared tag set performance
> https://lore.kernel.org/linux-block/20230103195337.158625-1-
> bvanassche@acm.org/
OK.

Why it was needed to invent a new flag instead of  just clear BLK_MQ_F_TAG_QUEUE_SHARED?

Thanks,
Avri

> 
> Thanks,
> 
> Bart.
Bart Van Assche May 10, 2023, 3:56 p.m. UTC | #6
On 5/9/23 22:21, Avri Altman wrote:
> Why it was needed to invent a new flag instead of  just clear 
> BLK_MQ_F_TAG_QUEUE_SHARED?

Hi Avri,

The meaning of BLK_MQ_F_TAG_QUEUE_SHARED is "the tag set is shared
across multiple request queues". Clearing BLK_MQ_F_TAG_QUEUE_SHARED if
the tag set is shared would be wrong: it would break all the code
outside the tag allocation code that tests this flag.

Thanks,

Bart.
Christoph Hellwig May 11, 2023, 3:34 p.m. UTC | #7
On Tue, May 09, 2023 at 02:52:30PM +0800, Ed Tsai wrote:
> The tags allocation is limited by the fair sharing algorithm. It hurts
> the performance for UFS devices, because the queue depth of general I/O
> is reduced by half once the UFS send a control command.

But it is there for a reason.  You completely fail to explain why you
think your change is safe, and also why you did not try to even explain
where the overhead is and how else you tried to mitigate it.
Bart Van Assche May 11, 2023, 3:38 p.m. UTC | #8
On 5/11/23 08:34, Christoph Hellwig wrote:
> On Tue, May 09, 2023 at 02:52:30PM +0800, Ed Tsai wrote:
>> The tags allocation is limited by the fair sharing algorithm. It hurts
>> the performance for UFS devices, because the queue depth of general I/O
>> is reduced by half once the UFS send a control command.
> 
> But it is there for a reason.  You completely fail to explain why you
> think your change is safe, and also why you did not try to even explain
> where the overhead is and how else you tried to mitigate it.

Hi Christoph,

For which devices is the fair sharing algorithm useful? As far as I know 
the legacy block layer did not have an equivalent of the fair sharing 
algorithm and I'm not aware of any complaints about the legacy block 
layer regarding to fairness. This is why I proposed in January to remove 
the fair sharing code entirely. See also 
https://lore.kernel.org/linux-block/20230103195337.158625-1-bvanassche@acm.org/.

Thanks,

Bart.
Christoph Hellwig May 12, 2023, 2:02 p.m. UTC | #9
On Thu, May 11, 2023 at 08:38:04AM -0700, Bart Van Assche wrote:
> For which devices is the fair sharing algorithm useful? As far as I know the
> legacy block layer did not have an equivalent of the fair sharing algorithm
> and I'm not aware of any complaints about the legacy block layer regarding
> to fairness. This is why I proposed in January to remove the fair sharing
> code entirely. See also https://lore.kernel.org/linux-block/20230103195337.158625-1-bvanassche@acm.org/.

Because the old code did not do tag allocation itself?  Either way I
don't think a "I'll opt out for a random driver" is the proper approach
when you think it's not needed.  Especially not without any data
explaining why just that driver is a special snowflake.
Bart Van Assche May 12, 2023, 6:12 p.m. UTC | #10
On 5/12/23 07:02, Christoph Hellwig wrote:
> On Thu, May 11, 2023 at 08:38:04AM -0700, Bart Van Assche wrote:
>> For which devices is the fair sharing algorithm useful? As far as I know the
>> legacy block layer did not have an equivalent of the fair sharing algorithm
>> and I'm not aware of any complaints about the legacy block layer regarding
>> to fairness. This is why I proposed in January to remove the fair sharing
>> code entirely. See also https://lore.kernel.org/linux-block/20230103195337.158625-1-bvanassche@acm.org/.
> 
> Because the old code did not do tag allocation itself?  Either way I
> don't think a "I'll opt out for a random driver" is the proper approach
> when you think it's not needed.  Especially not without any data
> explaining why just that driver is a special snowflake.

Hi Christoph,

I'm still wondering whether there are any drivers that benefit from the 
fair tag sharing algorithm. If the number of tags is large enough 
(NVMe), the number of tags exceeds the number of requests in flight and 
hence the fair tag sharing algorithm is not necessary.

The fair tag sharing algorithm has a negative impact on all SCSI devices 
with multiple logical units. This is because logical units are 
considered active until (request timeout) seconds have elapsed after the 
logical unit stopped being used (see also the blk_mq_tag_idle() call in 
blk_mq_timeout_work()). UFS users are hit by this because UFS 3.0 
devices have a limited queue depth (32) and because power management 
commands are submitted to a logical unit (WLUN). Hence, it happens often 
that the block layer "active queue" counter is equal to 2 while only one 
logical unit is being used actively (a logical unit backed by NAND 
flash). The performance difference between queue depths 16 and 32 for 
UFS devices is significant.

Is my understanding correct that in the legacy block layer 
implementation blk_queue_start_tag() had to be called to assign a tag to 
a request? I haven't found any code in the Linux kernel v4.20 
implementation of blk_queue_start_tag() that implements fairness in case 
a request tag map (struct blk_queue_tag) is shared across request queues 
(one request queue per logical unit in case of SCSI). Do you agree with 
my conclusion that from the point of view of the SCSI core in general 
and the UFS driver in particular the fair tag sharing algorithm in the 
blk-mq code introduced a performance regression?

Thanks,

Bart.
Yu Kuai May 13, 2023, 3:09 a.m. UTC | #11
Hi,

在 2023/05/13 2:12, Bart Van Assche 写道:
> The fair tag sharing algorithm has a negative impact on all SCSI devices 
> with multiple logical units. This is because logical units are 
> considered active until (request timeout) seconds have elapsed after the 
> logical unit stopped being used (see also the blk_mq_tag_idle() call in 
> blk_mq_timeout_work()). UFS users are hit by this because UFS 3.0 
> devices have a limited queue depth (32) and because power management 
> commands are submitted to a logical unit (WLUN). Hence, it happens often 
> that the block layer "active queue" counter is equal to 2 while only one 
> logical unit is being used actively (a logical unit backed by NAND 
> flash). The performance difference between queue depths 16 and 32 for 
> UFS devices is significant.

We meet similiar problem before, but I think remove tag fair sharing
might cause some problems, because get tag is not fair currently, for
example 2 devices share 32 tag, while device a issue large amount of
io concurrently, and device b only issue one io, in this case, if fair
tag sharing is removed, device b can get bad io latency.

By the way, I tried to propose a way to workaround this by following:

1) disable fair tag sharing untill get tag found no tag is avaiable;
2) enable fair tag sharing again if the disk donesn't faild to get tag
for a period of time;

Can this approch be considered?

Thanks,
Kuai
Bart Van Assche May 16, 2023, 3:12 p.m. UTC | #12
On 5/12/23 20:09, Yu Kuai wrote:
> 在 2023/05/13 2:12, Bart Van Assche 写道:
>> The fair tag sharing algorithm has a negative impact on all SCSI 
>> devices with multiple logical units. This is because logical units are 
>> considered active until (request timeout) seconds have elapsed after 
>> the logical unit stopped being used (see also the blk_mq_tag_idle() 
>> call in blk_mq_timeout_work()). UFS users are hit by this because UFS 
>> 3.0 devices have a limited queue depth (32) and because power 
>> management commands are submitted to a logical unit (WLUN). Hence, it 
>> happens often that the block layer "active queue" counter is equal to 
>> 2 while only one logical unit is being used actively (a logical unit 
>> backed by NAND flash). The performance difference between queue depths 
>> 16 and 32 for UFS devices is significant.
> 
> We meet similiar problem before, but I think remove tag fair sharing
> might cause some problems, because get tag is not fair currently, for
> example 2 devices share 32 tag, while device a issue large amount of
> io concurrently, and device b only issue one io, in this case, if fair
> tag sharing is removed, device b can get bad io latency.
> 
> By the way, I tried to propose a way to workaround this by following:
> 
> 1) disable fair tag sharing untill get tag found no tag is avaiable;
> 2) enable fair tag sharing again if the disk donesn't faild to get tag
> for a period of time;
> 
> Can this approch be considered?

I'm afraid that this approach won't help for the UFS driver since it is 
likely that all tags are in use by a single logical unit during an IOPS 
test. Hence, fair sharing would be enabled even when we don't want it to 
be enabled.

I propose that we switch to one of these two approaches:
* Either remove the fair tag sharing code entirely and rely on the 
fairness mechanism provided by the sbitmap code. I'm referring to how 
__sbitmap_queue_wake_up() uses the wake_index member variable.
* Or make the behavior of the fairness algorithm configurable from user 
space. One possible approach is to make the proportion of tags for a 
logical unit / NVMe namespace configurable via sysfs. This will allow to 
reduce the number of tags for the WLUN of UFS devices.

Thanks,

Bart.
Yu Kuai May 17, 2023, 7:49 a.m. UTC | #13
Hi,

在 2023/05/16 23:12, Bart Van Assche 写道:
> On 5/12/23 20:09, Yu Kuai wrote:
>> 在 2023/05/13 2:12, Bart Van Assche 写道:
>>> The fair tag sharing algorithm has a negative impact on all SCSI 
>>> devices with multiple logical units. This is because logical units 
>>> are considered active until (request timeout) seconds have elapsed 
>>> after the logical unit stopped being used (see also the 
>>> blk_mq_tag_idle() call in blk_mq_timeout_work()). UFS users are hit 
>>> by this because UFS 3.0 devices have a limited queue depth (32) and 
>>> because power management commands are submitted to a logical unit 
>>> (WLUN). Hence, it happens often that the block layer "active queue" 
>>> counter is equal to 2 while only one logical unit is being used 
>>> actively (a logical unit backed by NAND flash). The performance 
>>> difference between queue depths 16 and 32 for UFS devices is 
>>> significant.
>>
>> We meet similiar problem before, but I think remove tag fair sharing
>> might cause some problems, because get tag is not fair currently, for
>> example 2 devices share 32 tag, while device a issue large amount of
>> io concurrently, and device b only issue one io, in this case, if fair
>> tag sharing is removed, device b can get bad io latency.
>>
>> By the way, I tried to propose a way to workaround this by following:
>>
>> 1) disable fair tag sharing untill get tag found no tag is avaiable;
>> 2) enable fair tag sharing again if the disk donesn't faild to get tag
>> for a period of time;
>>
>> Can this approch be considered?
> 
> I'm afraid that this approach won't help for the UFS driver since it is 
> likely that all tags are in use by a single logical unit during an IOPS 
> test. Hence, fair sharing would be enabled even when we don't want it to 
> be enabled.

It's right my original method is not flexible.

> 
> I propose that we switch to one of these two approaches:

How about a smoothing method that the device with more io will share
more tag, and each device will get at least one tag?

Thanks,
Kuai

> * Either remove the fair tag sharing code entirely and rely on the 
> fairness mechanism provided by the sbitmap code. I'm referring to how 
> __sbitmap_queue_wake_up() uses the wake_index member variable.
> * Or make the behavior of the fairness algorithm configurable from user 
> space. One possible approach is to make the proportion of tags for a 
> logical unit / NVMe namespace configurable via sysfs. This will allow to 
> reduce the number of tags for the WLUN of UFS devices.
> 
> Thanks,
> 
> Bart.
> 
> 
> .
>
Bart Van Assche May 17, 2023, 6:23 p.m. UTC | #14
On 5/17/23 00:49, Yu Kuai wrote:
> 在 2023/05/16 23:12, Bart Van Assche 写道:
>> I propose that we switch to one of these two approaches:
> 
> How about a smoothing method that the device with more io will share
> more tag, and each device will get at least one tag?

Hi Yu,

hctx_may_queue() is called from the hot path (blk_mq_get_tag()). I'm 
pretty sure that adding any nontrivial code in that path will cause a 
performance (IOPS) regression. So I don't think that adding a smoothing 
method in hctx_may_queue() is a realistic option.

Thanks,

Bart.
Yu Kuai May 18, 2023, 1:49 a.m. UTC | #15
Hi,

在 2023/05/18 2:23, Bart Van Assche 写道:
> On 5/17/23 00:49, Yu Kuai wrote:
>> 在 2023/05/16 23:12, Bart Van Assche 写道:
>>> I propose that we switch to one of these two approaches:
>>
>> How about a smoothing method that the device with more io will share
>> more tag, and each device will get at least one tag?
> 
> Hi Yu,
> 
> hctx_may_queue() is called from the hot path (blk_mq_get_tag()). I'm 
> pretty sure that adding any nontrivial code in that path will cause a 
> performance (IOPS) regression. So I don't think that adding a smoothing 
> method in hctx_may_queue() is a realistic option.
> 

Currently, fair share from hctx_may_queue() requires two
atomic_read(active_queues and active_requests), I think this smoothing
method can be placed into get_tag fail path, for example, the more times
a disk failed to get tag in a period of time, the more tag this disk can
get, and all the information can be updated here(perhaps directly
record how many tags a disk can get, then hctx_may_queue() still only
require 2 atomic_read()).

Thanks,
Bart
Bart Van Assche May 18, 2023, 2:23 a.m. UTC | #16
On 5/17/23 18:49, Yu Kuai wrote:
> Currently, fair share from hctx_may_queue() requires two
> atomic_read(active_queues and active_requests), I think this smoothing
> method can be placed into get_tag fail path, for example, the more times
> a disk failed to get tag in a period of time, the more tag this disk can
> get, and all the information can be updated here(perhaps directly
> record how many tags a disk can get, then hctx_may_queue() still only
> require 2 atomic_read()).

That sounds interesting to me. Do you perhaps plan to implement this 
approach and to post it as a patch?

Thanks,

Bart.
Yu Kuai May 18, 2023, 7:55 a.m. UTC | #17
Hi,

在 2023/05/18 10:23, Bart Van Assche 写道:
> On 5/17/23 18:49, Yu Kuai wrote:
>> Currently, fair share from hctx_may_queue() requires two
>> atomic_read(active_queues and active_requests), I think this smoothing
>> method can be placed into get_tag fail path, for example, the more times
>> a disk failed to get tag in a period of time, the more tag this disk can
>> get, and all the information can be updated here(perhaps directly
>> record how many tags a disk can get, then hctx_may_queue() still only
>> require 2 atomic_read()).
> 
> That sounds interesting to me. Do you perhaps plan to implement this 
> approach and to post it as a patch?

Of course, I'll try to send a RFC patch.

Thanks,
Kuai
> 
> Thanks,
> 
> Bart.
> 
> 
> .
>
Bart Van Assche June 13, 2023, 2:07 p.m. UTC | #18
On 5/18/23 00:55, Yu Kuai wrote:
> 在 2023/05/18 10:23, Bart Van Assche 写道:
>> On 5/17/23 18:49, Yu Kuai wrote:
>>> Currently, fair share from hctx_may_queue() requires two
>>> atomic_read(active_queues and active_requests), I think this smoothing
>>> method can be placed into get_tag fail path, for example, the more times
>>> a disk failed to get tag in a period of time, the more tag this disk can
>>> get, and all the information can be updated here(perhaps directly
>>> record how many tags a disk can get, then hctx_may_queue() still only
>>> require 2 atomic_read()).
>>
>> That sounds interesting to me. Do you perhaps plan to implement this 
>> approach and to post it as a patch?
> 
> Of course, I'll try to send a RFC patch.

Hi Kuai,

Has this RFC patch already been posted or did I perhaps miss it?

Thanks,

Bart.
Yu Kuai June 14, 2023, 1:58 a.m. UTC | #19
Hi,

在 2023/06/13 22:07, Bart Van Assche 写道:
> On 5/18/23 00:55, Yu Kuai wrote:
>> 在 2023/05/18 10:23, Bart Van Assche 写道:
>>> On 5/17/23 18:49, Yu Kuai wrote:
>>>> Currently, fair share from hctx_may_queue() requires two
>>>> atomic_read(active_queues and active_requests), I think this smoothing
>>>> method can be placed into get_tag fail path, for example, the more 
>>>> times
>>>> a disk failed to get tag in a period of time, the more tag this disk 
>>>> can
>>>> get, and all the information can be updated here(perhaps directly
>>>> record how many tags a disk can get, then hctx_may_queue() still only
>>>> require 2 atomic_read()).
>>>
>>> That sounds interesting to me. Do you perhaps plan to implement this 
>>> approach and to post it as a patch?
>>
>> Of course, I'll try to send a RFC patch.
> 
> Hi Kuai,
> 
> Has this RFC patch already been posted or did I perhaps miss it?

Sorry for the delay, I finished switch from not share to fair share
directly verion, however, I found that it's not that simple to add a
smoothing method, and I'm stuck here for now.

I'll try to send a RFC verion soon, the smoothing method may not be
flexible, but I'll make sure it can work for your simple case that
2 device share tags, and one only issue few io while the other issue
lots of io.

Thanks,
Kuai
> 
> Thanks,
> 
> Bart.
> 
> 
> .
>
diff mbox series

Patch

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 17d7bb875fee..e96a50265285 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -5149,6 +5149,9 @@  static int ufshcd_slave_configure(struct scsi_device *sdev)
 	blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1);
 	if (hba->quirks & UFSHCD_QUIRK_4KB_DMA_ALIGNMENT)
 		blk_queue_update_dma_alignment(q, 4096 - 1);
+
+	blk_queue_flag_clear(QUEUE_FLAG_FAIR_TAG_SHARING, q);
+
 	/*
 	 * Block runtime-pm until all consumers are added.
 	 * Refer ufshcd_setup_links().