mbox series

[v4,0/3] Support disabling fair tag sharing

Message ID 20231023203643.3209592-1-bvanassche@acm.org (mailing list archive)
Headers show
Series Support disabling fair tag sharing | expand

Message

Bart Van Assche Oct. 23, 2023, 8:36 p.m. UTC
Hi Jens,

Performance of UFS devices is reduced significantly by the fair tag sharing
algorithm. This is because UFS devices have multiple logical units and a
limited queue depth (32 for UFS 3.1 devices) and also because it takes time to
give tags back after activity on a request queue has stopped. This patch series
addresses this issue by introducing a flag that allows block drivers to
disable fair sharing.

Please consider this patch series for the next merge window.

Thanks,

Bart.

Changes compared to v3:
 - Instead of disabling fair tag sharing for all block drivers, introduce a flag
   for disabling it conditionally.

Changes between v2 and v3:
 - Rebased on top of the latest kernel.

Changes between v1 and v2:
 - Restored the tags->active_queues variable and thereby fixed the
   "uninitialized variable" warning reported by the kernel test robot.

Bart Van Assche (3):
  block: Introduce flag BLK_MQ_F_DISABLE_FAIR_TAG_SHARING
  scsi: core: Support disabling fair tag sharing
  scsi: ufs: Disable fair tag sharing

 block/blk-mq-debugfs.c    | 1 +
 block/blk-mq.h            | 3 ++-
 drivers/scsi/hosts.c      | 1 +
 drivers/scsi/scsi_lib.c   | 2 ++
 drivers/ufs/core/ufshcd.c | 1 +
 include/linux/blk-mq.h    | 1 +
 include/scsi/scsi_host.h  | 6 ++++++
 7 files changed, 14 insertions(+), 1 deletion(-)

Comments

Ming Lei Oct. 24, 2023, 2:28 a.m. UTC | #1
Hi Bart,

On Mon, Oct 23, 2023 at 01:36:32PM -0700, Bart Van Assche wrote:
> Hi Jens,
> 
> Performance of UFS devices is reduced significantly by the fair tag sharing
> algorithm. This is because UFS devices have multiple logical units and a
> limited queue depth (32 for UFS 3.1 devices) and also because it takes time to
> give tags back after activity on a request queue has stopped. This patch series
> addresses this issue by introducing a flag that allows block drivers to
> disable fair sharing.
> 
> Please consider this patch series for the next merge window.

In previous post[1], you mentioned that the issue is caused by non-IO
queue of WLUN, but in this version, looks there isn't such story any more.

IMO, it isn't reasonable to account non-IO LUN for tag fairness, so
solution could be to not take non-IO queue into account for fair tag
sharing. But disabling fair tag sharing for this whole tagset could be
too over-kill.

And if you mean normal IO LUNs, can you share more details about the
performance drop? such as the test case, how many IO LUNs, and how to
observe performance drop, cause it isn't simple any more since multiple
LUN's perf has to be considered.


[1] https://lore.kernel.org/linux-block/20231018180056.2151711-1-bvanassche@acm.org/

Thanks,
Ming
Bart Van Assche Oct. 24, 2023, 4:41 p.m. UTC | #2
On 10/23/23 19:28, Ming Lei wrote:
> On Mon, Oct 23, 2023 at 01:36:32PM -0700, Bart Van Assche wrote:
>> Performance of UFS devices is reduced significantly by the fair tag sharing
>> algorithm. This is because UFS devices have multiple logical units and a
>> limited queue depth (32 for UFS 3.1 devices) and also because it takes time to
>> give tags back after activity on a request queue has stopped. This patch series
>> addresses this issue by introducing a flag that allows block drivers to
>> disable fair sharing.
>>
>> Please consider this patch series for the next merge window.
> 
> In previous post[1], you mentioned that the issue is caused by non-IO
> queue of WLUN, but in this version, looks there isn't such story any more.
> 
> IMO, it isn't reasonable to account non-IO LUN for tag fairness, so
> solution could be to not take non-IO queue into account for fair tag
> sharing. But disabling fair tag sharing for this whole tagset could be
> too over-kill.
> 
> And if you mean normal IO LUNs, can you share more details about the
> performance drop? such as the test case, how many IO LUNs, and how to
> observe performance drop, cause it isn't simple any more since multiple
> LUN's perf has to be considered.
> 
> [1] https://lore.kernel.org/linux-block/20231018180056.2151711-1-bvanassche@acm.org/

Hi Ming,

Submitting I/O to a WLUN is only one example of a use case that
activates the fair sharing algorithm for UFS devices. Another use
case is simultaneous activity for multiple data LUNs. Conventional
UFS devices typically have four data LUNs and zoned UFS devices
typically have five data LUNs. From an Android device with a zoned UFS
device:

$ adb shell ls /sys/class/scsi_device
0:0:0:0
0:0:0:1
0:0:0:2
0:0:0:3
0:0:0:4
0:0:0:49456
0:0:0:49476
0:0:0:49488

The first five are data logical units. The last three are WLUNs.

For a block size of 4 KiB, I see 144 K IOPS for queue depth 31 and
107 K IOPS for queue depth 15 (queue depth is reduced from 31 to 15
if I/O is being submitted to two LUNs simultaneously). In other words,
disabling fair sharing results in up to 35% higher IOPS for small reads
and in case two logical units are active simultaneously. I think that's
a very significant performance difference.

Thanks,

Bart.
Ming Lei Oct. 25, 2023, 1:33 a.m. UTC | #3
On Tue, Oct 24, 2023 at 09:41:50AM -0700, Bart Van Assche wrote:
> On 10/23/23 19:28, Ming Lei wrote:
> > On Mon, Oct 23, 2023 at 01:36:32PM -0700, Bart Van Assche wrote:
> > > Performance of UFS devices is reduced significantly by the fair tag sharing
> > > algorithm. This is because UFS devices have multiple logical units and a
> > > limited queue depth (32 for UFS 3.1 devices) and also because it takes time to
> > > give tags back after activity on a request queue has stopped. This patch series
> > > addresses this issue by introducing a flag that allows block drivers to
> > > disable fair sharing.
> > > 
> > > Please consider this patch series for the next merge window.
> > 
> > In previous post[1], you mentioned that the issue is caused by non-IO
> > queue of WLUN, but in this version, looks there isn't such story any more.
> > 
> > IMO, it isn't reasonable to account non-IO LUN for tag fairness, so
> > solution could be to not take non-IO queue into account for fair tag
> > sharing. But disabling fair tag sharing for this whole tagset could be
> > too over-kill.
> > 
> > And if you mean normal IO LUNs, can you share more details about the
> > performance drop? such as the test case, how many IO LUNs, and how to
> > observe performance drop, cause it isn't simple any more since multiple
> > LUN's perf has to be considered.
> > 
> > [1] https://lore.kernel.org/linux-block/20231018180056.2151711-1-bvanassche@acm.org/
> 
> Hi Ming,
> 
> Submitting I/O to a WLUN is only one example of a use case that
> activates the fair sharing algorithm for UFS devices. Another use
> case is simultaneous activity for multiple data LUNs. Conventional
> UFS devices typically have four data LUNs and zoned UFS devices
> typically have five data LUNs. From an Android device with a zoned UFS
> device:
> 
> $ adb shell ls /sys/class/scsi_device
> 0:0:0:0
> 0:0:0:1
> 0:0:0:2
> 0:0:0:3
> 0:0:0:4
> 0:0:0:49456
> 0:0:0:49476
> 0:0:0:49488
> 
> The first five are data logical units. The last three are WLUNs.
> 
> For a block size of 4 KiB, I see 144 K IOPS for queue depth 31 and
> 107 K IOPS for queue depth 15 (queue depth is reduced from 31 to 15
> if I/O is being submitted to two LUNs simultaneously). In other words,
> disabling fair sharing results in up to 35% higher IOPS for small reads
> and in case two logical units are active simultaneously. I think that's
> a very significant performance difference.

Yeah, performance does drop when queue depth is cut to half if queue
depth is low enough.

However, it isn't enough to just test perf over one LUN, what is the
perf effect when running IOs over the 2 or 5 data LUNs concurrently?

SATA should have similar issue too, and I think the improvement may be
more generic to bypass fair tag sharing in case of low queue depth
(such as < 32) if turns out the fair tag sharing doesn't work well
in case low queue depth.

Also the 'fairness' could be enhanced dynamically by scsi LUN's queue depth,
which can be adjusted dynamically.


Thanks, 
Ming
Avri Altman Oct. 25, 2023, 6:50 p.m. UTC | #4
Hi,
> On Tue, Oct 24, 2023 at 09:41:50AM -0700, Bart Van Assche wrote:
> > On 10/23/23 19:28, Ming Lei wrote:
> > > On Mon, Oct 23, 2023 at 01:36:32PM -0700, Bart Van Assche wrote:
> > > > Performance of UFS devices is reduced significantly by the fair
> > > > tag sharing algorithm. This is because UFS devices have multiple
> > > > logical units and a limited queue depth (32 for UFS 3.1 devices)
> > > > and also because it takes time to give tags back after activity on
> > > > a request queue has stopped. This patch series addresses this
> > > > issue by introducing a flag that allows block drivers to disable fair
> sharing.
> > > >
> > > > Please consider this patch series for the next merge window.
> > >
> > > In previous post[1], you mentioned that the issue is caused by
> > > non-IO queue of WLUN, but in this version, looks there isn't such story
> any more.
> > >
> > > IMO, it isn't reasonable to account non-IO LUN for tag fairness, so
> > > solution could be to not take non-IO queue into account for fair tag
> > > sharing. But disabling fair tag sharing for this whole tagset could
> > > be too over-kill.
> > >
> > > And if you mean normal IO LUNs, can you share more details about the
> > > performance drop? such as the test case, how many IO LUNs, and how
> > > to observe performance drop, cause it isn't simple any more since
> > > multiple LUN's perf has to be considered.
> > >
> > > [1]
> > > https://lore.kernel.org/linux-block/20231018180056.2151711-1-bvanass
> > > che@acm.org/
> >
> > Hi Ming,
> >
> > Submitting I/O to a WLUN is only one example of a use case that
> > activates the fair sharing algorithm for UFS devices. Another use case
> > is simultaneous activity for multiple data LUNs. Conventional UFS
> > devices typically have four data LUNs and zoned UFS devices typically
> > have five data LUNs. From an Android device with a zoned UFS
> > device:
> >
> > $ adb shell ls /sys/class/scsi_device
> > 0:0:0:0
> > 0:0:0:1
> > 0:0:0:2
> > 0:0:0:3
> > 0:0:0:4
> > 0:0:0:49456
> > 0:0:0:49476
> > 0:0:0:49488
> >
> > The first five are data logical units. The last three are WLUNs.
> >
> > For a block size of 4 KiB, I see 144 K IOPS for queue depth 31 and
> > 107 K IOPS for queue depth 15 (queue depth is reduced from 31 to 15 if
> > I/O is being submitted to two LUNs simultaneously). In other words,
> > disabling fair sharing results in up to 35% higher IOPS for small
> > reads and in case two logical units are active simultaneously. I think
> > that's a very significant performance difference.
This issue is known for a long time now.
Whenever we needed to provide performance measures,
We used to disable that mechanism, hacking the kernel locally - one way or the other.
I know that my peers are doing this as well.

> 
> Yeah, performance does drop when queue depth is cut to half if queue depth
> is low enough.
> 
> However, it isn't enough to just test perf over one LUN, what is the perf
> effect when running IOs over the 2 or 5 data LUNs concurrently?
> 
> SATA should have similar issue too, and I think the improvement may be
> more generic to bypass fair tag sharing in case of low queue depth (such as <
> 32) if turns out the fair tag sharing doesn't work well in case low queue
> depth.
> 
> Also the 'fairness' could be enhanced dynamically by scsi LUN's queue depth,
> which can be adjusted dynamically.
As far our concern as a ufs device manufacturers & users, 
We find the current proposal a clean and elegant solution we like to adopt.
Further, more sophisticated scheme can be adopted on top of disabling tag sharing.

Thanks,
Avri

> 
> 
> Thanks,
> Ming
Bart Van Assche Oct. 25, 2023, 7:01 p.m. UTC | #5
On 10/24/23 18:33, Ming Lei wrote:
> Yeah, performance does drop when queue depth is cut to half if queue
> depth is low enough.
> 
> However, it isn't enough to just test perf over one LUN, what is the
> perf effect when running IOs over the 2 or 5 data LUNs
> concurrently?

I think that the results I shared are sufficient because these show the
worst possible performance impact of fair tag sharing (two active
logical units and much more activity on one logical unit than on the
other).

> SATA should have similar issue too, and I think the improvement may 
> be more generic to bypass fair tag sharing in case of low queue depth
> (such as < 32) if turns out the fair tag sharing doesn't work well in
> case low queue depth.
> 
> Also the 'fairness' could be enhanced dynamically by scsi LUN's
> queue depth, which can be adjusted dynamically.

Most SATA devices are hard disks. Hard disk IOPS are constrained by the
speed with which the head of a hard disk can move. That speed hasn't
changed much during the past 40 years. I'm not sure that hard disks are
impacted as much as SSD devices by fair tag sharing.

Any algorithm that is more complicated than what I posted probably would
have a negative performance impact on storage devices that use NAND
technology, e.g. UFS devices. So I prefer to proceed with this patch
series and solve any issues with ATA devices separately. Once this patch
series has been merged, it could be used as a basis for a solution for
ATA devices. A solution for ATA devices does not have to be implemented
in the block layer core - it could e.g. be implemented in the ATA subsystem.

Thanks,

Bart.
Ming Lei Oct. 25, 2023, 11:37 p.m. UTC | #6
On Wed, Oct 25, 2023 at 12:01:33PM -0700, Bart Van Assche wrote:
> 
> On 10/24/23 18:33, Ming Lei wrote:
> > Yeah, performance does drop when queue depth is cut to half if queue
> > depth is low enough.
> > 
> > However, it isn't enough to just test perf over one LUN, what is the
> > perf effect when running IOs over the 2 or 5 data LUNs
> > concurrently?
> 
> I think that the results I shared are sufficient because these show the
> worst possible performance impact of fair tag sharing (two active
> logical units and much more activity on one logical unit than on the
> other).

You are talking about multi-lun case, and your change does affect
multi-lun code path, but your test result doesn't cover multi-lun,
is it enough?

At least your patch shouldn't cause performance regression on multi-lun IO
workloads, right?

> 
> > SATA should have similar issue too, and I think the improvement may be
> > more generic to bypass fair tag sharing in case of low queue depth
> > (such as < 32) if turns out the fair tag sharing doesn't work well in
> > case low queue depth.
> > 
> > Also the 'fairness' could be enhanced dynamically by scsi LUN's
> > queue depth, which can be adjusted dynamically.
> 
> Most SATA devices are hard disks. Hard disk IOPS are constrained by the
> speed with which the head of a hard disk can move. That speed hasn't
> changed much during the past 40 years. I'm not sure that hard disks are
> impacted as much as SSD devices by fair tag sharing.

What I meant is that SATA's queue depth is often 32 or 31, and still have
multi-lun cases.

At least from what you shared, the fair tag sharing doesn't work well
just because of low queue depth, nothing is actually related with UFS.

That is why I am wondering that why not force to disable fairing sharing
in case of low queue depth.

> 
> Any algorithm that is more complicated than what I posted probably would
> have a negative performance impact on storage devices that use NAND
> technology, e.g. UFS devices. So I prefer to proceed with this patch
> series and solve any issues with ATA devices separately. Once this patch
> series has been merged, it could be used as a basis for a solution for
> ATA devices. A solution for ATA devices does not have to be implemented
> in the block layer core - it could e.g. be implemented in the ATA subsystem.

I don't object to take the disabling fair sharing first, and I meant that
the fairness may be brought back by adjusting scsi_device's queue depth in
future.


Thanks,
Ming
Bart Van Assche Oct. 26, 2023, 4:29 p.m. UTC | #7
On 10/25/23 16:37, Ming Lei wrote:
> You are talking about multi-lun case, and your change does affect 
> multi-lun code path, but your test result doesn't cover multi-lun, is
> it enough?
I think that the results I shared are sufficient because these show the
worst possible performance impact of fair tag sharing. If there are two
active logical units and much more I/O is sent to one logical unit than
to the other, then the aggregate performance will be very close to what
I shared in a previous email.

> At least your patch shouldn't cause performance regression on
> multi-lun IO workloads, right?
If blk_mq_get_tag() can't allocate a tag, and if multiple threads are
waiting for a tag, the thread that called blk_mq_get_tag() first is
granted the first tag that is released. I think this guarantees fairness
if all requests have a similar latency. There will be some unfairness if
there are significant differences in latency per logical unit, e.g.
because all requests sent to one logical unit are small and because all
requests sent to another logical unit are large. Whether or not this
matters depends on the use case.

Thanks,

Bart.
Bart Van Assche Oct. 26, 2023, 4:37 p.m. UTC | #8
On 10/25/23 11:50, Avri Altman wrote:
> This issue is known for a long time now.
> Whenever we needed to provide performance measures,
> We used to disable that mechanism, hacking the kernel locally - one way or the other.
> I know that my peers are doing this as well.

The Android 14 kernels have this patch since May (five months ago). None
of the OEMs who use this kernel branch have complained so far.

Thanks,

Bart.
Yu Kuai Oct. 31, 2023, 2:01 a.m. UTC | #9
Hi,

在 2023/10/27 0:29, Bart Van Assche 写道:
> If blk_mq_get_tag() can't allocate a tag, and if multiple threads are
> waiting for a tag, the thread that called blk_mq_get_tag() first is
> granted the first tag that is released. I think this guarantees fairness
> if all requests have a similar latency. There will be some unfairness if
> there are significant differences in latency per logical unit, e.g.
> because all requests sent to one logical unit are small and because all
> requests sent to another logical unit are large. Whether or not this
> matters depends on the use case.

I'm afraid that is not correct, fairness can't be guranteed at all, not
even with just one scsi disk. This is because there are 8 wait queues in
sbitmap, and threads are waiting in roundrobin mode, and each time
wake_batch tags are released, wake_batch threads of one wait queue will
be woke up, regardless that some threads can't grab tag after woken up,
what's worse, thoese thread will be added to the tail of waitqueue
again.

In the case that high io pressure under a slow disk, this behaviour will
cause that io tail latency will be quite bad compared to sq from old
kernel.

AFAIC, disable tag sharing will definitely case some regresion, for
example, one disk will high io pressure, and another disk only issure
one IO at a time, disable tag sharing can improve brandwith of fist
disk, however, for the latter disk, IO latency will definitely be much
worse.

Thanks,
Kuai

> 
> Thanks,
> 
> Bart.
> 
> 
> 
> .
>
Bart Van Assche Oct. 31, 2023, 4:25 p.m. UTC | #10
On 10/30/23 19:01, Yu Kuai wrote:
> I'm afraid that is not correct, fairness can't be guranteed at all, not
> even with just one scsi disk. This is because there are 8 wait queues in
> sbitmap, and threads are waiting in roundrobin mode, and each time
> wake_batch tags are released, wake_batch threads of one wait queue will
> be woke up, regardless that some threads can't grab tag after woken up,
> what's worse, thoese thread will be added to the tail of waitqueue
> again.
> 
> In the case that high io pressure under a slow disk, this behaviour will
> cause that io tail latency will be quite bad compared to sq from old
> kernel.
> 
> AFAIC, disable tag sharing will definitely case some regresion, for
> example, one disk will high io pressure, and another disk only issure
> one IO at a time, disable tag sharing can improve brandwith of fist
> disk, however, for the latter disk, IO latency will definitely be much
> worse.

Hi Yu,

All UFS users I know prefer that fair tag sharing is disabled for UFS
devices. So I propose to proceed with this patch series and additionally
to improve fair tag sharing for devices and transport layers that need
an improved sharing algorithm.

Thanks,

Bart.