Message ID | 20231023203643.3209592-1-bvanassche@acm.org (mailing list archive) |
---|---|
Headers | show |
Series | Support disabling fair tag sharing | expand |
Hi Bart, On Mon, Oct 23, 2023 at 01:36:32PM -0700, Bart Van Assche wrote: > Hi Jens, > > Performance of UFS devices is reduced significantly by the fair tag sharing > algorithm. This is because UFS devices have multiple logical units and a > limited queue depth (32 for UFS 3.1 devices) and also because it takes time to > give tags back after activity on a request queue has stopped. This patch series > addresses this issue by introducing a flag that allows block drivers to > disable fair sharing. > > Please consider this patch series for the next merge window. In previous post[1], you mentioned that the issue is caused by non-IO queue of WLUN, but in this version, looks there isn't such story any more. IMO, it isn't reasonable to account non-IO LUN for tag fairness, so solution could be to not take non-IO queue into account for fair tag sharing. But disabling fair tag sharing for this whole tagset could be too over-kill. And if you mean normal IO LUNs, can you share more details about the performance drop? such as the test case, how many IO LUNs, and how to observe performance drop, cause it isn't simple any more since multiple LUN's perf has to be considered. [1] https://lore.kernel.org/linux-block/20231018180056.2151711-1-bvanassche@acm.org/ Thanks, Ming
On 10/23/23 19:28, Ming Lei wrote: > On Mon, Oct 23, 2023 at 01:36:32PM -0700, Bart Van Assche wrote: >> Performance of UFS devices is reduced significantly by the fair tag sharing >> algorithm. This is because UFS devices have multiple logical units and a >> limited queue depth (32 for UFS 3.1 devices) and also because it takes time to >> give tags back after activity on a request queue has stopped. This patch series >> addresses this issue by introducing a flag that allows block drivers to >> disable fair sharing. >> >> Please consider this patch series for the next merge window. > > In previous post[1], you mentioned that the issue is caused by non-IO > queue of WLUN, but in this version, looks there isn't such story any more. > > IMO, it isn't reasonable to account non-IO LUN for tag fairness, so > solution could be to not take non-IO queue into account for fair tag > sharing. But disabling fair tag sharing for this whole tagset could be > too over-kill. > > And if you mean normal IO LUNs, can you share more details about the > performance drop? such as the test case, how many IO LUNs, and how to > observe performance drop, cause it isn't simple any more since multiple > LUN's perf has to be considered. > > [1] https://lore.kernel.org/linux-block/20231018180056.2151711-1-bvanassche@acm.org/ Hi Ming, Submitting I/O to a WLUN is only one example of a use case that activates the fair sharing algorithm for UFS devices. Another use case is simultaneous activity for multiple data LUNs. Conventional UFS devices typically have four data LUNs and zoned UFS devices typically have five data LUNs. From an Android device with a zoned UFS device: $ adb shell ls /sys/class/scsi_device 0:0:0:0 0:0:0:1 0:0:0:2 0:0:0:3 0:0:0:4 0:0:0:49456 0:0:0:49476 0:0:0:49488 The first five are data logical units. The last three are WLUNs. For a block size of 4 KiB, I see 144 K IOPS for queue depth 31 and 107 K IOPS for queue depth 15 (queue depth is reduced from 31 to 15 if I/O is being submitted to two LUNs simultaneously). In other words, disabling fair sharing results in up to 35% higher IOPS for small reads and in case two logical units are active simultaneously. I think that's a very significant performance difference. Thanks, Bart.
On Tue, Oct 24, 2023 at 09:41:50AM -0700, Bart Van Assche wrote: > On 10/23/23 19:28, Ming Lei wrote: > > On Mon, Oct 23, 2023 at 01:36:32PM -0700, Bart Van Assche wrote: > > > Performance of UFS devices is reduced significantly by the fair tag sharing > > > algorithm. This is because UFS devices have multiple logical units and a > > > limited queue depth (32 for UFS 3.1 devices) and also because it takes time to > > > give tags back after activity on a request queue has stopped. This patch series > > > addresses this issue by introducing a flag that allows block drivers to > > > disable fair sharing. > > > > > > Please consider this patch series for the next merge window. > > > > In previous post[1], you mentioned that the issue is caused by non-IO > > queue of WLUN, but in this version, looks there isn't such story any more. > > > > IMO, it isn't reasonable to account non-IO LUN for tag fairness, so > > solution could be to not take non-IO queue into account for fair tag > > sharing. But disabling fair tag sharing for this whole tagset could be > > too over-kill. > > > > And if you mean normal IO LUNs, can you share more details about the > > performance drop? such as the test case, how many IO LUNs, and how to > > observe performance drop, cause it isn't simple any more since multiple > > LUN's perf has to be considered. > > > > [1] https://lore.kernel.org/linux-block/20231018180056.2151711-1-bvanassche@acm.org/ > > Hi Ming, > > Submitting I/O to a WLUN is only one example of a use case that > activates the fair sharing algorithm for UFS devices. Another use > case is simultaneous activity for multiple data LUNs. Conventional > UFS devices typically have four data LUNs and zoned UFS devices > typically have five data LUNs. From an Android device with a zoned UFS > device: > > $ adb shell ls /sys/class/scsi_device > 0:0:0:0 > 0:0:0:1 > 0:0:0:2 > 0:0:0:3 > 0:0:0:4 > 0:0:0:49456 > 0:0:0:49476 > 0:0:0:49488 > > The first five are data logical units. The last three are WLUNs. > > For a block size of 4 KiB, I see 144 K IOPS for queue depth 31 and > 107 K IOPS for queue depth 15 (queue depth is reduced from 31 to 15 > if I/O is being submitted to two LUNs simultaneously). In other words, > disabling fair sharing results in up to 35% higher IOPS for small reads > and in case two logical units are active simultaneously. I think that's > a very significant performance difference. Yeah, performance does drop when queue depth is cut to half if queue depth is low enough. However, it isn't enough to just test perf over one LUN, what is the perf effect when running IOs over the 2 or 5 data LUNs concurrently? SATA should have similar issue too, and I think the improvement may be more generic to bypass fair tag sharing in case of low queue depth (such as < 32) if turns out the fair tag sharing doesn't work well in case low queue depth. Also the 'fairness' could be enhanced dynamically by scsi LUN's queue depth, which can be adjusted dynamically. Thanks, Ming
Hi, > On Tue, Oct 24, 2023 at 09:41:50AM -0700, Bart Van Assche wrote: > > On 10/23/23 19:28, Ming Lei wrote: > > > On Mon, Oct 23, 2023 at 01:36:32PM -0700, Bart Van Assche wrote: > > > > Performance of UFS devices is reduced significantly by the fair > > > > tag sharing algorithm. This is because UFS devices have multiple > > > > logical units and a limited queue depth (32 for UFS 3.1 devices) > > > > and also because it takes time to give tags back after activity on > > > > a request queue has stopped. This patch series addresses this > > > > issue by introducing a flag that allows block drivers to disable fair > sharing. > > > > > > > > Please consider this patch series for the next merge window. > > > > > > In previous post[1], you mentioned that the issue is caused by > > > non-IO queue of WLUN, but in this version, looks there isn't such story > any more. > > > > > > IMO, it isn't reasonable to account non-IO LUN for tag fairness, so > > > solution could be to not take non-IO queue into account for fair tag > > > sharing. But disabling fair tag sharing for this whole tagset could > > > be too over-kill. > > > > > > And if you mean normal IO LUNs, can you share more details about the > > > performance drop? such as the test case, how many IO LUNs, and how > > > to observe performance drop, cause it isn't simple any more since > > > multiple LUN's perf has to be considered. > > > > > > [1] > > > https://lore.kernel.org/linux-block/20231018180056.2151711-1-bvanass > > > che@acm.org/ > > > > Hi Ming, > > > > Submitting I/O to a WLUN is only one example of a use case that > > activates the fair sharing algorithm for UFS devices. Another use case > > is simultaneous activity for multiple data LUNs. Conventional UFS > > devices typically have four data LUNs and zoned UFS devices typically > > have five data LUNs. From an Android device with a zoned UFS > > device: > > > > $ adb shell ls /sys/class/scsi_device > > 0:0:0:0 > > 0:0:0:1 > > 0:0:0:2 > > 0:0:0:3 > > 0:0:0:4 > > 0:0:0:49456 > > 0:0:0:49476 > > 0:0:0:49488 > > > > The first five are data logical units. The last three are WLUNs. > > > > For a block size of 4 KiB, I see 144 K IOPS for queue depth 31 and > > 107 K IOPS for queue depth 15 (queue depth is reduced from 31 to 15 if > > I/O is being submitted to two LUNs simultaneously). In other words, > > disabling fair sharing results in up to 35% higher IOPS for small > > reads and in case two logical units are active simultaneously. I think > > that's a very significant performance difference. This issue is known for a long time now. Whenever we needed to provide performance measures, We used to disable that mechanism, hacking the kernel locally - one way or the other. I know that my peers are doing this as well. > > Yeah, performance does drop when queue depth is cut to half if queue depth > is low enough. > > However, it isn't enough to just test perf over one LUN, what is the perf > effect when running IOs over the 2 or 5 data LUNs concurrently? > > SATA should have similar issue too, and I think the improvement may be > more generic to bypass fair tag sharing in case of low queue depth (such as < > 32) if turns out the fair tag sharing doesn't work well in case low queue > depth. > > Also the 'fairness' could be enhanced dynamically by scsi LUN's queue depth, > which can be adjusted dynamically. As far our concern as a ufs device manufacturers & users, We find the current proposal a clean and elegant solution we like to adopt. Further, more sophisticated scheme can be adopted on top of disabling tag sharing. Thanks, Avri > > > Thanks, > Ming
On 10/24/23 18:33, Ming Lei wrote: > Yeah, performance does drop when queue depth is cut to half if queue > depth is low enough. > > However, it isn't enough to just test perf over one LUN, what is the > perf effect when running IOs over the 2 or 5 data LUNs > concurrently? I think that the results I shared are sufficient because these show the worst possible performance impact of fair tag sharing (two active logical units and much more activity on one logical unit than on the other). > SATA should have similar issue too, and I think the improvement may > be more generic to bypass fair tag sharing in case of low queue depth > (such as < 32) if turns out the fair tag sharing doesn't work well in > case low queue depth. > > Also the 'fairness' could be enhanced dynamically by scsi LUN's > queue depth, which can be adjusted dynamically. Most SATA devices are hard disks. Hard disk IOPS are constrained by the speed with which the head of a hard disk can move. That speed hasn't changed much during the past 40 years. I'm not sure that hard disks are impacted as much as SSD devices by fair tag sharing. Any algorithm that is more complicated than what I posted probably would have a negative performance impact on storage devices that use NAND technology, e.g. UFS devices. So I prefer to proceed with this patch series and solve any issues with ATA devices separately. Once this patch series has been merged, it could be used as a basis for a solution for ATA devices. A solution for ATA devices does not have to be implemented in the block layer core - it could e.g. be implemented in the ATA subsystem. Thanks, Bart.
On Wed, Oct 25, 2023 at 12:01:33PM -0700, Bart Van Assche wrote: > > On 10/24/23 18:33, Ming Lei wrote: > > Yeah, performance does drop when queue depth is cut to half if queue > > depth is low enough. > > > > However, it isn't enough to just test perf over one LUN, what is the > > perf effect when running IOs over the 2 or 5 data LUNs > > concurrently? > > I think that the results I shared are sufficient because these show the > worst possible performance impact of fair tag sharing (two active > logical units and much more activity on one logical unit than on the > other). You are talking about multi-lun case, and your change does affect multi-lun code path, but your test result doesn't cover multi-lun, is it enough? At least your patch shouldn't cause performance regression on multi-lun IO workloads, right? > > > SATA should have similar issue too, and I think the improvement may be > > more generic to bypass fair tag sharing in case of low queue depth > > (such as < 32) if turns out the fair tag sharing doesn't work well in > > case low queue depth. > > > > Also the 'fairness' could be enhanced dynamically by scsi LUN's > > queue depth, which can be adjusted dynamically. > > Most SATA devices are hard disks. Hard disk IOPS are constrained by the > speed with which the head of a hard disk can move. That speed hasn't > changed much during the past 40 years. I'm not sure that hard disks are > impacted as much as SSD devices by fair tag sharing. What I meant is that SATA's queue depth is often 32 or 31, and still have multi-lun cases. At least from what you shared, the fair tag sharing doesn't work well just because of low queue depth, nothing is actually related with UFS. That is why I am wondering that why not force to disable fairing sharing in case of low queue depth. > > Any algorithm that is more complicated than what I posted probably would > have a negative performance impact on storage devices that use NAND > technology, e.g. UFS devices. So I prefer to proceed with this patch > series and solve any issues with ATA devices separately. Once this patch > series has been merged, it could be used as a basis for a solution for > ATA devices. A solution for ATA devices does not have to be implemented > in the block layer core - it could e.g. be implemented in the ATA subsystem. I don't object to take the disabling fair sharing first, and I meant that the fairness may be brought back by adjusting scsi_device's queue depth in future. Thanks, Ming
On 10/25/23 16:37, Ming Lei wrote: > You are talking about multi-lun case, and your change does affect > multi-lun code path, but your test result doesn't cover multi-lun, is > it enough? I think that the results I shared are sufficient because these show the worst possible performance impact of fair tag sharing. If there are two active logical units and much more I/O is sent to one logical unit than to the other, then the aggregate performance will be very close to what I shared in a previous email. > At least your patch shouldn't cause performance regression on > multi-lun IO workloads, right? If blk_mq_get_tag() can't allocate a tag, and if multiple threads are waiting for a tag, the thread that called blk_mq_get_tag() first is granted the first tag that is released. I think this guarantees fairness if all requests have a similar latency. There will be some unfairness if there are significant differences in latency per logical unit, e.g. because all requests sent to one logical unit are small and because all requests sent to another logical unit are large. Whether or not this matters depends on the use case. Thanks, Bart.
On 10/25/23 11:50, Avri Altman wrote: > This issue is known for a long time now. > Whenever we needed to provide performance measures, > We used to disable that mechanism, hacking the kernel locally - one way or the other. > I know that my peers are doing this as well. The Android 14 kernels have this patch since May (five months ago). None of the OEMs who use this kernel branch have complained so far. Thanks, Bart.
Hi, 在 2023/10/27 0:29, Bart Van Assche 写道: > If blk_mq_get_tag() can't allocate a tag, and if multiple threads are > waiting for a tag, the thread that called blk_mq_get_tag() first is > granted the first tag that is released. I think this guarantees fairness > if all requests have a similar latency. There will be some unfairness if > there are significant differences in latency per logical unit, e.g. > because all requests sent to one logical unit are small and because all > requests sent to another logical unit are large. Whether or not this > matters depends on the use case. I'm afraid that is not correct, fairness can't be guranteed at all, not even with just one scsi disk. This is because there are 8 wait queues in sbitmap, and threads are waiting in roundrobin mode, and each time wake_batch tags are released, wake_batch threads of one wait queue will be woke up, regardless that some threads can't grab tag after woken up, what's worse, thoese thread will be added to the tail of waitqueue again. In the case that high io pressure under a slow disk, this behaviour will cause that io tail latency will be quite bad compared to sq from old kernel. AFAIC, disable tag sharing will definitely case some regresion, for example, one disk will high io pressure, and another disk only issure one IO at a time, disable tag sharing can improve brandwith of fist disk, however, for the latter disk, IO latency will definitely be much worse. Thanks, Kuai > > Thanks, > > Bart. > > > > . >
On 10/30/23 19:01, Yu Kuai wrote: > I'm afraid that is not correct, fairness can't be guranteed at all, not > even with just one scsi disk. This is because there are 8 wait queues in > sbitmap, and threads are waiting in roundrobin mode, and each time > wake_batch tags are released, wake_batch threads of one wait queue will > be woke up, regardless that some threads can't grab tag after woken up, > what's worse, thoese thread will be added to the tail of waitqueue > again. > > In the case that high io pressure under a slow disk, this behaviour will > cause that io tail latency will be quite bad compared to sq from old > kernel. > > AFAIC, disable tag sharing will definitely case some regresion, for > example, one disk will high io pressure, and another disk only issure > one IO at a time, disable tag sharing can improve brandwith of fist > disk, however, for the latter disk, IO latency will definitely be much > worse. Hi Yu, All UFS users I know prefer that fair tag sharing is disabled for UFS devices. So I propose to proceed with this patch series and additionally to improve fair tag sharing for devices and transport layers that need an improved sharing algorithm. Thanks, Bart.