Message ID | 81b399f6-55f5-4aa2-0f31-8b4f8a44e6a4@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | blk-settings: round down io_opt to at least 4K | expand |
On Mon, Jan 20, 2025 at 04:16:26PM +0100, Mikulas Patocka wrote: > Some SATA SSDs and most NVMe SSDs report physical block size 512 bytes, > but they use 4K remapping table internally and they do slow > read-modify-write cycle for requests that are not aligned on 4K boundary. > Therefore, io_opt should be aligned on 4K. Not really. I mean it's always smart to not do tiny unaligned I/O unless you have to. So we're not just going to cap an exported value to a magic number because of something. > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > Fixes: a23634644afc ("block: take io_opt and io_min into account for max_sectors") > Fixes: 9c0ba14828d6 ("blk-settings: round down io_opt to physical_block_size") Please explain how this actually is a fix.
On 1/20/25 4:16 PM, Mikulas Patocka wrote: > Some SATA SSDs and most NVMe SSDs report physical block size 512 bytes, > but they use 4K remapping table internally and they do slow > read-modify-write cycle for requests that are not aligned on 4K boundary. > Therefore, io_opt should be aligned on 4K. This also changes how various (usually broken) USB enclosures reports opt-io (as you already found, it breaks our cryptsetup align test simulating these resulting in much larger LUKS data alignment). Not sure if the patch is acceptable, but if so, please add comment that it can have side effects - it apparently affects more systems than NVMe and SSDs, definitely some USB storage devices. Milan > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > Fixes: a23634644afc ("block: take io_opt and io_min into account for max_sectors") > Fixes: 9c0ba14828d6 ("blk-settings: round down io_opt to physical_block_size") > Cc: stable@vger.kernel.org # v6.11+ > > --- > block/blk-settings.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > Index: linux-2.6/block/blk-settings.c > =================================================================== > --- linux-2.6.orig/block/blk-settings.c 2025-01-03 21:10:56.000000000 +0100 > +++ linux-2.6/block/blk-settings.c 2025-01-20 15:59:13.000000000 +0100 > @@ -269,8 +269,12 @@ int blk_validate_limits(struct queue_lim > * The optimal I/O size may not be aligned to physical block size > * (because it may be limited by dma engines which have no clue about > * block size of the disks attached to them), so we round it down here. > + * > + * Note that some SSDs erroneously report physical_block_size 512 > + * despite the fact that they have remapping table granularity 4K and > + * they perform read-modify-write for unaligned requests. > */ > - lim->io_opt = round_down(lim->io_opt, lim->physical_block_size); > + lim->io_opt = round_down(lim->io_opt, max(4096, lim->physical_block_size)); > > /* > * max_hw_sectors has a somewhat weird default for historical reason, >
On Tue, 21 Jan 2025, Christoph Hellwig wrote: > On Mon, Jan 20, 2025 at 04:16:26PM +0100, Mikulas Patocka wrote: > > Some SATA SSDs and most NVMe SSDs report physical block size 512 bytes, > > but they use 4K remapping table internally and they do slow > > read-modify-write cycle for requests that are not aligned on 4K boundary. > > Therefore, io_opt should be aligned on 4K. > > Not really. I mean it's always smart to not do tiny unaligned I/O > unless you have to. So we're not just going to cap an exported value > to a magic number because of something. The purpose of this patch is to avoid doing I/O not aligned on 4k boundary. The 512-byte value that some SSDs report is just lie. > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > Fixes: a23634644afc ("block: take io_opt and io_min into account for max_sectors") > > Fixes: 9c0ba14828d6 ("blk-settings: round down io_opt to physical_block_size") > > Please explain how this actually is a fix. Some USB-SATA bridges report optimal I/O size 33553920 bytes (that is 512*65535). If you connect a SATA SSD that reports 512-bytes physical sector size to this kind of USB-SATA bridge, the kernel will believe that the value 33553920 is valid optimal I/O size and it will attempt to align I/O to this boundary - the result will be that most of the I/O will not be aligned on 4k, causing performance degradation. Optimal I/O size 33553920 may also confuse userspace tools (lvm, cryptsetup) to align logical volumes on 33553920-byte boundary. Mikulas
Hi Mikulas! > The purpose of this patch is to avoid doing I/O not aligned on 4k > boundary. > > The 512-byte value that some SSDs report is just lie. So is 4K, though. > Some USB-SATA bridges report optimal I/O size 33553920 bytes (that is > 512*65535). If you connect a SATA SSD that reports 512-bytes physical > sector size to this kind of USB-SATA bridge, the kernel will believe > that the value 33553920 is valid optimal I/O size and it will attempt > to align I/O to this boundary - the result will be that most of the > I/O will not be aligned on 4k, causing performance degradation. SCSI tries to make sure the characteristics reported by the device make sense and are consistent with each other. If we encounter a device which reports something incorrect despite passing the sanity checks, we quirk it. But I wonder why we're even reading the block limits if this is a USB device?
On Mon, 3 Feb 2025, Martin K. Petersen wrote: > > Hi Mikulas! > > > The purpose of this patch is to avoid doing I/O not aligned on 4k > > boundary. > > > > The 512-byte value that some SSDs report is just lie. > > So is 4K, though. Do you think that SSDs benefit from more alignment than 4K? Based on my understanding (and some IOPS testing), SSDs have a remapping table that maps every 4K block to the physical location. So, 8K (or larger) I/Os don't have higher IOPS than 4K I/Os. 2K and smaller I/Os have worse IOPS than 4K I/Os, because they perform read-modify-write cycle. If you have some SSD that has higher IOPS with alignment larger than 4K, let me know. Mikulas
Mikulas, > Do you think that SSDs benefit from more alignment than 4K? It really depends on the SSD. > If you have some SSD that has higher IOPS with alignment larger than > 4K, let me know. Absolutely. There are devices out there reporting characteristics that are not intuitively correct. I.e. not obvious powers of two due to internal striping or allocation units. Some devices prefer an alignment of 768KB, for instance. Also, wide RAID stripes can end up causing peculiar alignment to be reported. Instead of assuming that 4KB is the correct value for all devices, I'd rather address the cases where one particular device is reporting something wrong. I'm willing to entertain having a quirk to ignore reported values of 0xffff (have also seen 0xfffe) for USB-ATA bridge devices. But I don't think we should blindly distrust what devices using other transport classes happen to report, it's usually only USB bridges that are spewing nonsense. Which is why we typically avoid trying to make sense of VPD pages on USB in the first place.
On Mon, Feb 03, 2025 at 02:38:25PM +0100, Mikulas Patocka wrote: > > Not really. I mean it's always smart to not do tiny unaligned I/O > > unless you have to. So we're not just going to cap an exported value > > to a magic number because of something. > > The purpose of this patch is to avoid doing I/O not aligned on 4k > boundary. > > The 512-byte value that some SSDs report is just lie. That's very strong words. NVMe until fairly recently had no way to report this value at all, so defaulting to the LBA is not a lie. Similarly we ignore the device characteristics page is supposed to be skipped for usb attached SCSI device as it is buggy much more often than not. You still haven't mentioned what consumer of the value is affect that you care about, but it probably needs to be taught that opt_io == logical_block_size means that no opt_io size is provided at all. > Some USB-SATA bridges report optimal I/O size 33553920 bytes (that is > 512*65535). Well, that's clearly bogus and we'll need a tweak. That being said I was pretty sure we wouldn't even read that value for USB attachments.
On Mon, Feb 03, 2025 at 10:05:27PM +0100, Mikulas Patocka wrote: > Do you think that SSDs benefit from more alignment than 4K? SSDs is a very broad category of devices. But yes, many will. > Based on my understanding (and some IOPS testing), SSDs have a remapping > table that maps every 4K block to the physical location. So, 8K (or > larger) I/Os don't have higher IOPS than 4K I/Os. 4K is the traditional value, but it has been increasing at least for some classes of SSDs for a while.
On Mon, Feb 03, 2025 at 10:30:42PM -0500, Martin K. Petersen wrote: > I'm willing to entertain having a quirk to ignore reported values of > 0xffff (have also seen 0xfffe) for USB-ATA bridge devices. 0xffff as the max and a weirdly uneven number even for partity RAID seems like a good value to ignore by default. But then again just dealing with USB will probably catch the majority of cases.
Christoph, > 0xffff as the max and a weirdly uneven number even for partity RAID > seems like a good value to ignore by default. Quite a few SCSI devices report 0xffff to indicate that the optimal transfer length is the same as the maximum transfer length which for low-byte commands is capped at 0xffff. That's where the odd value comes from in some cases. > But then again just dealing with USB will probably catch the majority > of cases. Yep.
On Tue, Feb 04, 2025 at 08:02:17AM -0500, Martin K. Petersen wrote: > > Christoph, > > > 0xffff as the max and a weirdly uneven number even for partity RAID > > seems like a good value to ignore by default. > > Quite a few SCSI devices report 0xffff to indicate that the optimal > transfer length is the same as the maximum transfer length which for > low-byte commands is capped at 0xffff. That's where the odd value comes > from in some cases. Hmm, optimal == max is odd, and I'm pretty sure it's going to confuse the hell out of whatever non-sophisticated users of the optimal I/O size that Mikulas is dealing with.
On Mon, 3 Feb 2025, Martin K. Petersen wrote: > > Mikulas, > > > Do you think that SSDs benefit from more alignment than 4K? > > It really depends on the SSD. > > > If you have some SSD that has higher IOPS with alignment larger than > > 4K, let me know. > > Absolutely. > > There are devices out there reporting characteristics that are not > intuitively correct. I.e. not obvious powers of two due to internal > striping or allocation units. Some devices prefer an alignment of 768KB, > for instance. Also, wide RAID stripes can end up causing peculiar > alignment to be reported. Surely that RAID4/5/6 performs better when you write a full stripe. But I'd like to know whether individual SSDs have increased IOPS when the request size is increased. The write tests that I did (some years ago, I don't remember the details) showed that SSDs have the best write IOPS for 4k requests. For smaller requests they perform worse (because of read-modify-write), for larger requests they also have slightly less IOPS (because they write more data per request). If there is some particular SSD that has more write IOPS for 8k requests than for 4k requests, I'd like to know about it - out of curiosity. When I was developing dm-integrity, I just aligned metadata and data on 4k boundary, regardless of what the underlying device advertised. > Instead of assuming that 4KB is the correct value for all devices, I'd > rather address the cases where one particular device is reporting > something wrong. The patch that I made tries to round down optimal IO size to 4k, it doesn't force optimal IO size to be 4k. > I'm willing to entertain having a quirk to ignore reported values of > 0xffff (have also seen 0xfffe) for USB-ATA bridge devices. But I don't > think we should blindly distrust what devices using other transport > classes happen to report, it's usually only USB bridges that are spewing > nonsense. Which is why we typically avoid trying to make sense of VPD > pages on USB in the first place. Mikulas
Index: linux-2.6/block/blk-settings.c =================================================================== --- linux-2.6.orig/block/blk-settings.c 2025-01-03 21:10:56.000000000 +0100 +++ linux-2.6/block/blk-settings.c 2025-01-20 15:59:13.000000000 +0100 @@ -269,8 +269,12 @@ int blk_validate_limits(struct queue_lim * The optimal I/O size may not be aligned to physical block size * (because it may be limited by dma engines which have no clue about * block size of the disks attached to them), so we round it down here. + * + * Note that some SSDs erroneously report physical_block_size 512 + * despite the fact that they have remapping table granularity 4K and + * they perform read-modify-write for unaligned requests. */ - lim->io_opt = round_down(lim->io_opt, lim->physical_block_size); + lim->io_opt = round_down(lim->io_opt, max(4096, lim->physical_block_size)); /* * max_hw_sectors has a somewhat weird default for historical reason,
Some SATA SSDs and most NVMe SSDs report physical block size 512 bytes, but they use 4K remapping table internally and they do slow read-modify-write cycle for requests that are not aligned on 4K boundary. Therefore, io_opt should be aligned on 4K. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: a23634644afc ("block: take io_opt and io_min into account for max_sectors") Fixes: 9c0ba14828d6 ("blk-settings: round down io_opt to physical_block_size") Cc: stable@vger.kernel.org # v6.11+ --- block/blk-settings.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)