diff mbox series

blk-settings: round down io_opt to at least 4K

Message ID 81b399f6-55f5-4aa2-0f31-8b4f8a44e6a4@redhat.com (mailing list archive)
State New
Headers show
Series blk-settings: round down io_opt to at least 4K | expand

Commit Message

Mikulas Patocka Jan. 20, 2025, 3:16 p.m. UTC
Some SATA SSDs and most NVMe SSDs report physical block size 512 bytes,
but they use 4K remapping table internally and they do slow
read-modify-write cycle for requests that are not aligned on 4K boundary.
Therefore, io_opt should be aligned on 4K.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: a23634644afc ("block: take io_opt and io_min into account for max_sectors")
Fixes: 9c0ba14828d6 ("blk-settings: round down io_opt to physical_block_size")
Cc: stable@vger.kernel.org	# v6.11+

---
 block/blk-settings.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig Jan. 22, 2025, 6:12 a.m. UTC | #1
On Mon, Jan 20, 2025 at 04:16:26PM +0100, Mikulas Patocka wrote:
> Some SATA SSDs and most NVMe SSDs report physical block size 512 bytes,
> but they use 4K remapping table internally and they do slow
> read-modify-write cycle for requests that are not aligned on 4K boundary.
> Therefore, io_opt should be aligned on 4K.

Not really.  I mean it's always smart to not do tiny unaligned I/O
unless you have to.  So we're not just going to cap an exported value
to a magic number because of something.


> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Fixes: a23634644afc ("block: take io_opt and io_min into account for max_sectors")
> Fixes: 9c0ba14828d6 ("blk-settings: round down io_opt to physical_block_size")

Please explain how this actually is a fix.
Milan Broz Jan. 23, 2025, 12:24 p.m. UTC | #2
On 1/20/25 4:16 PM, Mikulas Patocka wrote:
> Some SATA SSDs and most NVMe SSDs report physical block size 512 bytes,
> but they use 4K remapping table internally and they do slow
> read-modify-write cycle for requests that are not aligned on 4K boundary.
> Therefore, io_opt should be aligned on 4K.

This also changes how various (usually broken) USB enclosures reports opt-io
(as you already found, it breaks our cryptsetup align test simulating these
resulting in much larger LUKS data alignment).

Not sure if the patch is acceptable, but if so, please add comment that
it can have side effects - it apparently affects more systems than NVMe and SSDs,
definitely some USB storage devices.

Milan


> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Fixes: a23634644afc ("block: take io_opt and io_min into account for max_sectors")
> Fixes: 9c0ba14828d6 ("blk-settings: round down io_opt to physical_block_size")
> Cc: stable@vger.kernel.org	# v6.11+
> 
> ---
>   block/blk-settings.c |    6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/block/blk-settings.c
> ===================================================================
> --- linux-2.6.orig/block/blk-settings.c	2025-01-03 21:10:56.000000000 +0100
> +++ linux-2.6/block/blk-settings.c	2025-01-20 15:59:13.000000000 +0100
> @@ -269,8 +269,12 @@ int blk_validate_limits(struct queue_lim
>   	 * The optimal I/O size may not be aligned to physical block size
>   	 * (because it may be limited by dma engines which have no clue about
>   	 * block size of the disks attached to them), so we round it down here.
> +	 *
> +	 * Note that some SSDs erroneously report physical_block_size 512
> +	 * despite the fact that they have remapping table granularity 4K and
> +	 * they perform read-modify-write for unaligned requests.
>   	 */
> -	lim->io_opt = round_down(lim->io_opt, lim->physical_block_size);
> +	lim->io_opt = round_down(lim->io_opt, max(4096, lim->physical_block_size));
>   
>   	/*
>   	 * max_hw_sectors has a somewhat weird default for historical reason,
>
Mikulas Patocka Feb. 3, 2025, 1:38 p.m. UTC | #3
On Tue, 21 Jan 2025, Christoph Hellwig wrote:

> On Mon, Jan 20, 2025 at 04:16:26PM +0100, Mikulas Patocka wrote:
> > Some SATA SSDs and most NVMe SSDs report physical block size 512 bytes,
> > but they use 4K remapping table internally and they do slow
> > read-modify-write cycle for requests that are not aligned on 4K boundary.
> > Therefore, io_opt should be aligned on 4K.
> 
> Not really.  I mean it's always smart to not do tiny unaligned I/O
> unless you have to.  So we're not just going to cap an exported value
> to a magic number because of something.

The purpose of this patch is to avoid doing I/O not aligned on 4k 
boundary.

The 512-byte value that some SSDs report is just lie.

> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > Fixes: a23634644afc ("block: take io_opt and io_min into account for max_sectors")
> > Fixes: 9c0ba14828d6 ("blk-settings: round down io_opt to physical_block_size")
> 
> Please explain how this actually is a fix.

Some USB-SATA bridges report optimal I/O size 33553920 bytes (that is 
512*65535). If you connect a SATA SSD that reports 512-bytes physical 
sector size to this kind of USB-SATA bridge, the kernel will believe that 
the value 33553920 is valid optimal I/O size and it will attempt to align 
I/O to this boundary - the result will be that most of the I/O will not be 
aligned on 4k, causing performance degradation.

Optimal I/O size 33553920 may also confuse userspace tools (lvm, 
cryptsetup) to align logical volumes on 33553920-byte boundary.

Mikulas
Martin K. Petersen Feb. 3, 2025, 7:02 p.m. UTC | #4
Hi Mikulas!

> The purpose of this patch is to avoid doing I/O not aligned on 4k 
> boundary.
>
> The 512-byte value that some SSDs report is just lie.

So is 4K, though.

> Some USB-SATA bridges report optimal I/O size 33553920 bytes (that is
> 512*65535). If you connect a SATA SSD that reports 512-bytes physical
> sector size to this kind of USB-SATA bridge, the kernel will believe
> that the value 33553920 is valid optimal I/O size and it will attempt
> to align I/O to this boundary - the result will be that most of the
> I/O will not be aligned on 4k, causing performance degradation.

SCSI tries to make sure the characteristics reported by the device make
sense and are consistent with each other. If we encounter a device which
reports something incorrect despite passing the sanity checks, we quirk
it.

But I wonder why we're even reading the block limits if this is a USB
device?
Mikulas Patocka Feb. 3, 2025, 9:05 p.m. UTC | #5
On Mon, 3 Feb 2025, Martin K. Petersen wrote:

> 
> Hi Mikulas!
> 
> > The purpose of this patch is to avoid doing I/O not aligned on 4k 
> > boundary.
> >
> > The 512-byte value that some SSDs report is just lie.
> 
> So is 4K, though.

Do you think that SSDs benefit from more alignment than 4K?

Based on my understanding (and some IOPS testing), SSDs have a remapping 
table that maps every 4K block to the physical location. So, 8K (or 
larger) I/Os don't have higher IOPS than 4K I/Os.

2K and smaller I/Os have worse IOPS than 4K I/Os, because they perform 
read-modify-write cycle.

If you have some SSD that has higher IOPS with alignment larger than 4K, 
let me know.

Mikulas
Martin K. Petersen Feb. 4, 2025, 3:30 a.m. UTC | #6
Mikulas,

> Do you think that SSDs benefit from more alignment than 4K?

It really depends on the SSD.

> If you have some SSD that has higher IOPS with alignment larger than
> 4K, let me know.

Absolutely.

There are devices out there reporting characteristics that are not
intuitively correct. I.e. not obvious powers of two due to internal
striping or allocation units. Some devices prefer an alignment of 768KB,
for instance. Also, wide RAID stripes can end up causing peculiar
alignment to be reported.

Instead of assuming that 4KB is the correct value for all devices, I'd
rather address the cases where one particular device is reporting
something wrong.

I'm willing to entertain having a quirk to ignore reported values of
0xffff (have also seen 0xfffe) for USB-ATA bridge devices. But I don't
think we should blindly distrust what devices using other transport
classes happen to report, it's usually only USB bridges that are spewing
nonsense. Which is why we typically avoid trying to make sense of VPD
pages on USB in the first place.
Christoph Hellwig Feb. 4, 2025, 5:55 a.m. UTC | #7
On Mon, Feb 03, 2025 at 02:38:25PM +0100, Mikulas Patocka wrote:
> > Not really.  I mean it's always smart to not do tiny unaligned I/O
> > unless you have to.  So we're not just going to cap an exported value
> > to a magic number because of something.
> 
> The purpose of this patch is to avoid doing I/O not aligned on 4k 
> boundary.
> 
> The 512-byte value that some SSDs report is just lie.

That's very strong words.  NVMe until fairly recently had no way
to report this value at all, so defaulting to the LBA is not a lie.
Similarly we ignore the device characteristics page is supposed to
be skipped for usb attached SCSI device as it is buggy much more often
than not.

You still haven't mentioned what consumer of the value is affect that
you care about, but it probably needs to be taught that
opt_io == logical_block_size means that no opt_io size is provided at
all.

> Some USB-SATA bridges report optimal I/O size 33553920 bytes (that is 
> 512*65535).

Well, that's clearly bogus and we'll need a tweak.  That being said
I was pretty sure we wouldn't even read that value for USB attachments.
Christoph Hellwig Feb. 4, 2025, 5:56 a.m. UTC | #8
On Mon, Feb 03, 2025 at 10:05:27PM +0100, Mikulas Patocka wrote:
> Do you think that SSDs benefit from more alignment than 4K?

SSDs is a very broad category of devices.  But yes, many will.

> Based on my understanding (and some IOPS testing), SSDs have a remapping 
> table that maps every 4K block to the physical location. So, 8K (or 
> larger) I/Os don't have higher IOPS than 4K I/Os.

4K is the traditional value, but it has been increasing at least for
some classes of SSDs for a while.
Christoph Hellwig Feb. 4, 2025, 5:57 a.m. UTC | #9
On Mon, Feb 03, 2025 at 10:30:42PM -0500, Martin K. Petersen wrote:
> I'm willing to entertain having a quirk to ignore reported values of
> 0xffff (have also seen 0xfffe) for USB-ATA bridge devices.

0xffff as the max and a weirdly uneven number even for partity RAID
seems like a good value to ignore by default.  But then again just
dealing with USB will probably catch the majority of cases.
Martin K. Petersen Feb. 4, 2025, 1:02 p.m. UTC | #10
Christoph,

> 0xffff as the max and a weirdly uneven number even for partity RAID
> seems like a good value to ignore by default.

Quite a few SCSI devices report 0xffff to indicate that the optimal
transfer length is the same as the maximum transfer length which for
low-byte commands is capped at 0xffff. That's where the odd value comes
from in some cases.

> But then again just dealing with USB will probably catch the majority
> of cases.

Yep.
Christoph Hellwig Feb. 4, 2025, 1:50 p.m. UTC | #11
On Tue, Feb 04, 2025 at 08:02:17AM -0500, Martin K. Petersen wrote:
> 
> Christoph,
> 
> > 0xffff as the max and a weirdly uneven number even for partity RAID
> > seems like a good value to ignore by default.
> 
> Quite a few SCSI devices report 0xffff to indicate that the optimal
> transfer length is the same as the maximum transfer length which for
> low-byte commands is capped at 0xffff. That's where the odd value comes
> from in some cases.

Hmm, optimal == max is odd, and I'm pretty sure it's going to confuse
the hell out of whatever non-sophisticated users of the optimal I/O
size that Mikulas is dealing with.
Mikulas Patocka Feb. 4, 2025, 2:56 p.m. UTC | #12
On Mon, 3 Feb 2025, Martin K. Petersen wrote:

> 
> Mikulas,
> 
> > Do you think that SSDs benefit from more alignment than 4K?
> 
> It really depends on the SSD.
> 
> > If you have some SSD that has higher IOPS with alignment larger than
> > 4K, let me know.
> 
> Absolutely.
> 
> There are devices out there reporting characteristics that are not
> intuitively correct. I.e. not obvious powers of two due to internal
> striping or allocation units. Some devices prefer an alignment of 768KB,
> for instance. Also, wide RAID stripes can end up causing peculiar
> alignment to be reported.

Surely that RAID4/5/6 performs better when you write a full stripe. But 
I'd like to know whether individual SSDs have increased IOPS when the 
request size is increased.

The write tests that I did (some years ago, I don't remember the details) 
showed that SSDs have the best write IOPS for 4k requests. For smaller 
requests they perform worse (because of read-modify-write), for larger 
requests they also have slightly less IOPS (because they write more data 
per request).

If there is some particular SSD that has more write IOPS for 8k requests 
than for 4k requests, I'd like to know about it - out of curiosity.

When I was developing dm-integrity, I just aligned metadata and data on 4k 
boundary, regardless of what the underlying device advertised.

> Instead of assuming that 4KB is the correct value for all devices, I'd
> rather address the cases where one particular device is reporting
> something wrong.

The patch that I made tries to round down optimal IO size to 4k, it 
doesn't force optimal IO size to be 4k.

> I'm willing to entertain having a quirk to ignore reported values of
> 0xffff (have also seen 0xfffe) for USB-ATA bridge devices. But I don't
> think we should blindly distrust what devices using other transport
> classes happen to report, it's usually only USB bridges that are spewing
> nonsense. Which is why we typically avoid trying to make sense of VPD
> pages on USB in the first place.

Mikulas
diff mbox series

Patch

Index: linux-2.6/block/blk-settings.c
===================================================================
--- linux-2.6.orig/block/blk-settings.c	2025-01-03 21:10:56.000000000 +0100
+++ linux-2.6/block/blk-settings.c	2025-01-20 15:59:13.000000000 +0100
@@ -269,8 +269,12 @@  int blk_validate_limits(struct queue_lim
 	 * The optimal I/O size may not be aligned to physical block size
 	 * (because it may be limited by dma engines which have no clue about
 	 * block size of the disks attached to them), so we round it down here.
+	 *
+	 * Note that some SSDs erroneously report physical_block_size 512
+	 * despite the fact that they have remapping table granularity 4K and
+	 * they perform read-modify-write for unaligned requests.
 	 */
-	lim->io_opt = round_down(lim->io_opt, lim->physical_block_size);
+	lim->io_opt = round_down(lim->io_opt, max(4096, lim->physical_block_size));
 
 	/*
 	 * max_hw_sectors has a somewhat weird default for historical reason,