mkfs: use stx_blksize for dev block size by default

Message ID	20250206-min-io-default-blocksize-v1-1-2312e0bb8809@samsung.com (mailing list archive)
State	Not Applicable, archived
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 65455188722 for <linux-xfs@vger.kernel.org>; Thu, 6 Feb 2025 19:01:06 +0000 (UTC) From: da.gomez@kernel.org Date: Thu, 06 Feb 2025 19:00:55 +0000 Subject: [PATCH] mkfs: use stx_blksize for dev block size by default Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Message-Id: <20250206-min-io-default-blocksize-v1-1-2312e0bb8809@samsung.com> To: linux-xfs@vger.kernel.org Cc: Luis Chamberlain <mcgrof@kernel.org>, Daniel Gomez <da.gomez@samsung.com>, Pankaj Raghav <p.raghav@samsung.com>, gost.dev@samsung.com
Series	mkfs: use stx_blksize for dev block size by default \| expand mkfs: use stx_blksize for dev block size by default

Daniel Gomez Feb. 6, 2025, 7 p.m. UTC

From: Daniel Gomez <da.gomez@samsung.com>

In patch [1] ("bdev: use bdev_io_min() for statx block size"), block
devices will now report their preferred minimum I/O size for optimal
performance in the stx_blksize field of the statx data structure. This
change updates the current default 4 KiB block size for all devices
reporting a minimum I/O larger than 4 KiB, opting instead to query for
its advertised minimum I/O value in the statx data struct.

[1]:
https://lore.kernel.org/all/20250204231209.429356-9-mcgrof@kernel.org/

Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
---
Set MIN-IO from statx as the default filesystem fundamental block size.
This ensures that, for devices reporting values within the supported
XFS block size range, we do not incur in RMW. If the MIN-IO reported
value is lower than the current default of 4 KiB, then 4 KiB will be
used instead.
---
 mkfs/xfs_mkfs.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)


---
base-commit: 90d6da68ee54e6d4ef99eca4a82cac6036a34b00
change-id: 20250206-min-io-default-blocksize-13334f54899e

Best regards,

Darrick J. Wong Feb. 6, 2025, 10:27 p.m. UTC | #1

On Thu, Feb 06, 2025 at 07:00:55PM +0000, da.gomez@kernel.org wrote:
> From: Daniel Gomez <da.gomez@samsung.com>
> 
> In patch [1] ("bdev: use bdev_io_min() for statx block size"), block
> devices will now report their preferred minimum I/O size for optimal
> performance in the stx_blksize field of the statx data structure. This
> change updates the current default 4 KiB block size for all devices
> reporting a minimum I/O larger than 4 KiB, opting instead to query for
> its advertised minimum I/O value in the statx data struct.
> 
> [1]:
> https://lore.kernel.org/all/20250204231209.429356-9-mcgrof@kernel.org/

This isn't even upstream yet...

> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> ---
> Set MIN-IO from statx as the default filesystem fundamental block size.
> This ensures that, for devices reporting values within the supported
> XFS block size range, we do not incur in RMW. If the MIN-IO reported
> value is lower than the current default of 4 KiB, then 4 KiB will be
> used instead.

I don't think this is a good idea -- assuming you mean the same MIN-IO
as what lsblk puts out:

NAME                     MIN-IO
sda                         512
├─sda1                      512
├─sda2                      512
│ └─node0.boot              512
├─sda3                      512
│ └─node0.swap              512
└─sda4                      512
  └─node0.lvm               512
    └─node0-root            512
sdb                        4096
└─sdb1                     4096
nvme1n1                     512
└─md0                    524288
  └─node0.raid           524288
    └─node0_raid-storage 524288
nvme0n1                     512
└─md0                    524288
  └─node0.raid           524288
    └─node0_raid-storage 524288
nvme2n1                     512
└─md0                    524288
  └─node0.raid           524288
    └─node0_raid-storage 524288
nvme3n1                     512
└─md0                    524288
  └─node0.raid           524288
    └─node0_raid-storage 524288

Now you've decreased the default blocksize to 512 on sda, and md0 gets
an impossible 512k blocksize.  Also, disrupting the default 4k blocksize
will introduce portability problems with distros that aren't yet
shipping 6.12.

--D

> ---
>  mkfs/xfs_mkfs.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index bbd0dbb6c80ab63ebf9edbe0a9a304149770f89d..e17389622682bc23f9b20c207db7e22181e2fe6f 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -2178,6 +2178,26 @@ _("block size %d cannot be smaller than sector size %d\n"),
>  	}
>  }
>  
> +void
> +get_dev_blocksize(
> +	struct cli_params	*cli,
> +	struct mkfs_default_params *dft)
> +{
> +	struct statx stx;
> +	int ret;
> +
> +	if (!cli->xi->data.name)
> +		return;
> +
> +	ret = statx(AT_FDCWD, cli->xi->data.name,
> +		    AT_STATX_DONT_SYNC | AT_NO_AUTOMOUNT,
> +		    STATX_DIOALIGN | STATX_TYPE | STATX_MODE | STATX_INO, &stx);
> +	if (!ret)
> +		dft->blocksize =
> +			min(max(1 << XFS_DFL_BLOCKSIZE_LOG, stx.stx_blksize),
> +			    XFS_MAX_BLOCKSIZE);
> +}
> +
>  static void
>  validate_blocksize(
>  	struct mkfs_params	*cfg,
> @@ -2189,6 +2209,7 @@ validate_blocksize(
>  	 * For RAID4/5/6 we want to align sector size and block size,
>  	 * so we need to start with the device geometry extraction too.
>  	 */
> +	get_dev_blocksize(cli, dft);
>  	if (!cli->blocksize)
>  		cfg->blocksize = dft->blocksize;
>  	else
> 
> ---
> base-commit: 90d6da68ee54e6d4ef99eca4a82cac6036a34b00
> change-id: 20250206-min-io-default-blocksize-13334f54899e
> 
> Best regards,
> -- 
> Daniel Gomez <da.gomez@samsung.com>
> 
>

Luis Chamberlain Feb. 6, 2025, 10:50 p.m. UTC | #2

On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> Now you've decreased the default blocksize to 512 on sda, and md0 gets
> an impossible 512k blocksize.  Also, disrupting the default 4k blocksize
> will introduce portability problems with distros that aren't yet
> shipping 6.12.

Our default should be 4k, and to address the later we should sanity
check and user an upper limit of what XFS supports, 64k.

Thoughts?

 Luis

Darrick J. Wong Feb. 6, 2025, 11:07 p.m. UTC | #3

On Thu, Feb 06, 2025 at 02:50:28PM -0800, Luis Chamberlain wrote:
> On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> > Now you've decreased the default blocksize to 512 on sda, and md0 gets
> > an impossible 512k blocksize.  Also, disrupting the default 4k blocksize
> > will introduce portability problems with distros that aren't yet
> > shipping 6.12.
> 
> Our default should be 4k, and to address the later we should sanity
> check and user an upper limit of what XFS supports, 64k.

I don't think it's a good idea to boost the default fsblock size beyond
4k until we get further into the era where the major distros are
shipping 6.12 kernels.  I wouldn't want to deal with people accidentally
ending up with an 8k fsblock filesystem that they can't mount on fairly
new things like RHEL9/Debian12/etc.

--D

> Thoughts?
> 
>  Luis
>

Christoph Hellwig Feb. 7, 2025, 4:30 a.m. UTC | #4

On Thu, Feb 06, 2025 at 07:00:55PM +0000, da.gomez@kernel.org wrote:
> From: Daniel Gomez <da.gomez@samsung.com>
> 
> In patch [1] ("bdev: use bdev_io_min() for statx block size"), block
> devices will now report their preferred minimum I/O size for optimal
> performance in the stx_blksize field of the statx data structure. This
> change updates the current default 4 KiB block size for all devices
> reporting a minimum I/O larger than 4 KiB, opting instead to query for
> its advertised minimum I/O value in the statx data struct.

UUuh, no.  Larger block sizes have their use cases, but this will
regress performance for a lot (most?) common setups.  A lot of
device report fairly high values there, but say increasing the
directory and bmap btree block size unconditionally using the current
algorithms will dramatically increase write amplification.  Similarly
for small buffered writes.

Daniel Gomez Feb. 7, 2025, 9:12 a.m. UTC | #5

On Thu, Feb 06, 2025 at 02:50:28PM +0100, Luis Chamberlain wrote:
> On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> > Now you've decreased the default blocksize to 512 on sda, and md0 gets
> > an impossible 512k blocksize.  Also, disrupting the default 4k blocksize
> > will introduce portability problems with distros that aren't yet
> > shipping 6.12.
> 
> Our default should be 4k, and to address the later we should sanity
> check and user an upper limit of what XFS supports, 64k.

To clarify, the patch addresses the cases described above. It sets mkfs.xfs's
default block size to the device’s reported minimum I/O size (via stx_blksize),
clamping the value between 4k and 64k: if the reported size is less than 4k, 4k
is used, and if it exceeds 64k (XFS_MAX_BLOCKSIZE), 64k is applied.

> 
> Thoughts?
> 
>  Luis

Daniel Gomez Feb. 7, 2025, 9:39 a.m. UTC | #6

On Thu, Feb 06, 2025 at 02:27:16PM +0100, Darrick J. Wong wrote:
> On Thu, Feb 06, 2025 at 07:00:55PM +0000, da.gomez@kernel.org wrote:
> > From: Daniel Gomez <da.gomez@samsung.com>
> > 
> > In patch [1] ("bdev: use bdev_io_min() for statx block size"), block
> > devices will now report their preferred minimum I/O size for optimal
> > performance in the stx_blksize field of the statx data structure. This
> > change updates the current default 4 KiB block size for all devices
> > reporting a minimum I/O larger than 4 KiB, opting instead to query for
> > its advertised minimum I/O value in the statx data struct.
> > 
> > [1]:
> > https://lore.kernel.org/all/20250204231209.429356-9-mcgrof@kernel.org/
> 
> This isn't even upstream yet...
> 
> > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > ---
> > Set MIN-IO from statx as the default filesystem fundamental block size.
> > This ensures that, for devices reporting values within the supported
> > XFS block size range, we do not incur in RMW. If the MIN-IO reported
> > value is lower than the current default of 4 KiB, then 4 KiB will be
> > used instead.
> 
> I don't think this is a good idea -- assuming you mean the same MIN-IO
> as what lsblk puts out:

This is just about matching the values in code and documentation across all
layers (to guarantee writes do no incur in RMW when possible and supported by
the fs): minimum_io_size (block layer) -> stx_blksize (statx) -> lsblk MIN-IO
(minimum I/ O size) -> Filesystem fundamental block size (mkfs.xfs -b size).

* MIN-IO is the minimum I/O size in lsblk [1] which should be the queue-sysfs
minimum_io_size [2] [3] ("This is the smallest preferred IO size reported by the
device").

* From statx [4] manual (and kernel statx data struct description), stx_blksize is
'The "preferred" block size for efficient filesystem I/O (Writing to a file in
smaller chunks may cause an inefficient read-modify-rewrite.)'

[1] https://github.com/util-linux/util-linux/blob/master/misc-utils/lsblk.c#L199
[2] minimum_io_size: https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt
[3] https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-block

What:		/sys/block/<disk>/queue/minimum_io_size
Date:		April 2009
Contact:	Martin K. Petersen <martin.petersen@oracle.com>
Description:
		[RO] Storage devices may report a granularity or preferred
		minimum I/O size which is the smallest request the device can
		perform without incurring a performance penalty.  For disk
		drives this is often the physical block size.  For RAID arrays
		it is often the stripe chunk size.  A properly aligned multiple
		of minimum_io_size is the preferred request size for workloads
		where a high number of I/O operations is desired.

[4] https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man/man2/statx.2?id=master#n369
kernel:	__u32	stx_blksize;	/* Preferred general I/O size [uncond] */

> nvme1n1                     512
> └─md0                    524288
>   └─node0.raid           524288
>     └─node0_raid-storage 524288

Is the MIN-IO correctly reported in RAID arrays here? I guess it should match
the stripe chunk size as per description above?

Daniel Gomez Feb. 7, 2025, 10:04 a.m. UTC | #7

On Thu, Feb 06, 2025 at 08:30:22PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 06, 2025 at 07:00:55PM +0000, da.gomez@kernel.org wrote:
> > From: Daniel Gomez <da.gomez@samsung.com>
> > 
> > In patch [1] ("bdev: use bdev_io_min() for statx block size"), block
> > devices will now report their preferred minimum I/O size for optimal
> > performance in the stx_blksize field of the statx data structure. This
> > change updates the current default 4 KiB block size for all devices
> > reporting a minimum I/O larger than 4 KiB, opting instead to query for
> > its advertised minimum I/O value in the statx data struct.
> 
> UUuh, no.  Larger block sizes have their use cases, but this will
> regress performance for a lot (most?) common setups.  A lot of
> device report fairly high values there, but say increasing the

Are these devices reporting the correct value? As I mentioned in my discussion
with Darrick, matching the minimum_io_size with the "fs fundamental blocksize"
actually allows to avoid RMW operations (when using the default path in mkfs.xfs
and the value reported is within boundaries).

> directory and bmap btree block size unconditionally using the current
> algorithms will dramatically increase write amplification.  Similarly

I agree, but it seems to be a consequence of using such a large minimum_io_size.

> for small buffered writes.
> 

Exactly. Even though write amplification happens with a 512 byte minimum_io_size
and a 4k default block size, it doesn't incur a performance penalty. But we will
incur that when minimum_io_size exceeds 4k. So, this solves the issue but comes
at the cost of write amplification.

Luis Chamberlain Feb. 7, 2025, 7:16 p.m. UTC | #8

On Fri, Feb 07, 2025 at 10:12:07AM +0100, Daniel Gomez wrote:
> On Thu, Feb 06, 2025 at 02:50:28PM +0100, Luis Chamberlain wrote:
> > On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> > > Now you've decreased the default blocksize to 512 on sda, and md0 gets
> > > an impossible 512k blocksize.  Also, disrupting the default 4k blocksize
> > > will introduce portability problems with distros that aren't yet
> > > shipping 6.12.
> > 
> > Our default should be 4k, and to address the later we should sanity
> > check and user an upper limit of what XFS supports, 64k.
> 
> To clarify, the patch addresses the cases described above. It sets mkfs.xfs's
> default block size to the device’s reported minimum I/O size (via stx_blksize),
> clamping the value between 4k and 64k: if the reported size is less than 4k, 4k
> is used, and if it exceeds 64k (XFS_MAX_BLOCKSIZE), 64k is applied.

Ah, neat.

Then the only other concern expresed by Darrick is using this for
distros not yet shipping v6.12. Let me address that next in reply
to another part of this thread.

  Luis

Luis Chamberlain Feb. 7, 2025, 7:26 p.m. UTC | #9

On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> NAME                     MIN-IO
> sda                         512
> ├─sda1                      512
> ├─sda2                      512
> │ └─node0.boot              512
> ├─sda3                      512
> │ └─node0.swap              512
> └─sda4                      512
>   └─node0.lvm               512
>     └─node0-root            512
> sdb                        4096
> └─sdb1                     4096
> nvme1n1                     512
> └─md0                    524288
>   └─node0.raid           524288
>     └─node0_raid-storage 524288
> nvme0n1                     512
> └─md0                    524288
>   └─node0.raid           524288
>     └─node0_raid-storage 524288
> nvme2n1                     512
> └─md0                    524288
>   └─node0.raid           524288
>     └─node0_raid-storage 524288
> nvme3n1                     512
> └─md0                    524288
>   └─node0.raid           524288
>     └─node0_raid-storage 524288

Can you try this for each of these:

stat --print=%o 

I believe that without that new patch I posted [0] you will get 4 KiB
here. Then the blocksize passed won't be the min-io until that patch
gets applied.

The above is:

statx(AT_FDCWD, "/dev/nvme0n1", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW|AT_NO_AUTOMOUNT, 0,
{stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0,
stx_mode=S_IFBLK|0660, stx_size=0, ...}) = 0

So if we use this instead at mkfs, then even older kernels will get 4
KiB, and if distros want to automatically lift the value at mkfs, they
could cherry pick that simple patch.

[0] https://lkml.kernel.org/r/20250204231209.429356-9-mcgrof@kernel.org

  Luis

Darrick J. Wong Feb. 7, 2025, 7:31 p.m. UTC | #10

On Fri, Feb 07, 2025 at 11:26:20AM -0800, Luis Chamberlain wrote:
> On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> > NAME                     MIN-IO
> > sda                         512
> > ├─sda1                      512
> > ├─sda2                      512
> > │ └─node0.boot              512
> > ├─sda3                      512
> > │ └─node0.swap              512
> > └─sda4                      512
> >   └─node0.lvm               512
> >     └─node0-root            512
> > sdb                        4096
> > └─sdb1                     4096
> > nvme1n1                     512
> > └─md0                    524288
> >   └─node0.raid           524288
> >     └─node0_raid-storage 524288
> > nvme0n1                     512
> > └─md0                    524288
> >   └─node0.raid           524288
> >     └─node0_raid-storage 524288
> > nvme2n1                     512
> > └─md0                    524288
> >   └─node0.raid           524288
> >     └─node0_raid-storage 524288
> > nvme3n1                     512
> > └─md0                    524288
> >   └─node0.raid           524288
> >     └─node0_raid-storage 524288
> 
> Can you try this for each of these:
> 
> stat --print=%o 
> 
> I believe that without that new patch I posted [0] you will get 4 KiB
> here. Then the blocksize passed won't be the min-io until that patch
> gets applied.

Yes, that returns 4K on 6.13.0 for every device in the list.  I think
you're saying that stat will start returning 512K for the blocksize if
your patch is merged?

> The above is:
> 
> statx(AT_FDCWD, "/dev/nvme0n1", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW|AT_NO_AUTOMOUNT, 0,
> {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0,
> stx_mode=S_IFBLK|0660, stx_size=0, ...}) = 0
> 
> So if we use this instead at mkfs, then even older kernels will get 4
> KiB, and if distros want to automatically lift the value at mkfs, they
> could cherry pick that simple patch.

How well does that work if the gold master image creator machine has a
new kernel and a RAID setup, but the kernel written into the gold master
image is something older than a 6.12 kernel?

--D

> 
> [0] https://lkml.kernel.org/r/20250204231209.429356-9-mcgrof@kernel.org
> 
>   Luis

Luis Chamberlain Feb. 7, 2025, 7:44 p.m. UTC | #11

On Fri, Feb 07, 2025 at 11:31:17AM -0800, Darrick J. Wong wrote:
> On Fri, Feb 07, 2025 at 11:26:20AM -0800, Luis Chamberlain wrote:
> > On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> > > NAME                     MIN-IO
> > > sda                         512
> > > ├─sda1                      512
> > > ├─sda2                      512
> > > │ └─node0.boot              512
> > > ├─sda3                      512
> > > │ └─node0.swap              512
> > > └─sda4                      512
> > >   └─node0.lvm               512
> > >     └─node0-root            512
> > > sdb                        4096
> > > └─sdb1                     4096
> > > nvme1n1                     512
> > > └─md0                    524288
> > >   └─node0.raid           524288
> > >     └─node0_raid-storage 524288
> > > nvme0n1                     512
> > > └─md0                    524288
> > >   └─node0.raid           524288
> > >     └─node0_raid-storage 524288
> > > nvme2n1                     512
> > > └─md0                    524288
> > >   └─node0.raid           524288
> > >     └─node0_raid-storage 524288
> > > nvme3n1                     512
> > > └─md0                    524288
> > >   └─node0.raid           524288
> > >     └─node0_raid-storage 524288
> > 
> > Can you try this for each of these:
> > 
> > stat --print=%o 
> > 
> > I believe that without that new patch I posted [0] you will get 4 KiB
> > here. Then the blocksize passed won't be the min-io until that patch
> > gets applied.
> 
> Yes, that returns 4K on 6.13.0 for every device in the list.  I think
> you're saying that stat will start returning 512K for the blocksize if
> your patch is merged?

Yes, as that *is* min-io for block devices.

> > The above is:
> > 
> > statx(AT_FDCWD, "/dev/nvme0n1", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW|AT_NO_AUTOMOUNT, 0,
> > {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0,
> > stx_mode=S_IFBLK|0660, stx_size=0, ...}) = 0
> > 
> > So if we use this instead at mkfs, then even older kernels will get 4
> > KiB, and if distros want to automatically lift the value at mkfs, they
> > could cherry pick that simple patch.
> 
> How well does that work if the gold master image creator machine has a
> new kernel and a RAID setup, but the kernel written into the gold master
> image is something older than a 6.12 kernel?

I think you're asking about an image creator for an old release and that
old release uses an old kernel. Isn't that the same concern as what if
the same new kernel creates a filesystem on an image for an old release?
If so the risk is the new kernel defaults take precedence.

And the only thing I can think of for this case an option to ignore this
heuristic for blocksize, which could be used on configs to override
defaults for older target kernels.

  Luis

Christoph Hellwig Feb. 13, 2025, 4:10 a.m. UTC | #12

On Fri, Feb 07, 2025 at 11:04:06AM +0100, Daniel Gomez wrote:
> > > performance in the stx_blksize field of the statx data structure. This
> > > change updates the current default 4 KiB block size for all devices
> > > reporting a minimum I/O larger than 4 KiB, opting instead to query for
> > > its advertised minimum I/O value in the statx data struct.
> > 
> > UUuh, no.  Larger block sizes have their use cases, but this will
> > regress performance for a lot (most?) common setups.  A lot of
> > device report fairly high values there, but say increasing the
> 
> Are these devices reporting the correct value?

Who defines what "correct" means to start with?

> As I mentioned in my
> discussion with Darrick, matching the minimum_io_size with the "fs
> fundamental blocksize" actually allows to avoid RMW operations (when
> using the default path in mkfs.xfs and the value reported is within
> boundaries).

At least for buffered I/O it does not remove RMW operations at all,
it just moves them up.

And for plenty devices the min_io size might be set to larger than
LBA size, but the device is still reasonably efficient at handling
smaller I/O (e.g. because it has power loss protection and stages
writes in powerfail protected memory).

Daniel Gomez Feb. 13, 2025, 1:26 p.m. UTC | #13

On Wed, Feb 12, 2025 at 08:10:04PM +0100, Christoph Hellwig wrote:
> On Fri, Feb 07, 2025 at 11:04:06AM +0100, Daniel Gomez wrote:
> > > > performance in the stx_blksize field of the statx data structure. This
> > > > change updates the current default 4 KiB block size for all devices
> > > > reporting a minimum I/O larger than 4 KiB, opting instead to query for
> > > > its advertised minimum I/O value in the statx data struct.
> > > 
> > > UUuh, no.  Larger block sizes have their use cases, but this will
> > > regress performance for a lot (most?) common setups.  A lot of
> > > device report fairly high values there, but say increasing the
> > 
> > Are these devices reporting the correct value?
> 
> Who defines what "correct" means to start with?

That's a good question. The stx_blksize field description indicates the value
should be referring to the fs block size that avoids RMW.

* From statx manual:

	DESCRIPTION
	           struct statx {
	               __u32 stx_blksize;     /* Block size for filesystem I/O */
	{...}

	The returned information
	       stx_blksize
	              The "preferred" block size for efficient filesystem I/O.  (Writing
	              to a file in smaller chunks may cause an inefficient read-modify-rewrite.)

* From include/uapi/linux/stat.h:

	struct statx {
	
		/* Preferred general I/O size [uncond] */
		__u32	stx_blksize;

So I think, if devices report high values in stx_blksize, it is either because
smaller values than the reported one cause RMW or they are incorrectly reporting
a value in the wrong statx field.

The commit I refer in the commit message maps the minimum_io_size reported by
the block layer with stx_blksize.

* Documentation/ABI/stable/sysfs-block

	What:		/sys/block/<disk>/queue/minimum_io_size
	Date:		April 2009
	Contact:	Martin K. Petersen <martin.petersen@oracle.com>
	Description:
			[RO] Storage devices may report a granularity or preferred
			minimum I/O size which is the smallest request the device can
			perform without incurring a performance penalty.  For disk
			drives this is often the physical block size.  For RAID arrays
			it is often the stripe chunk size.  A properly aligned multiple
			of minimum_io_size is the preferred request size for workloads
			where a high number of I/O operations is desired.

I guess the correct approach is to ensure the mapping only occurs for "disk
drives" and we avoid mapping RAID strip chunk size to stx_blksize.

In addition, we also have the optimal I/O size:

What:		/sys/block/<disk>/queue/optimal_io_size
Date:		April 2009
Contact:	Martin K. Petersen <martin.petersen@oracle.com>
Description:
		[RO] Storage devices may report an optimal I/O size, which is
		the device's preferred unit for sustained I/O.  This is rarely
		reported for disk drives.  For RAID arrays it is usually the
		stripe width or the internal track size.  A properly aligned
		multiple of optimal_io_size is the preferred request size for
		workloads where sustained throughput is desired.  If no optimal
		I/O size is reported this file contains 0.

Optimal I/O is not preferred I/O (minimum_io_size). However, I think xfs mount
option mixes both values:

* From Documentation/admin-guide/xfs.rst

	Mount Options
	=============
	
	  largeio or nolargeio (default)
		If ``nolargeio`` is specified, the optimal I/O reported in
		``st_blksize`` by **stat(2)** will be as small as possible to allow
		user applications to avoid inefficient read/modify/write
		I/O.  This is typically the page size of the machine, as
		this is the granularity of the page cache.

But it must be referring to the minimum_io_size as this is the limit "to
avoid inefficient read/modify/write I/O" as per sysfs-block minimum_io_size
description, right?

> 
> > As I mentioned in my
> > discussion with Darrick, matching the minimum_io_size with the "fs
> > fundamental blocksize" actually allows to avoid RMW operations (when
> > using the default path in mkfs.xfs and the value reported is within
> > boundaries).
> 
> At least for buffered I/O it does not remove RMW operations at all,
> it just moves them up.

But only if the RMW boundary is smaller than the fs bs. If they are matched,
it should not move them up.

> 
> And for plenty devices the min_io size might be set to larger than
> LBA size,

Yes, I agree.

> but the device is still reasonably efficient at handling
> smaller I/O (e.g. because it has power loss protection and stages

That is device specific and covered in the stx_blksize and minimum_io_size
descriptions.

> writes in powerfail protected memory).

Assuming min_io is lsblk MIN-IO reported field, this is the block
minimum_io_size.

* From util-linux code:

	case COL_MINIO:
		ul_path_read_string(dev->sysfs, &str, "queue/minimum_io_size");
		if (rawdata)
			str2u64(str, rawdata);
		break;

I think this might be a bit trickier. The value should report the disk
"preferred I/O minimum granularity" and "the stripe chunk size" for RAID arrays.

Christoph Hellwig Feb. 18, 2025, 8:30 a.m. UTC | #14

On Thu, Feb 13, 2025 at 02:26:37PM +0100, Daniel Gomez wrote:
> That's a good question. The stx_blksize field description indicates the value
> should be referring to the fs block size that avoids RMW.

One that is optimal, the RMW is an example.  This what Posix says:

blksize_t st_blksize    A file system-specific preferred I/O block size 
                        for this object. In some file system types, this 
			may vary from file to file. 

> So I think, if devices report high values in stx_blksize, it is either because
> smaller values than the reported one cause RMW or they are incorrectly reporting
> a value in the wrong statx field.

Or it is just more efficient.  E.g. on NFS or XFS you'll get fairly
big values.

> The commit I refer in the commit message maps the minimum_io_size reported by
> the block layer with stx_blksize.

Yes, and that was my question again - minimum_io_size for block
devices is specified even more handwavy.

In other words I'm really sceptical this is going to be a net win.
To win me over you'll need to show a improvement over a wide variety
of devіces and workloads.

mkfs: use stx_blksize for dev block size by default

Commit Message

Comments

Patch