mbox series

[GIT,PULL] vfs dio

Message ID 20250118-vfs-dio-3ca805947186@brauner (mailing list archive)
State New
Headers show
Series [GIT,PULL] vfs dio | expand

Pull-request

git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.14-rc1.statx.dio

Message

Christian Brauner Jan. 18, 2025, 1:09 p.m. UTC
Hey Linus,

/* Summary */

File systems that write out of place usually require different alignment
for direct I/O writes than what they can do for reads.

Add a separate dio read align field to statx, as many out of place write
file systems can easily do reads aligned to the device sector size, but
require bigger alignment for writes.

This is usually papered over by falling back to buffered I/O for smaller
writes and doing read-modify-write cycles, but performance for this
sucks, so applications benefit from knowing the actual write alignment.

/* Testing */

gcc version 14.2.0 (Debian 14.2.0-6)
Debian clang version 16.0.6 (27+b1)

No build failures or warnings were observed.

/* Conflicts */

Merge conflicts with mainline
=============================

No known conflicts.

Merge conflicts with other trees
================================

No known conflicts.

The following changes since commit 40384c840ea1944d7c5a392e8975ed088ecf0b37:

  Linux 6.13-rc1 (2024-12-01 14:28:56 -0800)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.14-rc1.statx.dio

for you to fetch changes up to cf40ebb2ed9fde24195260637e00e47a6f0e7c15:

  Merge patch series "add STATX_DIO_READ_ALIGN v3" (2025-01-09 16:23:26 +0100)

Please consider pulling these changes from the signed vfs-6.14-rc1.statx.dio tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.14-rc1.statx.dio

----------------------------------------------------------------
Christian Brauner (1):
      Merge patch series "add STATX_DIO_READ_ALIGN v3"

Christoph Hellwig (5):
      fs: reformat the statx definition
      fs: add STATX_DIO_READ_ALIGN
      xfs: cleanup xfs_vn_getattr
      xfs: report the correct read/write dio alignment for reflinked inodes
      xfs: report larger dio alignment for COW inodes

 fs/stat.c                 |  1 +
 fs/xfs/xfs_ioctl.c        | 11 +++++-
 fs/xfs/xfs_iops.c         | 62 +++++++++++++++++------------
 include/linux/stat.h      |  1 +
 include/uapi/linux/stat.h | 99 +++++++++++++++++++++++++++++++++++------------
 5 files changed, 125 insertions(+), 49 deletions(-)

Comments

Linus Torvalds Jan. 20, 2025, 7:24 p.m. UTC | #1
On Sat, 18 Jan 2025 at 05:09, Christian Brauner <brauner@kernel.org> wrote:
>
> Add a separate dio read align field to statx, as many out of place write
> file systems can easily do reads aligned to the device sector size, but
> require bigger alignment for writes.

I've pulled this, but it needs some fixing.

You added the 'dio_read_offset_align' field to 'struct kstat', and
that structure is *critical*, because it's used even for the real
'stat()' calls that people actually use (as opposed to the statx side
that is seldom a real issue).

And that field was added in a way that causes the struct to grow due
to alignment issues.  For no good reason, because there were existing
holes in there.

So please just fix it.

I despise the whole statx thing exactly because it has (approximately)
five specialized users, while slowing down regular stat/fstat that is
used widely absolutely *evertwhere*.

Of course, judging by past performance, I wouldn't be surprised if
glibc has screwed the pooch, and decided to use 'statx()' to implement
stat, together with extra pointless user space overhead to convert one
into the other. Because that's the glibc way (ie the whole "turn
fstat() into the much slower fstatat() call, just because").

So here's the deal: 'statx()' is *not* an "improved stat". It's an
actively worse stat() for people who need very unusual and specialized
information, and it makes everything else worse.

              Linus
Christian Brauner Jan. 20, 2025, 7:38 p.m. UTC | #2
On Mon, Jan 20, 2025 at 11:24:56AM -0800, Linus Torvalds wrote:
> On Sat, 18 Jan 2025 at 05:09, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Add a separate dio read align field to statx, as many out of place write
> > file systems can easily do reads aligned to the device sector size, but
> > require bigger alignment for writes.
> 
> I've pulled this, but it needs some fixing.
> 
> You added the 'dio_read_offset_align' field to 'struct kstat', and
> that structure is *critical*, because it's used even for the real
> 'stat()' calls that people actually use (as opposed to the statx side
> that is seldom a real issue).
> 
> And that field was added in a way that causes the struct to grow due
> to alignment issues.  For no good reason, because there were existing
> holes in there.
> 
> So please just fix it.

Right, sorry I should've noticed that during review.

> I despise the whole statx thing exactly because it has (approximately)
> five specialized users, while slowing down regular stat/fstat that is
> used widely absolutely *evertwhere*.
> 
> Of course, judging by past performance, I wouldn't be surprised if
> glibc has screwed the pooch, and decided to use 'statx()' to implement
> stat, together with extra pointless user space overhead to convert one
> into the other. Because that's the glibc way (ie the whole "turn
> fstat() into the much slower fstatat() call, just because").
> 
> So here's the deal: 'statx()' is *not* an "improved stat". It's an
> actively worse stat() for people who need very unusual and specialized
> information, and it makes everything else worse.

Yes, I'm well aware that you dislike statx(). :) It is heavily used
nowadays though because there's a few additional bits in there that
don't require calling into filesystems but are heavily used.

I want to reiterate that if I'd been involved in statx() I would've done
it as an extensible struct. So v1 of the struct would just be exactly
what stat() was.

This way neither copy-in nor copy-out would have done any unnecessary
work and perf would've been the same. Only when userspace actually
needed additional information it would have to pass in a larger struct
and pay the price for copy-in and copy-out.

I'll get you a new PR soon!
Linus Torvalds Jan. 20, 2025, 7:39 p.m. UTC | #3
On Mon, 20 Jan 2025 at 11:24, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And that field was added in a way that causes the struct to grow due
> to alignment issues.  For no good reason, because there were existing
> holes in there.
>
> So please just fix it.

Side note: for extra bonus points, just make those fields be 'u8' or
even smaller, knowing that the values

 (a) have to be powers of two anyway because nobody can deal with anything else

 (b) the powers are already limited to 31 on the top end (by the
existing use of 'u32') and 9 on the low end (by SECTOR_SIZE)

And yeah, maybe somebody wants to have a "no alignment possible"
value, so that would make it a total of 24 different values.

You really don't need 32 bits to encode 24 values.

So wasting three 32-bit words for this all - and an extra one for the
bad alignment - is just a crime, and only makes kstat bigger on stack.

It also just adds extra code, since most *sources* have the size in
bits anyway: it typically comes from things i_blocksize(), which just
does

        return (1 << node->i_blkbits);

and it would be better to just give the bit number.

(XFS uses a special XFS_FSB_TO_B() thing to turn its bit-numbers to
byte sizes - I'm not claiming that everybody uses i_blkbits, I'm just
claiming that all sane users already are in terms of bits)

               Linus
Linus Torvalds Jan. 20, 2025, 7:50 p.m. UTC | #4
On Mon, 20 Jan 2025 at 11:38, Christian Brauner <brauner@kernel.org> wrote:
>
> It is heavily used nowadays though because there's a few
> additional bits in there that don't require calling into filesystems
> but are heavily used.

By "heavily used" you mean "there is probably a 1:1000 ratio between
statx:stat in reality".

Those "few additional bits" are all very specialized. No normal
application cares about things like "mount ID" or subvolume data. I
can't imagine what other fields you think are so important.

          Linus
pr-tracker-bot@kernel.org Jan. 20, 2025, 7:51 p.m. UTC | #5
The pull request you sent on Sat, 18 Jan 2025 14:09:26 +0100:

> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.14-rc1.statx.dio

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/47c9f2b3c838a33552dbd41db6c5d93377842fcd

Thank you!
Christian Brauner Jan. 22, 2025, 1:50 p.m. UTC | #6
On Mon, Jan 20, 2025 at 11:50:09AM -0800, Linus Torvalds wrote:
> On Mon, 20 Jan 2025 at 11:38, Christian Brauner <brauner@kernel.org> wrote:
> >
> > It is heavily used nowadays though because there's a few
> > additional bits in there that don't require calling into filesystems
> > but are heavily used.
> 
> By "heavily used" you mean "there is probably a 1:1000 ratio between
> statx:stat in reality".
> 
> Those "few additional bits" are all very specialized. No normal
> application cares about things like "mount ID" or subvolume data. I
> can't imagine what other fields you think are so important.

Sorry, I wasn't talking about new fields I was talking about actual bits
that statx has.

#define STATX_ATTR_COMPRESSED		0x00000004 /* [I] File is compressed by the fs */
#define STATX_ATTR_IMMUTABLE		0x00000010 /* [I] File is marked immutable */
#define STATX_ATTR_APPEND		0x00000020 /* [I] File is append-only */
#define STATX_ATTR_NODUMP		0x00000040 /* [I] File is not to be dumped */
#define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
#define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
#define STATX_ATTR_MOUNT_ROOT		0x00002000 /* Root of a mount */
#define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
#define STATX_ATTR_DAX			0x00200000 /* File is currently in DAX state */
#define STATX_ATTR_WRITE_ATOMIC		0x00400000 /* File supports atomic write operations */

I think that such an attribute field could've been added to struct stat
itself because afaict there are two unused fields in there currently:

struct stat {
	unsigned long	st_dev;		/* Device.  */
	unsigned long	st_ino;		/* File serial number.  */
	unsigned int	st_mode;	/* File mode.  */
	unsigned int	st_nlink;	/* Link count.  */
	unsigned int	st_uid;		/* User ID of the file's owner.  */
	unsigned int	st_gid;		/* Group ID of the file's group. */
	[...]
	unsigned int	__unused4;
	unsigned int	__unused5;
};

I happily concede that stat() will exceed the usage of statx() for most
application that e.g., just want to do the basic mode, uid, gid thing.
So compared to that statx() usage is probably not that important.

And nothing would stop us from adding a statx2()...

int statx(int dfd, const char *path, unsigned int flags, unsigned int mask
          struct statx *st, size_t st_size)

#define STATX_SIZE_VER0 // compatible with struct stat

and then only the stat portion is filled in. This could allow glibc to
use statx() for everything without perf impact especially when paired
with allowing NULL with AT_EMPTY_PATH.
Christoph Hellwig Jan. 29, 2025, 6:34 a.m. UTC | #7
[Sorry for the delay, I've been distrracted by a fun mix of personal
business and urgend customer work]

On Mon, Jan 20, 2025 at 11:24:56AM -0800, Linus Torvalds wrote:
> On Sat, 18 Jan 2025 at 05:09, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Add a separate dio read align field to statx, as many out of place write
> > file systems can easily do reads aligned to the device sector size, but
> > require bigger alignment for writes.
> 
> I've pulled this, but it needs some fixing.
> 
> You added the 'dio_read_offset_align' field to 'struct kstat', and
> that structure is *critical*, because it's used even for the real
> 'stat()' calls that people actually use (as opposed to the statx side
> that is seldom a real issue).
> 
> And that field was added in a way that causes the struct to grow due
> to alignment issues.  For no good reason, because there were existing
> holes in there.

Indeed, sorry.  I put all attention on the user visible struct statx
and missed optimizing the in-kernel kstat.

> I despise the whole statx thing exactly because it has (approximately)
> five specialized users, while slowing down regular stat/fstat that is
> used widely absolutely *evertwhere*.

Yeah.  I still hate the statx design as it overload the critical
stat information with all the misc based path information that is
useful in some cases but not actually needed for most.  Not much
we can do about that now, though.