diff mbox series

[RFC] statx.2: Add stx_atomic_write_unit_max_opt

Message ID 20250319114402.3757248-1-john.g.garry@oracle.com (mailing list archive)
State New
Headers show
Series [RFC] statx.2: Add stx_atomic_write_unit_max_opt | expand

Commit Message

John Garry March 19, 2025, 11:44 a.m. UTC
XFS supports atomic writes - or untorn writes - based on different methods:
- HW offload in the disk
- Software emulation

The value reported in stx_atomic_write_unit_max will be the max of the
software emulation method.

The max atomic write unit size of the software emulated atomic writes will
generally be much larger than the HW offload. However, software emulated
atomic writes will also be typically much slower.

The filesystem will transparently support both methods, specifically
HW offload is the preferred method when possible, e.g. if write size is
small enough then HW offload will be used.

Advertise this HW offload limit to the user in a new statx member,
stx_atomic_write_unit_max_opt.

We want STATX_WRITE_ATOMIC to get this new member in addition to the
already-existing members, so mention that a value of 0 means that
stx_atomic_write_unit_max holds this limit.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
I'm sending as an RFC as I am not sure if we need bother with this.

Maybe it's better to update the man page to mention that software
emulated atomic writes are available, and the user should check the
mounted bdev atomic write limits instead to know this opt limit.

Comments

hch March 20, 2025, 7 a.m. UTC | #1
On Wed, Mar 19, 2025 at 11:44:02AM +0000, John Garry wrote:
> XFS supports atomic writes - or untorn writes - based on different methods:
> - HW offload in the disk
> - Software emulation
> 
> The value reported in stx_atomic_write_unit_max will be the max of the
> software emulation method.

I don't think emulation is a good word.  A file system implementing
file systems things is not emulation.

> We want STATX_WRITE_ATOMIC to get this new member in addition to the
> already-existing members, so mention that a value of 0 means that
> stx_atomic_write_unit_max holds this limit.

Does that actually work?  Can userspace assume all unknown statx
fields are padded to zero?  If so my dio read align change could have
done away with the extra flag.
John Garry March 20, 2025, 9:19 a.m. UTC | #2
On 20/03/2025 07:00, Christoph Hellwig wrote:
> On Wed, Mar 19, 2025 at 11:44:02AM +0000, John Garry wrote:
>> XFS supports atomic writes - or untorn writes - based on different methods:
>> - HW offload in the disk
>> - Software emulation
>>
>> The value reported in stx_atomic_write_unit_max will be the max of the
>> software emulation method.
> 
> I don't think emulation is a good word.  A file system implementing
> file systems things is not emulation.

Sure, I am still in the mindset that a filesystem-based atomic write is 
a 2nd-class citizen and just trying to emulate what can be done in the disk.

> 
>> We want STATX_WRITE_ATOMIC to get this new member in addition to the
>> already-existing members, so mention that a value of 0 means that
>> stx_atomic_write_unit_max holds this limit.
> 
> Does that actually work?  Can userspace assume all unknown statx
> fields are padded to zero?  If so my dio read align change could have
> done away with the extra flag.

I will double check that, but if we needed to add another mask just for 
getting this, then yuck.

> 
> 
But is there value in reporting this limit? I am not sure. I am not sure 
what the user would do with this info.

Maybe, for example, they want to write 1K consecutive 16K pages, each 
atomically, and decide to do a big 16M atomic write but find that it is 
slow as bdev atomic limit is < 16M.

Maybe I should just update the documentation to mention that for XFS 
they should check the mounted bdev atomic limits.
hch March 20, 2025, 2:12 p.m. UTC | #3
On Thu, Mar 20, 2025 at 09:19:40AM +0000, John Garry wrote:
> But is there value in reporting this limit? I am not sure. I am not sure 
> what the user would do with this info.

Align their data structures to it, e.g. size the log buffers to it.

> Maybe, for example, they want to write 1K consecutive 16K pages, each 
> atomically, and decide to do a big 16M atomic write but find that it is 
> slow as bdev atomic limit is < 16M.
>
> Maybe I should just update the documentation to mention that for XFS they 
> should check the mounted bdev atomic limits.

For something working on files having to figure out the underlying
block device (which is non-trivial given the various methods of
multi-device support) and then looking into block sysfs is a no-go.

So if we have any sort of use case for it we should expose the limit.
John Garry March 21, 2025, 10:20 a.m. UTC | #4
On 20/03/2025 14:12, Christoph Hellwig wrote:
> On Thu, Mar 20, 2025 at 09:19:40AM +0000, John Garry wrote:
>> But is there value in reporting this limit? I am not sure. I am not sure
>> what the user would do with this info.
> 
> Align their data structures to it, e.g. size the log buffers to it.
> 

Sure, there may be a usecase there.

So far I am just considering the DB usecase, and they know the atomic 
write size which they want to do, i.e. their internal page size, and 
align to that. If that internal page size <= this opt limit, then good.

>> Maybe, for example, they want to write 1K consecutive 16K pages, each
>> atomically, and decide to do a big 16M atomic write but find that it is
>> slow as bdev atomic limit is < 16M.
>>
>> Maybe I should just update the documentation to mention that for XFS they
>> should check the mounted bdev atomic limits.
> 
> For something working on files having to figure out the underlying
> block device (which is non-trivial given the various methods of
> multi-device support) and then looking into block sysfs is a no-go.
> 
> So if we have any sort of use case for it we should expose the limit.
> 

Coming back to what was discussed about not adding a new flag to fetch 
this limit:

 > Does that actually work?  Can userspace assume all unknown statx
 > fields are padded to zero?

In cp_statx, we do pre-zero the statx structure. As such, the rule "if 
zero, just use hard limit unit max" seems to hold.

 > If so my dio read align change could have
 > done away with the extra flag.

Sounds like it. Maybe this practice is not preferred, i.e. changing what 
the request/result mask returns.
hch March 23, 2025, 6:40 a.m. UTC | #5
On Fri, Mar 21, 2025 at 10:20:21AM +0000, John Garry wrote:
> Coming back to what was discussed about not adding a new flag to fetch this 
> limit:
>
> > Does that actually work?  Can userspace assume all unknown statx
> > fields are padded to zero?
>
> In cp_statx, we do pre-zero the statx structure. As such, the rule "if 
> zero, just use hard limit unit max" seems to hold.

Ok, canwe document this somewhere?
diff mbox series

Patch

diff --git a/man/man2/statx.2 b/man/man2/statx.2
index 0abac75c1..c3872f05d 100644
--- a/man/man2/statx.2
+++ b/man/man2/statx.2
@@ -79,6 +79,9 @@  struct statx {
 \&
     /* File offset alignment for direct I/O reads */
     __u32   stx_dio_read_offset_align;
+\&
+    /* Direct I/O atomic write opt max limit */
+    __u32 stx_atomic_write_unit_max_opt;
 };
 .EE
 .in
@@ -271,7 +274,8 @@  STATX_SUBVOL	Want stx_subvol
 	(since Linux 6.10; support varies by filesystem)
 STATX_WRITE_ATOMIC	Want stx_atomic_write_unit_min,
 	stx_atomic_write_unit_max,
-	and stx_atomic_write_segments_max.
+	stx_atomic_write_segments_max,
+	and stx_atomic_write_unit_max_opt.
 	(since Linux 6.11; support varies by filesystem)
 STATX_DIO_READ_ALIGN	Want stx_dio_read_offset_align.
 	(since Linux 6.14; support varies by filesystem)
@@ -519,6 +523,15 @@  is supported on block devices since Linux 6.11.
 The support on regular files varies by filesystem;
 it is supported by xfs and ext4 since Linux 6.13.
 .TP
+.I stx_atomic_write_unit_max_opt
+The maximum size (in bytes) which is optimised for fast
+untorn writes.
+This value must not exceed the value in
+.I stx_atomic_write_unit_max.
+A value of 0 indicates that
+.I stx_atomic_write_unit_max
+is the optimised limit.
+.TP
 .I stx_atomic_write_segments_max
 The maximum number of elements in an array of vectors
 for a write with torn-write protection enabled.