diff mbox series

[v6,11/13] xfs: add xfs_file_dio_write_atomic()

Message ID 20250313171310.1886394-12-john.g.garry@oracle.com (mailing list archive)
State New
Headers show
Series large atomic writes for xfs with CoW | expand

Commit Message

John Garry March 13, 2025, 5:13 p.m. UTC
Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.

In case of -EAGAIN being returned from iomap_dio_rw(), reissue the write
in CoW-based atomic write mode.

For CoW-based mode, ensure that we have no outstanding IOs which we
may trample on.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

Comments

hch March 17, 2025, 6:41 a.m. UTC | #1
On Thu, Mar 13, 2025 at 05:13:08PM +0000, John Garry wrote:
> + * REQ_ATOMIC-based is the preferred method, and is attempted first. If this
> + * method fails due to REQ_ATOMIC-related constraints, then we retry with the
> + * COW-based method. The REQ_ATOMIC-based method typically will fail if the
> + * write spans multiple extents or the disk blocks are misaligned.

It is only preferred if actually supported by the underlying hardware.
If it isn't it really shouldn't even be tried, as that is just a waste
of cycles.

Also a lot of comment should probably be near the code not on top
of the function as that's where people would look for them.

> +static noinline ssize_t
> +xfs_file_dio_write_atomic(
> +	struct xfs_inode	*ip,
> +	struct kiocb		*iocb,
> +	struct iov_iter		*from)
> +{
> +	unsigned int		iolock = XFS_IOLOCK_SHARED;
> +	unsigned int		dio_flags = 0;
> +	const struct iomap_ops	*dops = &xfs_direct_write_iomap_ops;
> +	ssize_t			ret;
> +
> +retry:
> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
> +	if (ret)
> +		return ret;
> +
> +	ret = xfs_file_write_checks(iocb, from, &iolock, NULL);
> +	if (ret)
> +		goto out_unlock;
> +
> +	if (dio_flags & IOMAP_DIO_FORCE_WAIT)
> +		inode_dio_wait(VFS_I(ip));
> +
> +	trace_xfs_file_direct_write(iocb, from);
> +	ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops,
> +			dio_flags, NULL, 0);

The normal direct I/O path downgrades the iolock to shared before
doing the I/O here.  Why isn't that done here?

> +	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
> +	    dops == &xfs_direct_write_iomap_ops) {

This should probably explain the unusual use of EGAIN.  Although I
still feel that picking a different error code for the fallback would
be much more maintainable.

> +		xfs_iunlock(ip, iolock);
> +		dio_flags = IOMAP_DIO_FORCE_WAIT;

I notice the top of function comment mentions the IOMAP_DIO_FORCE_WAIT
flag.  Maybe use the chance to write a full sentence here or where
it is checked to explain the logic a bit better?

>   * Handle block unaligned direct I/O writes
>   *
> @@ -840,6 +909,10 @@ xfs_file_dio_write(
>  		return xfs_file_dio_write_unaligned(ip, iocb, from);
>  	if (xfs_is_zoned_inode(ip))
>  		return xfs_file_dio_write_zoned(ip, iocb, from);
> +
> +	if (iocb->ki_flags & IOCB_ATOMIC)
> +		return xfs_file_dio_write_atomic(ip, iocb, from);
> +

Either keep space between all the conditional calls or none.  I doubt
just stick to the existing style.
John Garry March 17, 2025, 9:36 a.m. UTC | #2
On 17/03/2025 06:41, Christoph Hellwig wrote:
> On Thu, Mar 13, 2025 at 05:13:08PM +0000, John Garry wrote:
>> + * REQ_ATOMIC-based is the preferred method, and is attempted first. If this
>> + * method fails due to REQ_ATOMIC-related constraints, then we retry with the
>> + * COW-based method. The REQ_ATOMIC-based method typically will fail if the
>> + * write spans multiple extents or the disk blocks are misaligned.
> 
> It is only preferred if actually supported by the underlying hardware.
> If it isn't it really shouldn't even be tried, as that is just a waste
> of cycles.

We should not even call this function if atomics are not supported by HW 
- please see IOCB_ATOMIC checks in xfs_file_write_iter(). So maybe I 
will mention that the caller must ensure atomics are supported for the 
write size.

> 
> Also a lot of comment should probably be near the code not on top
> of the function as that's where people would look for them.

sure, if you prefer

> 
>> +static noinline ssize_t
>> +xfs_file_dio_write_atomic(
>> +	struct xfs_inode	*ip,
>> +	struct kiocb		*iocb,
>> +	struct iov_iter		*from)
>> +{
>> +	unsigned int		iolock = XFS_IOLOCK_SHARED;
>> +	unsigned int		dio_flags = 0;
>> +	const struct iomap_ops	*dops = &xfs_direct_write_iomap_ops;
>> +	ssize_t			ret;
>> +
>> +retry:
>> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = xfs_file_write_checks(iocb, from, &iolock, NULL);
>> +	if (ret)
>> +		goto out_unlock;
>> +
>> +	if (dio_flags & IOMAP_DIO_FORCE_WAIT)
>> +		inode_dio_wait(VFS_I(ip));
>> +
>> +	trace_xfs_file_direct_write(iocb, from);
>> +	ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops,
>> +			dio_flags, NULL, 0);
> 
> The normal direct I/O path downgrades the iolock to shared before
> doing the I/O here.  Why isn't that done here?

OK, I can do that. But we still require exclusive lock always for the 
CoW-based method.

> 
>> +	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
>> +	    dops == &xfs_direct_write_iomap_ops) {
> 
> This should probably explain the unusual use of EGAIN.  Although I
> still feel that picking a different error code for the fallback would
> be much more maintainable.

I could try another error code - can you suggest one? Is it going to be 
something unrelated to storage stack, like EREMOTEIO?

> 
>> +		xfs_iunlock(ip, iolock);
>> +		dio_flags = IOMAP_DIO_FORCE_WAIT;
> 
> I notice the top of function comment mentions the IOMAP_DIO_FORCE_WAIT
> flag.  Maybe use the chance to write a full sentence here or where
> it is checked to explain the logic a bit better?

ok, fine

> 
>>    * Handle block unaligned direct I/O writes
>>    *
>> @@ -840,6 +909,10 @@ xfs_file_dio_write(
>>   		return xfs_file_dio_write_unaligned(ip, iocb, from);
>>   	if (xfs_is_zoned_inode(ip))
>>   		return xfs_file_dio_write_zoned(ip, iocb, from);
>> +
>> +	if (iocb->ki_flags & IOCB_ATOMIC)
>> +		return xfs_file_dio_write_atomic(ip, iocb, from);
>> +
> 
> Either keep space between all the conditional calls or none.  I doubt
> just stick to the existing style.

Sure

>
hch March 18, 2025, 5:43 a.m. UTC | #3
On Mon, Mar 17, 2025 at 09:36:13AM +0000, John Garry wrote:
>> It is only preferred if actually supported by the underlying hardware.
>> If it isn't it really shouldn't even be tried, as that is just a waste
>> of cycles.
>
> We should not even call this function if atomics are not supported by HW - 
> please see IOCB_ATOMIC checks in xfs_file_write_iter(). So maybe I will 
> mention that the caller must ensure atomics are supported for the write 
> size.

I see that this is what's done in the current series now.  But that feels
very wrong.  Why do you want to deprive the user of this nice and useful
code if they don't have the right hardware?  Why do we limit us to the
hardware supported size when we support more in software?  How do you
force test the software code if you require the hardware support?

>>> +	trace_xfs_file_direct_write(iocb, from);
>>> +	ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops,
>>> +			dio_flags, NULL, 0);
>>
>> The normal direct I/O path downgrades the iolock to shared before
>> doing the I/O here.  Why isn't that done here?
>
> OK, I can do that. But we still require exclusive lock always for the 
> CoW-based method.

If you can do away with the lock that's great and useful to get good
performance.  But if not at least document why this is different from
others.  Similarly if the COW path needs an exclusive lock document why
in the code.


>
>>
>>> +	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
>>> +	    dops == &xfs_direct_write_iomap_ops) {
>>
>> This should probably explain the unusual use of EGAIN.  Although I
>> still feel that picking a different error code for the fallback would
>> be much more maintainable.
>
> I could try another error code - can you suggest one? Is it going to be 
> something unrelated to storage stack, like EREMOTEIO?

Yes, the funky networking codes tends to be good candidates.  E.g.
ENOPROTOOPT for something that sounds at least vaguely related.

>>> +
>>> +	if (iocb->ki_flags & IOCB_ATOMIC)
>>> +		return xfs_file_dio_write_atomic(ip, iocb, from);
>>> +
>>
>> Either keep space between all the conditional calls or none.  I doubt
>> just stick to the existing style.
>
> Sure

FYI, that I doubt should have been in doubt.  I was just so happy to
finally get the mail out after a flakey connection on the train.
John Garry March 18, 2025, 8:42 a.m. UTC | #4
On 18/03/2025 05:43, Christoph Hellwig wrote:
> On Mon, Mar 17, 2025 at 09:36:13AM +0000, John Garry wrote:
>>> It is only preferred if actually supported by the underlying hardware.
>>> If it isn't it really shouldn't even be tried, as that is just a waste
>>> of cycles.
>>
>> We should not even call this function if atomics are not supported by HW -
>> please see IOCB_ATOMIC checks in xfs_file_write_iter(). So maybe I will
>> mention that the caller must ensure atomics are supported for the write
>> size.
> 
> I see that this is what's done in the current series now.  But that feels
> very wrong.  Why do you want to deprive the user of this nice and useful
> code if they don't have the right hardware? 

I don't think it's fair to say that we deprive the user - so far we just 
don't and nobody has asked for atomics without HW support.

> Why do we limit us to the
> hardware supported size when we support more in software? 

As I see, HW offload gives fast and predictable performance.

The COW method is just a (slow) fallback is when HW offload is not possible.

If we want to allow the user to avail of atomics greater than the 
mounted bdev, then we should have a method to tell the user of the 
optimised threshold. They could read the bdev atomic limits and infer 
this, but that is not a good user experience.

> How do you
> force test the software code if you require the hardware support?
> 
>>>> +	trace_xfs_file_direct_write(iocb, from);
>>>> +	ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops,
>>>> +			dio_flags, NULL, 0);
>>>
>>> The normal direct I/O path downgrades the iolock to shared before
>>> doing the I/O here.  Why isn't that done here?
>>
>> OK, I can do that. But we still require exclusive lock always for the
>> CoW-based method.
> 
> If you can do away with the lock that's great and useful to get good
> performance.  But if not at least document why this is different from
> others.  Similarly if the COW path needs an exclusive lock document why
> in the code.

ok, I'll do that.

> 
> 
>>
>>>
>>>> +	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
>>>> +	    dops == &xfs_direct_write_iomap_ops) {
>>>
>>> This should probably explain the unusual use of EGAIN.  Although I
>>> still feel that picking a different error code for the fallback would
>>> be much more maintainable.
>>
>> I could try another error code - can you suggest one? Is it going to be
>> something unrelated to storage stack, like EREMOTEIO?
> 
> Yes, the funky networking codes tends to be good candidates.  E.g.
> ENOPROTOOPT for something that sounds at least vaguely related.

ok

> 
>>>> +
>>>> +	if (iocb->ki_flags & IOCB_ATOMIC)
>>>> +		return xfs_file_dio_write_atomic(ip, iocb, from);
>>>> +
>>>
>>> Either keep space between all the conditional calls or none.  I doubt
>>> just stick to the existing style.
>>
>> Sure
> 
> FYI, that I doubt should have been in doubt.  I was just so happy to
> finally get the mail out after a flakey connection on the train.
> 

thanks
hch March 18, 2025, 8:46 a.m. UTC | #5
On Tue, Mar 18, 2025 at 08:42:36AM +0000, John Garry wrote:
>> I see that this is what's done in the current series now.  But that feels
>> very wrong.  Why do you want to deprive the user of this nice and useful
>> code if they don't have the right hardware? 
>
> I don't think it's fair to say that we deprive the user - so far we just 
> don't and nobody has asked for atomics without HW support.

You're still keeping this nice functionality from the users..

>
>> Why do we limit us to the
>> hardware supported size when we support more in software? 
>
> As I see, HW offload gives fast and predictable performance.
>
> The COW method is just a (slow) fallback is when HW offload is not possible.
>
> If we want to allow the user to avail of atomics greater than the mounted 
> bdev, then we should have a method to tell the user of the optimised 
> threshold. They could read the bdev atomic limits and infer this, but that 
> is not a good user experience.

Yes, there should be an interface to expose that.  But even without
the hardware acceleration a guaranteed untorn write is a really nice
feature to have.
John Garry March 18, 2025, 9:12 a.m. UTC | #6
On 18/03/2025 08:46, Christoph Hellwig wrote:
> On Tue, Mar 18, 2025 at 08:42:36AM +0000, John Garry wrote:
>>> I see that this is what's done in the current series now.  But that feels
>>> very wrong.  Why do you want to deprive the user of this nice and useful
>>> code if they don't have the right hardware?
>>
>> I don't think it's fair to say that we deprive the user - so far we just
>> don't and nobody has asked for atomics without HW support.
> 
> You're still keeping this nice functionality from the users..
> 
>>
>>> Why do we limit us to the
>>> hardware supported size when we support more in software?
>>
>> As I see, HW offload gives fast and predictable performance.
>>
>> The COW method is just a (slow) fallback is when HW offload is not possible.
>>
>> If we want to allow the user to avail of atomics greater than the mounted
>> bdev, then we should have a method to tell the user of the optimised
>> threshold. They could read the bdev atomic limits and infer this, but that
>> is not a good user experience.
> 
> Yes, there should be an interface to expose that. 

So we could add to statx.atomic_write_unit_max_opt, which is the 
optimised/fast limit, i.e. bdev limit. Is that sane?

Note that we already have STATX_ATTR_WRITE_ATOMIC to get the atomic 
limits. I wouldn't want to add a new flag just to get this new member. 
So I suppose that if we added statx.atomic_write_unit_max_opt, the API 
would be: if statx.atomic_write_unit_max_opt is zero, then 
statx.atomic_write_unit_max is optimised limit also.

> But even without
> the hardware acceleration a guaranteed untorn write is a really nice
> feature to have.
>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7a56ddb86fd2..029684b54dda 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -725,6 +725,75 @@  xfs_file_dio_write_zoned(
 	return ret;
 }
 
+/*
+ * Handle block atomic writes
+ *
+ * Two methods of atomic writes are supported:
+ * - REQ_ATOMIC-based, which would typically use some form of HW offload in the
+ *   disk
+ * - COW-based, which uses a COW fork as a staging extent for data updates
+ *   before atomically updating extent mappings for the range being written
+ *
+ * REQ_ATOMIC-based is the preferred method, and is attempted first. If this
+ * method fails due to REQ_ATOMIC-related constraints, then we retry with the
+ * COW-based method. The REQ_ATOMIC-based method typically will fail if the
+ * write spans multiple extents or the disk blocks are misaligned.
+ *
+ * Similar to xfs_file_dio_write_unaligned(), the retry mechanism is based on
+ * the ->iomap_begin method returning -EAGAIN, which would be when the
+ * REQ_ATOMIC-based write is not possible. In the case of IOCB_NOWAIT being set,
+ * then we will not retry with the COW-based method, and instead pass that
+ * error code back to the caller immediately.
+ *
+ * REQ_ATOMIC-based atomic writes behave such that a racing read which overlaps
+ * with range being atomically written will see all or none of the old data.
+ * Emulate this behaviour for COW-based atomic writes by using
+ * IOMAP_DIO_FORCE_WAIT and inode_dio_wait() to ensure active reads. This also
+ * locks out racing writes, which could trample on the COW fork extent.
+ */
+
+static noinline ssize_t
+xfs_file_dio_write_atomic(
+	struct xfs_inode	*ip,
+	struct kiocb		*iocb,
+	struct iov_iter		*from)
+{
+	unsigned int		iolock = XFS_IOLOCK_SHARED;
+	unsigned int		dio_flags = 0;
+	const struct iomap_ops	*dops = &xfs_direct_write_iomap_ops;
+	ssize_t			ret;
+
+retry:
+	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
+	if (ret)
+		return ret;
+
+	ret = xfs_file_write_checks(iocb, from, &iolock, NULL);
+	if (ret)
+		goto out_unlock;
+
+	if (dio_flags & IOMAP_DIO_FORCE_WAIT)
+		inode_dio_wait(VFS_I(ip));
+
+	trace_xfs_file_direct_write(iocb, from);
+	ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops,
+			dio_flags, NULL, 0);
+
+	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
+	    dops == &xfs_direct_write_iomap_ops) {
+		xfs_iunlock(ip, iolock);
+		dio_flags = IOMAP_DIO_FORCE_WAIT;
+		dops = &xfs_atomic_write_cow_iomap_ops;
+		iolock = XFS_IOLOCK_EXCL;
+		goto retry;
+	}
+
+out_unlock:
+	if (iolock)
+		xfs_iunlock(ip, iolock);
+	return ret;
+}
+
 /*
  * Handle block unaligned direct I/O writes
  *
@@ -840,6 +909,10 @@  xfs_file_dio_write(
 		return xfs_file_dio_write_unaligned(ip, iocb, from);
 	if (xfs_is_zoned_inode(ip))
 		return xfs_file_dio_write_zoned(ip, iocb, from);
+
+	if (iocb->ki_flags & IOCB_ATOMIC)
+		return xfs_file_dio_write_atomic(ip, iocb, from);
+
 	return xfs_file_dio_write_aligned(ip, iocb, from,
 			&xfs_direct_write_iomap_ops, &xfs_dio_write_ops, NULL);
 }