diff mbox series

[v5,07/10] xfs: Commit CoW-based atomic writes atomically

Message ID 20250310183946.932054-8-john.g.garry@oracle.com (mailing list archive)
State New
Headers show
Series large atomic writes for xfs with CoW | expand

Commit Message

John Garry March 10, 2025, 6:39 p.m. UTC
When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.

For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c    |  5 ++++-
 fs/xfs/xfs_reflink.c | 49 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |  3 +++
 3 files changed, 56 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig March 12, 2025, 7:39 a.m. UTC | #1
On Mon, Mar 10, 2025 at 06:39:43PM +0000, John Garry wrote:
> When completing a CoW-based write, each extent range mapping update is
> covered by a separate transaction.
> 
> For a CoW-based atomic write, all mappings must be changed at once, so
> change to use a single transaction.

As already mentioned in a previous reply:  "all" might be to much.
The code can only support a (relatively low) number of extents
in a single transaction safely.

> +int
> +xfs_reflink_end_atomic_cow(
> +	struct xfs_inode		*ip,
> +	xfs_off_t			offset,
> +	xfs_off_t			count)

Assuming we could actually to the multi extent per transaction
commit safely, what would be the reason to not always do it?
John Garry March 12, 2025, 9:04 a.m. UTC | #2
On 12/03/2025 07:39, Christoph Hellwig wrote:
> On Mon, Mar 10, 2025 at 06:39:43PM +0000, John Garry wrote:
>> When completing a CoW-based write, each extent range mapping update is
>> covered by a separate transaction.
>>
>> For a CoW-based atomic write, all mappings must be changed at once, so
>> change to use a single transaction.
> 
> As already mentioned in a previous reply:  "all" might be to much.
> The code can only support a (relatively low) number of extents
> in a single transaction safely.

Then we would need to limit the awu max to whatever can be guaranteed
(to fit).


> 
>> +int
>> +xfs_reflink_end_atomic_cow(
>> +	struct xfs_inode		*ip,
>> +	xfs_off_t			offset,
>> +	xfs_off_t			count)
> 
> Assuming we could actually to the multi extent per transaction
> commit safely, what would be the reason to not always do it?
> 

Yes, I suppose that it could always be used. I would suggest that as a 
later improvement, if you agree.

Thanks,
John
Christoph Hellwig March 12, 2025, 1:54 p.m. UTC | #3
On Wed, Mar 12, 2025 at 09:04:07AM +0000, John Garry wrote:
> > As already mentioned in a previous reply:  "all" might be to much.
> > The code can only support a (relatively low) number of extents
> > in a single transaction safely.
> 
> Then we would need to limit the awu max to whatever can be guaranteed
> (to fit).

Yes.  And please add a testcase that creates a badly fragmented file
and verifies that we can handle the worst case for this limit.

(although being able to reproduce the worst case btree splits might
be hard, but at least the worst case fragmentation should be doable)

> > Assuming we could actually to the multi extent per transaction
> > commit safely, what would be the reason to not always do it?
> > 
> 
> Yes, I suppose that it could always be used. I would suggest that as a later
> improvement, if you agree.

I remember running into some problems with my earlier version, but I'd
have to dig into it.  Maybe it will resurface with the above testing,
or it was due to my optimizations for the extent lookups.
John Garry March 12, 2025, 3:01 p.m. UTC | #4
On 12/03/2025 13:54, Christoph Hellwig wrote:

+

> On Wed, Mar 12, 2025 at 09:04:07AM +0000, John Garry wrote:
>>> As already mentioned in a previous reply:  "all" might be to much.
>>> The code can only support a (relatively low) number of extents
>>> in a single transaction safely.
>>
>> Then we would need to limit the awu max to whatever can be guaranteed
>> (to fit).
> 
> Yes.  And please add a testcase that creates a badly fragmented file
> and verifies that we can handle the worst case for this limit.

ok, we can do that.

I have my own stress test for atomic writes on fragmented FSes, but I 
have not encountered a such a problem. But we need to formalize 
something though.


> 
> (although being able to reproduce the worst case btree splits might
> be hard, but at least the worst case fragmentation should be doable)
> 
>>> Assuming we could actually to the multi extent per transaction
>>> commit safely, what would be the reason to not always do it?
>>>
>>
>> Yes, I suppose that it could always be used. I would suggest that as a later
>> improvement, if you agree.
> 
> I remember running into some problems with my earlier version, but I'd
> have to dig into it.  Maybe it will resurface with the above testing,
> or it was due to my optimizations for the extent lookups.
> 

ok.

Thanks,
John
diff mbox series

Patch

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ddcf95ce741e..16739c408af3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -576,7 +576,10 @@  xfs_dio_write_end_io(
 	nofs_flag = memalloc_nofs_save();
 
 	if (flags & IOMAP_DIO_COW) {
-		error = xfs_reflink_end_cow(ip, offset, size);
+		if (iocb->ki_flags & IOCB_ATOMIC)
+			error = xfs_reflink_end_atomic_cow(ip, offset, size);
+		else
+			error = xfs_reflink_end_cow(ip, offset, size);
 		if (error)
 			goto out;
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 0949d6ba2b3b..ce1fd58dff35 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -987,6 +987,55 @@  xfs_reflink_end_cow(
 		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
 	return error;
 }
+int
+xfs_reflink_end_atomic_cow(
+	struct xfs_inode		*ip,
+	xfs_off_t			offset,
+	xfs_off_t			count)
+{
+	xfs_fileoff_t			offset_fsb;
+	xfs_fileoff_t			end_fsb;
+	int				error = 0;
+	struct xfs_mount		*mp = ip->i_mount;
+	struct xfs_trans		*tp;
+	unsigned int			resblks;
+
+	trace_xfs_reflink_end_cow(ip, offset, count);
+
+	offset_fsb = XFS_B_TO_FSBT(mp, offset);
+	end_fsb = XFS_B_TO_FSB(mp, offset + count);
+
+	/*
+	 * Each remapping operation could cause a btree split, so in the worst
+	 * case that's one for each block.
+	 */
+	resblks = (end_fsb - offset_fsb) *
+			XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
+			XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	while (end_fsb > offset_fsb && !error) {
+		error = xfs_reflink_end_cow_extent_locked(tp, ip, &offset_fsb,
+				end_fsb);
+	}
+	if (error) {
+		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
+		goto out_cancel;
+	}
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
 
 /*
  * Free all CoW staging blocks that are still referenced by the ondisk refcount
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index dfd94e51e2b4..4cb2ee53cd8d 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -49,6 +49,9 @@  extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count, bool cancel_real);
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
+		int
+xfs_reflink_end_atomic_cow(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, loff_t len,