diff mbox series

[v4,05/12] iomap: Support SW-based atomic writes

Message ID 20250303171120.2837067-6-john.g.garry@oracle.com (mailing list archive)
State New
Headers show
Series large atomic writes for xfs with CoW | expand

Commit Message

John Garry March 3, 2025, 5:11 p.m. UTC
Currently atomic write support requires dedicated HW support. This imposes
a restriction on the filesystem that disk blocks need to be aligned and
contiguously mapped to FS blocks to issue atomic writes.

XFS has no method to guarantee FS block alignment for regular,
non-RT files. As such, atomic writes are currently limited to 1x FS block
there.

To deal with the scenario that we are issuing an atomic write over
misaligned or discontiguous data blocks - and raise the atomic write size
limit - support a SW-based software emulated atomic write mode. For XFS,
this SW-based atomic writes would use CoW support to issue emulated untorn
writes.

It is the responsibility of the FS to detect discontiguous atomic writes
and switch to IOMAP_DIO_ATOMIC_SW mode and retry the write. Indeed,
SW-based atomic writes could be used always when the mounted bdev does
not support HW offload, but this strategy is not initially expected to be
used.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 Documentation/filesystems/iomap/operations.rst | 16 ++++++++++++++--
 fs/iomap/direct-io.c                           |  4 +++-
 include/linux/iomap.h                          |  8 +++++++-
 3 files changed, 24 insertions(+), 4 deletions(-)

Comments

Dave Chinner March 9, 2025, 9:51 p.m. UTC | #1
On Mon, Mar 03, 2025 at 05:11:13PM +0000, John Garry wrote:
> Currently atomic write support requires dedicated HW support. This imposes
> a restriction on the filesystem that disk blocks need to be aligned and
> contiguously mapped to FS blocks to issue atomic writes.
> 
> XFS has no method to guarantee FS block alignment for regular,
> non-RT files. As such, atomic writes are currently limited to 1x FS block
> there.
> 
> To deal with the scenario that we are issuing an atomic write over
> misaligned or discontiguous data blocks - and raise the atomic write size
> limit - support a SW-based software emulated atomic write mode. For XFS,
> this SW-based atomic writes would use CoW support to issue emulated untorn
> writes.
> 
> It is the responsibility of the FS to detect discontiguous atomic writes
> and switch to IOMAP_DIO_ATOMIC_SW mode and retry the write. Indeed,
> SW-based atomic writes could be used always when the mounted bdev does
> not support HW offload, but this strategy is not initially expected to be
> used.

So now seeing how these are are to be used, these aren't "hardware"
and "software" atomic IOs. They are block layer vs filesystem atomic
IOs.

We can do atomic IOs in software in the block layer drivers (think
loop or dm-thinp) rather than off-loading to storage hardware.

Hence I think these really need to be named after the layer that
will provide the atomic IO guarantees, because "hw" and "sw" as they
are currently used are not correct. e.g something like
IOMAP_FS_ATOMIC and IOMAP_BDEV_ATOMIC which indicates which layer
should be providing the atomic IO constraints and guarantees.

-Dave.
John Garry March 10, 2025, 10:44 a.m. UTC | #2
On 09/03/2025 21:51, Dave Chinner wrote:
> Mon, Mar 03, 2025 at 05:11:13PM +0000, John Garry wrote:
>> Currently atomic write support requires dedicated HW support. This imposes
>> a restriction on the filesystem that disk blocks need to be aligned and
>> contiguously mapped to FS blocks to issue atomic writes.
>>
>> XFS has no method to guarantee FS block alignment for regular,
>> non-RT files. As such, atomic writes are currently limited to 1x FS block
>> there.
>>
>> To deal with the scenario that we are issuing an atomic write over
>> misaligned or discontiguous data blocks - and raise the atomic write size
>> limit - support a SW-based software emulated atomic write mode. For XFS,
>> this SW-based atomic writes would use CoW support to issue emulated untorn
>> writes.
>>
>> It is the responsibility of the FS to detect discontiguous atomic writes
>> and switch to IOMAP_DIO_ATOMIC_SW mode and retry the write. Indeed,
>> SW-based atomic writes could be used always when the mounted bdev does
>> not support HW offload, but this strategy is not initially expected to be
>> used.
> So now seeing how these are are to be used, these aren't "hardware"
> and "software" atomic IOs. They are block layer vs filesystem atomic
> IOs.
> 
> We can do atomic IOs in software in the block layer drivers (think
> loop or dm-thinp) rather than off-loading to storage hardware.
> 
> Hence I think these really need to be named after the layer that
> will provide the atomic IO guarantees, because "hw" and "sw" as they
> are currently used are not correct. e.g something like
> IOMAP_FS_ATOMIC and IOMAP_BDEV_ATOMIC which indicates which layer
> should be providing the atomic IO constraints and guarantees.

I'd prefer IOMAP_REQ_ATOMIC instead (of IOMAP_BDEV_ATOMIC), as we are 
using REQ_ATOMIC for those BIOs only. Anything which the block layer and 
below does with REQ_ATOMIC is its business, as long as it guarantees 
atomic submission. But I am not overly keen on that name, as it clashes 
with block layer names (naturally).

And IOMAP_FS_ATOMIC seems a bit vague, but I can't think of anything else.

Darrick, any opinion on this?

Cheers,
John
Darrick J. Wong March 10, 2025, 5:21 p.m. UTC | #3
On Mon, Mar 10, 2025 at 10:44:47AM +0000, John Garry wrote:
> On 09/03/2025 21:51, Dave Chinner wrote:
> > Mon, Mar 03, 2025 at 05:11:13PM +0000, John Garry wrote:
> > > Currently atomic write support requires dedicated HW support. This imposes
> > > a restriction on the filesystem that disk blocks need to be aligned and
> > > contiguously mapped to FS blocks to issue atomic writes.
> > > 
> > > XFS has no method to guarantee FS block alignment for regular,
> > > non-RT files. As such, atomic writes are currently limited to 1x FS block
> > > there.
> > > 
> > > To deal with the scenario that we are issuing an atomic write over
> > > misaligned or discontiguous data blocks - and raise the atomic write size
> > > limit - support a SW-based software emulated atomic write mode. For XFS,
> > > this SW-based atomic writes would use CoW support to issue emulated untorn
> > > writes.
> > > 
> > > It is the responsibility of the FS to detect discontiguous atomic writes
> > > and switch to IOMAP_DIO_ATOMIC_SW mode and retry the write. Indeed,
> > > SW-based atomic writes could be used always when the mounted bdev does
> > > not support HW offload, but this strategy is not initially expected to be
> > > used.
> > So now seeing how these are are to be used, these aren't "hardware"
> > and "software" atomic IOs. They are block layer vs filesystem atomic
> > IOs.
> > 
> > We can do atomic IOs in software in the block layer drivers (think
> > loop or dm-thinp) rather than off-loading to storage hardware.
> > 
> > Hence I think these really need to be named after the layer that
> > will provide the atomic IO guarantees, because "hw" and "sw" as they
> > are currently used are not correct. e.g something like
> > IOMAP_FS_ATOMIC and IOMAP_BDEV_ATOMIC which indicates which layer
> > should be providing the atomic IO constraints and guarantees.
> 
> I'd prefer IOMAP_REQ_ATOMIC instead (of IOMAP_BDEV_ATOMIC), as we are using
> REQ_ATOMIC for those BIOs only. Anything which the block layer and below
> does with REQ_ATOMIC is its business, as long as it guarantees atomic
> submission. But I am not overly keen on that name, as it clashes with block
> layer names (naturally).

I don't like encoding "REQ_ATOMIC" in iomap flags.  If we're changing
the names, they ought to reflect who's making the guarantees:

IOMAP_DIO_BDEV_ATOMIC vs. IOMAP_DIO_FS_ATOMIC.

Not sure why the flags lost the "_DIO" part.

--D

> And IOMAP_FS_ATOMIC seems a bit vague, but I can't think of anything else.
> 
> Darrick, any opinion on this?
> 
> Cheers,
> John
>
diff mbox series

Patch

diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 0b9d7be23bce..b08a79d11d9f 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -526,8 +526,20 @@  IOMAP_WRITE`` with any combination of the following enhancements:
    conversion or copy on write), all updates for the entire file range
    must be committed atomically as well.
    Only one space mapping is allowed per untorn write.
-   Untorn writes must be aligned to, and must not be longer than, a
-   single file block.
+   Untorn writes may be longer than a single file block. In all cases,
+   the mapping start disk block must have at least the same alignment as
+   the write offset.
+
+ * ``IOMAP_ATOMIC_SW``: This write is being issued with torn-write
+   protection via a software mechanism provided by the filesystem.
+   All the disk block alignment and single bio restrictions which apply
+   to IOMAP_ATOMIC_HW do not apply here.
+   SW-based untorn writes would typically be used as a fallback when
+   HW-based untorn writes may not be issued, e.g. the range of the write
+   covers multiple extents, meaning that it is not possible to issue
+   a single bio.
+   All filesystem metadata updates for the entire file range must be
+   committed atomically as well.
 
 Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
 calling this function.
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c696ce980796..c594f2cf3ab4 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -686,7 +686,9 @@  __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			iomi.flags |= IOMAP_OVERWRITE_ONLY;
 		}
 
-		if (iocb->ki_flags & IOCB_ATOMIC)
+		if (dio_flags & IOMAP_DIO_ATOMIC_SW)
+			iomi.flags |= IOMAP_ATOMIC_SW;
+		else if (iocb->ki_flags & IOCB_ATOMIC)
 			iomi.flags |= IOMAP_ATOMIC_HW;
 
 		/* for data sync or sync, we need sync completion processing */
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 87cd7079aaf3..9cd93530013c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -189,8 +189,9 @@  struct iomap_folio_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
-#define IOMAP_ATOMIC_HW		(1 << 9)
+#define IOMAP_ATOMIC_HW		(1 << 9) /* HW-based torn-write protection */
 #define IOMAP_DONTCACHE		(1 << 10)
+#define IOMAP_ATOMIC_SW		(1 << 11)/* SW-based torn-write protection */
 
 struct iomap_ops {
 	/*
@@ -502,6 +503,11 @@  struct iomap_dio_ops {
  */
 #define IOMAP_DIO_PARTIAL		(1 << 2)
 
+/*
+ * Use software-based torn-write protection.
+ */
+#define IOMAP_DIO_ATOMIC_SW		(1 << 3)
+
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags, void *private, size_t done_before);