Message ID | 20250313171310.1886394-12-john.g.garry@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | large atomic writes for xfs with CoW | expand |
On Thu, Mar 13, 2025 at 05:13:08PM +0000, John Garry wrote: > + * REQ_ATOMIC-based is the preferred method, and is attempted first. If this > + * method fails due to REQ_ATOMIC-related constraints, then we retry with the > + * COW-based method. The REQ_ATOMIC-based method typically will fail if the > + * write spans multiple extents or the disk blocks are misaligned. It is only preferred if actually supported by the underlying hardware. If it isn't it really shouldn't even be tried, as that is just a waste of cycles. Also a lot of comment should probably be near the code not on top of the function as that's where people would look for them. > +static noinline ssize_t > +xfs_file_dio_write_atomic( > + struct xfs_inode *ip, > + struct kiocb *iocb, > + struct iov_iter *from) > +{ > + unsigned int iolock = XFS_IOLOCK_SHARED; > + unsigned int dio_flags = 0; > + const struct iomap_ops *dops = &xfs_direct_write_iomap_ops; > + ssize_t ret; > + > +retry: > + ret = xfs_ilock_iocb_for_write(iocb, &iolock); > + if (ret) > + return ret; > + > + ret = xfs_file_write_checks(iocb, from, &iolock, NULL); > + if (ret) > + goto out_unlock; > + > + if (dio_flags & IOMAP_DIO_FORCE_WAIT) > + inode_dio_wait(VFS_I(ip)); > + > + trace_xfs_file_direct_write(iocb, from); > + ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops, > + dio_flags, NULL, 0); The normal direct I/O path downgrades the iolock to shared before doing the I/O here. Why isn't that done here? > + if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) && > + dops == &xfs_direct_write_iomap_ops) { This should probably explain the unusual use of EGAIN. Although I still feel that picking a different error code for the fallback would be much more maintainable. > + xfs_iunlock(ip, iolock); > + dio_flags = IOMAP_DIO_FORCE_WAIT; I notice the top of function comment mentions the IOMAP_DIO_FORCE_WAIT flag. Maybe use the chance to write a full sentence here or where it is checked to explain the logic a bit better? > * Handle block unaligned direct I/O writes > * > @@ -840,6 +909,10 @@ xfs_file_dio_write( > return xfs_file_dio_write_unaligned(ip, iocb, from); > if (xfs_is_zoned_inode(ip)) > return xfs_file_dio_write_zoned(ip, iocb, from); > + > + if (iocb->ki_flags & IOCB_ATOMIC) > + return xfs_file_dio_write_atomic(ip, iocb, from); > + Either keep space between all the conditional calls or none. I doubt just stick to the existing style.
On 17/03/2025 06:41, Christoph Hellwig wrote: > On Thu, Mar 13, 2025 at 05:13:08PM +0000, John Garry wrote: >> + * REQ_ATOMIC-based is the preferred method, and is attempted first. If this >> + * method fails due to REQ_ATOMIC-related constraints, then we retry with the >> + * COW-based method. The REQ_ATOMIC-based method typically will fail if the >> + * write spans multiple extents or the disk blocks are misaligned. > > It is only preferred if actually supported by the underlying hardware. > If it isn't it really shouldn't even be tried, as that is just a waste > of cycles. We should not even call this function if atomics are not supported by HW - please see IOCB_ATOMIC checks in xfs_file_write_iter(). So maybe I will mention that the caller must ensure atomics are supported for the write size. > > Also a lot of comment should probably be near the code not on top > of the function as that's where people would look for them. sure, if you prefer > >> +static noinline ssize_t >> +xfs_file_dio_write_atomic( >> + struct xfs_inode *ip, >> + struct kiocb *iocb, >> + struct iov_iter *from) >> +{ >> + unsigned int iolock = XFS_IOLOCK_SHARED; >> + unsigned int dio_flags = 0; >> + const struct iomap_ops *dops = &xfs_direct_write_iomap_ops; >> + ssize_t ret; >> + >> +retry: >> + ret = xfs_ilock_iocb_for_write(iocb, &iolock); >> + if (ret) >> + return ret; >> + >> + ret = xfs_file_write_checks(iocb, from, &iolock, NULL); >> + if (ret) >> + goto out_unlock; >> + >> + if (dio_flags & IOMAP_DIO_FORCE_WAIT) >> + inode_dio_wait(VFS_I(ip)); >> + >> + trace_xfs_file_direct_write(iocb, from); >> + ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops, >> + dio_flags, NULL, 0); > > The normal direct I/O path downgrades the iolock to shared before > doing the I/O here. Why isn't that done here? OK, I can do that. But we still require exclusive lock always for the CoW-based method. > >> + if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) && >> + dops == &xfs_direct_write_iomap_ops) { > > This should probably explain the unusual use of EGAIN. Although I > still feel that picking a different error code for the fallback would > be much more maintainable. I could try another error code - can you suggest one? Is it going to be something unrelated to storage stack, like EREMOTEIO? > >> + xfs_iunlock(ip, iolock); >> + dio_flags = IOMAP_DIO_FORCE_WAIT; > > I notice the top of function comment mentions the IOMAP_DIO_FORCE_WAIT > flag. Maybe use the chance to write a full sentence here or where > it is checked to explain the logic a bit better? ok, fine > >> * Handle block unaligned direct I/O writes >> * >> @@ -840,6 +909,10 @@ xfs_file_dio_write( >> return xfs_file_dio_write_unaligned(ip, iocb, from); >> if (xfs_is_zoned_inode(ip)) >> return xfs_file_dio_write_zoned(ip, iocb, from); >> + >> + if (iocb->ki_flags & IOCB_ATOMIC) >> + return xfs_file_dio_write_atomic(ip, iocb, from); >> + > > Either keep space between all the conditional calls or none. I doubt > just stick to the existing style. Sure >
On Mon, Mar 17, 2025 at 09:36:13AM +0000, John Garry wrote: >> It is only preferred if actually supported by the underlying hardware. >> If it isn't it really shouldn't even be tried, as that is just a waste >> of cycles. > > We should not even call this function if atomics are not supported by HW - > please see IOCB_ATOMIC checks in xfs_file_write_iter(). So maybe I will > mention that the caller must ensure atomics are supported for the write > size. I see that this is what's done in the current series now. But that feels very wrong. Why do you want to deprive the user of this nice and useful code if they don't have the right hardware? Why do we limit us to the hardware supported size when we support more in software? How do you force test the software code if you require the hardware support? >>> + trace_xfs_file_direct_write(iocb, from); >>> + ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops, >>> + dio_flags, NULL, 0); >> >> The normal direct I/O path downgrades the iolock to shared before >> doing the I/O here. Why isn't that done here? > > OK, I can do that. But we still require exclusive lock always for the > CoW-based method. If you can do away with the lock that's great and useful to get good performance. But if not at least document why this is different from others. Similarly if the COW path needs an exclusive lock document why in the code. > >> >>> + if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) && >>> + dops == &xfs_direct_write_iomap_ops) { >> >> This should probably explain the unusual use of EGAIN. Although I >> still feel that picking a different error code for the fallback would >> be much more maintainable. > > I could try another error code - can you suggest one? Is it going to be > something unrelated to storage stack, like EREMOTEIO? Yes, the funky networking codes tends to be good candidates. E.g. ENOPROTOOPT for something that sounds at least vaguely related. >>> + >>> + if (iocb->ki_flags & IOCB_ATOMIC) >>> + return xfs_file_dio_write_atomic(ip, iocb, from); >>> + >> >> Either keep space between all the conditional calls or none. I doubt >> just stick to the existing style. > > Sure FYI, that I doubt should have been in doubt. I was just so happy to finally get the mail out after a flakey connection on the train.
On 18/03/2025 05:43, Christoph Hellwig wrote: > On Mon, Mar 17, 2025 at 09:36:13AM +0000, John Garry wrote: >>> It is only preferred if actually supported by the underlying hardware. >>> If it isn't it really shouldn't even be tried, as that is just a waste >>> of cycles. >> >> We should not even call this function if atomics are not supported by HW - >> please see IOCB_ATOMIC checks in xfs_file_write_iter(). So maybe I will >> mention that the caller must ensure atomics are supported for the write >> size. > > I see that this is what's done in the current series now. But that feels > very wrong. Why do you want to deprive the user of this nice and useful > code if they don't have the right hardware? I don't think it's fair to say that we deprive the user - so far we just don't and nobody has asked for atomics without HW support. > Why do we limit us to the > hardware supported size when we support more in software? As I see, HW offload gives fast and predictable performance. The COW method is just a (slow) fallback is when HW offload is not possible. If we want to allow the user to avail of atomics greater than the mounted bdev, then we should have a method to tell the user of the optimised threshold. They could read the bdev atomic limits and infer this, but that is not a good user experience. > How do you > force test the software code if you require the hardware support? > >>>> + trace_xfs_file_direct_write(iocb, from); >>>> + ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops, >>>> + dio_flags, NULL, 0); >>> >>> The normal direct I/O path downgrades the iolock to shared before >>> doing the I/O here. Why isn't that done here? >> >> OK, I can do that. But we still require exclusive lock always for the >> CoW-based method. > > If you can do away with the lock that's great and useful to get good > performance. But if not at least document why this is different from > others. Similarly if the COW path needs an exclusive lock document why > in the code. ok, I'll do that. > > >> >>> >>>> + if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) && >>>> + dops == &xfs_direct_write_iomap_ops) { >>> >>> This should probably explain the unusual use of EGAIN. Although I >>> still feel that picking a different error code for the fallback would >>> be much more maintainable. >> >> I could try another error code - can you suggest one? Is it going to be >> something unrelated to storage stack, like EREMOTEIO? > > Yes, the funky networking codes tends to be good candidates. E.g. > ENOPROTOOPT for something that sounds at least vaguely related. ok > >>>> + >>>> + if (iocb->ki_flags & IOCB_ATOMIC) >>>> + return xfs_file_dio_write_atomic(ip, iocb, from); >>>> + >>> >>> Either keep space between all the conditional calls or none. I doubt >>> just stick to the existing style. >> >> Sure > > FYI, that I doubt should have been in doubt. I was just so happy to > finally get the mail out after a flakey connection on the train. > thanks
On Tue, Mar 18, 2025 at 08:42:36AM +0000, John Garry wrote: >> I see that this is what's done in the current series now. But that feels >> very wrong. Why do you want to deprive the user of this nice and useful >> code if they don't have the right hardware? > > I don't think it's fair to say that we deprive the user - so far we just > don't and nobody has asked for atomics without HW support. You're still keeping this nice functionality from the users.. > >> Why do we limit us to the >> hardware supported size when we support more in software? > > As I see, HW offload gives fast and predictable performance. > > The COW method is just a (slow) fallback is when HW offload is not possible. > > If we want to allow the user to avail of atomics greater than the mounted > bdev, then we should have a method to tell the user of the optimised > threshold. They could read the bdev atomic limits and infer this, but that > is not a good user experience. Yes, there should be an interface to expose that. But even without the hardware acceleration a guaranteed untorn write is a really nice feature to have.
On 18/03/2025 08:46, Christoph Hellwig wrote: > On Tue, Mar 18, 2025 at 08:42:36AM +0000, John Garry wrote: >>> I see that this is what's done in the current series now. But that feels >>> very wrong. Why do you want to deprive the user of this nice and useful >>> code if they don't have the right hardware? >> >> I don't think it's fair to say that we deprive the user - so far we just >> don't and nobody has asked for atomics without HW support. > > You're still keeping this nice functionality from the users.. > >> >>> Why do we limit us to the >>> hardware supported size when we support more in software? >> >> As I see, HW offload gives fast and predictable performance. >> >> The COW method is just a (slow) fallback is when HW offload is not possible. >> >> If we want to allow the user to avail of atomics greater than the mounted >> bdev, then we should have a method to tell the user of the optimised >> threshold. They could read the bdev atomic limits and infer this, but that >> is not a good user experience. > > Yes, there should be an interface to expose that. So we could add to statx.atomic_write_unit_max_opt, which is the optimised/fast limit, i.e. bdev limit. Is that sane? Note that we already have STATX_ATTR_WRITE_ATOMIC to get the atomic limits. I wouldn't want to add a new flag just to get this new member. So I suppose that if we added statx.atomic_write_unit_max_opt, the API would be: if statx.atomic_write_unit_max_opt is zero, then statx.atomic_write_unit_max is optimised limit also. > But even without > the hardware acceleration a guaranteed untorn write is a really nice > feature to have. >
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 7a56ddb86fd2..029684b54dda 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -725,6 +725,75 @@ xfs_file_dio_write_zoned( return ret; } +/* + * Handle block atomic writes + * + * Two methods of atomic writes are supported: + * - REQ_ATOMIC-based, which would typically use some form of HW offload in the + * disk + * - COW-based, which uses a COW fork as a staging extent for data updates + * before atomically updating extent mappings for the range being written + * + * REQ_ATOMIC-based is the preferred method, and is attempted first. If this + * method fails due to REQ_ATOMIC-related constraints, then we retry with the + * COW-based method. The REQ_ATOMIC-based method typically will fail if the + * write spans multiple extents or the disk blocks are misaligned. + * + * Similar to xfs_file_dio_write_unaligned(), the retry mechanism is based on + * the ->iomap_begin method returning -EAGAIN, which would be when the + * REQ_ATOMIC-based write is not possible. In the case of IOCB_NOWAIT being set, + * then we will not retry with the COW-based method, and instead pass that + * error code back to the caller immediately. + * + * REQ_ATOMIC-based atomic writes behave such that a racing read which overlaps + * with range being atomically written will see all or none of the old data. + * Emulate this behaviour for COW-based atomic writes by using + * IOMAP_DIO_FORCE_WAIT and inode_dio_wait() to ensure active reads. This also + * locks out racing writes, which could trample on the COW fork extent. + */ + +static noinline ssize_t +xfs_file_dio_write_atomic( + struct xfs_inode *ip, + struct kiocb *iocb, + struct iov_iter *from) +{ + unsigned int iolock = XFS_IOLOCK_SHARED; + unsigned int dio_flags = 0; + const struct iomap_ops *dops = &xfs_direct_write_iomap_ops; + ssize_t ret; + +retry: + ret = xfs_ilock_iocb_for_write(iocb, &iolock); + if (ret) + return ret; + + ret = xfs_file_write_checks(iocb, from, &iolock, NULL); + if (ret) + goto out_unlock; + + if (dio_flags & IOMAP_DIO_FORCE_WAIT) + inode_dio_wait(VFS_I(ip)); + + trace_xfs_file_direct_write(iocb, from); + ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops, + dio_flags, NULL, 0); + + if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) && + dops == &xfs_direct_write_iomap_ops) { + xfs_iunlock(ip, iolock); + dio_flags = IOMAP_DIO_FORCE_WAIT; + dops = &xfs_atomic_write_cow_iomap_ops; + iolock = XFS_IOLOCK_EXCL; + goto retry; + } + +out_unlock: + if (iolock) + xfs_iunlock(ip, iolock); + return ret; +} + /* * Handle block unaligned direct I/O writes * @@ -840,6 +909,10 @@ xfs_file_dio_write( return xfs_file_dio_write_unaligned(ip, iocb, from); if (xfs_is_zoned_inode(ip)) return xfs_file_dio_write_zoned(ip, iocb, from); + + if (iocb->ki_flags & IOCB_ATOMIC) + return xfs_file_dio_write_atomic(ip, iocb, from); + return xfs_file_dio_write_aligned(ip, iocb, from, &xfs_direct_write_iomap_ops, &xfs_dio_write_ops, NULL); }