Message ID | 20241204154344.3034362-2-john.g.garry@oracle.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | large atomic writes for xfs | expand |
On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote: > From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> > > Filesystems like ext4 can submit writes in multiples of blocksizes. > But we still can't allow the writes to be split into multiple BIOs. Hence > let's check if the iomap_length() is same as iter->len or not. > > It is the responsibility of userspace to ensure that a write does not span > mixed unwritten and mapped extents (which would lead to multiple BIOs). How is "userspace" supposed to do this? No existing utility in userspace is aware of atomic write limits or rtextsize configs, so how does "userspace" ensure everything is laid out in a manner compatible with atomic writes? e.g. restoring a backup (or other disaster recovery procedures) is going to have to lay the files out correctly for atomic writes. backup tools often sparsify the data set and so what gets restored will not have the same layout as the original data set... Where's the documentation that outlines all the restrictions on userspace behaviour to prevent this sort of problem being triggered? Common operations such as truncate, hole punch, buffered writes, reflinks, etc will trip over this, so application developers, users and admins really need to know what they should be doing to avoid stepping on this landmine... Further to that, what is the correction process for users to get rid of this error? They'll need some help from an atomic write constraint aware utility that can resilver the file such that these failures do not occur again. Can xfs_fsr do this? Or maybe the new exchangr-range code? Or does none of this infrastructure yet exist? > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > jpg: tweak commit message > Signed-off-by: John Garry <john.g.garry@oracle.com> > --- > fs/iomap/direct-io.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index b521eb15759e..3dd883dd77d2 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, > size_t copied = 0; > size_t orig_count; > > - if (atomic && length != fs_block_size) > + if (atomic && length != iter->len) > return -EINVAL; Given this is now rejecting IOs that are otherwise well formed from userspace, this situation needs to have a different errno now. The userspace application has not provided an invalid set of IO parameters for this IO - the IO has been rejected because the previously set up persistent file layout was screwed up by something in the past. i.e. This is not an application IO submission error anymore, hence EINVAL is the wrong error to be returning to userspace here. -Dave.
On Thu, Dec 05, 2024 at 07:35:45AM +1100, Dave Chinner wrote: > On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote: > > From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> > > > > Filesystems like ext4 can submit writes in multiples of blocksizes. > > But we still can't allow the writes to be split into multiple BIOs. Hence > > let's check if the iomap_length() is same as iter->len or not. > > > > It is the responsibility of userspace to ensure that a write does not span > > mixed unwritten and mapped extents (which would lead to multiple BIOs). > > How is "userspace" supposed to do this? > > No existing utility in userspace is aware of atomic write limits or > rtextsize configs, so how does "userspace" ensure everything is > laid out in a manner compatible with atomic writes? > > e.g. restoring a backup (or other disaster recovery procedures) is > going to have to lay the files out correctly for atomic writes. > backup tools often sparsify the data set and so what gets restored > will not have the same layout as the original data set... > > Where's the documentation that outlines all the restrictions on > userspace behaviour to prevent this sort of problem being triggered? > Common operations such as truncate, hole punch, buffered writes, > reflinks, etc will trip over this, so application developers, users > and admins really need to know what they should be doing to avoid > stepping on this landmine... I'm kinda assuming that this requires forcealign to get the extent alignments correct, and writing zeroes non-atomically if the extent state gets mixed up before retrying the untorn write. John? > Further to that, what is the correction process for users to get rid > of this error? They'll need some help from an atomic write > constraint aware utility that can resilver the file such that these > failures do not occur again. Can xfs_fsr do this? Or maybe the new > exchangr-range code? Or does none of this infrastructure yet exist? TBH aside from databases doing pure overwrites to storage hardware, I think everyone else would be better served by start_commit / commit_range since it's /much/ more flexible. > > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > > jpg: tweak commit message > > Signed-off-by: John Garry <john.g.garry@oracle.com> > > --- > > fs/iomap/direct-io.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > index b521eb15759e..3dd883dd77d2 100644 > > --- a/fs/iomap/direct-io.c > > +++ b/fs/iomap/direct-io.c > > @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, > > size_t copied = 0; > > size_t orig_count; > > > > - if (atomic && length != fs_block_size) > > + if (atomic && length != iter->len) > > return -EINVAL; > > Given this is now rejecting IOs that are otherwise well formed from > userspace, this situation needs to have a different errno now. The > userspace application has not provided an invalid set of > IO parameters for this IO - the IO has been rejected because > the previously set up persistent file layout was screwed up by > something in the past. > > i.e. This is not an application IO submission error anymore, hence > EINVAL is the wrong error to be returning to userspace here. Admittedly it would be useful to add at least a couple of new errnos for alignment issues and incorrect file storage mapping states. How difficult is it to get a new errno code added to uapi and then plumbed into glibc? Are new errno additions to libc still gate-kept by Ulrich or is my knowlege 15 years out of date? --D > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On 04/12/2024 20:35, Dave Chinner wrote: > On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote: >> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> >> >> Filesystems like ext4 can submit writes in multiples of blocksizes. >> But we still can't allow the writes to be split into multiple BIOs. Hence >> let's check if the iomap_length() is same as iter->len or not. >> >> It is the responsibility of userspace to ensure that a write does not span >> mixed unwritten and mapped extents (which would lead to multiple BIOs). > > How is "userspace" supposed to do this? If an atomic write spans mixed unwritten and mapped extents, then it should manually zero the unwritten extents beforehand. > > No existing utility in userspace is aware of atomic write limits or > rtextsize configs, so how does "userspace" ensure everything is > laid out in a manner compatible with atomic writes? > > e.g. restoring a backup (or other disaster recovery procedures) is > going to have to lay the files out correctly for atomic writes. > backup tools often sparsify the data set and so what gets restored > will not have the same layout as the original data set... I am happy to support whatever is needed to make atomic writes work over mixed extents if that is really an expected use case and it is a pain for an application writer/admin to deal with this (by manually zeroing extents). JFYI, I did originally support the extent pre-zeroing for this. That was to support a real-life scenario which we saw where we were attempting atomic writes over mixed extents. The mixed extents were coming from userspace punching holes and then attempting an atomic write over that space. However that was using an early experimental and buggy forcealign; it was buggy as it did not handle punching holes properly - it punched out single blocks and not only full alloc units. > > Where's the documentation that outlines all the restrictions on > userspace behaviour to prevent this sort of problem being triggered? I would provide a man page update. > Common operations such as truncate, hole punch, So how would punch hole be a problem? The atomic write unit max is limited by the alloc unit, and we can only punch out full alloc units. > buffered writes, > reflinks, etc will trip over this, so application developers, users > and admins really need to know what they should be doing to avoid > stepping on this landmine... If this is not a real-life scenario which we expect to see, then I don't see why we would add the complexity to the kernel for this. My motivation for atomic writes support is to support atomically writing large database internal page size. If the database only writes at a fixed internal page size, then we should not see mixed mappings. But you see potential problems elsewhere .. > > Further to that, what is the correction process for users to get rid > of this error? They'll need some help from an atomic write > constraint aware utility that can resilver the file such that these > failures do not occur again. Can xfs_fsr do this? Or maybe the new > exchangr-range code? Or does none of this infrastructure yet exist? Nothing exists yet. > >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> >> jpg: tweak commit message >> Signed-off-by: John Garry <john.g.garry@oracle.com> >> --- >> fs/iomap/direct-io.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c >> index b521eb15759e..3dd883dd77d2 100644 >> --- a/fs/iomap/direct-io.c >> +++ b/fs/iomap/direct-io.c >> @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, >> size_t copied = 0; >> size_t orig_count; >> >> - if (atomic && length != fs_block_size) >> + if (atomic && length != iter->len) >> return -EINVAL; > > Given this is now rejecting IOs that are otherwise well formed from > userspace, this situation needs to have a different errno now. The > userspace application has not provided an invalid set of > IO parameters for this IO - the IO has been rejected because > the previously set up persistent file layout was screwed up by > something in the past. > > i.e. This is not an application IO submission error anymore, hence > EINVAL is the wrong error to be returning to userspace here. > Understood, but let's come back to this (if needed..). Thanks, John
On 05/12/2024 06:30, Darrick J. Wong wrote: > On Thu, Dec 05, 2024 at 07:35:45AM +1100, Dave Chinner wrote: >> On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote: >>> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> >>> >>> Filesystems like ext4 can submit writes in multiples of blocksizes. >>> But we still can't allow the writes to be split into multiple BIOs. Hence >>> let's check if the iomap_length() is same as iter->len or not. >>> >>> It is the responsibility of userspace to ensure that a write does not span >>> mixed unwritten and mapped extents (which would lead to multiple BIOs). >> >> How is "userspace" supposed to do this? >> >> No existing utility in userspace is aware of atomic write limits or >> rtextsize configs, so how does "userspace" ensure everything is >> laid out in a manner compatible with atomic writes? >> >> e.g. restoring a backup (or other disaster recovery procedures) is >> going to have to lay the files out correctly for atomic writes. >> backup tools often sparsify the data set and so what gets restored >> will not have the same layout as the original data set... >> >> Where's the documentation that outlines all the restrictions on >> userspace behaviour to prevent this sort of problem being triggered? >> Common operations such as truncate, hole punch, buffered writes, >> reflinks, etc will trip over this, so application developers, users >> and admins really need to know what they should be doing to avoid >> stepping on this landmine... > > I'm kinda assuming that this requires forcealign to get the extent > alignments correct, and writing zeroes non-atomically if the extent > state gets mixed up before retrying the untorn write. John? Sure, the code to do the automatic pre-zeroing and retry the atomic write is not super complicated. It's just a matter or whether we add it or not. Thanks, John
On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote: > On 04/12/2024 20:35, Dave Chinner wrote: > > On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote: > > > From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> > > > > > > Filesystems like ext4 can submit writes in multiples of blocksizes. > > > But we still can't allow the writes to be split into multiple BIOs. Hence > > > let's check if the iomap_length() is same as iter->len or not. > > > > > > It is the responsibility of userspace to ensure that a write does not span > > > mixed unwritten and mapped extents (which would lead to multiple BIOs). > > > > How is "userspace" supposed to do this? > > If an atomic write spans mixed unwritten and mapped extents, then it should > manually zero the unwritten extents beforehand. > > > > > No existing utility in userspace is aware of atomic write limits or > > rtextsize configs, so how does "userspace" ensure everything is > > laid out in a manner compatible with atomic writes? > > > > e.g. restoring a backup (or other disaster recovery procedures) is > > going to have to lay the files out correctly for atomic writes. > > backup tools often sparsify the data set and so what gets restored > > will not have the same layout as the original data set... > > I am happy to support whatever is needed to make atomic writes work over > mixed extents if that is really an expected use case and it is a pain for an > application writer/admin to deal with this (by manually zeroing extents). > > JFYI, I did originally support the extent pre-zeroing for this. That was to > support a real-life scenario which we saw where we were attempting atomic > writes over mixed extents. The mixed extents were coming from userspace > punching holes and then attempting an atomic write over that space. However > that was using an early experimental and buggy forcealign; it was buggy as > it did not handle punching holes properly - it punched out single blocks and > not only full alloc units. > > > > > Where's the documentation that outlines all the restrictions on > > userspace behaviour to prevent this sort of problem being triggered? > > I would provide a man page update. I think, at this point, we need an better way of documenting all the atomic write stuff in one place. Not just the user interface and what is expected of userspace, but also all the things the filesystems need to do to ensure atomic writes work correctly. I was thinking that a document somewhere in the Documentation/ directory, rather than random pieces of information splattered across random man pages would be a much better way of explaining all this. Don't get me wrong - man pages explaining the programmatic API are necessary, but there's a whole lot more to understanding and making effective use of atomic writes than what has been added to the man pages so far. > > Common operations such as truncate, hole punch, > > So how would punch hole be a problem? The atomic write unit max is limited > by the alloc unit, and we can only punch out full alloc units. I was under the impression that this was a feature of the force-align code, not a feature of atomic writes. i.e. force-align is what ensures the BMBT aligns correctly with the underlying extents. Or did I miss the fact that some of the force-align semantics bleed back into the original atomic write patch set? > > buffered writes, > > reflinks, etc will trip over this, so application developers, users > > and admins really need to know what they should be doing to avoid > > stepping on this landmine... > > If this is not a real-life scenario which we expect to see, then I don't see > why we would add the complexity to the kernel for this. I gave you one above - restoring a data set as a result of disaster recovery. > My motivation for atomic writes support is to support atomically writing > large database internal page size. If the database only writes at a fixed > internal page size, then we should not see mixed mappings. Yup, that's the problem here. Once atomic writes are supported by the kernel and userspace, all sorts of applications are going to start using them for in all sorts of ways you didn't think of. > But you see potential problems elsewhere .. That's my job as a senior engineer with 20+ years of experience in filesystems and storage related applications. I see far because I stand on the shoulders of giants - I don't try to be a giant myself. Other people become giants by implementing ground-breaking features (e.g. like atomic writes), but without the people who can see far enough ahead just adding features ends up with an incoherent mess of special interest niche features rather than a neatly integrated set of widely usable generic features. e.g. look at MySQL's use of fallocate(hole punch) for transparent data compression - nobody had forseen that hole punching would be used like this, but it's a massive win for the applications which store bulk compressible data in the database even though it does bad things to the filesystem. Spend some time looking outside the proprietary database application box and think a little harder about the implications of atomic write functionality. i.e. what happens when we have ubiquitous support for guaranteeing only the old or the new data will be seen after a crash *without the need for using fsync*. Think about the implications of that for a minute - for any full file overwrite up to the hardware atomic limits, we won't need fsync to guarantee the integrity of overwritten data anymore. We only need a mechanism to flush the journal and device caches once all the data has been written (e.g. syncfs)... Want to overwrite a bunch of small files safely? Atomic write the new data, then syncfs(). There's no need to run fdatasync after each write to ensure individual files are not corrupted if we crash in the middle of the operation. Indeed, atomic writes actually provide better overwrite integrity semantics that fdatasync as it will be all or nothing. fdatasync does not provide that guarantee if we crash during the fdatasync operation. Further, with COW data filesystems like XFS, btrfs and bcachefs, we can emulate atomic writes for any size larger than what the hardware supports. At this point we actually provide app developers with what they've been repeatedly asking kernel filesystem engineers to provide them for the past 20 years: a way of overwriting arbitrary file data safely without needing an expensive fdatasync operation on every file that gets modified. Put simply: atomic writes have a huge potential to fundamentally change the way applications interact with Linux filesystems and to make it *much* simpler for applications to safely overwrite user data. Hence there is an imperitive here to make the foundational support for this technology solid and robust because atomic writes are going to be with us for the next few decades... -Dave.
>>> Where's the documentation that outlines all the restrictions on >>> userspace behaviour to prevent this sort of problem being triggered? >> >> I would provide a man page update. > > I think, at this point, we need an better way of documenting all the > atomic write stuff in one place. Not just the user interface and > what is expected of userspace, but also all the things the > filesystems need to do to ensure atomic writes work correctly. I was > thinking that a document somewhere in the Documentation/ directory, > rather than random pieces of information splattered across random man pages > would be a much better way of explaining all this. > > Don't get me wrong - man pages explaining the programmatic API are > necessary, but there's a whole lot more to understanding and making > effective use of atomic writes than what has been added to the man > pages so far. Sure, maybe that would be useful. I think that the final piece of the jigsaw is large atomic write support, and then any kernel documentation can be further considered. > >>> Common operations such as truncate, hole punch, >> >> So how would punch hole be a problem? The atomic write unit max is limited >> by the alloc unit, and we can only punch out full alloc units. > > I was under the impression that this was a feature of the > force-align code, not a feature of atomic writes. i.e. force-align > is what ensures the BMBT aligns correctly with the underlying > extents. > > Or did I miss the fact that some of the force-align semantics bleed > back into the original atomic write patch set? Not really. As I mentioned, if we can only punch out a full allocation unit and atomic write unit max is limited by the allocation unit size, then punching out a hole should not create a new range of mixed extents that we can legally attempt to atomic write. > >>> buffered writes, >>> reflinks, etc will trip over this, so application developers, users >>> and admins really need to know what they should be doing to avoid >>> stepping on this landmine... >> >> If this is not a real-life scenario which we expect to see, then I don't see >> why we would add the complexity to the kernel for this. > > I gave you one above - restoring a data set as a result of disaster > recovery. ack > >> My motivation for atomic writes support is to support atomically writing >> large database internal page size. If the database only writes at a fixed >> internal page size, then we should not see mixed mappings. > > Yup, that's the problem here. Once atomic writes are supported by > the kernel and userspace, all sorts of applications are going to > start using them for in all sorts of ways you didn't think of. > >> But you see potential problems elsewhere .. > > That's my job as a senior engineer with 20+ years of experience in > filesystems and storage related applications. I see far because I > stand on the shoulders of giants - I don't try to be a giant myself. > > Other people become giants by implementing ground-breaking features > (e.g. like atomic writes), but without the people who can see far > enough ahead just adding features ends up with an incoherent mess of > special interest niche features rather than a neatly integrated set > of widely usable generic features. yes > > e.g. look at MySQL's use of fallocate(hole punch) for transparent > data compression - nobody had forseen that hole punching would be > used like this, but it's a massive win for the applications which > store bulk compressible data in the database even though it does bad > things to the filesystem. > > Spend some time looking outside the proprietary database application > box and think a little harder about the implications of atomic write > functionality. i.e. what happens when we have ubiquitous support > for guaranteeing only the old or the new data will be seen after > a crash *without the need for using fsync*. > > Think about the implications of that for a minute - for any full > file overwrite up to the hardware atomic limits, we won't need fsync > to guarantee the integrity of overwritten data anymore. We only need > a mechanism to flush the journal and device caches once all the data > has been written (e.g. syncfs)... > > Want to overwrite a bunch of small files safely? Atomic write the > new data, then syncfs(). There's no need to run fdatasync after each > write to ensure individual files are not corrupted if we crash in > the middle of the operation. Indeed, atomic writes actually provide > better overwrite integrity semantics that fdatasync as it will be > all or nothing. fdatasync does not provide that guarantee if we > crash during the fdatasync operation. > > Further, with COW data filesystems like XFS, btrfs and bcachefs, we > can emulate atomic writes for any size larger than what the hardware > supports. > > At this point we actually provide app developers with what they've > been repeatedly asking kernel filesystem engineers to provide them > for the past 20 years: a way of overwriting arbitrary file data > safely without needing an expensive fdatasync operation on every > file that gets modified. Understood, you see that there are many applications of atomic writes beyond the scope of DBs. > > Put simply: atomic writes have a huge potential to fundamentally > change the way applications interact with Linux filesystems and to > make it *much* simpler for applications to safely overwrite user > data. Hence there is an imperitive here to make the foundational > support for this technology solid and robust because atomic writes > are going to be with us for the next few decades... > Thanks for going further in describing the possible use cases. Now let's talk again about the implementation of kernel extent zeroing for atomic writes. Firstly I will mention the obvious and that is we so far can automatically atomically write a single FS block and there was no FS_XFLAG_ATOMICWRITES flag introduced for enabling this. Furthermore, the range of data does not need to be in mapped state. Then we need to consider how to decide to do the extent zeroing for the following scenarios for atomic writes: a. For forcealign, we can decide to always zero the full alloc unit, same as [0]. So that is good for atomic writes. But that does involve further work to zero extents for buffered IO. b. For rtvol without forcealign, we will not start to always zero the full alloc unit as that would be major regression in performance. c. For rtvol with forcealign, we can do the same as a. I can suggest 2x options for solving b: 1. Introduce FS_XFLAG_LARGE_ATOMICWRITES to control whether we do the same as a. 2. Introduce a method to pre-zero extents for atomic writes only Option 2. would work like this: - when we find that the atomic write covers a mix of unwritten and mapped extent mappings, we have 2x phases of the atomic write: - phase 1. will pre-zero the unwritten extents and update the extent mappings - phase 2. will retry the atomic write, and we should find a single mapping Option 2. could also be leveraged for a. and c., above. Please let me know your thoughts on this. [0] https://lore.kernel.org/linux-xfs/20240607143919.2622319-3-john.g.garry@oracle.com/ John
On Fri, Dec 06, 2024 at 08:15:05AM +1100, Dave Chinner wrote: > On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote: > > On 04/12/2024 20:35, Dave Chinner wrote: > > > On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote: > > > > From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> > > > > > > > > Filesystems like ext4 can submit writes in multiples of blocksizes. > > > > But we still can't allow the writes to be split into multiple BIOs. Hence > > > > let's check if the iomap_length() is same as iter->len or not. > > > > > > > > It is the responsibility of userspace to ensure that a write does not span > > > > mixed unwritten and mapped extents (which would lead to multiple BIOs). > > > > > > How is "userspace" supposed to do this? > > > > If an atomic write spans mixed unwritten and mapped extents, then it should > > manually zero the unwritten extents beforehand. > > > > > > > > No existing utility in userspace is aware of atomic write limits or > > > rtextsize configs, so how does "userspace" ensure everything is > > > laid out in a manner compatible with atomic writes? > > > > > > e.g. restoring a backup (or other disaster recovery procedures) is > > > going to have to lay the files out correctly for atomic writes. > > > backup tools often sparsify the data set and so what gets restored > > > will not have the same layout as the original data set... > > > > I am happy to support whatever is needed to make atomic writes work over > > mixed extents if that is really an expected use case and it is a pain for an > > application writer/admin to deal with this (by manually zeroing extents). > > > > JFYI, I did originally support the extent pre-zeroing for this. That was to > > support a real-life scenario which we saw where we were attempting atomic > > writes over mixed extents. The mixed extents were coming from userspace > > punching holes and then attempting an atomic write over that space. However > > that was using an early experimental and buggy forcealign; it was buggy as > > it did not handle punching holes properly - it punched out single blocks and > > not only full alloc units. > > > > > > > > Where's the documentation that outlines all the restrictions on > > > userspace behaviour to prevent this sort of problem being triggered? > > > > I would provide a man page update. > > I think, at this point, we need an better way of documenting all the > atomic write stuff in one place. Not just the user interface and > what is expected of userspace, but also all the things the > filesystems need to do to ensure atomic writes work correctly. I was > thinking that a document somewhere in the Documentation/ directory, > rather than random pieces of information splattered across random man pages > would be a much better way of explaining all this. > > Don't get me wrong - man pages explaining the programmatic API are > necessary, but there's a whole lot more to understanding and making > effective use of atomic writes than what has been added to the man > pages so far. > > > > Common operations such as truncate, hole punch, > > > > So how would punch hole be a problem? The atomic write unit max is limited > > by the alloc unit, and we can only punch out full alloc units. > > I was under the impression that this was a feature of the > force-align code, not a feature of atomic writes. i.e. force-align > is what ensures the BMBT aligns correctly with the underlying > extents. > > Or did I miss the fact that some of the force-align semantics bleed > back into the original atomic write patch set? > > > > buffered writes, > > > reflinks, etc will trip over this, so application developers, users > > > and admins really need to know what they should be doing to avoid > > > stepping on this landmine... > > > > If this is not a real-life scenario which we expect to see, then I don't see > > why we would add the complexity to the kernel for this. > > I gave you one above - restoring a data set as a result of disaster > recovery. > > > My motivation for atomic writes support is to support atomically writing > > large database internal page size. If the database only writes at a fixed > > internal page size, then we should not see mixed mappings. > > Yup, that's the problem here. Once atomic writes are supported by > the kernel and userspace, all sorts of applications are going to > start using them for in all sorts of ways you didn't think of. > > > But you see potential problems elsewhere .. > > That's my job as a senior engineer with 20+ years of experience in > filesystems and storage related applications. I see far because I > stand on the shoulders of giants - I don't try to be a giant myself. > > Other people become giants by implementing ground-breaking features > (e.g. like atomic writes), but without the people who can see far > enough ahead just adding features ends up with an incoherent mess of > special interest niche features rather than a neatly integrated set > of widely usable generic features. > > e.g. look at MySQL's use of fallocate(hole punch) for transparent > data compression - nobody had forseen that hole punching would be > used like this, but it's a massive win for the applications which > store bulk compressible data in the database even though it does bad > things to the filesystem. > > Spend some time looking outside the proprietary database application > box and think a little harder about the implications of atomic write > functionality. i.e. what happens when we have ubiquitous support > for guaranteeing only the old or the new data will be seen after > a crash *without the need for using fsync*. IOWs, the program either wants an old version or a new version of the files that it wrote, and the commit boundary is syncfs() after updating all the files? > Think about the implications of that for a minute - for any full > file overwrite up to the hardware atomic limits, we won't need fsync > to guarantee the integrity of overwritten data anymore. We only need > a mechanism to flush the journal and device caches once all the data > has been written (e.g. syncfs)... "up to the hardware atomic limits" -- that's a big limitation. What if I need to write 256K but the device only supports up to 64k? RWF_ATOMIC won't work. Or what if the file range I want to dirty isn't aligned with the atomic write alignment? What if the awu geometry changes online due to a device change, how do programs detect that? Programs that aren't 100% block-based should use exchange-range. There are no alignment restrictions, no limits on the size you can exchange, no file mapping state requiments to trip over, and you can update arbitrary sparse ranges. As long as you don't tell exchange-range to flush the log itself, programs can use syncfs to amortize the log and cache flush across a bunch of file content exchanges. Even better, if you still wanted to use untorn block writes to persist the temporary file's dirty data to disk, you don't even need forcealign because the exchange-range will take care of restarting the operation during log recovery. I don't know that there's much point in doing that but the idea is there. > Want to overwrite a bunch of small files safely? Atomic write the > new data, then syncfs(). There's no need to run fdatasync after each > write to ensure individual files are not corrupted if we crash in > the middle of the operation. Indeed, atomic writes actually provide > better overwrite integrity semantics that fdatasync as it will be > all or nothing. fdatasync does not provide that guarantee if we > crash during the fdatasync operation. > > Further, with COW data filesystems like XFS, btrfs and bcachefs, we > can emulate atomic writes for any size larger than what the hardware > supports. > > At this point we actually provide app developers with what they've > been repeatedly asking kernel filesystem engineers to provide them > for the past 20 years: a way of overwriting arbitrary file data > safely without needing an expensive fdatasync operation on every > file that gets modified. > > Put simply: atomic writes have a huge potential to fundamentally > change the way applications interact with Linux filesystems and to > make it *much* simpler for applications to safely overwrite user > data. Hence there is an imperitive here to make the foundational > support for this technology solid and robust because atomic writes > are going to be with us for the next few decades... I agree that we need to make the interface solid and robust, but I don't agree that the current RWF_ATOMIC, with its block-oriented storage device quirks is the way to go here. Maybe a byte-oriented RWF_ATOMIC would work, but the only way I can think of to do that is (say) someone implements Christoph's suggestion to change the COW code to allow multiple writes to a staging extent, and only commit the remapping operations at sync time... and you'd still have problems if you have to do multiple remappings if there's not also a way to restart the ioend chains. Exchange-range already solved all of that, and it's already merged. --D > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On 05/12/2024 21:15, Dave Chinner wrote: > On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote: >> On 04/12/2024 20:35, Dave Chinner wrote: >>> On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote: >>>> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> >>>> >>>> Filesystems like ext4 can submit writes in multiples of blocksizes. >>>> But we still can't allow the writes to be split into multiple BIOs. Hence >>>> let's check if the iomap_length() is same as iter->len or not. >>>> >>>> It is the responsibility of userspace to ensure that a write does not span >>>> mixed unwritten and mapped extents (which would lead to multiple BIOs). >>> >>> How is "userspace" supposed to do this? >> >> If an atomic write spans mixed unwritten and mapped extents, then it should >> manually zero the unwritten extents beforehand. >> >>> >>> No existing utility in userspace is aware of atomic write limits or >>> rtextsize configs, so how does "userspace" ensure everything is >>> laid out in a manner compatible with atomic writes? >>> >>> e.g. restoring a backup (or other disaster recovery procedures) is >>> going to have to lay the files out correctly for atomic writes. >>> backup tools often sparsify the data set and so what gets restored >>> will not have the same layout as the original data set... >> >> I am happy to support whatever is needed to make atomic writes work over >> mixed extents if that is really an expected use case and it is a pain for an >> application writer/admin to deal with this (by manually zeroing extents). >> >> JFYI, I did originally support the extent pre-zeroing for this. That was to >> support a real-life scenario which we saw where we were attempting atomic >> writes over mixed extents. The mixed extents were coming from userspace >> punching holes and then attempting an atomic write over that space. However >> that was using an early experimental and buggy forcealign; it was buggy as >> it did not handle punching holes properly - it punched out single blocks and >> not only full alloc units. >> >>> >>> Where's the documentation that outlines all the restrictions on >>> userspace behaviour to prevent this sort of problem being triggered? >> >> I would provide a man page update. > > I think, at this point, we need an better way of documenting all the > atomic write stuff in one place. Not just the user interface and > what is expected of userspace, but also all the things the > filesystems need to do to ensure atomic writes work correctly. I was > thinking that a document somewhere in the Documentation/ directory, > rather than random pieces of information splattered across random man pages > would be a much better way of explaining all this. > > Don't get me wrong - man pages explaining the programmatic API are > necessary, but there's a whole lot more to understanding and making > effective use of atomic writes than what has been added to the man > pages so far. > >>> Common operations such as truncate, hole punch, >> >> So how would punch hole be a problem? The atomic write unit max is limited >> by the alloc unit, and we can only punch out full alloc units. > > I was under the impression that this was a feature of the > force-align code, not a feature of atomic writes. i.e. force-align > is what ensures the BMBT aligns correctly with the underlying > extents. > > Or did I miss the fact that some of the force-align semantics bleed > back into the original atomic write patch set? > >>> buffered writes, >>> reflinks, etc will trip over this, so application developers, users >>> and admins really need to know what they should be doing to avoid >>> stepping on this landmine... >> >> If this is not a real-life scenario which we expect to see, then I don't see >> why we would add the complexity to the kernel for this. > > I gave you one above - restoring a data set as a result of disaster > recovery. > >> My motivation for atomic writes support is to support atomically writing >> large database internal page size. If the database only writes at a fixed >> internal page size, then we should not see mixed mappings. > > Yup, that's the problem here. Once atomic writes are supported by > the kernel and userspace, all sorts of applications are going to > start using them for in all sorts of ways you didn't think of. > >> But you see potential problems elsewhere .. > > That's my job as a senior engineer with 20+ years of experience in > filesystems and storage related applications. I see far because I > stand on the shoulders of giants - I don't try to be a giant myself. > > Other people become giants by implementing ground-breaking features > (e.g. like atomic writes), but without the people who can see far > enough ahead just adding features ends up with an incoherent mess of > special interest niche features rather than a neatly integrated set > of widely usable generic features. > > e.g. look at MySQL's use of fallocate(hole punch) for transparent > data compression - nobody had forseen that hole punching would be > used like this, but it's a massive win for the applications which > store bulk compressible data in the database even though it does bad > things to the filesystem. > > Spend some time looking outside the proprietary database application > box and think a little harder about the implications of atomic write > functionality. i.e. what happens when we have ubiquitous support > for guaranteeing only the old or the new data will be seen after > a crash *without the need for using fsync*. > > Think about the implications of that for a minute - for any full > file overwrite up to the hardware atomic limits, we won't need fsync > to guarantee the integrity of overwritten data anymore. We only need > a mechanism to flush the journal and device caches once all the data > has been written (e.g. syncfs)... > > Want to overwrite a bunch of small files safely? Atomic write the > new data, then syncfs(). There's no need to run fdatasync after each > write to ensure individual files are not corrupted if we crash in > the middle of the operation. Indeed, atomic writes actually provide > better overwrite integrity semantics that fdatasync as it will be > all or nothing. fdatasync does not provide that guarantee if we > crash during the fdatasync operation. > > Further, with COW data filesystems like XFS, btrfs and bcachefs, we > can emulate atomic writes for any size larger than what the hardware > supports. > > At this point we actually provide app developers with what they've > been repeatedly asking kernel filesystem engineers to provide them > for the past 20 years: a way of overwriting arbitrary file data > safely without needing an expensive fdatasync operation on every > file that gets modified. > > Put simply: atomic writes have a huge potential to fundamentally > change the way applications interact with Linux filesystems and to > make it *much* simpler for applications to safely overwrite user > data. Hence there is an imperitive here to make the foundational > support for this technology solid and robust because atomic writes > are going to be with us for the next few decades... > Dave, I provided an proposal to solve this issue in https://lore.kernel.org/lkml/20241210125737.786928-3-john.g.garry@oracle.com/ (there is also a v3, which is much the same. but I can't make progress, as there is no agreement upon how this should be implemented, if at all. Any input there would be appreciated... Cheers
On Wed, Dec 11, 2024 at 05:34:33PM -0800, Darrick J. Wong wrote: > On Fri, Dec 06, 2024 at 08:15:05AM +1100, Dave Chinner wrote: > > On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote: > > e.g. look at MySQL's use of fallocate(hole punch) for transparent > > data compression - nobody had forseen that hole punching would be > > used like this, but it's a massive win for the applications which > > store bulk compressible data in the database even though it does bad > > things to the filesystem. > > > > Spend some time looking outside the proprietary database application > > box and think a little harder about the implications of atomic write > > functionality. i.e. what happens when we have ubiquitous support > > for guaranteeing only the old or the new data will be seen after > > a crash *without the need for using fsync*. > > IOWs, the program either wants an old version or a new version of the > files that it wrote, and the commit boundary is syncfs() after updating > all the files? Yes, though there isn't a need for syncfs() to guarantee old-or-new. That's the sort of thing an application can choose to do at the end of it's update set... > > Think about the implications of that for a minute - for any full > > file overwrite up to the hardware atomic limits, we won't need fsync > > to guarantee the integrity of overwritten data anymore. We only need > > a mechanism to flush the journal and device caches once all the data > > has been written (e.g. syncfs)... > > "up to the hardware atomic limits" -- that's a big limitation. What if > I need to write 256K but the device only supports up to 64k? RWF_ATOMIC > won't work. Or what if the file range I want to dirty isn't aligned > with the atomic write alignment? What if the awu geometry changes > online due to a device change, how do programs detect that? If awu geometry changes dynamically in an incompatible way, then filesystem RWF_ATOMIC alignment guarantees are fundamentally broken. This is not a problem the filesystem can solve. IMO, RAID device hotplug should reject new device replacement that has incompatible atomic write support with the existing device set. With that constraint, the whole mess of "awu can randomly change" problems go away. > Programs that aren't 100% block-based should use exchange-range. There > are no alignment restrictions, no limits on the size you can exchange, > no file mapping state requiments to trip over, and you can update > arbitrary sparse ranges. As long as you don't tell exchange-range to > flush the log itself, programs can use syncfs to amortize the log and > cache flush across a bunch of file content exchanges. Right - that's kinda my point - I was assuming that we'd be using something like xchg-range as the "unaligned slow path" for RWF_ATOMIC. i.e. RWF_ATOMIC as implemented by a COW capable filesystem should always be able to succeed regardless of IO alignment. In these situations, the REQ_ATOMIC block layer offload to the hardware is a fast path that is enabled when the user IO and filesystem extent alignment matches the constraints needed to do a hardware atomic write. In all other cases, we implement RWF_ATOMIC something like always-cow or prealloc-beyond-eof-then-xchg-range-on-io-completion for anything that doesn't correctly align to hardware REQ_ATOMIC. That said, there is nothing that prevents us from first implementing RWF_ATOMIC constraints as "must match hardware requirements exactly" and then relaxing them to be less stringent as filesystems implementations improve. We've relaxed the direct IO hardware alignment constraints multiple times over the years, so there's nothing that really prevents us from doing so with RWF_ATOMIC, either. Especially as we have statx to tell the application exactly what alignment will get fast hardware offloads... > Even better, if you still wanted to use untorn block writes to persist > the temporary file's dirty data to disk, you don't even need forcealign > because the exchange-range will take care of restarting the operation > during log recovery. I don't know that there's much point in doing that > but the idea is there. *nod* > > Want to overwrite a bunch of small files safely? Atomic write the > > new data, then syncfs(). There's no need to run fdatasync after each > > write to ensure individual files are not corrupted if we crash in > > the middle of the operation. Indeed, atomic writes actually provide > > better overwrite integrity semantics that fdatasync as it will be > > all or nothing. fdatasync does not provide that guarantee if we > > crash during the fdatasync operation. > > > > Further, with COW data filesystems like XFS, btrfs and bcachefs, we > > can emulate atomic writes for any size larger than what the hardware > > supports. > > > > At this point we actually provide app developers with what they've > > been repeatedly asking kernel filesystem engineers to provide them > > for the past 20 years: a way of overwriting arbitrary file data > > safely without needing an expensive fdatasync operation on every > > file that gets modified. > > > > Put simply: atomic writes have a huge potential to fundamentally > > change the way applications interact with Linux filesystems and to > > make it *much* simpler for applications to safely overwrite user > > data. Hence there is an imperitive here to make the foundational > > support for this technology solid and robust because atomic writes > > are going to be with us for the next few decades... > > I agree that we need to make the interface solid and robust, but I don't > agree that the current RWF_ATOMIC, with its block-oriented storage > device quirks is the way to go here. > Maybe a byte-oriented RWF_ATOMIC > would work, but the only way I can think of to do that is (say) someone > implements Christoph's suggestion to change the COW code to allow > multiple writes to a staging extent, and only commit the remapping > operations at sync time... and you'd still have problems if you have to > do multiple remappings if there's not also a way to restart the ioend > chains. > > Exchange-range already solved all of that, and it's already merged. Yes, I agree that the block-device quirks need to go away from RWF_ATOMIC, but I think it's the right interface for applications that want to use atomic overwrite semantics. Hiding exchange-range under the XFS covers for unaligned atomic IO would mean applications won't need to target XFS specific ioctls to do reliable atomic overwrites. i.e. the API really needs to be simple and filesystem independent, and RWF_ATOMIC gives us that... -Dave.
On Mon, Jan 13, 2025 at 09:35:01PM +0000, John Garry wrote: > Dave, > > I provided an proposal to solve this issue in > https://lore.kernel.org/lkml/20241210125737.786928-3-john.g.garry@oracle.com/ > (there is also a v3, which is much the same. > > but I can't make progress, as there is no agreement upon how this should be > implemented, if at all. Any input there would be appreciated... I've been on PTO since mid december. I'm trying to catch up now. -Dave.
On Tue, Jan 14, 2025 at 03:41:13PM +1100, Dave Chinner wrote: > On Wed, Dec 11, 2024 at 05:34:33PM -0800, Darrick J. Wong wrote: > > On Fri, Dec 06, 2024 at 08:15:05AM +1100, Dave Chinner wrote: > > > On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote: > > > e.g. look at MySQL's use of fallocate(hole punch) for transparent > > > data compression - nobody had forseen that hole punching would be > > > used like this, but it's a massive win for the applications which > > > store bulk compressible data in the database even though it does bad > > > things to the filesystem. > > > > > > Spend some time looking outside the proprietary database application > > > box and think a little harder about the implications of atomic write > > > functionality. i.e. what happens when we have ubiquitous support > > > for guaranteeing only the old or the new data will be seen after > > > a crash *without the need for using fsync*. > > > > IOWs, the program either wants an old version or a new version of the > > files that it wrote, and the commit boundary is syncfs() after updating > > all the files? > > Yes, though there isn't a need for syncfs() to guarantee old-or-new. > That's the sort of thing an application can choose to do at the end > of it's update set... Well yes, there has to be a caches flush somewhere -- last I checked, RWF_ATOMIC doesn't require that the written data be persisted after the call completes. > > > Think about the implications of that for a minute - for any full > > > file overwrite up to the hardware atomic limits, we won't need fsync > > > to guarantee the integrity of overwritten data anymore. We only need > > > a mechanism to flush the journal and device caches once all the data > > > has been written (e.g. syncfs)... > > > > "up to the hardware atomic limits" -- that's a big limitation. What if > > I need to write 256K but the device only supports up to 64k? RWF_ATOMIC > > won't work. Or what if the file range I want to dirty isn't aligned > > with the atomic write alignment? What if the awu geometry changes > > online due to a device change, how do programs detect that? > > If awu geometry changes dynamically in an incompatible way, then > filesystem RWF_ATOMIC alignment guarantees are fundamentally broken. > This is not a problem the filesystem can solve. > > IMO, RAID device hotplug should reject new device replacement that > has incompatible atomic write support with the existing device set. > With that constraint, the whole mess of "awu can randomly change" > problems go away. Assuming device mapper is subject to that too, I agree. > > Programs that aren't 100% block-based should use exchange-range. There > > are no alignment restrictions, no limits on the size you can exchange, > > no file mapping state requiments to trip over, and you can update > > arbitrary sparse ranges. As long as you don't tell exchange-range to > > flush the log itself, programs can use syncfs to amortize the log and > > cache flush across a bunch of file content exchanges. > > Right - that's kinda my point - I was assuming that we'd be using > something like xchg-range as the "unaligned slow path" for > RWF_ATOMIC. > > i.e. RWF_ATOMIC as implemented by a COW capable filesystem should > always be able to succeed regardless of IO alignment. In these > situations, the REQ_ATOMIC block layer offload to the hardware is a > fast path that is enabled when the user IO and filesystem extent > alignment matches the constraints needed to do a hardware atomic > write. > > In all other cases, we implement RWF_ATOMIC something like > always-cow or prealloc-beyond-eof-then-xchg-range-on-io-completion > for anything that doesn't correctly align to hardware REQ_ATOMIC. > > That said, there is nothing that prevents us from first implementing > RWF_ATOMIC constraints as "must match hardware requirements exactly" > and then relaxing them to be less stringent as filesystems > implementations improve. We've relaxed the direct IO hardware > alignment constraints multiple times over the years, so there's > nothing that really prevents us from doing so with RWF_ATOMIC, > either. Especially as we have statx to tell the application exactly > what alignment will get fast hardware offloads... Ok, let's do that then. Just to be clear -- for any RWF_ATOMIC direct write that's correctly aligned and targets a single mapping in the correct state, we can build the untorn bio and submit it. For everything else, prealloc some post EOF blocks, write them there, and exchange-range them. Tricky questions: How do we avoid collisions between overlapping writes? I guess we find a free file range at the top of the file that is long enough to stage the write, and put it there? And purge it later? Also, does this imply that the maximum file size is less than the usual 8EB? (There's also the question about how to do this with buffered writes, but I guess we could skip that for now.) > > Even better, if you still wanted to use untorn block writes to persist > > the temporary file's dirty data to disk, you don't even need forcealign > > because the exchange-range will take care of restarting the operation > > during log recovery. I don't know that there's much point in doing that > > but the idea is there. > > *nod* > > > > Want to overwrite a bunch of small files safely? Atomic write the > > > new data, then syncfs(). There's no need to run fdatasync after each > > > write to ensure individual files are not corrupted if we crash in > > > the middle of the operation. Indeed, atomic writes actually provide > > > better overwrite integrity semantics that fdatasync as it will be > > > all or nothing. fdatasync does not provide that guarantee if we > > > crash during the fdatasync operation. > > > > > > Further, with COW data filesystems like XFS, btrfs and bcachefs, we > > > can emulate atomic writes for any size larger than what the hardware > > > supports. > > > > > > At this point we actually provide app developers with what they've > > > been repeatedly asking kernel filesystem engineers to provide them > > > for the past 20 years: a way of overwriting arbitrary file data > > > safely without needing an expensive fdatasync operation on every > > > file that gets modified. > > > > > > Put simply: atomic writes have a huge potential to fundamentally > > > change the way applications interact with Linux filesystems and to > > > make it *much* simpler for applications to safely overwrite user > > > data. Hence there is an imperitive here to make the foundational > > > support for this technology solid and robust because atomic writes > > > are going to be with us for the next few decades... > > > > I agree that we need to make the interface solid and robust, but I don't > > agree that the current RWF_ATOMIC, with its block-oriented storage > > device quirks is the way to go here. > > > Maybe a byte-oriented RWF_ATOMIC > > would work, but the only way I can think of to do that is (say) someone > > implements Christoph's suggestion to change the COW code to allow > > multiple writes to a staging extent, and only commit the remapping > > operations at sync time... and you'd still have problems if you have to > > do multiple remappings if there's not also a way to restart the ioend > > chains. > > > > Exchange-range already solved all of that, and it's already merged. > > Yes, I agree that the block-device quirks need to go away from > RWF_ATOMIC, but I think it's the right interface for applications > that want to use atomic overwrite semantics. Ok. > Hiding exchange-range under the XFS covers for unaligned atomic IO > would mean applications won't need to target XFS specific ioctls to > do reliable atomic overwrites. i.e. the API really needs to be > simple and filesystem independent, and RWF_ATOMIC gives us that... <nod> --D > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On 14/01/2025 23:57, Darrick J. Wong wrote: > On Tue, Jan 14, 2025 at 03:41:13PM +1100, Dave Chinner wrote: >> On Wed, Dec 11, 2024 at 05:34:33PM -0800, Darrick J. Wong wrote: >>> On Fri, Dec 06, 2024 at 08:15:05AM +1100, Dave Chinner wrote: >>>> On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote: >>>> e.g. look at MySQL's use of fallocate(hole punch) for transparent >>>> data compression - nobody had forseen that hole punching would be >>>> used like this, but it's a massive win for the applications which >>>> store bulk compressible data in the database even though it does bad >>>> things to the filesystem. >>>> >>>> Spend some time looking outside the proprietary database application >>>> box and think a little harder about the implications of atomic write >>>> functionality. i.e. what happens when we have ubiquitous support >>>> for guaranteeing only the old or the new data will be seen after >>>> a crash *without the need for using fsync*. >>> >>> IOWs, the program either wants an old version or a new version of the >>> files that it wrote, and the commit boundary is syncfs() after updating >>> all the files? >> >> Yes, though there isn't a need for syncfs() to guarantee old-or-new. >> That's the sort of thing an application can choose to do at the end >> of it's update set... > > Well yes, there has to be a caches flush somewhere -- last I checked, > RWF_ATOMIC doesn't require that the written data be persisted after the > call completes. Correct, RWF_ATOMIC | RWF_SYNC is required for guarantee of persistence > >>>> Think about the implications of that for a minute - for any full >>>> file overwrite up to the hardware atomic limits, we won't need fsync >>>> to guarantee the integrity of overwritten data anymore. We only need >>>> a mechanism to flush the journal and device caches once all the data >>>> has been written (e.g. syncfs)... >>> >>> "up to the hardware atomic limits" -- that's a big limitation. What if >>> I need to write 256K but the device only supports up to 64k? RWF_ATOMIC >>> won't work. Or what if the file range I want to dirty isn't aligned >>> with the atomic write alignment? What if the awu geometry changes >>> online due to a device change, how do programs detect that? >> >> If awu geometry changes dynamically in an incompatible way, then >> filesystem RWF_ATOMIC alignment guarantees are fundamentally broken. >> This is not a problem the filesystem can solve. >> >> IMO, RAID device hotplug should reject new device replacement that >> has incompatible atomic write support with the existing device set. >> With that constraint, the whole mess of "awu can randomly change" >> problems go away. > > Assuming device mapper is subject to that too, I agree. If a device is added to a md raid array which does not support atomic writes, then atomic writes are disabled (for the block device). I need to verify that hotplug behaves like this. And dm does behave like this also, i.e. atomic writes are disabled for the dm block device. > >>> Programs that aren't 100% block-based should use exchange-range. There >>> are no alignment restrictions, no limits on the size you can exchange, >>> no file mapping state requiments to trip over, and you can update >>> arbitrary sparse ranges. As long as you don't tell exchange-range to >>> flush the log itself, programs can use syncfs to amortize the log and >>> cache flush across a bunch of file content exchanges. >> >> Right - that's kinda my point - I was assuming that we'd be using >> something like xchg-range as the "unaligned slow path" for >> RWF_ATOMIC. >> >> i.e. RWF_ATOMIC as implemented by a COW capable filesystem should >> always be able to succeed regardless of IO alignment. In these >> situations, the REQ_ATOMIC block layer offload to the hardware is a >> fast path that is enabled when the user IO and filesystem extent >> alignment matches the constraints needed to do a hardware atomic >> write. >> >> In all other cases, we implement RWF_ATOMIC something like >> always-cow or prealloc-beyond-eof-then-xchg-range-on-io-completion >> for anything that doesn't correctly align to hardware REQ_ATOMIC. >> >> That said, there is nothing that prevents us from first implementing >> RWF_ATOMIC constraints as "must match hardware requirements exactly" >> and then relaxing them to be less stringent as filesystems >> implementations improve. We've relaxed the direct IO hardware >> alignment constraints multiple times over the years, so there's >> nothing that really prevents us from doing so with RWF_ATOMIC, >> either. Especially as we have statx to tell the application exactly >> what alignment will get fast hardware offloads... > > Ok, let's do that then. Just to be clear -- for any RWF_ATOMIC direct > write that's correctly aligned and targets a single mapping in the > correct state, we can build the untorn bio and submit it. For > everything else, prealloc some post EOF blocks, write them there, and > exchange-range them. > That makes my life easier ... today, anyway. For RWF_ATOMIC, our targeted users will want guaranteed performance, so would really need to know about anything which is doing software-based atomic writes behind the scenes. JFYI, I did rework the zeroing code to leverage what we already have in iomap, and it looks better to me: https://github.com/johnpgarry/linux/commits/atomic-write-large-atomics-v6.13-v4/ There is a problem with atomic writes over EOF, but that same be solved. > Tricky questions: How do we avoid collisions between overlapping writes? > I guess we find a free file range at the top of the file that is long > enough to stage the write, and put it there? And purge it later? > > Also, does this imply that the maximum file size is less than the usual > 8EB? > > (There's also the question about how to do this with buffered writes, > but I guess we could skip that for now.) > >>> Even better, if you still wanted to use untorn block writes to persist >>> the temporary file's dirty data to disk, you don't even need forcealign >>> because the exchange-range will take care of restarting the operation >>> during log recovery. I don't know that there's much point in doing that >>> but the idea is there. >> >> *nod*
On Tue, Jan 14, 2025 at 03:57:26PM -0800, Darrick J. Wong wrote: > Ok, let's do that then. Just to be clear -- for any RWF_ATOMIC direct > write that's correctly aligned and targets a single mapping in the > correct state, we can build the untorn bio and submit it. For > everything else, prealloc some post EOF blocks, write them there, and > exchange-range them. > > Tricky questions: How do we avoid collisions between overlapping writes? > I guess we find a free file range at the top of the file that is long > enough to stage the write, and put it there? And purge it later? > > Also, does this imply that the maximum file size is less than the usual > 8EB? I think literally using the exchrange code for anything but an initial prototype is a bad idea for the above reasons. If we go beyond proving this is possible you'd want a version of exchrange where the exchange partners is not a file mapping, but a cow staging record.
On 14/01/2025 23:57, Darrick J. Wong wrote: >> i.e. RWF_ATOMIC as implemented by a COW capable filesystem should >> always be able to succeed regardless of IO alignment. In these >> situations, the REQ_ATOMIC block layer offload to the hardware is a >> fast path that is enabled when the user IO and filesystem extent >> alignment matches the constraints needed to do a hardware atomic >> write. >> >> In all other cases, we implement RWF_ATOMIC something like >> always-cow or prealloc-beyond-eof-then-xchg-range-on-io-completion >> for anything that doesn't correctly align to hardware REQ_ATOMIC. >> >> That said, there is nothing that prevents us from first implementing >> RWF_ATOMIC constraints as "must match hardware requirements exactly" >> and then relaxing them to be less stringent as filesystems >> implementations improve. We've relaxed the direct IO hardware >> alignment constraints multiple times over the years, so there's >> nothing that really prevents us from doing so with RWF_ATOMIC, >> either. Especially as we have statx to tell the application exactly >> what alignment will get fast hardware offloads... > Ok, let's do that then. Just to be clear -- for any RWF_ATOMIC direct > write that's correctly aligned and targets a single mapping in the > correct state, we can build the untorn bio and submit it. For > everything else, prealloc some post EOF blocks, write them there, and > exchange-range them. I have some doubt about this, but I may be misunderstanding the concept: So is there any guarantee that what we write into is aligned (after the exchange-range routine)? If not, surely every subsequent write with RWF_ATOMIC to that logical range will require this exchange-range routine until we get something aligned (and correct granularity) - correct? I know that getting unaligned blocks continuously is unlikely, unless a heavily fragmented disk. However, databases prefer guaranteed performance (which HW offload gives). We can use extszhint to hint at granularity, but that does not help with alignment (AFAIK). > > Tricky questions: How do we avoid collisions between overlapping writes? > I guess we find a free file range at the top of the file that is long > enough to stage the write, and put it there? And purge it later? > > Also, does this imply that the maximum file size is less than the usual > 8EB? > > (There's also the question about how to do this with buffered writes, > but I guess we could skip that for now.)
On Fri, Jan 17, 2025 at 10:26:34AM +0000, John Garry wrote: > On 14/01/2025 23:57, Darrick J. Wong wrote: > > > i.e. RWF_ATOMIC as implemented by a COW capable filesystem should > > > always be able to succeed regardless of IO alignment. In these > > > situations, the REQ_ATOMIC block layer offload to the hardware is a > > > fast path that is enabled when the user IO and filesystem extent > > > alignment matches the constraints needed to do a hardware atomic > > > write. > > > > > > In all other cases, we implement RWF_ATOMIC something like > > > always-cow or prealloc-beyond-eof-then-xchg-range-on-io-completion > > > for anything that doesn't correctly align to hardware REQ_ATOMIC. > > > > > > That said, there is nothing that prevents us from first implementing > > > RWF_ATOMIC constraints as "must match hardware requirements exactly" > > > and then relaxing them to be less stringent as filesystems > > > implementations improve. We've relaxed the direct IO hardware > > > alignment constraints multiple times over the years, so there's > > > nothing that really prevents us from doing so with RWF_ATOMIC, > > > either. Especially as we have statx to tell the application exactly > > > what alignment will get fast hardware offloads... > > Ok, let's do that then. Just to be clear -- for any RWF_ATOMIC direct > > write that's correctly aligned and targets a single mapping in the > > correct state, we can build the untorn bio and submit it. For > > everything else, prealloc some post EOF blocks, write them there, and > > exchange-range them. > > I have some doubt about this, but I may be misunderstanding the concept: > > So is there any guarantee that what we write into is aligned (after the > exchange-range routine)? If not, surely every subsequent write with > RWF_ATOMIC to that logical range will require this exchange-range routine > until we get something aligned (and correct granularity) - correct? Correct, you'd still need forcealign to make sure that the new allocations for exchange-range are aligned to awumin. --D > I know that getting unaligned blocks continuously is unlikely, unless a > heavily fragmented disk. However, databases prefer guaranteed performance > (which HW offload gives). > > We can use extszhint to hint at granularity, but that does not help with > alignment (AFAIK). > > > > > Tricky questions: How do we avoid collisions between overlapping writes? > > I guess we find a free file range at the top of the file that is long > > enough to stage the write, and put it there? And purge it later? > > > > Also, does this imply that the maximum file size is less than the usual > > 8EB? > > > > (There's also the question about how to do this with buffered writes, > > but I guess we could skip that for now.) > >
On Thu, Jan 16, 2025 at 07:52:25AM +0100, Christoph Hellwig wrote: > On Tue, Jan 14, 2025 at 03:57:26PM -0800, Darrick J. Wong wrote: > > Ok, let's do that then. Just to be clear -- for any RWF_ATOMIC direct > > write that's correctly aligned and targets a single mapping in the > > correct state, we can build the untorn bio and submit it. For > > everything else, prealloc some post EOF blocks, write them there, and > > exchange-range them. > > > > Tricky questions: How do we avoid collisions between overlapping writes? > > I guess we find a free file range at the top of the file that is long > > enough to stage the write, and put it there? And purge it later? > > > > Also, does this imply that the maximum file size is less than the usual > > 8EB? > > I think literally using the exchrange code for anything but an > initial prototype is a bad idea for the above reasons. If we go > beyond proving this is possible you'd want a version of exchrange > where the exchange partners is not a file mapping, but a cow staging > record. The trouble is that the br_startoff attribute of cow staging mappings aren't persisted on disk anywhere, which is why exchange-range can't handle the cow fork. You could open an O_TMPFILE and swap between the two files, though that gets expensive per-io unless you're willing to stash that temp file somewhere. At this point I think we should slap the usual EXPERIMENTAL warning on atomic writes through xfs and let John land the simplest multi-fsblock untorn write support, which only handles the corner case where all the stars are <cough> aligned; and then make an exchange-range prototype and/or all the other forcealign stuff. (Lifting in smaller pieces sounds a lot better than having John carry around an increasingly large patchset...) --D
On 17/01/2025 18:29, Darrick J. Wong wrote: >>> Ok, let's do that then. Just to be clear -- for any RWF_ATOMIC direct >>> write that's correctly aligned and targets a single mapping in the >>> correct state, we can build the untorn bio and submit it. For >>> everything else, prealloc some post EOF blocks, write them there, and >>> exchange-range them. >> I have some doubt about this, but I may be misunderstanding the concept: >> >> So is there any guarantee that what we write into is aligned (after the >> exchange-range routine)? If not, surely every subsequent write with >> RWF_ATOMIC to that logical range will require this exchange-range routine >> until we get something aligned (and correct granularity) - correct? > Correct, you'd still need forcealign to make sure that the new > allocations for exchange-range are aligned to awumin. Sure, but I think that if we align sb_agblocks to the bdev atomic write limits at mkfs time and also set extszhint appropriately, then probability of desired alignment and granularity is appreciably higher. Thanks, John
On Fri, Jan 17, 2025 at 10:49:34AM -0800, Darrick J. Wong wrote: > The trouble is that the br_startoff attribute of cow staging mappings > aren't persisted on disk anywhere, which is why exchange-range can't > handle the cow fork. You could open an O_TMPFILE and swap between the > two files, though that gets expensive per-io unless you're willing to > stash that temp file somewhere. Needing another inode is better than trying to steal ranges from the actual inode we're operating on. But we might just need a different kind of COW staging for that. > > At this point I think we should slap the usual EXPERIMENTAL warning on > atomic writes through xfs and let John land the simplest multi-fsblock > untorn write support, which only handles the corner case where all the > stars are <cough> aligned; and then make an exchange-range prototype > and/or all the other forcealign stuff. That is the worst of all possible outcomes. Combing up with an atomic API that fails for random reasons only on aged file systems is literally the worst thing we can do. NAK.
On 22/01/2025 06:42, Christoph Hellwig wrote: > On Fri, Jan 17, 2025 at 10:49:34AM -0800, Darrick J. Wong wrote: >> The trouble is that the br_startoff attribute of cow staging mappings >> aren't persisted on disk anywhere, which is why exchange-range can't >> handle the cow fork. You could open an O_TMPFILE and swap between the >> two files, though that gets expensive per-io unless you're willing to >> stash that temp file somewhere. > > Needing another inode is better than trying to steal ranges from the > actual inode we're operating on. But we might just need a different > kind of COW staging for that. > >> >> At this point I think we should slap the usual EXPERIMENTAL warning on >> atomic writes through xfs and let John land the simplest multi-fsblock >> untorn write support, which only handles the corner case where all the >> stars are <cough> aligned; and then make an exchange-range prototype >> and/or all the other forcealign stuff. > > That is the worst of all possible outcomes. Combing up with an > atomic API that fails for random reasons only on aged file systems > is literally the worst thing we can do. NAK. > > I did my own quick PoC to use CoW for misaligned blocks atomic writes fallback. I am finding that the block allocator is often giving misaligned blocks wrt atomic write length, like this: # xfs_bmap -v mnt/file mnt/file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..20479]: 192..20671 0 (192..20671) 20480 000000 # # #xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file #xfs_bmap -v mnt/file mnt/file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..127]: 20672..20799 0 (20672..20799) 128 000000 1: [128..20479]: 320..20671 0 (320..20671) 20352 000000 # #xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file #xfs_bmap -v mnt/file mnt/file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..127]: 20928..21055 0 (20928..21055) 128 000000 1: [128..20479]: 320..20671 0 (320..20671) 20352 000000 In this case we would not use HW offload (as no start blocks are 64K-aligned), which will affect performance. Since we are not considering forcealign ATM, can we still consider some other alignment hint to the block allocator? It could be similar to how stripe alignment is handled. Some other thoughts: - I am not sure what atomic write unit max we would now use. - Anything written back with CoW/exchange range will need FUA to ensure that the write is fully persisted. Otherwise I think that not using FUA could mean that the data is reported written by the disk but may only be partially persisted from a power fail later.
On Tue, Jan 14, 2025 at 03:57:26PM -0800, Darrick J. Wong wrote: > On Tue, Jan 14, 2025 at 03:41:13PM +1100, Dave Chinner wrote: > > On Wed, Dec 11, 2024 at 05:34:33PM -0800, Darrick J. Wong wrote: > > > On Fri, Dec 06, 2024 at 08:15:05AM +1100, Dave Chinner wrote: > > > > On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote: > Tricky questions: How do we avoid collisions between overlapping writes? > I guess we find a free file range at the top of the file that is long > enough to stage the write, and put it there? And purge it later? Use xfs_bmap_last_offset() to find the last used block in the file, locate the block we are operating on beyond that. If done under the ILOCK_EXCL, then this can't race against other attempts to create a post-EOF extent for an atomic write swap, and it's safe against other atomic writes in progress. i.e. we don't really need anything special to create a temporary post-EOF extent. > Also, does this imply that the maximum file size is less than the usual > 8EB? Maybe. Depends on implementation details, I think. I think The BMBT on-disk record format can handle offsets beyond 8EiB, so the question remains how we stage the data for the IO into that extent before we swap it over. The mapping tree index is actually unsigned, we are limited to 8EiB by loff_t being signed, not by XFS or the mapping tree being unable to index beyond 8EiB safely... > (There's also the question about how to do this with buffered writes, > but I guess we could skip that for now.) yup, that's kinda what I meant by "implementation details"... -Dave.
On Wed, Jan 22, 2025 at 10:45:34AM +0000, John Garry wrote: > On 22/01/2025 06:42, Christoph Hellwig wrote: > > On Fri, Jan 17, 2025 at 10:49:34AM -0800, Darrick J. Wong wrote: > > > The trouble is that the br_startoff attribute of cow staging mappings > > > aren't persisted on disk anywhere, which is why exchange-range can't > > > handle the cow fork. You could open an O_TMPFILE and swap between the > > > two files, though that gets expensive per-io unless you're willing to > > > stash that temp file somewhere. > > > > Needing another inode is better than trying to steal ranges from the > > actual inode we're operating on. But we might just need a different > > kind of COW staging for that. > > > > > > > > At this point I think we should slap the usual EXPERIMENTAL warning on > > > atomic writes through xfs and let John land the simplest multi-fsblock > > > untorn write support, which only handles the corner case where all the > > > stars are <cough> aligned; and then make an exchange-range prototype > > > and/or all the other forcealign stuff. > > > > That is the worst of all possible outcomes. Combing up with an > > atomic API that fails for random reasons only on aged file systems > > is literally the worst thing we can do. NAK. > > > > > > I did my own quick PoC to use CoW for misaligned blocks atomic writes > fallback. > > I am finding that the block allocator is often giving misaligned blocks wrt > atomic write length, like this: Of course - I'm pretty sure this needs force-align to ensure that the large allocated extent is aligned to file offset and hardware atomic write alignment constraints.... > Since we are not considering forcealign ATM, can we still consider some > other alignment hint to the block allocator? It could be similar to how > stripe alignment is handled. Perhaps we should finish off the the remaining bits needed to make force-align work everywhere before going any further? > Some other thoughts: > - I am not sure what atomic write unit max we would now use. What statx exposes should be the size/alignment for hardware offload to take place (i.e. no change), regardless of what the filesystem can do software offloads for. i.e. like statx->stx_blksize is the "preferred block size for efficient IO", the atomic write unit information is the "preferred atomic write size and alignment for efficient IO", not the maximum sizes supported... > - Anything written back with CoW/exchange range will need FUA to ensure that > the write is fully persisted. I don't think so. The journal commit for the exchange range operation will issue a cache flush before the journal IO is submitted. that will make the new data stable before the first xchgrange transaction becomes stable. Hence we get the correct data/metadata ordering on stable storage simply by doing the exchange-range operation at data IO completion. This the same data/metadata ordering semantics that unwritten extent conversion is based on.... -Dave.
On 22/01/2025 23:51, Dave Chinner wrote: >> I did my own quick PoC to use CoW for misaligned blocks atomic writes >> fallback. >> >> I am finding that the block allocator is often giving misaligned blocks wrt >> atomic write length, like this: > Of course - I'm pretty sure this needs force-align to ensure that > the large allocated extent is aligned to file offset and hardware > atomic write alignment constraints.... > >> Since we are not considering forcealign ATM, can we still consider some >> other alignment hint to the block allocator? It could be similar to how >> stripe alignment is handled. > Perhaps we should finish off the the remaining bits needed to make > force-align work everywhere before going any further? forcealign implementation is just about finished from my PoV, but needs more review. However, there has been push back on that feature - https://lore.kernel.org/linux-xfs/20240923120715.GA13585@lst.de/ So we will try this PoC for the unaligned software-emulated fallback, see how it looks - especially in terms of performance - and go from there. > >> Some other thoughts: >> - I am not sure what atomic write unit max we would now use. > What statx exposes should be the size/alignment for hardware offload > to take place (i.e. no change) The user could get that from statx on the block device on which we are mounted, if that is not too inconvenient. >, regardless of what the filesystem > can do software offloads for. i.e. like statx->stx_blksize is the > "preferred block size for efficient IO", the atomic write unit > information is the "preferred atomic write size and alignment for > efficient IO", not the maximum sizes supported... It's already documented that an atomic write which exceeds the unit max will be rejected. I don't like the idea of relaxing the API to "an atomic which exceeds unit max may be rejected". Indeed, in that case, the user may still want to know the non-optimal unit max. > >> - Anything written back with CoW/exchange range will need FUA to ensure that >> the write is fully persisted. > I don't think so. The journal commit for the exchange range > operation will issue a cache flush before the journal IO is > submitted. that will make the new data stable before the first > xchgrange transaction becomes stable. > > Hence we get the correct data/metadata ordering on stable storage > simply by doing the exchange-range operation at data IO completion. > This the same data/metadata ordering semantics that unwritten extent > conversion is based on.... I am not sure if we will use exchange range, but we need such behavior (in terms of persistence) described for whatever we do. Thanks, John
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index b521eb15759e..3dd883dd77d2 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, size_t copied = 0; size_t orig_count; - if (atomic && length != fs_block_size) + if (atomic && length != iter->len) return -EINVAL; if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||