diff mbox series

[1/4] iomap: Lift blocksize restriction on atomic writes

Message ID 20241204154344.3034362-2-john.g.garry@oracle.com (mailing list archive)
State Not Applicable, archived
Headers show
Series large atomic writes for xfs | expand

Commit Message

John Garry Dec. 4, 2024, 3:43 p.m. UTC
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>

Filesystems like ext4 can submit writes in multiples of blocksizes.
But we still can't allow the writes to be split into multiple BIOs. Hence
let's check if the iomap_length() is same as iter->len or not.

It is the responsibility of userspace to ensure that a write does not span
mixed unwritten and mapped extents (which would lead to multiple BIOs).

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
jpg: tweak commit message
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Dave Chinner Dec. 4, 2024, 8:35 p.m. UTC | #1
On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote:
> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
> 
> Filesystems like ext4 can submit writes in multiples of blocksizes.
> But we still can't allow the writes to be split into multiple BIOs. Hence
> let's check if the iomap_length() is same as iter->len or not.
> 
> It is the responsibility of userspace to ensure that a write does not span
> mixed unwritten and mapped extents (which would lead to multiple BIOs).

How is "userspace" supposed to do this?

No existing utility in userspace is aware of atomic write limits or
rtextsize configs, so how does "userspace" ensure everything is
laid out in a manner compatible with atomic writes?

e.g. restoring a backup (or other disaster recovery procedures) is
going to have to lay the files out correctly for atomic writes.
backup tools often sparsify the data set and so what gets restored
will not have the same layout as the original data set...

Where's the documentation that outlines all the restrictions on
userspace behaviour to prevent this sort of problem being triggered?
Common operations such as truncate, hole punch, buffered writes,
reflinks, etc will trip over this, so application developers, users
and admins really need to know what they should be doing to avoid
stepping on this landmine...

Further to that, what is the correction process for users to get rid
of this error? They'll need some help from an atomic write
constraint aware utility that can resilver the file such that these
failures do not occur again. Can xfs_fsr do this? Or maybe the new
exchangr-range code? Or does none of this infrastructure yet exist?

> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> jpg: tweak commit message
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/iomap/direct-io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index b521eb15759e..3dd883dd77d2 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	size_t copied = 0;
>  	size_t orig_count;
>  
> -	if (atomic && length != fs_block_size)
> +	if (atomic && length != iter->len)
>  		return -EINVAL;

Given this is now rejecting IOs that are otherwise well formed from
userspace, this situation needs to have a different errno now. The
userspace application has not provided an invalid set of
IO parameters for this IO - the IO has been rejected because
the previously set up persistent file layout was screwed up by
something in the past.

i.e. This is not an application IO submission error anymore, hence
EINVAL is the wrong error to be returning to userspace here.

-Dave.
Darrick J. Wong Dec. 5, 2024, 6:30 a.m. UTC | #2
On Thu, Dec 05, 2024 at 07:35:45AM +1100, Dave Chinner wrote:
> On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote:
> > From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
> > 
> > Filesystems like ext4 can submit writes in multiples of blocksizes.
> > But we still can't allow the writes to be split into multiple BIOs. Hence
> > let's check if the iomap_length() is same as iter->len or not.
> > 
> > It is the responsibility of userspace to ensure that a write does not span
> > mixed unwritten and mapped extents (which would lead to multiple BIOs).
> 
> How is "userspace" supposed to do this?
> 
> No existing utility in userspace is aware of atomic write limits or
> rtextsize configs, so how does "userspace" ensure everything is
> laid out in a manner compatible with atomic writes?
> 
> e.g. restoring a backup (or other disaster recovery procedures) is
> going to have to lay the files out correctly for atomic writes.
> backup tools often sparsify the data set and so what gets restored
> will not have the same layout as the original data set...
> 
> Where's the documentation that outlines all the restrictions on
> userspace behaviour to prevent this sort of problem being triggered?
> Common operations such as truncate, hole punch, buffered writes,
> reflinks, etc will trip over this, so application developers, users
> and admins really need to know what they should be doing to avoid
> stepping on this landmine...

I'm kinda assuming that this requires forcealign to get the extent
alignments correct, and writing zeroes non-atomically if the extent
state gets mixed up before retrying the untorn write.  John?

> Further to that, what is the correction process for users to get rid
> of this error? They'll need some help from an atomic write
> constraint aware utility that can resilver the file such that these
> failures do not occur again. Can xfs_fsr do this? Or maybe the new
> exchangr-range code? Or does none of this infrastructure yet exist?

TBH aside from databases doing pure overwrites to storage hardware, I
think everyone else would be better served by start_commit /
commit_range since it's /much/ more flexible.

> > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> > jpg: tweak commit message
> > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > ---
> >  fs/iomap/direct-io.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index b521eb15759e..3dd883dd77d2 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> >  	size_t copied = 0;
> >  	size_t orig_count;
> >  
> > -	if (atomic && length != fs_block_size)
> > +	if (atomic && length != iter->len)
> >  		return -EINVAL;
> 
> Given this is now rejecting IOs that are otherwise well formed from
> userspace, this situation needs to have a different errno now. The
> userspace application has not provided an invalid set of
> IO parameters for this IO - the IO has been rejected because
> the previously set up persistent file layout was screwed up by
> something in the past.
> 
> i.e. This is not an application IO submission error anymore, hence
> EINVAL is the wrong error to be returning to userspace here.

Admittedly it would be useful to add at least a couple of new errnos for
alignment issues and incorrect file storage mapping states.  How
difficult is it to get a new errno code added to uapi and then plumbed
into glibc?  Are new errno additions to libc still gate-kept by Ulrich
or is my knowlege 15 years out of date?

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
John Garry Dec. 5, 2024, 10:52 a.m. UTC | #3
On 04/12/2024 20:35, Dave Chinner wrote:
> On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote:
>> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
>>
>> Filesystems like ext4 can submit writes in multiples of blocksizes.
>> But we still can't allow the writes to be split into multiple BIOs. Hence
>> let's check if the iomap_length() is same as iter->len or not.
>>
>> It is the responsibility of userspace to ensure that a write does not span
>> mixed unwritten and mapped extents (which would lead to multiple BIOs).
> 
> How is "userspace" supposed to do this?

If an atomic write spans mixed unwritten and mapped extents, then it 
should manually zero the unwritten extents beforehand.

> 
> No existing utility in userspace is aware of atomic write limits or
> rtextsize configs, so how does "userspace" ensure everything is
> laid out in a manner compatible with atomic writes?
> 
> e.g. restoring a backup (or other disaster recovery procedures) is
> going to have to lay the files out correctly for atomic writes.
> backup tools often sparsify the data set and so what gets restored
> will not have the same layout as the original data set...

I am happy to support whatever is needed to make atomic writes work over
mixed extents if that is really an expected use case and it is a pain 
for an application writer/admin to deal with this (by manually zeroing 
extents).

JFYI, I did originally support the extent pre-zeroing for this. That was 
to support a real-life scenario which we saw where we were attempting 
atomic writes over mixed extents. The mixed extents were coming from 
userspace punching holes and then attempting an atomic write over that 
space. However that was using an early experimental and buggy 
forcealign; it was buggy as it did not handle punching holes properly - 
it punched out single blocks and not only full alloc units.

> 
> Where's the documentation that outlines all the restrictions on
> userspace behaviour to prevent this sort of problem being triggered?

I would provide a man page update.

> Common operations such as truncate, hole punch,

So how would punch hole be a problem? The atomic write unit max is 
limited by the alloc unit, and we can only punch out full alloc units.

> buffered writes,
> reflinks, etc will trip over this, so application developers, users
> and admins really need to know what they should be doing to avoid
> stepping on this landmine...

If this is not a real-life scenario which we expect to see, then I don't 
see why we would add the complexity to the kernel for this.

My motivation for atomic writes support is to support atomically writing 
large database internal page size. If the database only writes at a 
fixed internal page size, then we should not see mixed mappings.

But you see potential problems elsewhere ..

> 
> Further to that, what is the correction process for users to get rid
> of this error? They'll need some help from an atomic write
> constraint aware utility that can resilver the file such that these
> failures do not occur again. Can xfs_fsr do this? Or maybe the new
> exchangr-range code? Or does none of this infrastructure yet exist?

Nothing exists yet.

> 
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> jpg: tweak commit message
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/iomap/direct-io.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index b521eb15759e..3dd883dd77d2 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   	size_t copied = 0;
>>   	size_t orig_count;
>>   
>> -	if (atomic && length != fs_block_size)
>> +	if (atomic && length != iter->len)
>>   		return -EINVAL;
> 
> Given this is now rejecting IOs that are otherwise well formed from
> userspace, this situation needs to have a different errno now. The
> userspace application has not provided an invalid set of
> IO parameters for this IO - the IO has been rejected because
> the previously set up persistent file layout was screwed up by
> something in the past.
> 
> i.e. This is not an application IO submission error anymore, hence
> EINVAL is the wrong error to be returning to userspace here.
> 

Understood, but let's come back to this (if needed..).

Thanks,
John
John Garry Dec. 5, 2024, 11:51 a.m. UTC | #4
On 05/12/2024 06:30, Darrick J. Wong wrote:
> On Thu, Dec 05, 2024 at 07:35:45AM +1100, Dave Chinner wrote:
>> On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote:
>>> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
>>>
>>> Filesystems like ext4 can submit writes in multiples of blocksizes.
>>> But we still can't allow the writes to be split into multiple BIOs. Hence
>>> let's check if the iomap_length() is same as iter->len or not.
>>>
>>> It is the responsibility of userspace to ensure that a write does not span
>>> mixed unwritten and mapped extents (which would lead to multiple BIOs).
>>
>> How is "userspace" supposed to do this?
>>
>> No existing utility in userspace is aware of atomic write limits or
>> rtextsize configs, so how does "userspace" ensure everything is
>> laid out in a manner compatible with atomic writes?
>>
>> e.g. restoring a backup (or other disaster recovery procedures) is
>> going to have to lay the files out correctly for atomic writes.
>> backup tools often sparsify the data set and so what gets restored
>> will not have the same layout as the original data set...
>>
>> Where's the documentation that outlines all the restrictions on
>> userspace behaviour to prevent this sort of problem being triggered?
>> Common operations such as truncate, hole punch, buffered writes,
>> reflinks, etc will trip over this, so application developers, users
>> and admins really need to know what they should be doing to avoid
>> stepping on this landmine...
> 
> I'm kinda assuming that this requires forcealign to get the extent
> alignments correct, and writing zeroes non-atomically if the extent
> state gets mixed up before retrying the untorn write.  John?

Sure, the code to do the automatic pre-zeroing and retry the atomic 
write is not super complicated.

It's just a matter or whether we add it or not.

Thanks,
John
Dave Chinner Dec. 5, 2024, 9:15 p.m. UTC | #5
On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote:
> On 04/12/2024 20:35, Dave Chinner wrote:
> > On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote:
> > > From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
> > > 
> > > Filesystems like ext4 can submit writes in multiples of blocksizes.
> > > But we still can't allow the writes to be split into multiple BIOs. Hence
> > > let's check if the iomap_length() is same as iter->len or not.
> > > 
> > > It is the responsibility of userspace to ensure that a write does not span
> > > mixed unwritten and mapped extents (which would lead to multiple BIOs).
> > 
> > How is "userspace" supposed to do this?
> 
> If an atomic write spans mixed unwritten and mapped extents, then it should
> manually zero the unwritten extents beforehand.
> 
> > 
> > No existing utility in userspace is aware of atomic write limits or
> > rtextsize configs, so how does "userspace" ensure everything is
> > laid out in a manner compatible with atomic writes?
> > 
> > e.g. restoring a backup (or other disaster recovery procedures) is
> > going to have to lay the files out correctly for atomic writes.
> > backup tools often sparsify the data set and so what gets restored
> > will not have the same layout as the original data set...
> 
> I am happy to support whatever is needed to make atomic writes work over
> mixed extents if that is really an expected use case and it is a pain for an
> application writer/admin to deal with this (by manually zeroing extents).
> 
> JFYI, I did originally support the extent pre-zeroing for this. That was to
> support a real-life scenario which we saw where we were attempting atomic
> writes over mixed extents. The mixed extents were coming from userspace
> punching holes and then attempting an atomic write over that space. However
> that was using an early experimental and buggy forcealign; it was buggy as
> it did not handle punching holes properly - it punched out single blocks and
> not only full alloc units.
> 
> > 
> > Where's the documentation that outlines all the restrictions on
> > userspace behaviour to prevent this sort of problem being triggered?
> 
> I would provide a man page update.

I think, at this point, we need an better way of documenting all the
atomic write stuff in one place. Not just the user interface and
what is expected of userspace, but also all the things the
filesystems need to do to ensure atomic writes work correctly. I was
thinking that a document somewhere in the Documentation/ directory,
rather than random pieces of information splattered across random man pages
would be a much better way of explaining all this.

Don't get me wrong - man pages explaining the programmatic API are
necessary, but there's a whole lot more to understanding and making
effective use of atomic writes than what has been added to the man
pages so far.

> > Common operations such as truncate, hole punch,
> 
> So how would punch hole be a problem? The atomic write unit max is limited
> by the alloc unit, and we can only punch out full alloc units.

I was under the impression that this was a feature of the
force-align code, not a feature of atomic writes. i.e. force-align
is what ensures the BMBT aligns correctly with the underlying
extents.

Or did I miss the fact that some of the force-align semantics bleed
back into the original atomic write patch set?

> > buffered writes,
> > reflinks, etc will trip over this, so application developers, users
> > and admins really need to know what they should be doing to avoid
> > stepping on this landmine...
> 
> If this is not a real-life scenario which we expect to see, then I don't see
> why we would add the complexity to the kernel for this.

I gave you one above - restoring a data set as a result of disaster
recovery. 

> My motivation for atomic writes support is to support atomically writing
> large database internal page size. If the database only writes at a fixed
> internal page size, then we should not see mixed mappings.

Yup, that's the problem here. Once atomic writes are supported by
the kernel and userspace, all sorts of applications are going to
start using them for in all sorts of ways you didn't think of.

> But you see potential problems elsewhere ..

That's my job as a senior engineer with 20+ years of experience in
filesystems and storage related applications. I see far because I
stand on the shoulders of giants - I don't try to be a giant myself.

Other people become giants by implementing ground-breaking features
(e.g. like atomic writes), but without the people who can see far
enough ahead just adding features ends up with an incoherent mess of
special interest niche features rather than a neatly integrated set
of widely usable generic features.

e.g. look at MySQL's use of fallocate(hole punch) for transparent
data compression - nobody had forseen that hole punching would be
used like this, but it's a massive win for the applications which
store bulk compressible data in the database even though it does bad
things to the filesystem.

Spend some time looking outside the proprietary database application
box and think a little harder about the implications of atomic write
functionality.  i.e. what happens when we have ubiquitous support
for guaranteeing only the old or the new data will be seen after
a crash *without the need for using fsync*.

Think about the implications of that for a minute - for any full
file overwrite up to the hardware atomic limits, we won't need fsync
to guarantee the integrity of overwritten data anymore. We only need
a mechanism to flush the journal and device caches once all the data
has been written (e.g. syncfs)...

Want to overwrite a bunch of small files safely?  Atomic write the
new data, then syncfs(). There's no need to run fdatasync after each
write to ensure individual files are not corrupted if we crash in
the middle of the operation. Indeed, atomic writes actually provide
better overwrite integrity semantics that fdatasync as it will be
all or nothing. fdatasync does not provide that guarantee if we
crash during the fdatasync operation.

Further, with COW data filesystems like XFS, btrfs and bcachefs, we
can emulate atomic writes for any size larger than what the hardware
supports.

At this point we actually provide app developers with what they've
been repeatedly asking kernel filesystem engineers to provide them
for the past 20 years: a way of overwriting arbitrary file data
safely without needing an expensive fdatasync operation on every
file that gets modified.

Put simply: atomic writes have a huge potential to fundamentally
change the way applications interact with Linux filesystems and to
make it *much* simpler for applications to safely overwrite user
data.  Hence there is an imperitive here to make the foundational
support for this technology solid and robust because atomic writes
are going to be with us for the next few decades...

-Dave.
John Garry Dec. 6, 2024, 9:43 a.m. UTC | #6
>>> Where's the documentation that outlines all the restrictions on
>>> userspace behaviour to prevent this sort of problem being triggered?
>>
>> I would provide a man page update.
> 
> I think, at this point, we need an better way of documenting all the
> atomic write stuff in one place. Not just the user interface and
> what is expected of userspace, but also all the things the
> filesystems need to do to ensure atomic writes work correctly. I was
> thinking that a document somewhere in the Documentation/ directory,
> rather than random pieces of information splattered across random man pages
> would be a much better way of explaining all this.
> 
> Don't get me wrong - man pages explaining the programmatic API are
> necessary, but there's a whole lot more to understanding and making
> effective use of atomic writes than what has been added to the man
> pages so far.

Sure, maybe that would be useful. I think that the final piece of the 
jigsaw is large atomic write support, and then any kernel documentation 
can be further considered.

> 
>>> Common operations such as truncate, hole punch,
>>
>> So how would punch hole be a problem? The atomic write unit max is limited
>> by the alloc unit, and we can only punch out full alloc units.
> 
> I was under the impression that this was a feature of the
> force-align code, not a feature of atomic writes. i.e. force-align
> is what ensures the BMBT aligns correctly with the underlying
> extents.
> 
> Or did I miss the fact that some of the force-align semantics bleed
> back into the original atomic write patch set?

Not really.

As I mentioned, if we can only punch out a full allocation unit and 
atomic write unit max is limited by the allocation unit size, then 
punching out a hole should not create a new range of mixed extents that 
we can legally attempt to atomic write.

> 
>>> buffered writes,
>>> reflinks, etc will trip over this, so application developers, users
>>> and admins really need to know what they should be doing to avoid
>>> stepping on this landmine...
>>
>> If this is not a real-life scenario which we expect to see, then I don't see
>> why we would add the complexity to the kernel for this.
> 
> I gave you one above - restoring a data set as a result of disaster
> recovery.

ack

> 
>> My motivation for atomic writes support is to support atomically writing
>> large database internal page size. If the database only writes at a fixed
>> internal page size, then we should not see mixed mappings.
> 
> Yup, that's the problem here. Once atomic writes are supported by
> the kernel and userspace, all sorts of applications are going to
> start using them for in all sorts of ways you didn't think of.
> 
>> But you see potential problems elsewhere ..
> 
> That's my job as a senior engineer with 20+ years of experience in
> filesystems and storage related applications. I see far because I
> stand on the shoulders of giants - I don't try to be a giant myself.
> 
> Other people become giants by implementing ground-breaking features
> (e.g. like atomic writes), but without the people who can see far
> enough ahead just adding features ends up with an incoherent mess of
> special interest niche features rather than a neatly integrated set
> of widely usable generic features.

yes

> 
> e.g. look at MySQL's use of fallocate(hole punch) for transparent
> data compression - nobody had forseen that hole punching would be
> used like this, but it's a massive win for the applications which
> store bulk compressible data in the database even though it does bad
> things to the filesystem.
> 
> Spend some time looking outside the proprietary database application
> box and think a little harder about the implications of atomic write
> functionality.  i.e. what happens when we have ubiquitous support
> for guaranteeing only the old or the new data will be seen after
> a crash *without the need for using fsync*.
> 
> Think about the implications of that for a minute - for any full
> file overwrite up to the hardware atomic limits, we won't need fsync
> to guarantee the integrity of overwritten data anymore. We only need
> a mechanism to flush the journal and device caches once all the data
> has been written (e.g. syncfs)...
> 
> Want to overwrite a bunch of small files safely?  Atomic write the
> new data, then syncfs(). There's no need to run fdatasync after each
> write to ensure individual files are not corrupted if we crash in
> the middle of the operation. Indeed, atomic writes actually provide
> better overwrite integrity semantics that fdatasync as it will be
> all or nothing. fdatasync does not provide that guarantee if we
> crash during the fdatasync operation.
> 
> Further, with COW data filesystems like XFS, btrfs and bcachefs, we
> can emulate atomic writes for any size larger than what the hardware
> supports.
> 
> At this point we actually provide app developers with what they've
> been repeatedly asking kernel filesystem engineers to provide them
> for the past 20 years: a way of overwriting arbitrary file data
> safely without needing an expensive fdatasync operation on every
> file that gets modified.

Understood, you see that there are many applications of atomic writes 
beyond the scope of DBs.

> 
> Put simply: atomic writes have a huge potential to fundamentally
> change the way applications interact with Linux filesystems and to
> make it *much* simpler for applications to safely overwrite user
> data.  Hence there is an imperitive here to make the foundational
> support for this technology solid and robust because atomic writes
> are going to be with us for the next few decades...
> 

Thanks for going further in describing the possible use cases.

Now let's talk again about the implementation of kernel extent zeroing 
for atomic writes.

Firstly I will mention the obvious and that is we so far can 
automatically atomically write a single FS block and there was no 
FS_XFLAG_ATOMICWRITES flag introduced for enabling this. Furthermore, 
the range of data does not need to be in mapped state.

Then we need to consider how to decide to do the extent zeroing for the 
following scenarios for atomic writes:
a. For forcealign, we can decide to always zero the full alloc unit, 
same as [0]. So that is good for atomic writes. But that does involve 
further work to zero extents for buffered IO.
b. For rtvol without forcealign, we will not start to always zero the 
full alloc unit as that would be major regression in performance.
c. For rtvol with forcealign, we can do the same as a.

I can suggest 2x options for solving b:
1. Introduce FS_XFLAG_LARGE_ATOMICWRITES to control whether we do the 
same as a.
2. Introduce a method to pre-zero extents for atomic writes only

Option 2. would work like this:
- when we find that the atomic write covers a mix of unwritten and 
mapped extent mappings, we have 2x phases of the atomic write:
- phase 1. will pre-zero the unwritten extents and update the extent 
mappings
- phase 2. will retry the atomic write, and we should find a single mapping

Option 2. could also be leveraged for a. and c., above.

Please let me know your thoughts on this.

[0] 
https://lore.kernel.org/linux-xfs/20240607143919.2622319-3-john.g.garry@oracle.com/

John
Darrick J. Wong Dec. 12, 2024, 1:34 a.m. UTC | #7
On Fri, Dec 06, 2024 at 08:15:05AM +1100, Dave Chinner wrote:
> On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote:
> > On 04/12/2024 20:35, Dave Chinner wrote:
> > > On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote:
> > > > From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
> > > > 
> > > > Filesystems like ext4 can submit writes in multiples of blocksizes.
> > > > But we still can't allow the writes to be split into multiple BIOs. Hence
> > > > let's check if the iomap_length() is same as iter->len or not.
> > > > 
> > > > It is the responsibility of userspace to ensure that a write does not span
> > > > mixed unwritten and mapped extents (which would lead to multiple BIOs).
> > > 
> > > How is "userspace" supposed to do this?
> > 
> > If an atomic write spans mixed unwritten and mapped extents, then it should
> > manually zero the unwritten extents beforehand.
> > 
> > > 
> > > No existing utility in userspace is aware of atomic write limits or
> > > rtextsize configs, so how does "userspace" ensure everything is
> > > laid out in a manner compatible with atomic writes?
> > > 
> > > e.g. restoring a backup (or other disaster recovery procedures) is
> > > going to have to lay the files out correctly for atomic writes.
> > > backup tools often sparsify the data set and so what gets restored
> > > will not have the same layout as the original data set...
> > 
> > I am happy to support whatever is needed to make atomic writes work over
> > mixed extents if that is really an expected use case and it is a pain for an
> > application writer/admin to deal with this (by manually zeroing extents).
> > 
> > JFYI, I did originally support the extent pre-zeroing for this. That was to
> > support a real-life scenario which we saw where we were attempting atomic
> > writes over mixed extents. The mixed extents were coming from userspace
> > punching holes and then attempting an atomic write over that space. However
> > that was using an early experimental and buggy forcealign; it was buggy as
> > it did not handle punching holes properly - it punched out single blocks and
> > not only full alloc units.
> > 
> > > 
> > > Where's the documentation that outlines all the restrictions on
> > > userspace behaviour to prevent this sort of problem being triggered?
> > 
> > I would provide a man page update.
> 
> I think, at this point, we need an better way of documenting all the
> atomic write stuff in one place. Not just the user interface and
> what is expected of userspace, but also all the things the
> filesystems need to do to ensure atomic writes work correctly. I was
> thinking that a document somewhere in the Documentation/ directory,
> rather than random pieces of information splattered across random man pages
> would be a much better way of explaining all this.
> 
> Don't get me wrong - man pages explaining the programmatic API are
> necessary, but there's a whole lot more to understanding and making
> effective use of atomic writes than what has been added to the man
> pages so far.
> 
> > > Common operations such as truncate, hole punch,
> > 
> > So how would punch hole be a problem? The atomic write unit max is limited
> > by the alloc unit, and we can only punch out full alloc units.
> 
> I was under the impression that this was a feature of the
> force-align code, not a feature of atomic writes. i.e. force-align
> is what ensures the BMBT aligns correctly with the underlying
> extents.
> 
> Or did I miss the fact that some of the force-align semantics bleed
> back into the original atomic write patch set?
> 
> > > buffered writes,
> > > reflinks, etc will trip over this, so application developers, users
> > > and admins really need to know what they should be doing to avoid
> > > stepping on this landmine...
> > 
> > If this is not a real-life scenario which we expect to see, then I don't see
> > why we would add the complexity to the kernel for this.
> 
> I gave you one above - restoring a data set as a result of disaster
> recovery. 
> 
> > My motivation for atomic writes support is to support atomically writing
> > large database internal page size. If the database only writes at a fixed
> > internal page size, then we should not see mixed mappings.
> 
> Yup, that's the problem here. Once atomic writes are supported by
> the kernel and userspace, all sorts of applications are going to
> start using them for in all sorts of ways you didn't think of.
> 
> > But you see potential problems elsewhere ..
> 
> That's my job as a senior engineer with 20+ years of experience in
> filesystems and storage related applications. I see far because I
> stand on the shoulders of giants - I don't try to be a giant myself.
> 
> Other people become giants by implementing ground-breaking features
> (e.g. like atomic writes), but without the people who can see far
> enough ahead just adding features ends up with an incoherent mess of
> special interest niche features rather than a neatly integrated set
> of widely usable generic features.
> 
> e.g. look at MySQL's use of fallocate(hole punch) for transparent
> data compression - nobody had forseen that hole punching would be
> used like this, but it's a massive win for the applications which
> store bulk compressible data in the database even though it does bad
> things to the filesystem.
> 
> Spend some time looking outside the proprietary database application
> box and think a little harder about the implications of atomic write
> functionality.  i.e. what happens when we have ubiquitous support
> for guaranteeing only the old or the new data will be seen after
> a crash *without the need for using fsync*.

IOWs, the program either wants an old version or a new version of the
files that it wrote, and the commit boundary is syncfs() after updating
all the files?

> Think about the implications of that for a minute - for any full
> file overwrite up to the hardware atomic limits, we won't need fsync
> to guarantee the integrity of overwritten data anymore. We only need
> a mechanism to flush the journal and device caches once all the data
> has been written (e.g. syncfs)...

"up to the hardware atomic limits" -- that's a big limitation.  What if
I need to write 256K but the device only supports up to 64k?  RWF_ATOMIC
won't work.  Or what if the file range I want to dirty isn't aligned
with the atomic write alignment?  What if the awu geometry changes
online due to a device change, how do programs detect that?

Programs that aren't 100% block-based should use exchange-range.  There
are no alignment restrictions, no limits on the size you can exchange,
no file mapping state requiments to trip over, and you can update
arbitrary sparse ranges.  As long as you don't tell exchange-range to
flush the log itself, programs can use syncfs to amortize the log and
cache flush across a bunch of file content exchanges.

Even better, if you still wanted to use untorn block writes to persist
the temporary file's dirty data to disk, you don't even need forcealign
because the exchange-range will take care of restarting the operation
during log recovery.  I don't know that there's much point in doing that
but the idea is there.

> Want to overwrite a bunch of small files safely?  Atomic write the
> new data, then syncfs(). There's no need to run fdatasync after each
> write to ensure individual files are not corrupted if we crash in
> the middle of the operation. Indeed, atomic writes actually provide
> better overwrite integrity semantics that fdatasync as it will be
> all or nothing. fdatasync does not provide that guarantee if we
> crash during the fdatasync operation.
> 
> Further, with COW data filesystems like XFS, btrfs and bcachefs, we
> can emulate atomic writes for any size larger than what the hardware
> supports.
> 
> At this point we actually provide app developers with what they've
> been repeatedly asking kernel filesystem engineers to provide them
> for the past 20 years: a way of overwriting arbitrary file data
> safely without needing an expensive fdatasync operation on every
> file that gets modified.
> 
> Put simply: atomic writes have a huge potential to fundamentally
> change the way applications interact with Linux filesystems and to
> make it *much* simpler for applications to safely overwrite user
> data.  Hence there is an imperitive here to make the foundational
> support for this technology solid and robust because atomic writes
> are going to be with us for the next few decades...

I agree that we need to make the interface solid and robust, but I don't
agree that the current RWF_ATOMIC, with its block-oriented storage
device quirks is the way to go here.  Maybe a byte-oriented RWF_ATOMIC
would work, but the only way I can think of to do that is (say) someone
implements Christoph's suggestion to change the COW code to allow
multiple writes to a staging extent, and only commit the remapping
operations at sync time... and you'd still have problems if you have to
do multiple remappings if there's not also a way to restart the ioend
chains.

Exchange-range already solved all of that, and it's already merged.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
diff mbox series

Patch

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b521eb15759e..3dd883dd77d2 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -306,7 +306,7 @@  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	size_t copied = 0;
 	size_t orig_count;
 
-	if (atomic && length != fs_block_size)
+	if (atomic && length != iter->len)
 		return -EINVAL;
 
 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||