[5/6] iomap: Lift blocksize restriction on atomic writes

Message ID	f5bd55d32031b49bdd9e2c6d073787d1ac4b6d78.1729825985.git.ritesh.list@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BD3DF19939D; Fri, 25 Oct 2024 03:46:28 +0000 (UTC) From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> To: linux-ext4@vger.kernel.org Cc: Theodore Ts'o <tytso@mit.edu>, Jan Kara <jack@suse.cz>, "Darrick J . Wong" <djwong@kernel.org>, Christoph Hellwig <hch@infradead.org>, John Garry <john.g.garry@oracle.com>, Ojaswin Mujoo <ojaswin@linux.ibm.com>, Dave Chinner <david@fromorbit.com>, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Subject: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes Date: Fri, 25 Oct 2024 09:15:54 +0530 Message-ID: <f5bd55d32031b49bdd9e2c6d073787d1ac4b6d78.1729825985.git.ritesh.list@gmail.com> In-Reply-To: <cover.1729825985.git.ritesh.list@gmail.com> References: <cover.1729825985.git.ritesh.list@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	ext4: Add atomic write support for DIO \| expand [0/6] ext4: Add atomic write support for DIO [1/6] ext4: Add statx support for atomic writes [2/6] ext4: Check for atomic writes support in write iter [3/6] ext4: Support setting FMODE_CAN_ATOMIC_WRITE [4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes [5/6] iomap: Lift blocksize restriction on atomic writes [6/6] ext4: Add atomic write support for bigalloc

Ritesh Harjani (IBM) Oct. 25, 2024, 3:45 a.m. UTC

Filesystems like ext4 can submit writes in multiples of blocksizes.
But we still can't allow the writes to be split. Hence let's check if
the iomap_length() is same as iter->len or not.

This shouldn't affect XFS since it anyways checks for this in
xfs_file_write_iter() to not support atomic write size request of more
than FS blocksize.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

John Garry Oct. 25, 2024, 8:52 a.m. UTC | #1

On 25/10/2024 04:45, Ritesh Harjani (IBM) wrote:
> Filesystems like ext4 can submit writes in multiples of blocksizes.
> But we still can't allow the writes to be split. Hence let's check if
> the iomap_length() is same as iter->len or not.
> 
> This shouldn't affect XFS since it anyways checks for this in
> xfs_file_write_iter() to not support atomic write size request of more
> than FS blocksize.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>   fs/iomap/direct-io.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index ed4764e3b8f0..1d33b4239b3e 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>   	size_t copied = 0;
>   	size_t orig_count;
>   
> -	if (atomic && length != fs_block_size)
> +	if (atomic && length != iter->len)
>   		return -EINVAL;

Here you expect just one iter for an atomic write always.

In 6/6, you are saying that iomap does not allow an atomic write which 
covers unwritten and written extents, right?

But for writing a single fs block atomically, we don't mandate it to be 
in unwritten state. So there is a difference in behavior in writing a 
single FS block vs multiple FS blocks atomically.

So we have 3x choices, as I see:
a. add a check now in iomap that the extent is in written state (for an 
atomic write)
b. add extent zeroing code, as I was trying for originally
c. document this peculiarity

Thanks,
John

Ritesh Harjani (IBM) Oct. 25, 2024, 9:31 a.m. UTC | #2

Hi John, 

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 04:45, Ritesh Harjani (IBM) wrote:
>> Filesystems like ext4 can submit writes in multiples of blocksizes.
>> But we still can't allow the writes to be split. Hence let's check if
>> the iomap_length() is same as iter->len or not.
>> 
>> This shouldn't affect XFS since it anyways checks for this in
>> xfs_file_write_iter() to not support atomic write size request of more
>> than FS blocksize.
>> 
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>   fs/iomap/direct-io.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index ed4764e3b8f0..1d33b4239b3e 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   	size_t copied = 0;
>>   	size_t orig_count;
>>   
>> -	if (atomic && length != fs_block_size)
>> +	if (atomic && length != iter->len)
>>   		return -EINVAL;
>
> Here you expect just one iter for an atomic write always.

Here we are lifting the limitation of iomap to support entire iter->len
rather than just 1 fsblock. 

>
> In 6/6, you are saying that iomap does not allow an atomic write which 
> covers unwritten and written extents, right?

No, it's not that. If FS does not provide a full mapping to iomap in
->iomap_begin then the writes will get split. For atomic writes this
should not happen, hence the check in iomap above to return -EINVAL if
mapped length does not match iter->len.

>
> But for writing a single fs block atomically, we don't mandate it to be 
> in unwritten state. So there is a difference in behavior in writing a 
> single FS block vs multiple FS blocks atomically.

Same as mentioned above. We can't have atomic writes to get split.
This patch is just lifting the restriction of iomap to allow more than
blocksize but the mapped length should still meet iter->len, as
otherwise the writes can get split.

>
> So we have 3x choices, as I see:
> a. add a check now in iomap that the extent is in written state (for an 
> atomic write)
> b. add extent zeroing code, as I was trying for originally
> c. document this peculiarity
>
> Thanks,
> John

John Garry Oct. 25, 2024, 9:59 a.m. UTC | #3

On 25/10/2024 10:31, Ritesh Harjani (IBM) wrote:
>>>    
>>> -	if (atomic && length != fs_block_size)
>>> +	if (atomic && length != iter->len)
>>>    		return -EINVAL;
>> Here you expect just one iter for an atomic write always.
> Here we are lifting the limitation of iomap to support entire iter->len
> rather than just 1 fsblock.

Sure

> 
>> In 6/6, you are saying that iomap does not allow an atomic write which
>> covers unwritten and written extents, right?
> No, it's not that. If FS does not provide a full mapping to iomap in
> ->iomap_begin then the writes will get split. 

but why would it provide multiple mapping?

> For atomic writes this
> should not happen, hence the check in iomap above to return -EINVAL if
> mapped length does not match iter->len.
> 
>> But for writing a single fs block atomically, we don't mandate it to be
>> in unwritten state. So there is a difference in behavior in writing a
>> single FS block vs multiple FS blocks atomically.
> Same as mentioned above. We can't have atomic writes to get split.
> This patch is just lifting the restriction of iomap to allow more than
> blocksize but the mapped length should still meet iter->len, as
> otherwise the writes can get split.

Sure, I get this. But I wonder why would we be getting multiple 
mappings? Why cannot the FS always provide a single mapping?

Thanks,
John

Ritesh Harjani (IBM) Oct. 25, 2024, 10:35 a.m. UTC | #4

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 10:31, Ritesh Harjani (IBM) wrote:
>>>>    
>>>> -	if (atomic && length != fs_block_size)
>>>> +	if (atomic && length != iter->len)
>>>>    		return -EINVAL;
>>> Here you expect just one iter for an atomic write always.
>> Here we are lifting the limitation of iomap to support entire iter->len
>> rather than just 1 fsblock.
>
> Sure
>
>> 
>>> In 6/6, you are saying that iomap does not allow an atomic write which
>>> covers unwritten and written extents, right?
>> No, it's not that. If FS does not provide a full mapping to iomap in
>> ->iomap_begin then the writes will get split. 
>
> but why would it provide multiple mapping?
>
>> For atomic writes this
>> should not happen, hence the check in iomap above to return -EINVAL if
>> mapped length does not match iter->len.
>> 
>>> But for writing a single fs block atomically, we don't mandate it to be
>>> in unwritten state. So there is a difference in behavior in writing a
>>> single FS block vs multiple FS blocks atomically.
>> Same as mentioned above. We can't have atomic writes to get split.
>> This patch is just lifting the restriction of iomap to allow more than
>> blocksize but the mapped length should still meet iter->len, as
>> otherwise the writes can get split.
>
> Sure, I get this. But I wonder why would we be getting multiple 
> mappings? Why cannot the FS always provide a single mapping?

FS can decide to split the mappings when it couldn't allocate a single
large mapping of the requested length. Could be due to - 
- already allocated extent followed by EOF, 
- already allocated extent followed by a hole
- already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written)
- delalloc (not delalloc since we invalidate respective page cache pages before doing DIO).
- fragmentation or ENOSPC - For ext4 bigalloc this will not happen since
we reserve the entire cluster. So we know there should be space. But I
am not sure how other filesystems might end up implementing this functionality.

Thanks!

-ritesh

John Garry Oct. 25, 2024, 11:07 a.m. UTC | #5

On 25/10/2024 11:35, Ritesh Harjani (IBM) wrote:
>>> Same as mentioned above. We can't have atomic writes to get split.
>>> This patch is just lifting the restriction of iomap to allow more than
>>> blocksize but the mapped length should still meet iter->len, as
>>> otherwise the writes can get split.
>> Sure, I get this. But I wonder why would we be getting multiple
>> mappings? Why cannot the FS always provide a single mapping?
> FS can decide to split the mappings when it couldn't allocate a single
> large mapping of the requested length. Could be due to -
> - already allocated extent followed by EOF,
> - already allocated extent followed by a hole
> - already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written)

This is the sort of scenario which I am concerned with. This issue has 
been discussed at length for XFS forcealign support for atomic writes.

So far, the user can atomic write a single FS block regardless of 
whether the extent in which it would be part of is in written or 
unwritten state.

Now the rule will be to write multiple FS blocks atomically, all blocks 
need to be in same written or unwritten state.

This oddity at least needs to be documented.

Better yet would be to not have this restriction.

> - delalloc (not delalloc since we invalidate respective page cache pages before doing DIO).
> - fragmentation or ENOSPC - For ext4 bigalloc this will not happen since
> we reserve the entire cluster. So we know there should be space. But I
> am not sure how other filesystems might end up implementing this functionality.

Thanks,
John

Ritesh Harjani (IBM) Oct. 25, 2024, 11:19 a.m. UTC | #6

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 11:35, Ritesh Harjani (IBM) wrote:
>>>> Same as mentioned above. We can't have atomic writes to get split.
>>>> This patch is just lifting the restriction of iomap to allow more than
>>>> blocksize but the mapped length should still meet iter->len, as
>>>> otherwise the writes can get split.
>>> Sure, I get this. But I wonder why would we be getting multiple
>>> mappings? Why cannot the FS always provide a single mapping?
>> FS can decide to split the mappings when it couldn't allocate a single
>> large mapping of the requested length. Could be due to -
>> - already allocated extent followed by EOF,
>> - already allocated extent followed by a hole
>> - already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written)
>
> This is the sort of scenario which I am concerned with. This issue has 
> been discussed at length for XFS forcealign support for atomic writes.

extsize and forcealign is being worked for ext4 as well where we can
add such support, sure.

>
> So far, the user can atomic write a single FS block regardless of 
> whether the extent in which it would be part of is in written or 
> unwritten state.
>
> Now the rule will be to write multiple FS blocks atomically, all blocks 
> need to be in same written or unwritten state.

FS needs to ensure that the writes does not get torned. So for whatever reason
FS splits the mapping then we need to return an -EINVAL error to not
allow such writes to get torned. This patch just does that.

But I get your point. More below.

>
> This oddity at least needs to be documented.

Got it. Yes, we can do that.

>
> Better yet would be to not have this restriction.
>

I haven't thought of a clever way where we don't have to zero out the
rest of the unwritten mapping. With ext4 bigalloc since the entire
cluster is anyway reserved - I was thinking if we can come up with a
clever way for doing atomic writes to the entire user requested size w/o
zeroing out.

Zeroing out the other unwritten extent is also a cost penalty to the
user anyways. So user will anyway will have to be made aware of not to
attempt writes of fashion which can cause them such penalties. 

As patch-6 mentions this is a base support for bs = ps systems for
enabling atomic writes using bigalloc. For now we return -EINVAL when we
can't allocate a continuous user requested mapping which means it won't
support operations of types 8k followed by 16k.

We can document this behavior as other things are documented for atomic
writes on ext4 with bigalloc e.g. pow_of_2 length writes, natural
alignment w.r.t pos and length etc.

Does that sound ok?

>> - delalloc (not delalloc since we invalidate respective page cache pages before doing DIO).
>> - fragmentation or ENOSPC - For ext4 bigalloc this will not happen since
>> we reserve the entire cluster. So we know there should be space. But I
>> am not sure how other filesystems might end up implementing this functionality.
>
> Thanks,
> John

-ritesh

John Garry Oct. 25, 2024, 12:23 p.m. UTC | #7

On 25/10/2024 12:19, Ritesh Harjani (IBM) wrote:
> John Garry <john.g.garry@oracle.com> writes:
> 
>> On 25/10/2024 11:35, Ritesh Harjani (IBM) wrote:
>>>>> Same as mentioned above. We can't have atomic writes to get split.
>>>>> This patch is just lifting the restriction of iomap to allow more than
>>>>> blocksize but the mapped length should still meet iter->len, as
>>>>> otherwise the writes can get split.
>>>> Sure, I get this. But I wonder why would we be getting multiple
>>>> mappings? Why cannot the FS always provide a single mapping?
>>> FS can decide to split the mappings when it couldn't allocate a single
>>> large mapping of the requested length. Could be due to -
>>> - already allocated extent followed by EOF,
>>> - already allocated extent followed by a hole
>>> - already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written)
>>
>> This is the sort of scenario which I am concerned with. This issue has
>> been discussed at length for XFS forcealign support for atomic writes.
> 
> extsize and forcealign is being worked for ext4 as well where we can
> add such support, sure.
> 
>>
>> So far, the user can atomic write a single FS block regardless of
>> whether the extent in which it would be part of is in written or
>> unwritten state.
>>
>> Now the rule will be to write multiple FS blocks atomically, all blocks
>> need to be in same written or unwritten state.
> 
> FS needs to ensure that the writes does not get torned. So for whatever reason
> FS splits the mapping then we need to return an -EINVAL error to not
> allow such writes to get torned. This patch just does that.
> 
> But I get your point. More below.
> 
>>
>> This oddity at least needs to be documented.
> 
> Got it. Yes, we can do that.
> 
>>
>> Better yet would be to not have this restriction.
>>
> 
> I haven't thought of a clever way where we don't have to zero out the
> rest of the unwritten mapping. With ext4 bigalloc since the entire
> cluster is anyway reserved - I was thinking if we can come up with a
> clever way for doing atomic writes to the entire user requested size w/o
> zeroing out.

This following was main method which was being attempted:

https://lore.kernel.org/linux-fsdevel/20240429174746.2132161-15-john.g.garry@oracle.com/

There were other ideas in different versions of the forcelign/xfs block 
atomic writes series.

> 
> Zeroing out the other unwritten extent is also a cost penalty to the
> user anyways.

Sure, unless we have a special inode flag to say "pre-zero the extent".

> So user will anyway will have to be made aware of not to
> attempt writes of fashion which can cause them such penalties.
> 
> As patch-6 mentions this is a base support for bs = ps systems for
> enabling atomic writes using bigalloc. For now we return -EINVAL when we
> can't allocate a continuous user requested mapping which means it won't
> support operations of types 8k followed by 16k.
> 

That's my least-preferred option.

I think better would be reject atomic writes that cover unwritten 
extents always - but that boat is about to sail...

Thanks,
John

Ritesh Harjani (IBM) Oct. 25, 2024, 12:36 p.m. UTC | #8

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 12:19, Ritesh Harjani (IBM) wrote:
>> John Garry <john.g.garry@oracle.com> writes:
>> 
>>> On 25/10/2024 11:35, Ritesh Harjani (IBM) wrote:
>>>>>> Same as mentioned above. We can't have atomic writes to get split.
>>>>>> This patch is just lifting the restriction of iomap to allow more than
>>>>>> blocksize but the mapped length should still meet iter->len, as
>>>>>> otherwise the writes can get split.
>>>>> Sure, I get this. But I wonder why would we be getting multiple
>>>>> mappings? Why cannot the FS always provide a single mapping?
>>>> FS can decide to split the mappings when it couldn't allocate a single
>>>> large mapping of the requested length. Could be due to -
>>>> - already allocated extent followed by EOF,
>>>> - already allocated extent followed by a hole
>>>> - already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written)
>>>
>>> This is the sort of scenario which I am concerned with. This issue has
>>> been discussed at length for XFS forcealign support for atomic writes.
>> 
>> extsize and forcealign is being worked for ext4 as well where we can
>> add such support, sure.
>> 
>>>
>>> So far, the user can atomic write a single FS block regardless of
>>> whether the extent in which it would be part of is in written or
>>> unwritten state.
>>>
>>> Now the rule will be to write multiple FS blocks atomically, all blocks
>>> need to be in same written or unwritten state.
>> 
>> FS needs to ensure that the writes does not get torned. So for whatever reason
>> FS splits the mapping then we need to return an -EINVAL error to not
>> allow such writes to get torned. This patch just does that.
>> 
>> But I get your point. More below.
>> 
>>>
>>> This oddity at least needs to be documented.
>> 
>> Got it. Yes, we can do that.
>> 
>>>
>>> Better yet would be to not have this restriction.
>>>
>> 
>> I haven't thought of a clever way where we don't have to zero out the
>> rest of the unwritten mapping. With ext4 bigalloc since the entire
>> cluster is anyway reserved - I was thinking if we can come up with a
>> clever way for doing atomic writes to the entire user requested size w/o
>> zeroing out.
>
> This following was main method which was being attempted:
>
> https://lore.kernel.org/linux-fsdevel/20240429174746.2132161-15-john.g.garry@oracle.com/
>
> There were other ideas in different versions of the forcelign/xfs block 
> atomic writes series.
>
>> 
>> Zeroing out the other unwritten extent is also a cost penalty to the
>> user anyways.
>
> Sure, unless we have a special inode flag to say "pre-zero the extent".
>
>> So user will anyway will have to be made aware of not to
>> attempt writes of fashion which can cause them such penalties.
>> 
>> As patch-6 mentions this is a base support for bs = ps systems for
>> enabling atomic writes using bigalloc. For now we return -EINVAL when we
>> can't allocate a continuous user requested mapping which means it won't
>> support operations of types 8k followed by 16k.
>> 
>
> That's my least-preferred option.
>
> I think better would be reject atomic writes that cover unwritten 
> extents always - but that boat is about to sail...

That's what this patch does. For whatever reason if we couldn't allocate
a single contiguous region of requested size for atomic write, then we
reject the request always, isn't it. Or maybe I didn't understand your comment.

If others prefer - we can maybe add such a check (e.g. ext4_dio_atomic_write_checks()) 
for atomic writes in ext4_dio_write_checks(), similar to how we detect
overwrites case to decide whether we need a read v/s write semaphore. 
So this can check if the user has a partially allocated extent for the
user requested region and if yes, we can return -EINVAL from
ext4_dio_write_iter() itself. 

I think this maybe better option than waiting until ->iomap_begin().
This might also bring all atomic write constraints to be checked in one
place i.e. during ext4_file_write_iter() itself.

Thoughts?

-ritesh

John Garry Oct. 25, 2024, 2:04 p.m. UTC | #9

On 25/10/2024 13:36, Ritesh Harjani (IBM) wrote:
>>> So user will anyway will have to be made aware of not to
>>> attempt writes of fashion which can cause them such penalties.
>>>
>>> As patch-6 mentions this is a base support for bs = ps systems for
>>> enabling atomic writes using bigalloc. For now we return -EINVAL when we
>>> can't allocate a continuous user requested mapping which means it won't
>>> support operations of types 8k followed by 16k.
>>>
>> That's my least-preferred option.
>>
>> I think better would be reject atomic writes that cover unwritten
>> extents always - but that boat is about to sail...
> That's what this patch does.

Not really.

Currently we have 2x iomap restrictions:
a. mapping length must equal fs block size
b. bio created must equal total write size

This patch just says that the mapping length must equal total write size 
(instead of a.). So quite similar to b.

> For whatever reason if we couldn't allocate
> a single contiguous region of requested size for atomic write, then we
> reject the request always, isn't it. Or maybe I didn't understand your comment.

As the simplest example, for an atomic write to an empty file, there 
should only be a single mapping returned to iomap_dio_bio_iter() and 
that would be of IOMAP_UNWRITTEN type. And we don't reject that.

> 
> If others prefer - we can maybe add such a check (e.g. ext4_dio_atomic_write_checks())
> for atomic writes in ext4_dio_write_checks(), similar to how we detect
> overwrites case to decide whether we need a read v/s write semaphore.
> So this can check if the user has a partially allocated extent for the
> user requested region and if yes, we can return -EINVAL from
> ext4_dio_write_iter() itself.
 > > I think this maybe better option than waiting until ->iomap_begin().
> This might also bring all atomic write constraints to be checked in one
> place i.e. during ext4_file_write_iter() itself.

Something like this can be done once we decide how atomic writing to 
regions which cover mixed unwritten and written extents is to be handled.

Thanks,
John

Ritesh Harjani (IBM) Oct. 25, 2024, 2:13 p.m. UTC | #10

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 13:36, Ritesh Harjani (IBM) wrote:
>>>> So user will anyway will have to be made aware of not to
>>>> attempt writes of fashion which can cause them such penalties.
>>>>
>>>> As patch-6 mentions this is a base support for bs = ps systems for
>>>> enabling atomic writes using bigalloc. For now we return -EINVAL when we
>>>> can't allocate a continuous user requested mapping which means it won't
>>>> support operations of types 8k followed by 16k.
>>>>
>>> That's my least-preferred option.
>>>
>>> I think better would be reject atomic writes that cover unwritten
>>> extents always - but that boat is about to sail...
>> That's what this patch does.
>
> Not really.
>
> Currently we have 2x iomap restrictions:
> a. mapping length must equal fs block size
> b. bio created must equal total write size
>
> This patch just says that the mapping length must equal total write size 
> (instead of a.). So quite similar to b.
>
>> For whatever reason if we couldn't allocate
>> a single contiguous region of requested size for atomic write, then we
>> reject the request always, isn't it. Or maybe I didn't understand your comment.
>
> As the simplest example, for an atomic write to an empty file, there 
> should only be a single mapping returned to iomap_dio_bio_iter() and 
> that would be of IOMAP_UNWRITTEN type. And we don't reject that.
>

Ok. Maybe this is what I am missing. Could you please help me understand
why should such writes be rejected? 

For e.g. 
If FS could allocate a single contiguous IOMAP_UNWRITTEN extent of
atomic write request size, that means - 
1. FS will allocate an unwritten extent.
2. will do writes (using submit_bio) to the unwritten extent. 
3. will do unwritten to written conversion. 

It is ok if either of the above operations fail right? If (3) fails
then the region will still be marked unwritten that means it will read
zero (old contents). (2) can anyway fail and will not result into
partial writes. (1) will anyway not result into any write whatsoever.

So we can never have a situation where there is partial writes leading
to mix of old and new write contents right for such cases? Which is what the
requirement of atomic/untorn write also is?

Sorry am I missing something here?

>> 
>> If others prefer - we can maybe add such a check (e.g. ext4_dio_atomic_write_checks())
>> for atomic writes in ext4_dio_write_checks(), similar to how we detect
>> overwrites case to decide whether we need a read v/s write semaphore.
>> So this can check if the user has a partially allocated extent for the
>> user requested region and if yes, we can return -EINVAL from
>> ext4_dio_write_iter() itself.
>  > > I think this maybe better option than waiting until ->iomap_begin().
>> This might also bring all atomic write constraints to be checked in one
>> place i.e. during ext4_file_write_iter() itself.
>
> Something like this can be done once we decide how atomic writing to 
> regions which cover mixed unwritten and written extents is to be handled.

Mixed extent regions (written + unwritten) is a different case all
together (which can lead to mix of old and new contents).

But here what I am suggesting is to add following constraint in case of
ext4 with bigalloc - 

"Writes to a region which already has partially allocated extent is not supported."

That means we will return -EINVAL if we detect above case in
ext4_file_write_iter() and sure we can document this behavior.

In retrospect, I am not sure why we cannot add a constraint for atomic
writes (e.g. for ext4 bigalloc) and reject such writes outright,
instead of silently incurring a performance penalty by zeroing out the
partial regions by allowing such write request.

-ritesh

Darrick J. Wong Oct. 25, 2024, 6:28 p.m. UTC | #11

On Fri, Oct 25, 2024 at 07:43:17PM +0530, Ritesh Harjani wrote:
> John Garry <john.g.garry@oracle.com> writes:
> 
> > On 25/10/2024 13:36, Ritesh Harjani (IBM) wrote:
> >>>> So user will anyway will have to be made aware of not to
> >>>> attempt writes of fashion which can cause them such penalties.
> >>>>
> >>>> As patch-6 mentions this is a base support for bs = ps systems for
> >>>> enabling atomic writes using bigalloc. For now we return -EINVAL when we
> >>>> can't allocate a continuous user requested mapping which means it won't
> >>>> support operations of types 8k followed by 16k.
> >>>>
> >>> That's my least-preferred option.
> >>>
> >>> I think better would be reject atomic writes that cover unwritten
> >>> extents always - but that boat is about to sail...
> >> That's what this patch does.
> >
> > Not really.
> >
> > Currently we have 2x iomap restrictions:
> > a. mapping length must equal fs block size
> > b. bio created must equal total write size
> >
> > This patch just says that the mapping length must equal total write size 
> > (instead of a.). So quite similar to b.
> >
> >> For whatever reason if we couldn't allocate
> >> a single contiguous region of requested size for atomic write, then we
> >> reject the request always, isn't it. Or maybe I didn't understand your comment.
> >
> > As the simplest example, for an atomic write to an empty file, there 
> > should only be a single mapping returned to iomap_dio_bio_iter() and 
> > that would be of IOMAP_UNWRITTEN type. And we don't reject that.
> >
> 
> Ok. Maybe this is what I am missing. Could you please help me understand
> why should such writes be rejected? 
> 
> For e.g. 
> If FS could allocate a single contiguous IOMAP_UNWRITTEN extent of
> atomic write request size, that means - 
> 1. FS will allocate an unwritten extent.
> 2. will do writes (using submit_bio) to the unwritten extent. 
> 3. will do unwritten to written conversion. 
> 
> It is ok if either of the above operations fail right? If (3) fails
> then the region will still be marked unwritten that means it will read
> zero (old contents). (2) can anyway fail and will not result into
> partial writes. (1) will anyway not result into any write whatsoever.
> 
> So we can never have a situation where there is partial writes leading
> to mix of old and new write contents right for such cases? Which is what the
> requirement of atomic/untorn write also is?
> 
> Sorry am I missing something here?

I must be missing something; to perform an untorn write, two things must
happen --

1. The kernel writes the data to the storage device, and the storage
device either persists all of it, or throws back an error having
persisted none of it.

2. If (1) completes successfully, all file mapping updates for the range
written must be persisted, or an error is thrown back and none of them
are persisted.

iomap doesn't have to know how the filesystem satisfies (2); it just has
to create a single bio containing all data pages or it rejects the
write.

Currently, it's an implementation detail that the XFS directio write
ioend code processes the file mapping updates for the range written by
walking every extent mapping for that range and issuing separate
transactions for each mapping update.  There's nothing that can restart
the walk if it is interrupted.  That's why XFS cannot support multi
fsblock untorn writes to blocks with different status.

As I've said before, the most general solution to this would be to add a
new log intent item that would track the "update all mappings in this
file range" operation so that recovery could restart the walk.  This is
the most technically challenging, so we decided not to implement it
until there is demand.

Having set aside the idea of redesigning ioend, the second-most general
solution is pre-zeroing unwritten extents and holes so that
->iomap_begin implementations can present a single mapping to the bio
constructor.  Technically if there's only one unwritten extent or hole
or cow, xfs can actually satisfy (2) because it only creates one
transaction.

This gets me to the third and much less general solution -- only allow
untorn writes if we know that the ioend only ever has to run a single
transaction.  That's why untorn writes are limited to a single fsblock
for now -- it's a simple solution so that we can get our downstream
customers to kick the tires and start on the next iteration instead of
spending years on waterfalling.

Did you notice that in all of these cases, the capabilities of the
filesystem's ioend processing determines the restrictions on the number
and type of mappings that ->iomap_begin can give to iomap?

Now that we have a second system trying to hook up to the iomap support,
it's clear to me that the restrictions on mappings are specific to each
filesystem.  Therefore, the iomap directio code should not impose
restrictions on the mappings it receives unless they would prevent the
creation of the single aligned bio.

Instead, xfs_direct_write_iomap_begin and ext4_iomap_begin should return
EINVAL or something if they look at the file mappings and discover that
they cannot perform the ioend without risking torn mapping updates.  In
the long run, ->iomap_begin is where this iomap->len <= iter->len check
really belongs, but hold that thought.

For the multi fsblock case, the ->iomap_begin functions would have to
check that only one metadata update would be necessary in the ioend.
That's where things get murky, since ext4/xfs drop their mapping locks
between calls to ->iomap_begin.  So you'd have to check all the mappings
for unsupported mixed state every time.  Yuck.

It might be less gross to retain the restriction that iomap accepts only
one mapping for the entire file range, like Ritesh has here.  Users
might be ok with us saying that you can't do a 16k atomic write to a
region where you previously did an 8k write until you write the other
8k, even if someone has to write zeroes.  Users might be ok with the
kernel allowing multi-fsblock writes but only if the stars align.  But
to learn the answers to those questions, we have to put /something/ in
the hands of our users.

For now (because we're already at -rc5), let's have xfs/ext4's
->write_iter implementations restrict atomic writes to a single fsblock,
and get both merged into the kernel.  Let's defer the multi fsblock work
to 6.14, though I think we could take this patch.

Does that sound cool?

--D

> >> 
> >> If others prefer - we can maybe add such a check (e.g. ext4_dio_atomic_write_checks())
> >> for atomic writes in ext4_dio_write_checks(), similar to how we detect
> >> overwrites case to decide whether we need a read v/s write semaphore.
> >> So this can check if the user has a partially allocated extent for the
> >> user requested region and if yes, we can return -EINVAL from
> >> ext4_dio_write_iter() itself.
> >  > > I think this maybe better option than waiting until ->iomap_begin().
> >> This might also bring all atomic write constraints to be checked in one
> >> place i.e. during ext4_file_write_iter() itself.
> >
> > Something like this can be done once we decide how atomic writing to 
> > regions which cover mixed unwritten and written extents is to be handled.
> 
> Mixed extent regions (written + unwritten) is a different case all
> together (which can lead to mix of old and new contents).
> 
> 
> But here what I am suggesting is to add following constraint in case of
> ext4 with bigalloc - 
> 
> "Writes to a region which already has partially allocated extent is not supported."
> 
> That means we will return -EINVAL if we detect above case in
> ext4_file_write_iter() and sure we can document this behavior.
> 
> In retrospect, I am not sure why we cannot add a constraint for atomic
> writes (e.g. for ext4 bigalloc) and reject such writes outright,
> instead of silently incurring a performance penalty by zeroing out the
> partial regions by allowing such write request.
> 
> -ritesh
>

[5/6] iomap: Lift blocksize restriction on atomic writes

Commit Message

Comments

Patch