diff mbox series

[RFC,5/6] fs: xfs: iomap atomic write support

Message ID 20240124142645.9334-6-john.g.garry@oracle.com (mailing list archive)
State New, archived
Headers show
Series block atomic writes for XFS | expand

Commit Message

John Garry Jan. 24, 2024, 2:26 p.m. UTC
Ensure that when creating a mapping that we adhere to all the atomic
write rules.

We check that the mapping covers the complete range of the write to ensure
that we'll be just creating a single mapping.

Currently minimum granularity is the FS block size, but it should be
possibly to support lower in future.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
I am setting this as an RFC as I am not sure on the change in
xfs_iomap_write_direct() - it gives the desired result AFAICS.

 fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

Comments

Darrick J. Wong Feb. 2, 2024, 6:47 p.m. UTC | #1
On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote:
> Ensure that when creating a mapping that we adhere to all the atomic
> write rules.
> 
> We check that the mapping covers the complete range of the write to ensure
> that we'll be just creating a single mapping.
> 
> Currently minimum granularity is the FS block size, but it should be
> possibly to support lower in future.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> I am setting this as an RFC as I am not sure on the change in
> xfs_iomap_write_direct() - it gives the desired result AFAICS.
> 
>  fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 18c8f168b153..758dc1c90a42 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
>  		}
>  	}
>  
> +	if (xfs_inode_atomicwrites(ip))
> +		bmapi_flags = XFS_BMAPI_ZERO;

Why do we want to write zeroes to the disk if we're allocating space
even if we're not sending an atomic write?

(This might want an explanation for why we're doing this at all -- it's
to avoid unwritten extent conversion, which defeats hardware untorn
writes.)

I think we should support IOCB_ATOMIC when the mapping is unwritten --
the data will land on disk in an untorn fashion, the unwritten extent
conversion on IO completion is itself atomic, and callers still have to
set O_DSYNC to persist anything.  Then we can avoid the cost of
BMAPI_ZERO, because double-writes aren't free.

> +
>  	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
>  			rblocks, force, &tp);
>  	if (error)
> @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
>  	if (error)
>  		goto out_unlock;
>  
> +	if (flags & IOMAP_ATOMIC) {
> +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
> +		unsigned int unit_min, unit_max;
> +
> +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
> +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
> +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
> +
> +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
> +			error = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		if ((offset & mp->m_blockmask) ||
> +		    (length & mp->m_blockmask)) {
> +			error = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		if (imap.br_blockcount == unit_min_fsb ||
> +		    imap.br_blockcount == unit_max_fsb) {
> +			/* ok if exactly min or max */
> +		} else if (imap.br_blockcount < unit_min_fsb ||
> +			   imap.br_blockcount > unit_max_fsb) {
> +			error = -EINVAL;
> +			goto out_unlock;
> +		} else if (!is_power_of_2(imap.br_blockcount)) {
> +			error = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		if (imap.br_startoff &&
> +		    imap.br_startoff & (imap.br_blockcount - 1)) {

Not sure why we care about the file position, it's br_startblock that
gets passed into the bio, not br_startoff.

I'm also still not convinced that any of this validation is useful here.
The block device stack underneath the filesystem can change at any time
without any particular notice to the fs, so the only way to find out if
the proposed IO would meet the alignment constraints is to submit_bio
and see what happens.

(The "one bio per untorn write request" thing in the direct-io.c patch
sound sane to me though.)

--D

> +			error = -EINVAL;
> +			goto out_unlock;
> +		}
> +	}
> +
>  	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
>  		error = -EAGAIN;
>  		if (flags & IOMAP_NOWAIT)
> -- 
> 2.31.1
> 
>
John Garry Feb. 5, 2024, 1:36 p.m. UTC | #2
On 02/02/2024 18:47, Darrick J. Wong wrote:
> On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote:
>> Ensure that when creating a mapping that we adhere to all the atomic
>> write rules.
>>
>> We check that the mapping covers the complete range of the write to ensure
>> that we'll be just creating a single mapping.
>>
>> Currently minimum granularity is the FS block size, but it should be
>> possibly to support lower in future.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>> I am setting this as an RFC as I am not sure on the change in
>> xfs_iomap_write_direct() - it gives the desired result AFAICS.
>>
>>   fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 41 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 18c8f168b153..758dc1c90a42 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
>>   		}
>>   	}
>>   
>> +	if (xfs_inode_atomicwrites(ip))
>> +		bmapi_flags = XFS_BMAPI_ZERO;
> 
> Why do we want to write zeroes to the disk if we're allocating space
> even if we're not sending an atomic write?
> 
> (This might want an explanation for why we're doing this at all -- it's
> to avoid unwritten extent conversion, which defeats hardware untorn
> writes.)

It's to handle the scenario where we have a partially written extent, 
and then try to issue an atomic write which covers the complete extent. 
In this scenario, the iomap code will issue 2x IOs, which is 
unacceptable. So we ensure that the extent is completely written 
whenever we allocate it. At least that is my idea.

> 
> I think we should support IOCB_ATOMIC when the mapping is unwritten --
> the data will land on disk in an untorn fashion, the unwritten extent
> conversion on IO completion is itself atomic, and callers still have to
> set O_DSYNC to persist anything. 

But does this work for the scenario above?

> Then we can avoid the cost of
> BMAPI_ZERO, because double-writes aren't free.

About double-writes not being free, I thought that this was acceptable 
to just have this write zero when initially allocating the extent as it 
should not add too much overhead in practice, i.e. it's one off.

> 
>> +
>>   	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
>>   			rblocks, force, &tp);
>>   	if (error)
>> @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
>>   	if (error)
>>   		goto out_unlock;
>>   
>> +	if (flags & IOMAP_ATOMIC) {
>> +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
>> +		unsigned int unit_min, unit_max;
>> +
>> +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
>> +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
>> +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
>> +
>> +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
>> +			error = -EINVAL;
>> +			goto out_unlock;
>> +		}
>> +
>> +		if ((offset & mp->m_blockmask) ||
>> +		    (length & mp->m_blockmask)) {
>> +			error = -EINVAL;
>> +			goto out_unlock;
>> +		}
>> +
>> +		if (imap.br_blockcount == unit_min_fsb ||
>> +		    imap.br_blockcount == unit_max_fsb) {
>> +			/* ok if exactly min or max */
>> +		} else if (imap.br_blockcount < unit_min_fsb ||
>> +			   imap.br_blockcount > unit_max_fsb) {
>> +			error = -EINVAL;
>> +			goto out_unlock;
>> +		} else if (!is_power_of_2(imap.br_blockcount)) {
>> +			error = -EINVAL;
>> +			goto out_unlock;
>> +		}
>> +
>> +		if (imap.br_startoff &&
>> +		    imap.br_startoff & (imap.br_blockcount - 1)) {
> 
> Not sure why we care about the file position, it's br_startblock that
> gets passed into the bio, not br_startoff.

We just want to ensure that the length of the write is valid w.r.t. to 
the offset within the extent, and br_startoff would be the offset within 
the aligned extent.

> 
> I'm also still not convinced that any of this validation is useful here.
> The block device stack underneath the filesystem can change at any time
> without any particular notice to the fs, so the only way to find out if
> the proposed IO would meet the alignment constraints is to submit_bio
> and see what happens.

I am not sure what submit_bio() would do differently. If the block 
device is changing underneath the block layer, then there is where these 
things need to be checked.

> 
> (The "one bio per untorn write request" thing in the direct-io.c patch
> sound sane to me though.)
> 

ok

Thanks,
John
Dave Chinner Feb. 6, 2024, 1:15 a.m. UTC | #3
On Mon, Feb 05, 2024 at 01:36:03PM +0000, John Garry wrote:
> On 02/02/2024 18:47, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote:
> > > Ensure that when creating a mapping that we adhere to all the atomic
> > > write rules.
> > > 
> > > We check that the mapping covers the complete range of the write to ensure
> > > that we'll be just creating a single mapping.
> > > 
> > > Currently minimum granularity is the FS block size, but it should be
> > > possibly to support lower in future.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > > I am setting this as an RFC as I am not sure on the change in
> > > xfs_iomap_write_direct() - it gives the desired result AFAICS.
> > > 
> > >   fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 41 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 18c8f168b153..758dc1c90a42 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
> > >   		}
> > >   	}
> > > +	if (xfs_inode_atomicwrites(ip))
> > > +		bmapi_flags = XFS_BMAPI_ZERO;

We really, really don't want to be doing this during allocation
unless we can avoid it. If the filesystem block size is 64kB, we
could be allocating up to 96GB per extent, and that becomes an
uninterruptable write stream inside a transaction context that holds
inode metadata locked.

IOWs, if the inode is already dirty, this data zeroing effectively
pins the tail of the journal until the data writes complete, and
hence can potentially stall the entire filesystem for that length of
time.

Historical note: XFS_BMAPI_ZERO was introduced for DAX where
unwritten extents are not used for initial allocation because the
direct zeroing overhead is typically much lower than unwritten
extent conversion overhead.  It was not intended as a general
purpose "zero data at allocation time" solution primarily because of
how easy it would be to DOS the storage with a single, unkillable
fallocate() call on slow storage.

> > Why do we want to write zeroes to the disk if we're allocating space
> > even if we're not sending an atomic write?
> > 
> > (This might want an explanation for why we're doing this at all -- it's
> > to avoid unwritten extent conversion, which defeats hardware untorn
> > writes.)
> 
> It's to handle the scenario where we have a partially written extent, and
> then try to issue an atomic write which covers the complete extent.

When/how would that ever happen with the forcealign bits being set
preventing unaligned allocation and writes?

> In this
> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
> ensure that the extent is completely written whenever we allocate it. At
> least that is my idea.

So return an unaligned extent, and then the IOMAP_ATOMIC checks you
add below say "no" and then the application has to do things the
slow, safe way....

> > I think we should support IOCB_ATOMIC when the mapping is unwritten --
> > the data will land on disk in an untorn fashion, the unwritten extent
> > conversion on IO completion is itself atomic, and callers still have to
> > set O_DSYNC to persist anything.
> 
> But does this work for the scenario above?

Probably not, but if we want the mapping to return a single
contiguous extent mapping that spans both unwritten and written
states, then we should directly code that behaviour for atomic
IO and not try to hack around it via XFS_BMAPI_ZERO.

Unwritten extent conversion will already do the right thing in that
it will only convert unwritten regions to written in the larger
range that is passed to it, but if there are multiple regions that
need conversion then the conversion won't be atomic.

> > Then we can avoid the cost of
> > BMAPI_ZERO, because double-writes aren't free.
> 
> About double-writes not being free, I thought that this was acceptable to
> just have this write zero when initially allocating the extent as it should
> not add too much overhead in practice, i.e. it's one off.

The whole point about atomic writes is they are a performance
optimisation. If the cost of enabling atomic writes is that we
double the amount of IO we are doing, then we've lost more
performance than we gained by using atomic writes. That doesn't
seem desirable....

> 
> > 
> > > +
> > >   	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
> > >   			rblocks, force, &tp);
> > >   	if (error)
> > > @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
> > >   	if (error)
> > >   		goto out_unlock;
> > > +	if (flags & IOMAP_ATOMIC) {
> > > +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
> > > +		unsigned int unit_min, unit_max;
> > > +
> > > +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
> > > +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
> > > +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
> > > +
> > > +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		if ((offset & mp->m_blockmask) ||
> > > +		    (length & mp->m_blockmask)) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		}

That belongs in the iomap DIO setup code, not here. It's also only
checking the data offset/length is filesystem block aligned, not
atomic IO aligned, too.

> > > +
> > > +		if (imap.br_blockcount == unit_min_fsb ||
> > > +		    imap.br_blockcount == unit_max_fsb) {
> > > +			/* ok if exactly min or max */

Why? Exact sizing doesn't imply alignment is correct.

> > > +		} else if (imap.br_blockcount < unit_min_fsb ||
> > > +			   imap.br_blockcount > unit_max_fsb) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;

Why do this after an exact check?

> > > +		} else if (!is_power_of_2(imap.br_blockcount)) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;

Why does this matter? If the extent mapping spans a range larger
than was asked for, who cares what size it is as the infrastructure
is only going to do IO for the sub-range in the mapping the user
asked for....

> > > +		}
> > > +
> > > +		if (imap.br_startoff &&
> > > +		    imap.br_startoff & (imap.br_blockcount - 1)) {
> > 
> > Not sure why we care about the file position, it's br_startblock that
> > gets passed into the bio, not br_startoff.
> 
> We just want to ensure that the length of the write is valid w.r.t. to the
> offset within the extent, and br_startoff would be the offset within the
> aligned extent.

I'm not sure why the filesystem extent mapping code needs to care
about IOMAP_ATOMIC like this - the extent allocation behaviour is
determined by the inode forcealign flag, not IOMAP_ATOMIC.
Everything else we have to do is just mapping the offset/len that
was passed to it from the iomap DIO layer. As long as we allocate
with correct alignment and return a mapping that spans the start
offset of the requested range, we've done our job here.

Actually determining if the mapping returned for IO is suitable for
the type of IO we are doing (i.e. IOMAP_ATOMIC) is the
responsibility of the iomap infrastructure. The same checks will
have to be done for every filesystem that implements atomic writes,
so these checks belong in the generic code, not the filesystem
mapping callouts.

-Dave
John Garry Feb. 6, 2024, 9:53 a.m. UTC | #4
Hi Dave,

>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>>>> index 18c8f168b153..758dc1c90a42 100644
>>>> --- a/fs/xfs/xfs_iomap.c
>>>> +++ b/fs/xfs/xfs_iomap.c
>>>> @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
>>>>    		}
>>>>    	}
>>>> +	if (xfs_inode_atomicwrites(ip))
>>>> +		bmapi_flags = XFS_BMAPI_ZERO;
> 
> We really, really don't want to be doing this during allocation
> unless we can avoid it. If the filesystem block size is 64kB, we
> could be allocating up to 96GB per extent, and that becomes an
> uninterruptable write stream inside a transaction context that holds
> inode metadata locked.

Where does that 96GB figure come from?

> 
> IOWs, if the inode is already dirty, this data zeroing effectively
> pins the tail of the journal until the data writes complete, and
> hence can potentially stall the entire filesystem for that length of
> time.
> 
> Historical note: XFS_BMAPI_ZERO was introduced for DAX where
> unwritten extents are not used for initial allocation because the
> direct zeroing overhead is typically much lower than unwritten
> extent conversion overhead.  It was not intended as a general
> purpose "zero data at allocation time" solution primarily because of
> how easy it would be to DOS the storage with a single, unkillable
> fallocate() call on slow storage.

Understood

> 
>>> Why do we want to write zeroes to the disk if we're allocating space
>>> even if we're not sending an atomic write?
>>>
>>> (This might want an explanation for why we're doing this at all -- it's
>>> to avoid unwritten extent conversion, which defeats hardware untorn
>>> writes.)
>>
>> It's to handle the scenario where we have a partially written extent, and
>> then try to issue an atomic write which covers the complete extent.
> 
> When/how would that ever happen with the forcealign bits being set
> preventing unaligned allocation and writes?

Consider this scenario:

# mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda
# mount /dev/sda mnt -o rtdev=/dev/sdb
# touch  mnt/file
# /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic
write, 4096B at pos 0
# filefrag -v mnt/file
Filesystem type is: 58465342
File size of mnt/file is 4096 (1 block of 4096 bytes)
   ext:     logical_offset:        physical_offset: length:   expected:
flags:
     0:        0..       0:         24..        24:      1:
last,eof
mnt/file: 1 extent found
# /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file
wrote -1 bytes at pos 0 write_size=16384
#

For the 2nd write, which would cover a 16KB extent, the iomap code will 
iter twice and produce 2x BIOs, which we don't want - that's why it 
errors there.

With the change in this patch, instead we have something like this after 
the first write:

# /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
wrote 4096 bytes at pos 0 write_size=4096
# filefrag -v mnt/file
Filesystem type is: 58465342
File size of mnt/file is 4096 (1 block of 4096 bytes)
   ext:     logical_offset:        physical_offset: length:   expected:
flags:
     0:        0..       3:         24..        27:      4:
last,eof
mnt/file: 1 extent found
#

So the 16KB extent is in written state and the 2nd 16KB write would iter 
once, producing a single BIO.

> 
>> In this
>> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
>> ensure that the extent is completely written whenever we allocate it. At
>> least that is my idea.
> 
> So return an unaligned extent, and then the IOMAP_ATOMIC checks you
> add below say "no" and then the application has to do things the
> slow, safe way....

We have been porting atomic write support to some database apps and they 
(database developers) have had to do something like manually zero the 
complete file to get around this issue, but that's not a good user 
experience.

Note that in their case the first 4KB write is non-atomic, but that does 
not change things. They do these 4KB writes in some DB setup phase.

> 
>>> I think we should support IOCB_ATOMIC when the mapping is unwritten --
>>> the data will land on disk in an untorn fashion, the unwritten extent
>>> conversion on IO completion is itself atomic, and callers still have to
>>> set O_DSYNC to persist anything.
>>
>> But does this work for the scenario above?
> 
> Probably not, but if we want the mapping to return a single
> contiguous extent mapping that spans both unwritten and written
> states, then we should directly code that behaviour for atomic
> IO and not try to hack around it via XFS_BMAPI_ZERO.
> 
> Unwritten extent conversion will already do the right thing in that
> it will only convert unwritten regions to written in the larger
> range that is passed to it, but if there are multiple regions that
> need conversion then the conversion won't be atomic.

We would need something atomic.

> 
>>> Then we can avoid the cost of
>>> BMAPI_ZERO, because double-writes aren't free.
>>
>> About double-writes not being free, I thought that this was acceptable to
>> just have this write zero when initially allocating the extent as it should
>> not add too much overhead in practice, i.e. it's one off.
> 
> The whole point about atomic writes is they are a performance
> optimisation. If the cost of enabling atomic writes is that we
> double the amount of IO we are doing, then we've lost more
> performance than we gained by using atomic writes. That doesn't
> seem desirable....

But the zero'ing is a one off per extent allocation, right? I would 
expect just an initial overhead when the database is being created/extended.

Anyway, I did mark this as an RFC for this same reason.

> 
>>
>>>
>>>> +
>>>>    	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
>>>>    			rblocks, force, &tp);
>>>>    	if (error)
>>>> @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
>>>>    	if (error)
>>>>    		goto out_unlock;
>>>> +	if (flags & IOMAP_ATOMIC) {
>>>> +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
>>>> +		unsigned int unit_min, unit_max;
>>>> +
>>>> +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
>>>> +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
>>>> +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
>>>> +
>>>> +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
>>>> +			error = -EINVAL;
>>>> +			goto out_unlock;
>>>> +		}
>>>> +
>>>> +		if ((offset & mp->m_blockmask) ||
>>>> +		    (length & mp->m_blockmask)) {
>>>> +			error = -EINVAL;
>>>> +			goto out_unlock;
>>>> +		}
> 
> That belongs in the iomap DIO setup code, not here. It's also only
> checking the data offset/length is filesystem block aligned, not
> atomic IO aligned, too.

hmmm... I'm not sure about that. Initially XFS will only support writes 
whose size is a multiple of FS block size, and this is what we are 
checking here, even if it is not obvious.

The idea is that we can first ensure size is a multiple of FS blocksize, 
and then can use br_blockcount directly, below.

> 
>>>> +
>>>> +		if (imap.br_blockcount == unit_min_fsb ||
>>>> +		    imap.br_blockcount == unit_max_fsb) {
>>>> +			/* ok if exactly min or max */
> 
> Why? Exact sizing doesn't imply alignment is correct.

We're not checking alignment specifically, but just checking that the 
size is ok.

> 
>>>> +		} else if (imap.br_blockcount < unit_min_fsb ||
>>>> +			   imap.br_blockcount > unit_max_fsb) {
>>>> +			error = -EINVAL;
>>>> +			goto out_unlock;
> 
> Why do this after an exact check?

And this is a continuation of the size check.

> 
>>>> +		} else if (!is_power_of_2(imap.br_blockcount)) {
>>>> +			error = -EINVAL;
>>>> +			goto out_unlock;
> 
> Why does this matter? If the extent mapping spans a range larger
> than was asked for, who cares what size it is as the infrastructure
> is only going to do IO for the sub-range in the mapping the user
> asked for....

ok, so where would be a better check for power-of-2 write length? In 
iomap DIO code?

I was thinking of doing that, but not so happy with sparse checks.

> 
>>>> +		}
>>>> +
>>>> +		if (imap.br_startoff &&
>>>> +		    imap.br_startoff & (imap.br_blockcount - 1)) {
>>>
>>> Not sure why we care about the file position, it's br_startblock that
>>> gets passed into the bio, not br_startoff.
>>
>> We just want to ensure that the length of the write is valid w.r.t. to the
>> offset within the extent, and br_startoff would be the offset within the
>> aligned extent.
> 
> I'm not sure why the filesystem extent mapping code needs to care
> about IOMAP_ATOMIC like this - the extent allocation behaviour is
> determined by the inode forcealign flag, not IOMAP_ATOMIC.
> Everything else we have to do is just mapping the offset/len that
> was passed to it from the iomap DIO layer. As long as we allocate
> with correct alignment and return a mapping that spans the start
> offset of the requested range, we've done our job here.
> 
> Actually determining if the mapping returned for IO is suitable for
> the type of IO we are doing (i.e. IOMAP_ATOMIC) is the
> responsibility of the iomap infrastructure. The same checks will
> have to be done for every filesystem that implements atomic writes,
> so these checks belong in the generic code, not the filesystem
> mapping callouts.

We can move some of these checks to the core iomap code.

However, the core iomap code does not know FS atomic write min and max 
per inode, so we need some checks here.

Thanks,
John
Dave Chinner Feb. 7, 2024, 12:06 a.m. UTC | #5
On Tue, Feb 06, 2024 at 09:53:11AM +0000, John Garry wrote:
> Hi Dave,
> 
> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > > index 18c8f168b153..758dc1c90a42 100644
> > > > > --- a/fs/xfs/xfs_iomap.c
> > > > > +++ b/fs/xfs/xfs_iomap.c
> > > > > @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
> > > > >    		}
> > > > >    	}
> > > > > +	if (xfs_inode_atomicwrites(ip))
> > > > > +		bmapi_flags = XFS_BMAPI_ZERO;
> > 
> > We really, really don't want to be doing this during allocation
> > unless we can avoid it. If the filesystem block size is 64kB, we
> > could be allocating up to 96GB per extent, and that becomes an
> > uninterruptable write stream inside a transaction context that holds
> > inode metadata locked.
> 
> Where does that 96GB figure come from?

My inability to do math. The actual number is 128GB.

Max extent size = XFS_MAX_BMBT_EXTLEN * fs block size.
	        = 2^21 * fs block size.

So for a 4kB block size filesystem, that's 8GB max extent length,
and that's the most we will allocate in a single transaction (i.e.
one new BMBT record).

For 64kB block size, we can get 128GB of space allocated in a single
transaction.

> > > > Why do we want to write zeroes to the disk if we're allocating space
> > > > even if we're not sending an atomic write?
> > > > 
> > > > (This might want an explanation for why we're doing this at all -- it's
> > > > to avoid unwritten extent conversion, which defeats hardware untorn
> > > > writes.)
> > > 
> > > It's to handle the scenario where we have a partially written extent, and
> > > then try to issue an atomic write which covers the complete extent.
> > 
> > When/how would that ever happen with the forcealign bits being set
> > preventing unaligned allocation and writes?
> 
> Consider this scenario:
> 
> # mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda
> # mount /dev/sda mnt -o rtdev=/dev/sdb
> # touch  mnt/file
> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic
> write, 4096B at pos 0

Please don't write one-off custom test programs to issue IO - please
use and enhance xfs_io so the test cases can then be put straight
into fstests without adding yet another "do some minor IO variant"
test program. This also means you don't need a random assortment of
other tools.

i.e.

# xfs_io -dc "pwrite -VA 0 4096" /root/mnt/file

Should do an RWF_ATOMIC IO, and

# xfs_io -dc "pwrite -VAD 0 4096" /root/mnt/file

should do an RWF_ATOMIC|RWF_DSYNC IO...


> # filefrag -v mnt/file

xfs_io -c "fiemap" mnt/file

> Filesystem type is: 58465342
> File size of mnt/file is 4096 (1 block of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected:
> flags:
>     0:        0..       0:         24..        24:      1:
> last,eof
> mnt/file: 1 extent found
> # /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file
> wrote -1 bytes at pos 0 write_size=16384
> #

Whole test as one repeatable command:

# xfs_io -d -c "truncate 0" -c "chattr +r" \
	-c "pwrite -VAD 0 4096" \
	-c "fiemap" \
	-c "pwrite -VAD 0 16384" \
	/mnt/root/file
> 
> For the 2nd write, which would cover a 16KB extent, the iomap code will iter
> twice and produce 2x BIOs, which we don't want - that's why it errors there.

Yes, but I think that's a feature.  You've optimised the filesystem
layout for IO that is 64kB sized and aligned IO, but your test case
is mixing 4kB and 16KB IO. The filesystem should be telling you that
you're doing something that is sub-optimal for it's configuration,
and refusing to do weird overlapping sub-rtextsize atomic IO is a
pretty good sign that you've got something wrong.

The whole reason for rtextsize existing is to optimise the rtextent
allocation to the typical minimum IO size done to that volume. If
all your IO is sub-rtextsize size and alignment, then all that has
been done is forcing the entire rt device IO into a corner it was
never really intended nor optimised for.

Why should we jump through crazy hoops to try to make filesystems
optimised for large IOs with mismatched, overlapping small atomic
writes?

> With the change in this patch, instead we have something like this after the
> first write:
> 
> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
> wrote 4096 bytes at pos 0 write_size=4096
> # filefrag -v mnt/file
> Filesystem type is: 58465342
> File size of mnt/file is 4096 (1 block of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected:
> flags:
>     0:        0..       3:         24..        27:      4:
> last,eof
> mnt/file: 1 extent found
> #
> 
> So the 16KB extent is in written state and the 2nd 16KB write would iter
> once, producing a single BIO.

Sure, I know how it works. My point is that it's a terrible way to
go about allowing that second atomic write to succeed.

> > > In this
> > > scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
> > > ensure that the extent is completely written whenever we allocate it. At
> > > least that is my idea.
> > 
> > So return an unaligned extent, and then the IOMAP_ATOMIC checks you
> > add below say "no" and then the application has to do things the
> > slow, safe way....
> 
> We have been porting atomic write support to some database apps and they
> (database developers) have had to do something like manually zero the
> complete file to get around this issue, but that's not a good user
> experience.

Better the application zeros the file when it is being initialised
and doesn't have performance constraints rather than forcing the
filesystem to do it in the IO fast path when IO performance and
latency actually matters to the application.

There are production databases that already do this manual zero
initialisation to avoid unwritten extent conversion overhead during
runtime operation. That's because they want FUA writes to work, and
that gives 25% better IO performance over the same O_DSYNC writes
doing allocation and/or unwritten extent conversion after
fallocate() which requires journal flushes with O_DSYNC writes.

Using atomic writes is no different.

> Note that in their case the first 4KB write is non-atomic, but that does not
> change things. They do these 4KB writes in some DB setup phase.

And therein lies the problem.

If you are doing sub-rtextent IO at all, then you are forcing the
filesystem down the path of explicitly using unwritten extents and
requiring O_DSYNC direct IO to do journal flushes in IO completion
context and then performance just goes down hill from them.

The requirement for unwritten extents to track sub-rtextsize written
regions is what you're trying to work around with XFS_BMAPI_ZERO so
that atomic writes will always see "atomic write aligned" allocated
regions.

Do you see the problem here? You've explicitly told the filesystem
that allocation is aligned to 64kB chunks, then because the
filesystem block size is 4kB, it's allowed to track unwritten
regions at 4kB boundaries. Then you do 4kB aligned file IO, which
then changes unwritten extents at 4kB boundaries. Then you do a
overlapping 16kB IO that *requires* 16kB allocation alignment, and
things go BOOM.

Yes, they should go BOOM.

This is a horrible configuration - it is incomaptible with 16kB
aligned and sized atomic IO. Allocation is aligned to 64kB, written
region tracking is aligned to 4kB, and there's nothing to tell the
filesystem that it should be maintaining 16kB "written alignment" so
that 16kB atomic writes can always be issued atomically.

i.e. if we are going to do 16kB aligned atomic IO, then all the
allocation and unwritten tracking needs to be done in 16kB aligned
chunks, not 4kB. That means a 4KB write into an unwritten region or
a hole actually needs to zero the rest of the 16KB range it sits
within.

The direct IO code can do this, but it needs extension of the
unaligned IO serialisation in XFS (the alignment checks in
xfs_file_dio_write()) and the the sub-block zeroing in
iomap_dio_bio_iter() (the need_zeroing padding has to span the fs
allocation size, not the fsblock size) to do this safely.

Regardless of how we do it, all IO concurrency on this file is shot
if we have sub-rtextent sized IOs being done. That is true even with
this patch set - XFS_BMAPI_ZERO is done whilst holding the
XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the
zeroing is being done.

IOWs, anything to do with sub-rtextent IO really has to be treated
like sub-fsblock DIO - i.e. exclusive inode access until the
sub-rtextent zeroing has been completed.

> > > > I think we should support IOCB_ATOMIC when the mapping is unwritten --
> > > > the data will land on disk in an untorn fashion, the unwritten extent
> > > > conversion on IO completion is itself atomic, and callers still have to
> > > > set O_DSYNC to persist anything.
> > > 
> > > But does this work for the scenario above?
> > 
> > Probably not, but if we want the mapping to return a single
> > contiguous extent mapping that spans both unwritten and written
> > states, then we should directly code that behaviour for atomic
> > IO and not try to hack around it via XFS_BMAPI_ZERO.
> > 
> > Unwritten extent conversion will already do the right thing in that
> > it will only convert unwritten regions to written in the larger
> > range that is passed to it, but if there are multiple regions that
> > need conversion then the conversion won't be atomic.
> 
> We would need something atomic.
> 
> > 
> > > > Then we can avoid the cost of
> > > > BMAPI_ZERO, because double-writes aren't free.
> > > 
> > > About double-writes not being free, I thought that this was acceptable to
> > > just have this write zero when initially allocating the extent as it should
> > > not add too much overhead in practice, i.e. it's one off.
> > 
> > The whole point about atomic writes is they are a performance
> > optimisation. If the cost of enabling atomic writes is that we
> > double the amount of IO we are doing, then we've lost more
> > performance than we gained by using atomic writes. That doesn't
> > seem desirable....
> 
> But the zero'ing is a one off per extent allocation, right? I would expect
> just an initial overhead when the database is being created/extended.

So why can't the application do that manually like others already do
to enable FUA optimisations for O_DSYNC writes?

FWIW, we probably should look to extend fallocate() to allow
userspace to say "write real zeroes, not fake ones" so the
filesystem can call blkdev_issue_zeroout() after preallocation to
offload the zeroing to the hardware and then clear the unwritten
bits on the preallocated range...

> > > > >    	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
> > > > >    			rblocks, force, &tp);
> > > > >    	if (error)
> > > > > @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
> > > > >    	if (error)
> > > > >    		goto out_unlock;
> > > > > +	if (flags & IOMAP_ATOMIC) {
> > > > > +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
> > > > > +		unsigned int unit_min, unit_max;
> > > > > +
> > > > > +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
> > > > > +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
> > > > > +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
> > > > > +
> > > > > +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
> > > > > +			error = -EINVAL;
> > > > > +			goto out_unlock;
> > > > > +		}
> > > > > +
> > > > > +		if ((offset & mp->m_blockmask) ||
> > > > > +		    (length & mp->m_blockmask)) {
> > > > > +			error = -EINVAL;
> > > > > +			goto out_unlock;
> > > > > +		}
> > 
> > That belongs in the iomap DIO setup code, not here. It's also only
> > checking the data offset/length is filesystem block aligned, not
> > atomic IO aligned, too.
> 
> hmmm... I'm not sure about that. Initially XFS will only support writes
> whose size is a multiple of FS block size, and this is what we are checking
> here, even if it is not obvious.

Which means, initially, iomap only supposed writes that are a
multiple of filesystem block size. regardless, this should be
checked in the submission path, not in the extent mapping callback.

FWIW, we've already established above that iomap needs to handle
rtextsize chunks rather than fs block size for correct zeroing
behaviour for atomic writes, so this probably just needs to go away.

> The idea is that we can first ensure size is a multiple of FS blocksize, and
> then can use br_blockcount directly, below.

Yes, and we can do all these checks on the iomap that we return to
the iomap layer. All this is doing is running the checks on the XFS
imap before it is formatted into the iomap iomap and returned to the
iomap layer. These checks can be done on the returned iomap in the
iomap layer if IOMAP_ATOMIC is set....

> However, the core iomap code does not know FS atomic write min and max per
> inode, so we need some checks here.

So maybe we should pass them down to iomap in the iocb when
IOCB_ATOMIC is set, or reject the IO at the filesytem layer when
checking atomic IO alignment before we pass the IO to the iomap
layer...

-Dave.
John Garry Feb. 7, 2024, 2:13 p.m. UTC | #6
On 07/02/2024 00:06, Dave Chinner wrote:
>>> We really, really don't want to be doing this during allocation
>>> unless we can avoid it. If the filesystem block size is 64kB, we
>>> could be allocating up to 96GB per extent, and that becomes an
>>> uninterruptable write stream inside a transaction context that holds
>>> inode metadata locked.
>> Where does that 96GB figure come from?
> My inability to do math. The actual number is 128GB.
> 
> Max extent size = XFS_MAX_BMBT_EXTLEN * fs block size.
> 	        = 2^21  * fs block size.
> 
> So for a 4kB block size filesystem, that's 8GB max extent length,
> and that's the most we will allocate in a single transaction (i.e.
> one new BMBT record).
> 
> For 64kB block size, we can get 128GB of space allocated in a single
> transaction.

atomic write unit max theoretical upper limit is 
rounddown_power_of_2(2^32 - 1) = 2GB

So this would be what is expected to be the largest extent size 
requested for atomic writes. I am not saying that 2GB is small, but 
certainly much smaller than 128GB.

> 
>>>>> Why do we want to write zeroes to the disk if we're allocating space
>>>>> even if we're not sending an atomic write?
>>>>>
>>>>> (This might want an explanation for why we're doing this at all -- it's
>>>>> to avoid unwritten extent conversion, which defeats hardware untorn
>>>>> writes.)
>>>> It's to handle the scenario where we have a partially written extent, and
>>>> then try to issue an atomic write which covers the complete extent.
>>> When/how would that ever happen with the forcealign bits being set
>>> preventing unaligned allocation and writes?
>> Consider this scenario:
>>
>> # mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda
>> # mount /dev/sda mnt -o rtdev=/dev/sdb
>> # touch  mnt/file
>> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic
>> write, 4096B at pos 0
> Please don't write one-off custom test programs to issue IO - please
> use and enhance xfs_io so the test cases can then be put straight
> into fstests without adding yet another "do some minor IO variant"
> test program. This also means you don't need a random assortment of
> other tools.
> 
> i.e.
> 
> # xfs_io -dc "pwrite -VA 0 4096" /root/mnt/file
> 
> Should do an RWF_ATOMIC IO, and
> 
> # xfs_io -dc "pwrite -VAD 0 4096" /root/mnt/file
> 
> should do an RWF_ATOMIC|RWF_DSYNC IO...
> 
> 
>> # filefrag -v mnt/file
> xfs_io -c "fiemap" mnt/file

Fine, but I like using something generic for accessing block devices and 
also other FSes. I didn't think that xfs_io can do that.

Anyway, we can look to add atomic write support to xfs_io and any other 
xfs-progs

> 
>> Filesystem type is: 58465342
>> File size of mnt/file is 4096 (1 block of 4096 bytes)
>>    ext:     logical_offset:        physical_offset: length:   expected:
>> flags:
>>      0:        0..       0:         24..        24:      1:
>> last,eof
>> mnt/file: 1 extent found
>> # /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file
>> wrote -1 bytes at pos 0 write_size=16384
>> #
> Whole test as one repeatable command:
> 
> # xfs_io -d -c "truncate 0" -c "chattr +r" \
> 	-c "pwrite -VAD 0 4096" \
> 	-c "fiemap" \
> 	-c "pwrite -VAD 0 16384" \
> 	/mnt/root/file
>> For the 2nd write, which would cover a 16KB extent, the iomap code will iter
>> twice and produce 2x BIOs, which we don't want - that's why it errors there.
> Yes, but I think that's a feature.  You've optimised the filesystem
> layout for IO that is 64kB sized and aligned IO, but your test case
> is mixing 4kB and 16KB IO. The filesystem should be telling you that
> you're doing something that is sub-optimal for it's configuration,
> and refusing to do weird overlapping sub-rtextsize atomic IO is a
> pretty good sign that you've got something wrong.

Then we really end up with a strange behavior for the user. I mean, the 
user may ask - "why did this 16KB atomic write pass and this one fail? 
I'm following all the rules", and then "No one said not to mix write 
sizes or not mix atomic and non-atomic writes, so should be ok. Indeed, 
that earlier 4K write for the same region passed".

Playing devil's advocate here, at least this behavior should be documented.

> 
> The whole reason for rtextsize existing is to optimise the rtextent
> allocation to the typical minimum IO size done to that volume. If
> all your IO is sub-rtextsize size and alignment, then all that has
> been done is forcing the entire rt device IO into a corner it was
> never really intended nor optimised for.

Sure, but just because we are optimized for a certain IO write size 
should not mean that other writes are disallowed or quite problematic.

> 
> Why should we jump through crazy hoops to try to make filesystems
> optimised for large IOs with mismatched, overlapping small atomic
> writes?

As mentioned, typically the atomic writes will be the same size, but we 
may have other writes of smaller size.

> 
>> With the change in this patch, instead we have something like this after the
>> first write:
>>
>> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
>> wrote 4096 bytes at pos 0 write_size=4096
>> # filefrag -v mnt/file
>> Filesystem type is: 58465342
>> File size of mnt/file is 4096 (1 block of 4096 bytes)
>>    ext:     logical_offset:        physical_offset: length:   expected:
>> flags:
>>      0:        0..       3:         24..        27:      4:
>> last,eof
>> mnt/file: 1 extent found
>> #
>>
>> So the 16KB extent is in written state and the 2nd 16KB write would iter
>> once, producing a single BIO.
> Sure, I know how it works. My point is that it's a terrible way to
> go about allowing that second atomic write to succeed.
I think 'terrible' is a bit too strong a word here. Indeed, you suggest 
to manually zero the file to solve this problem, below, while this code 
change does the same thing automatically.

BTW, there was a request for behavior like in this patch. Please see 
this discussion on the ext4 atomic writes port:

https://lore.kernel.org/linux-ext4/ZXhb0tKFvAge%2FGWf@infradead.org/

So we should have some solution where the kernel automatically takes 
care of this unwritten extent issue.

> 
>>>> In this
>>>> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
>>>> ensure that the extent is completely written whenever we allocate it. At
>>>> least that is my idea.
>>> So return an unaligned extent, and then the IOMAP_ATOMIC checks you
>>> add below say "no" and then the application has to do things the
>>> slow, safe way....
>> We have been porting atomic write support to some database apps and they
>> (database developers) have had to do something like manually zero the
>> complete file to get around this issue, but that's not a good user
>> experience.
> Better the application zeros the file when it is being initialised
> and doesn't have performance constraints rather than forcing the
> filesystem to do it in the IO fast path when IO performance and
> latency actually matters to the application.

Can't we do both? I mean, the well-informed user can still pre-zero the 
file just to ensure we aren't doing this zero'ing with the extent 
allocation.

> 
> There are production databases that already do this manual zero
> initialisation to avoid unwritten extent conversion overhead during
> runtime operation. That's because they want FUA writes to work, and
> that gives 25% better IO performance over the same O_DSYNC writes
> doing allocation and/or unwritten extent conversion after
> fallocate() which requires journal flushes with O_DSYNC writes.
> 
> Using atomic writes is no different.

Sure, and as I said, they can do both. The user is already wise enough 
to zero the file for other performance reasons (like FUA writes).

> 
>> Note that in their case the first 4KB write is non-atomic, but that does not
>> change things. They do these 4KB writes in some DB setup phase.

JFYI, these 4K writes were in some compressed mode in the DB setup 
phase, hence the smaller size.

> And therein lies the problem.
> 
> If you are doing sub-rtextent IO at all, then you are forcing the
> filesystem down the path of explicitly using unwritten extents and
> requiring O_DSYNC direct IO to do journal flushes in IO completion
> context and then performance just goes down hill from them.
> 
> The requirement for unwritten extents to track sub-rtextsize written
> regions is what you're trying to work around with XFS_BMAPI_ZERO so
> that atomic writes will always see "atomic write aligned" allocated
> regions.
> 
> Do you see the problem here? You've explicitly told the filesystem
> that allocation is aligned to 64kB chunks, then because the
> filesystem block size is 4kB, it's allowed to track unwritten
> regions at 4kB boundaries. Then you do 4kB aligned file IO, which
> then changes unwritten extents at 4kB boundaries. Then you do a
> overlapping 16kB IO that*requires*  16kB allocation alignment, and
> things go BOOM.
> 
> Yes, they should go BOOM.
> 
> This is a horrible configuration - it is incomaptible with 16kB
> aligned and sized atomic IO. 

Just because the DB may do 16KB atomic writes most of the time should 
not disallow it from any other form of writes.

Indeed at 
https://lore.kernel.org/linux-nvme/ZSR4jeSKlppLWjQy@dread.disaster.area/ 
you wrote "Not every IO to every file needs to be atomic." (sorry for 
quoting you)

So the user can do other regular writes, but you say that they should be 
always writing full extents. This just may not suit some DBs.

> Allocation is aligned to 64kB, written
> region tracking is aligned to 4kB, and there's nothing to tell the
> filesystem that it should be maintaining 16kB "written alignment" so
> that 16kB atomic writes can always be issued atomically.
> 
> i.e. if we are going to do 16kB aligned atomic IO, then all the
> allocation and unwritten tracking needs to be done in 16kB aligned
> chunks, not 4kB. That means a 4KB write into an unwritten region or
> a hole actually needs to zero the rest of the 16KB range it sits
> within.
> 
> The direct IO code can do this, but it needs extension of the
> unaligned IO serialisation in XFS (the alignment checks in
> xfs_file_dio_write()) and the the sub-block zeroing in
> iomap_dio_bio_iter() (the need_zeroing padding has to span the fs
> allocation size, not the fsblock size) to do this safely.
> 
> Regardless of how we do it, all IO concurrency on this file is shot
> if we have sub-rtextent sized IOs being done. That is true even with
> this patch set - XFS_BMAPI_ZERO is done whilst holding the
> XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the
> zeroing is being done.
> 
> IOWs, anything to do with sub-rtextent IO really has to be treated
> like sub-fsblock DIO - i.e. exclusive inode access until the
> sub-rtextent zeroing has been completed.

I do understand that this is not perfect that we may have mixed block 
sizes being written, but I don't think that we should disallow it and 
throw an error.

I would actually think that the worst thing is that the user does not 
know this restriction.

> 
>>>>> I think we should support IOCB_ATOMIC when the mapping is unwritten --
>>>>> the data will land on disk in an untorn fashion, the unwritten extent
>>>>> conversion on IO completion is itself atomic, and callers still have to
>>>>> set O_DSYNC to persist anything.
>>>> But does this work for the scenario above?
>>> Probably not, but if we want the mapping to return a single
>>> contiguous extent mapping that spans both unwritten and written
>>> states, then we should directly code that behaviour for atomic
>>> IO and not try to hack around it via XFS_BMAPI_ZERO.
>>>
>>> Unwritten extent conversion will already do the right thing in that
>>> it will only convert unwritten regions to written in the larger
>>> range that is passed to it, but if there are multiple regions that
>>> need conversion then the conversion won't be atomic.
>> We would need something atomic.
>>
>>>>> Then we can avoid the cost of
>>>>> BMAPI_ZERO, because double-writes aren't free.
>>>> About double-writes not being free, I thought that this was acceptable to
>>>> just have this write zero when initially allocating the extent as it should
>>>> not add too much overhead in practice, i.e. it's one off.
>>> The whole point about atomic writes is they are a performance
>>> optimisation. If the cost of enabling atomic writes is that we
>>> double the amount of IO we are doing, then we've lost more
>>> performance than we gained by using atomic writes. That doesn't
>>> seem desirable....
>> But the zero'ing is a one off per extent allocation, right? I would expect
>> just an initial overhead when the database is being created/extended.
> So why can't the application do that manually like others already do
> to enable FUA optimisations for O_DSYNC writes?

Is this even officially documented as advice or a suggestion for users?

> 
> FWIW, we probably should look to extend fallocate() to allow
> userspace to say "write real zeroes, not fake ones" so the
> filesystem can call blkdev_issue_zeroout() after preallocation to
> offload the zeroing to the hardware and then clear the unwritten
> bits on the preallocated range...

ack

> 
>>>>>>     	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
>>>>>>     			rblocks, force, &tp);
>>>>>>     	if (error)
>>>>>> @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
>>>>>>     	if (error)
>>>>>>     		goto out_unlock;
>>>>>> +	if (flags & IOMAP_ATOMIC) {
>>>>>> +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
>>>>>> +		unsigned int unit_min, unit_max;
>>>>>> +
>>>>>> +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
>>>>>> +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
>>>>>> +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
>>>>>> +
>>>>>> +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
>>>>>> +			error = -EINVAL;
>>>>>> +			goto out_unlock;
>>>>>> +		}
>>>>>> +
>>>>>> +		if ((offset & mp->m_blockmask) ||
>>>>>> +		    (length & mp->m_blockmask)) {
>>>>>> +			error = -EINVAL;
>>>>>> +			goto out_unlock;
>>>>>> +		}
>>> That belongs in the iomap DIO setup code, not here. It's also only
>>> checking the data offset/length is filesystem block aligned, not
>>> atomic IO aligned, too.
>> hmmm... I'm not sure about that. Initially XFS will only support writes
>> whose size is a multiple of FS block size, and this is what we are checking
>> here, even if it is not obvious.
> Which means, initially, iomap only supposed writes that are a
> multiple of filesystem block size. regardless, this should be
> checked in the submission path, not in the extent mapping callback.
> 
> FWIW, we've already established above that iomap needs to handle
> rtextsize chunks rather than fs block size for correct zeroing
> behaviour for atomic writes, so this probably just needs to go away.

Fine, I think that all this can be moved to iomap core / removed.

> 
>> The idea is that we can first ensure size is a multiple of FS blocksize, and
>> then can use br_blockcount directly, below.
> Yes, and we can do all these checks on the iomap that we return to
> the iomap layer. All this is doing is running the checks on the XFS
> imap before it is formatted into the iomap iomap and returned to the
> iomap layer. These checks can be done on the returned iomap in the
> iomap layer if IOMAP_ATOMIC is set....
> 
>> However, the core iomap code does not know FS atomic write min and max per
>> inode, so we need some checks here.
> So maybe we should pass them down to iomap in the iocb when
> IOCB_ATOMIC is set, or reject the IO at the filesytem layer when
> checking atomic IO alignment before we pass the IO to the iomap
> layer...

Yes, I think that something like this is possible.

As for using kiocb, it's quite a small structure, so I can't imagine 
that it would be welcome to add all this atomic write stuff.

So we could add extra awu min and max args to iomao_dio_rw(), and these 
could be filled in by the calling FS. But that function already has 
enough args.

Alternatively we could add an iomap_ops method to look up awu min and 
max for the inode.

Thanks,
John
Dave Chinner Feb. 9, 2024, 1:40 a.m. UTC | #7
On Wed, Feb 07, 2024 at 02:13:23PM +0000, John Garry wrote:
> On 07/02/2024 00:06, Dave Chinner wrote:
> > > > We really, really don't want to be doing this during allocation
> > > > unless we can avoid it. If the filesystem block size is 64kB, we
> > > > could be allocating up to 96GB per extent, and that becomes an
> > > > uninterruptable write stream inside a transaction context that holds
> > > > inode metadata locked.
> > > Where does that 96GB figure come from?
> > My inability to do math. The actual number is 128GB.
> > 
> > Max extent size = XFS_MAX_BMBT_EXTLEN * fs block size.
> > 	        = 2^21  * fs block size.
> > 
> > So for a 4kB block size filesystem, that's 8GB max extent length,
> > and that's the most we will allocate in a single transaction (i.e.
> > one new BMBT record).
> > 
> > For 64kB block size, we can get 128GB of space allocated in a single
> > transaction.
> 
> atomic write unit max theoretical upper limit is rounddown_power_of_2(2^32 -
> 1) = 2GB
> 
> So this would be what is expected to be the largest extent size requested
> for atomic writes. I am not saying that 2GB is small, but certainly much
> smaller than 128GB.

*cough*

Extent size hints.

I'm a little disappointed that after all these discussions about how
we decouple extent allocation size and alignment from the user IO
size and alignment with things like extent size hints, force align,
etc that you are still thinking that user IO size and alignment
directly drives extent allocation size and alignment....


> > > > > > Why do we want to write zeroes to the disk if we're allocating space
> > > > > > even if we're not sending an atomic write?
> > > > > > 
> > > > > > (This might want an explanation for why we're doing this at all -- it's
> > > > > > to avoid unwritten extent conversion, which defeats hardware untorn
> > > > > > writes.)
> > > > > It's to handle the scenario where we have a partially written extent, and
> > > > > then try to issue an atomic write which covers the complete extent.
> > > > When/how would that ever happen with the forcealign bits being set
> > > > preventing unaligned allocation and writes?
> > > Consider this scenario:
> > > 
> > > # mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda
> > > # mount /dev/sda mnt -o rtdev=/dev/sdb
> > > # touch  mnt/file
> > > # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic
> > > write, 4096B at pos 0
> > Please don't write one-off custom test programs to issue IO - please
> > use and enhance xfs_io so the test cases can then be put straight
> > into fstests without adding yet another "do some minor IO variant"
> > test program. This also means you don't need a random assortment of
> > other tools.
> > 
> > i.e.
> > 
> > # xfs_io -dc "pwrite -VA 0 4096" /root/mnt/file
> > 
> > Should do an RWF_ATOMIC IO, and
> > 
> > # xfs_io -dc "pwrite -VAD 0 4096" /root/mnt/file
> > 
> > should do an RWF_ATOMIC|RWF_DSYNC IO...
> > 
> > 
> > > # filefrag -v mnt/file
> > xfs_io -c "fiemap" mnt/file
> 
> Fine, but I like using something generic for accessing block devices and
> also other FSes. I didn't think that xfs_io can do that.

Yes, it can. We use it extensively in fstests because it works
for any filesystem, not just XFS.

> Anyway, we can look to add atomic write support to xfs_io and any other
> xfs-progs

Please do, then the support is there for developers, users and
fstests without needing to write their own custom test programs.

> > > Filesystem type is: 58465342
> > > File size of mnt/file is 4096 (1 block of 4096 bytes)
> > >    ext:     logical_offset:        physical_offset: length:   expected:
> > > flags:
> > >      0:        0..       0:         24..        24:      1:
> > > last,eof
> > > mnt/file: 1 extent found
> > > # /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file
> > > wrote -1 bytes at pos 0 write_size=16384
> > > #
> > Whole test as one repeatable command:
> > 
> > # xfs_io -d -c "truncate 0" -c "chattr +r" \
> > 	-c "pwrite -VAD 0 4096" \
> > 	-c "fiemap" \
> > 	-c "pwrite -VAD 0 16384" \
> > 	/mnt/root/file
> > > For the 2nd write, which would cover a 16KB extent, the iomap code will iter
> > > twice and produce 2x BIOs, which we don't want - that's why it errors there.
> > Yes, but I think that's a feature.  You've optimised the filesystem
> > layout for IO that is 64kB sized and aligned IO, but your test case
> > is mixing 4kB and 16KB IO. The filesystem should be telling you that
> > you're doing something that is sub-optimal for it's configuration,
> > and refusing to do weird overlapping sub-rtextsize atomic IO is a
> > pretty good sign that you've got something wrong.
> 
> Then we really end up with a strange behavior for the user. I mean, the user
> may ask - "why did this 16KB atomic write pass and this one fail? I'm
> following all the rules", and then "No one said not to mix write sizes or
> not mix atomic and non-atomic writes, so should be ok. Indeed, that earlier
> 4K write for the same region passed".
> 
> Playing devil's advocate here, at least this behavior should be documented.

That's what man pages are for, yes?

Are you expecting your deployments to be run on highly suboptimal
configurations and so the code needs to be optimised for this
behaviour, or are you expecting them to be run on correctly
configured systems which would never see these issues?


> > The whole reason for rtextsize existing is to optimise the rtextent
> > allocation to the typical minimum IO size done to that volume. If
> > all your IO is sub-rtextsize size and alignment, then all that has
> > been done is forcing the entire rt device IO into a corner it was
> > never really intended nor optimised for.
> 
> Sure, but just because we are optimized for a certain IO write size should
> not mean that other writes are disallowed or quite problematic.

Atomic writes are just "other writes". They are writes that are
*expected to fail* if they cannot be done atomically.

Application writers will quickly learn how to do sane, fast,
reliable atomic write IO if we reject anything that is going to
requires some complex, sub-optimal workaround in the kernel to make
it work. The simplest solution is to -fail the write-, because
userspace *must* be prepared for *any* atomic write to fail.

> > Why should we jump through crazy hoops to try to make filesystems
> > optimised for large IOs with mismatched, overlapping small atomic
> > writes?
> 
> As mentioned, typically the atomic writes will be the same size, but we may
> have other writes of smaller size.

Then we need the tiny write to allocate and zero according to the
maximum sized atomic write bounds. Then we just don't care about
large atomic IO overlapping small IO, because the extent on disk
aligned to the large atomic IO is then always guaranteed to be the
correct size and shape.


> > > With the change in this patch, instead we have something like this after the
> > > first write:
> > > 
> > > # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
> > > wrote 4096 bytes at pos 0 write_size=4096
> > > # filefrag -v mnt/file
> > > Filesystem type is: 58465342
> > > File size of mnt/file is 4096 (1 block of 4096 bytes)
> > >    ext:     logical_offset:        physical_offset: length:   expected:
> > > flags:
> > >      0:        0..       3:         24..        27:      4:
> > > last,eof
> > > mnt/file: 1 extent found
> > > #
> > > 
> > > So the 16KB extent is in written state and the 2nd 16KB write would iter
> > > once, producing a single BIO.
> > Sure, I know how it works. My point is that it's a terrible way to
> > go about allowing that second atomic write to succeed.
> I think 'terrible' is a bit too strong a word here.

Doing it anything in a way that a user can DOS the entire filesystem
is *terrible*. No ifs, buts or otherwise.

> Indeed, you suggest to
> manually zero the file to solve this problem, below, while this code change
> does the same thing automatically.

Yes, but I also outlined a way that it can be done automatically
without being terrible. There are multiple options here, I outlined
two different approaches that are acceptible.

> > > > > In this
> > > > > scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
> > > > > ensure that the extent is completely written whenever we allocate it. At
> > > > > least that is my idea.
> > > > So return an unaligned extent, and then the IOMAP_ATOMIC checks you
> > > > add below say "no" and then the application has to do things the
> > > > slow, safe way....
> > > We have been porting atomic write support to some database apps and they
> > > (database developers) have had to do something like manually zero the
> > > complete file to get around this issue, but that's not a good user
> > > experience.
> > Better the application zeros the file when it is being initialised
> > and doesn't have performance constraints rather than forcing the
> > filesystem to do it in the IO fast path when IO performance and
> > latency actually matters to the application.
> 
> Can't we do both? I mean, the well-informed user can still pre-zero the file
> just to ensure we aren't doing this zero'ing with the extent allocation.

I never said we can't do zeroing. I just said that it's normally
better when the application controls zeroing directly.

> > And therein lies the problem.
> > 
> > If you are doing sub-rtextent IO at all, then you are forcing the
> > filesystem down the path of explicitly using unwritten extents and
> > requiring O_DSYNC direct IO to do journal flushes in IO completion
> > context and then performance just goes down hill from them.
> > 
> > The requirement for unwritten extents to track sub-rtextsize written
> > regions is what you're trying to work around with XFS_BMAPI_ZERO so
> > that atomic writes will always see "atomic write aligned" allocated
> > regions.
> > 
> > Do you see the problem here? You've explicitly told the filesystem
> > that allocation is aligned to 64kB chunks, then because the
> > filesystem block size is 4kB, it's allowed to track unwritten
> > regions at 4kB boundaries. Then you do 4kB aligned file IO, which
> > then changes unwritten extents at 4kB boundaries. Then you do a
> > overlapping 16kB IO that*requires*  16kB allocation alignment, and
> > things go BOOM.
> > 
> > Yes, they should go BOOM.
> > 
> > This is a horrible configuration - it is incomaptible with 16kB
> > aligned and sized atomic IO.
> 
> Just because the DB may do 16KB atomic writes most of the time should not
> disallow it from any other form of writes.

That's not what I said. I said the using sub-rtextsize atomic writes
with single FSB unwritten extent tracking is horrible and
incompatible with doing 16kB atomic writes.

This setup will not work at all well with your patches and should go
BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup
has uncoordinated extent allocation and unwritten conversion
granularity.

That's the fundamental design problem with your approach - it allows
unwritten conversion at *minimum IO sizes* and that does not work
with atomic IOs with larger alignment requirements.

The fundamental design principle is this: for maximally sized atomic
writes to always succeed we require every allocation, zeroing and
unwritten conversion operation to use alignments and sizes that are
compatible with the maximum atomic write sizes being used.

i.e. atomic writes need to use max write size granularity for all IO
operations, not filesystem block granularity.

And that also means things like rtextsize and extsize hints need to
match these atomic write requirements, too....

> > Allocation is aligned to 64kB, written
> > region tracking is aligned to 4kB, and there's nothing to tell the
> > filesystem that it should be maintaining 16kB "written alignment" so
> > that 16kB atomic writes can always be issued atomically.
> > 
> > i.e. if we are going to do 16kB aligned atomic IO, then all the
> > allocation and unwritten tracking needs to be done in 16kB aligned
> > chunks, not 4kB. That means a 4KB write into an unwritten region or
> > a hole actually needs to zero the rest of the 16KB range it sits
> > within.
> > 
> > The direct IO code can do this, but it needs extension of the
> > unaligned IO serialisation in XFS (the alignment checks in
> > xfs_file_dio_write()) and the the sub-block zeroing in
> > iomap_dio_bio_iter() (the need_zeroing padding has to span the fs
> > allocation size, not the fsblock size) to do this safely.
> > 
> > Regardless of how we do it, all IO concurrency on this file is shot
> > if we have sub-rtextent sized IOs being done. That is true even with
> > this patch set - XFS_BMAPI_ZERO is done whilst holding the
> > XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the
> > zeroing is being done.
> > 
> > IOWs, anything to do with sub-rtextent IO really has to be treated
> > like sub-fsblock DIO - i.e. exclusive inode access until the
> > sub-rtextent zeroing has been completed.
> 
> I do understand that this is not perfect that we may have mixed block sizes
> being written, but I don't think that we should disallow it and throw an
> error.

Ummmm, did you read what you quoted?

The above is an outline of the IO path modifications that will allow
mixed IO sizes to be used with atomic writes without requiring the
XFS_BMAPI_ZERO hack. It pushes the sub-atomic write alignment
zeroing out to the existing DIO sub-block zeroing, hence ensuring
that we only ever convert unwritten extents on max sized atomic
write boundaries for atomic write enabled inodes.

At no point have I said "no mixed writes". I've said no to the
XFS_BMAPI_ZERO hack, but then I've explained the fundamental issue
that it works around and given you a decent amount of detail on how
to sanely implementing mixed write support that will work (slowly)
with those configurations and IO patterns.

So it's your choice - you can continue to beleive I don't mixed
writes to work at all, or you can go back and try to understand the
IO path changes I've suggested that will allow mixed atomic writes
to work as well as they possibly can....

-Dave.
John Garry Feb. 9, 2024, 12:47 p.m. UTC | #8
>>
>> Playing devil's advocate here, at least this behavior should be documented.
> 
> That's what man pages are for, yes?
> 
> Are you expecting your deployments to be run on highly suboptimal
> configurations and so the code needs to be optimised for this
> behaviour, or are you expecting them to be run on correctly
> configured systems which would never see these issues?

The latter hopefully

> 
> 
>>> The whole reason for rtextsize existing is to optimise the rtextent
>>> allocation to the typical minimum IO size done to that volume. If
>>> all your IO is sub-rtextsize size and alignment, then all that has
>>> been done is forcing the entire rt device IO into a corner it was
>>> never really intended nor optimised for.
>>
>> Sure, but just because we are optimized for a certain IO write size should
>> not mean that other writes are disallowed or quite problematic.
> 
> Atomic writes are just "other writes". They are writes that are
> *expected to fail* if they cannot be done atomically.

Agreed

> 
> Application writers will quickly learn how to do sane, fast,
> reliable atomic write IO if we reject anything that is going to
> requires some complex, sub-optimal workaround in the kernel to make
> it work. The simplest solution is to -fail the write-, because
> userspace *must* be prepared for *any* atomic write to fail.

Sure, but it needs to be such that the application writer at least knows 
why it failed, which so far had not been documented.

> 
>>> Why should we jump through crazy hoops to try to make filesystems
>>> optimised for large IOs with mismatched, overlapping small atomic
>>> writes?
>>
>> As mentioned, typically the atomic writes will be the same size, but we may
>> have other writes of smaller size.
> 
> Then we need the tiny write to allocate and zero according to the
> maximum sized atomic write bounds. Then we just don't care about
> large atomic IO overlapping small IO, because the extent on disk
> aligned to the large atomic IO is then always guaranteed to be the
> correct size and shape.

I think it's worth mentioning that there is currently a separation 
between how we configure the FS extent size for atomic writes and what 
the bdev can actually support in terms of atomic writes.

Setting the rtvol extsize at mkfs time or enabling atomic writes 
FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do 
in terms of atomic writes.

This check is not done as it is not fixed what the bdev can do in terms 
of atomic writes - or, more specifically, what they request_queue 
reports is not be fixed. There are things which can change this. For 
example, a FW update could change all the atomic write capabilities of a 
disk. Or even if we swapped a SCSI disk into another host the atomic 
write limits may change, as the atomic write unit max depends on the 
SCSI HBA DMA limits. Changing BIO_MAX_VECS - which could come from a 
kernel update - could also change what we report as atomic write limit 
in the request queue.

> 
> 
>>>> With the change in this patch, instead we have something like this after the
>>>> first write:
>>>>
>>>> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
>>>> wrote 4096 bytes at pos 0 write_size=4096
>>>> # filefrag -v mnt/file
>>>> Filesystem type is: 58465342
>>>> File size of mnt/file is 4096 (1 block of 4096 bytes)
>>>>     ext:     logical_offset:        physical_offset: length:   expected:
>>>> flags:
>>>>       0:        0..       3:         24..        27:      4:
>>>> last,eof
>>>> mnt/file: 1 extent found
>>>> #
>>>>
>>>> So the 16KB extent is in written state and the 2nd 16KB write would iter
>>>> once, producing a single BIO.
>>> Sure, I know how it works. My point is that it's a terrible way to
>>> go about allowing that second atomic write to succeed.
>> I think 'terrible' is a bit too strong a word here.
> 
> Doing it anything in a way that a user can DOS the entire filesystem
> is *terrible*. No ifs, buts or otherwise.

Understood

> 
>> Indeed, you suggest to
>> manually zero the file to solve this problem, below, while this code change
>> does the same thing automatically.
> 
> Yes, but I also outlined a way that it can be done automatically
> without being terrible. There are multiple options here, I outlined
> two different approaches that are acceptible.

I think that I need to check these alternate solutions in more detail. 
More below.

> 
>>>>>> In this
>>>>>> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
>>>>>> ensure that the extent is completely written whenever we allocate it. At
>>>>>> least that is my idea.
>>>>> So return an unaligned extent, and then the IOMAP_ATOMIC checks you
>>>>> add below say "no" and then the application has to do things the
>>>>> slow, safe way....
>>>> We have been porting atomic write support to some database apps and they
>>>> (database developers) have had to do something like manually zero the
>>>> complete file to get around this issue, but that's not a good user
>>>> experience.
>>> Better the application zeros the file when it is being initialised
>>> and doesn't have performance constraints rather than forcing the
>>> filesystem to do it in the IO fast path when IO performance and
>>> latency actually matters to the application.
>>
>> Can't we do both? I mean, the well-informed user can still pre-zero the file
>> just to ensure we aren't doing this zero'ing with the extent allocation.
> 
> I never said we can't do zeroing. I just said that it's normally
> better when the application controls zeroing directly.

ok

> 
>>> And therein lies the problem.
>>>
>>> If you are doing sub-rtextent IO at all, then you are forcing the
>>> filesystem down the path of explicitly using unwritten extents and
>>> requiring O_DSYNC direct IO to do journal flushes in IO completion
>>> context and then performance just goes down hill from them.
>>>
>>> The requirement for unwritten extents to track sub-rtextsize written
>>> regions is what you're trying to work around with XFS_BMAPI_ZERO so
>>> that atomic writes will always see "atomic write aligned" allocated
>>> regions.
>>>
>>> Do you see the problem here? You've explicitly told the filesystem
>>> that allocation is aligned to 64kB chunks, then because the
>>> filesystem block size is 4kB, it's allowed to track unwritten
>>> regions at 4kB boundaries. Then you do 4kB aligned file IO, which
>>> then changes unwritten extents at 4kB boundaries. Then you do a
>>> overlapping 16kB IO that*requires*  16kB allocation alignment, and
>>> things go BOOM.
>>>
>>> Yes, they should go BOOM.
>>>
>>> This is a horrible configuration - it is incomaptible with 16kB
>>> aligned and sized atomic IO.
>>
>> Just because the DB may do 16KB atomic writes most of the time should not
>> disallow it from any other form of writes.
> 
> That's not what I said. I said the using sub-rtextsize atomic writes
> with single FSB unwritten extent tracking is horrible and
> incompatible with doing 16kB atomic writes.
> 
> This setup will not work at all well with your patches and should go
> BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup
> has uncoordinated extent allocation and unwritten conversion
> granularity.
> 
> That's the fundamental design problem with your approach - it allows
> unwritten conversion at *minimum IO sizes* and that does not work
> with atomic IOs with larger alignment requirements.
> 
> The fundamental design principle is this: for maximally sized atomic
> writes to always succeed we require every allocation, zeroing and
> unwritten conversion operation to use alignments and sizes that are
> compatible with the maximum atomic write sizes being used.
> 

That sounds fine.

My question then is how we determine this max atomic write size granularity.

We don't explicitly tell the FS what atomic write size we want for a 
file. Rather we mkfs with some extsize value which should match our 
atomic write maximal value and then tell the FS we want to do atomic 
writes on a file, and if this is accepted then we can query the atomic 
write min and max unit size, and this would be [FS block size, min(bdev 
atomic write limit, rtexsize)].

If rtextsize is 16KB, then we have a good idea that we want 16KB atomic 
writes support. So then we could use rtextsize as this max atomic write 
size. But I am not 100% sure that it your idea (apologies if I am wrong 
- I am sincerely trying to follow your idea), but rather it would be 
min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and 
bdev atomic write limit is 16KB, then there is no much point in dealing 
in 1MB blocks for this unwritten extent conversion alignment. If so, 
then my concern is that the bdev atomic write upper limit is not fixed. 
This can solved, but I would still like to be clear on this max atomic 
write size.

 > i.e. atomic writes need to use max write size granularity for all IO
 > operations, not filesystem block granularity.
> 
> And that also means things like rtextsize and extsize hints need to
> match these atomic write requirements, too....
> 

As above, I am not 100% sure if you mean these to be the atomic write 
maximal value.

>>> Allocation is aligned to 64kB, written
>>> region tracking is aligned to 4kB, and there's nothing to tell the
>>> filesystem that it should be maintaining 16kB "written alignment" so
>>> that 16kB atomic writes can always be issued atomically.

Please note that in my previous example the mkfs rtextsize arg should 
really have been 16KB, and that the intention would have been to enable 
16KB atomic writes. I used 64KB casually as I thought it should be 
possible to support sub-rtextsize atomic writes. The point which I was 
trying to make was that the 16KB atomic write and 4KB regular write 
intermixing was problematic.

>>>
>>> i.e. if we are going to do 16kB aligned atomic IO, then all the
>>> allocation and unwritten tracking needs to be done in 16kB aligned
>>> chunks, not 4kB. That means a 4KB write into an unwritten region or
>>> a hole actually needs to zero the rest of the 16KB range it sits
>>> within.
>>>
>>> The direct IO code can do this, but it needs extension of the
>>> unaligned IO serialisation in XFS (the alignment checks in
>>> xfs_file_dio_write()) and the the sub-block zeroing in
>>> iomap_dio_bio_iter() (the need_zeroing padding has to span the fs
>>> allocation size, not the fsblock size) to do this safely.
>>>
>>> Regardless of how we do it, all IO concurrency on this file is shot
>>> if we have sub-rtextent sized IOs being done. That is true even with
>>> this patch set - XFS_BMAPI_ZERO is done whilst holding the
>>> XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the
>>> zeroing is being done.
>>>
>>> IOWs, anything to do with sub-rtextent IO really has to be treated
>>> like sub-fsblock DIO - i.e. exclusive inode access until the
>>> sub-rtextent zeroing has been completed.
>>
>> I do understand that this is not perfect that we may have mixed block sizes
>> being written, but I don't think that we should disallow it and throw an
>> error.
> 
> Ummmm, did you read what you quoted?
> 
> The above is an outline of the IO path modifications that will allow
> mixed IO sizes to be used with atomic writes without requiring the
> XFS_BMAPI_ZERO hack. It pushes the sub-atomic write alignment
> zeroing out to the existing DIO sub-block zeroing, hence ensuring
> that we only ever convert unwritten extents on max sized atomic
> write boundaries for atomic write enabled inodes.

ok, I get this idea. And, indeed, it does sound better than the 
XFS_BMAPI_ZERO proposal.

> 
> At no point have I said "no mixed writes".

For sure

> I've said no to the
> XFS_BMAPI_ZERO hack, but then I've explained the fundamental issue
> that it works around and given you a decent amount of detail on how
> to sanely implementing mixed write support that will work (slowly)
> with those configurations and IO patterns.
> 
> So it's your choice - you can continue to beleive I don't mixed
> writes to work at all, or you can go back and try to understand the
> IO path changes I've suggested that will allow mixed atomic writes
> to work as well as they possibly can....
> 

Ack

Much appreciated,
John
Darrick J. Wong Feb. 13, 2024, 5:50 p.m. UTC | #9
On Mon, Feb 05, 2024 at 01:36:03PM +0000, John Garry wrote:
> On 02/02/2024 18:47, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote:
> > > Ensure that when creating a mapping that we adhere to all the atomic
> > > write rules.
> > > 
> > > We check that the mapping covers the complete range of the write to ensure
> > > that we'll be just creating a single mapping.
> > > 
> > > Currently minimum granularity is the FS block size, but it should be
> > > possibly to support lower in future.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > > I am setting this as an RFC as I am not sure on the change in
> > > xfs_iomap_write_direct() - it gives the desired result AFAICS.
> > > 
> > >   fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 41 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 18c8f168b153..758dc1c90a42 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
> > >   		}
> > >   	}
> > > +	if (xfs_inode_atomicwrites(ip))
> > > +		bmapi_flags = XFS_BMAPI_ZERO;
> > 
> > Why do we want to write zeroes to the disk if we're allocating space
> > even if we're not sending an atomic write?
> > 
> > (This might want an explanation for why we're doing this at all -- it's
> > to avoid unwritten extent conversion, which defeats hardware untorn
> > writes.)
> 
> It's to handle the scenario where we have a partially written extent, and
> then try to issue an atomic write which covers the complete extent. In this
> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
> ensure that the extent is completely written whenever we allocate it. At
> least that is my idea.
> 
> > 
> > I think we should support IOCB_ATOMIC when the mapping is unwritten --
> > the data will land on disk in an untorn fashion, the unwritten extent
> > conversion on IO completion is itself atomic, and callers still have to
> > set O_DSYNC to persist anything.
> 
> But does this work for the scenario above?
> 
> > Then we can avoid the cost of
> > BMAPI_ZERO, because double-writes aren't free.
> 
> About double-writes not being free, I thought that this was acceptable to
> just have this write zero when initially allocating the extent as it should
> not add too much overhead in practice, i.e. it's one off.
> 
> > 
> > > +
> > >   	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
> > >   			rblocks, force, &tp);
> > >   	if (error)
> > > @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
> > >   	if (error)
> > >   		goto out_unlock;
> > > +	if (flags & IOMAP_ATOMIC) {
> > > +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
> > > +		unsigned int unit_min, unit_max;
> > > +
> > > +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
> > > +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
> > > +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
> > > +
> > > +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		if ((offset & mp->m_blockmask) ||
> > > +		    (length & mp->m_blockmask)) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		if (imap.br_blockcount == unit_min_fsb ||
> > > +		    imap.br_blockcount == unit_max_fsb) {
> > > +			/* ok if exactly min or max */
> > > +		} else if (imap.br_blockcount < unit_min_fsb ||
> > > +			   imap.br_blockcount > unit_max_fsb) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		} else if (!is_power_of_2(imap.br_blockcount)) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		if (imap.br_startoff &&
> > > +		    imap.br_startoff & (imap.br_blockcount - 1)) {
> > 
> > Not sure why we care about the file position, it's br_startblock that
> > gets passed into the bio, not br_startoff.
> 
> We just want to ensure that the length of the write is valid w.r.t. to the
> offset within the extent, and br_startoff would be the offset within the
> aligned extent.

Yes, I understand what br_startoff is, but this doesn't help me
understand why this code is necessary.  Let's say you have a device that
supports untorn writes of 16k in length provided the LBA of the write
command is also aligned to 16k, and the fs has 4k blocks.

Userspace issues an 16k untorn write at offset 13k in the file, and gets
this mapping:

[startoff: 13k, startblock: 16k, blockcount: 16k]

Why should this IO be rejected?  The physical space extent satisfies the
alignment requirements of the underlying device, and the logical file
space extent does not need aligning at all.

> > I'm also still not convinced that any of this validation is useful here.
> > The block device stack underneath the filesystem can change at any time
> > without any particular notice to the fs, so the only way to find out if
> > the proposed IO would meet the alignment constraints is to submit_bio
> > and see what happens.
> 
> I am not sure what submit_bio() would do differently. If the block device is
> changing underneath the block layer, then there is where these things need
> to be checked.

Agreed.

> > 
> > (The "one bio per untorn write request" thing in the direct-io.c patch
> > sound sane to me though.)
> > 
> 
> ok
> 
> Thanks,
> John
>
Dave Chinner Feb. 13, 2024, 11:41 p.m. UTC | #10
On Fri, Feb 09, 2024 at 12:47:38PM +0000, John Garry wrote:
> > > > Why should we jump through crazy hoops to try to make filesystems
> > > > optimised for large IOs with mismatched, overlapping small atomic
> > > > writes?
> > > 
> > > As mentioned, typically the atomic writes will be the same size, but we may
> > > have other writes of smaller size.
> > 
> > Then we need the tiny write to allocate and zero according to the
> > maximum sized atomic write bounds. Then we just don't care about
> > large atomic IO overlapping small IO, because the extent on disk
> > aligned to the large atomic IO is then always guaranteed to be the
> > correct size and shape.
> 
> I think it's worth mentioning that there is currently a separation between
> how we configure the FS extent size for atomic writes and what the bdev can
> actually support in terms of atomic writes.

And that's part of what is causing all the issues here - we're
trying to jump though hoops at the fs level to handle cases that
they device doesn't support and vice versa.

> Setting the rtvol extsize at mkfs time or enabling atomic writes
> FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in
> terms of atomic writes.

Which is wrong. mkfs.xfs gets physical information about the volume
from the kernel and makes the filesystem accounting to that
information. That's how we do stripe alignment, sector sizing, etc.
Atomic write support and setting up alignment constraints should be
no different.

Yes, mkfs allows the user to override the hardware configsi it
probes, but it also warns when the override is doing something
sub-optimal (like aligning all AG headers to the same disk in a
stripe).

IOWs, mkfs should be pulling this atomic write info from the
hardware and configuring the filesysetm around that information.
That's the target we should be aiming the kernel implementation at
and optimising for - a filesystem that is correctly configured
according to published hardware capability.

Everything else is in the "make it behave correctly, but we don't
care if it's slow" category.

> This check is not done as it is not fixed what the bdev can do in terms of
> atomic writes - or, more specifically, what they request_queue reports is
> not be fixed. There are things which can change this. For example, a FW
> update could change all the atomic write capabilities of a disk. Or even if
> we swapped a SCSI disk into another host the atomic write limits may change,
> as the atomic write unit max depends on the SCSI HBA DMA limits. Changing
> BIO_MAX_VECS - which could come from a kernel update - could also change
> what we report as atomic write limit in the request queue.

If that sort of thing happens, then that's too bad. We already have
these sorts of "do not do if you care about performance"
constraints. e.g. don't do a RAID restripe that changes the
alignment/size of the RAID device (e.g. add a single disk and make
the stripe width wider) because the physical filesystem structure
will no longer be aligned to the underlying hardware. instead, you
have to grow striped volumes with compatible stripes in compatible
sizes to ensure the filesystem remains aligned to the storage...

We don't try to cater for every single possible permutation of
storage hardware configurations - that way lies madness. Design and
optimise for the common case of correctly configured and well
behaved storage, and everything else we just don't care about beyond
"don't corrupt or lose data".

> > > > And therein lies the problem.
> > > > 
> > > > If you are doing sub-rtextent IO at all, then you are forcing the
> > > > filesystem down the path of explicitly using unwritten extents and
> > > > requiring O_DSYNC direct IO to do journal flushes in IO completion
> > > > context and then performance just goes down hill from them.
> > > > 
> > > > The requirement for unwritten extents to track sub-rtextsize written
> > > > regions is what you're trying to work around with XFS_BMAPI_ZERO so
> > > > that atomic writes will always see "atomic write aligned" allocated
> > > > regions.
> > > > 
> > > > Do you see the problem here? You've explicitly told the filesystem
> > > > that allocation is aligned to 64kB chunks, then because the
> > > > filesystem block size is 4kB, it's allowed to track unwritten
> > > > regions at 4kB boundaries. Then you do 4kB aligned file IO, which
> > > > then changes unwritten extents at 4kB boundaries. Then you do a
> > > > overlapping 16kB IO that*requires*  16kB allocation alignment, and
> > > > things go BOOM.
> > > > 
> > > > Yes, they should go BOOM.
> > > > 
> > > > This is a horrible configuration - it is incomaptible with 16kB
> > > > aligned and sized atomic IO.
> > > 
> > > Just because the DB may do 16KB atomic writes most of the time should not
> > > disallow it from any other form of writes.
> > 
> > That's not what I said. I said the using sub-rtextsize atomic writes
> > with single FSB unwritten extent tracking is horrible and
> > incompatible with doing 16kB atomic writes.
> > 
> > This setup will not work at all well with your patches and should go
> > BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup
> > has uncoordinated extent allocation and unwritten conversion
> > granularity.
> > 
> > That's the fundamental design problem with your approach - it allows
> > unwritten conversion at *minimum IO sizes* and that does not work
> > with atomic IOs with larger alignment requirements.
> > 
> > The fundamental design principle is this: for maximally sized atomic
> > writes to always succeed we require every allocation, zeroing and
> > unwritten conversion operation to use alignments and sizes that are
> > compatible with the maximum atomic write sizes being used.
> > 
> 
> That sounds fine.
> 
> My question then is how we determine this max atomic write size granularity.
> 
> We don't explicitly tell the FS what atomic write size we want for a file.
> Rather we mkfs with some extsize value which should match our atomic write
> maximal value and then tell the FS we want to do atomic writes on a file,
> and if this is accepted then we can query the atomic write min and max unit
> size, and this would be [FS block size, min(bdev atomic write limit,
> rtexsize)].
> 
> If rtextsize is 16KB, then we have a good idea that we want 16KB atomic
> writes support. So then we could use rtextsize as this max atomic write
> size.

Maybe, but I think continuing to focus on this as 'atomic writes
requires' is wrong.

The filesystem does not carea bout atomic writes. What it cares
about is the allocation constraints that need to be implemented.
That constraint is that all BMBT extent operations need to be
aligned to a specific extent size, not filesystem blocks.

The current extent size hint (and rtextsize) only impact the
-allocation of extents-. They are not directly placing constraints
on the BMBT layout, they are placing constraints on the free space
search that the allocator runs on the BNO/CNT btrees to select an
extent that is then inserted into the BMBT.

The problem is that unwritten extent conversion, truncate, hole
punching, etc also all need to be correctly aligned for files that
are configured to support atomic writes. These operations place
constraints on how the BMBT can modify the existing extent list.

These are different constraints to what rtextsize/extszhint apply,
and that's the fundamental behavioural difference between existing
extent size hint behaviour and the behaviour needed by atomic
writes.

> But I am not 100% sure that it your idea (apologies if I am wrong - I
> am sincerely trying to follow your idea), but rather it would be
> min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and bdev
> atomic write limit is 16KB, then there is no much point in dealing in 1MB
> blocks for this unwritten extent conversion alignment.

Exactly my point - there really is no relationship between rtextsize
and atomic write constraints and that it is a mistake to use
rtextsize as it stands as a placeholder for atomic write
constraints.

> If so, then my
> concern is that the bdev atomic write upper limit is not fixed. This can
> solved, but I would still like to be clear on this max atomic write size.

Right, we haven't clearly defined how XFS is supposed align BMBT
operations in a way that is compatible for atomic write operations.

What the patchset does is try to extend and infer things from
existing allocation alignment constraints, but then not apply those
constraints to critical extent state operations (pure BMBT
modifications) that atomic writes also need constrained to work
correctly and efficiently.

> > i.e. atomic writes need to use max write size granularity for all IO
> > operations, not filesystem block granularity.
> > 
> > And that also means things like rtextsize and extsize hints need to
> > match these atomic write requirements, too....
> 
> As above, I am not 100% sure if you mean these to be the atomic write
> maximal value.

Yes, they either need to be the same as the atomic write max value
or, much better, once a hint has been set, then resultant size is
then aligned up to be compatible with the atomic write constraints.

e.g. set an inode extent size hint of 960kB on a device with 128kB
atomic write capability. If the inode has the atomic write flag set,
then allocations need to round the extent size up from 960kB to 1MB
so that the BMBT extent layout and alignment is compatible with 128kB
atomic writes.

At this point, zeroing, punching, unwritten extent conversion, etc
also needs to be done on aligned 128kB ranges to be comaptible with
atomic writes, rather than filesysetm block boundaries that would
normally be used if just the extent size hint was set.

This is effectively what the original "force align" inode flag bit
did - it told the inode that all BMBT manipulations need to be
extent size hint aligned, not just allocation. It's a generic flag
that implements specific extent manipulation constraints that happen
to be compatible with atomic writes when the right extent size hint
is set.

So at this point, I'm not sure that we should have an "atomic
writes" flag in the inode. We need to tell BMBT modifications
to behave in a particular way - forced alignment - not that atomic
writes are being used in the filesystem....

At this point, the filesystem can do all the extent modification
alignment stuff that atomic writes require without caring if the
block device supports atomic writes or even if the
application is using atomic writes.

This means we can test the BMBT functionality in fstests without
requiring hardware (or emulation) that supports atomic writes - all
we need to do is set the forced align flag, an extent size hint and
go run fsx on it...

-Dave.
John Garry Feb. 14, 2024, 11:06 a.m. UTC | #11
> 
>> Setting the rtvol extsize at mkfs time or enabling atomic writes
>> FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in
>> terms of atomic writes.
> 
> Which is wrong. mkfs.xfs gets physical information about the volume
> from the kernel and makes the filesystem accounting to that
> information. That's how we do stripe alignment, sector sizing, etc.
> Atomic write support and setting up alignment constraints should be
> no different.

Yes, I was just looking at adding a mkfs option to format for atomic 
writes, which would check physical information about the volume and 
whether it suits rtextsize and then subsequently also set 
XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.

> 
> Yes, mkfs allows the user to override the hardware configsi it
> probes, but it also warns when the override is doing something
> sub-optimal (like aligning all AG headers to the same disk in a
> stripe).
> 
> IOWs, mkfs should be pulling this atomic write info from the
> hardware and configuring the filesysetm around that information.
> That's the target we should be aiming the kernel implementation at
> and optimising for - a filesystem that is correctly configured
> according to published hardware capability.

Right

So, for example, if the atomic writes option is set and the rtextsize 
set by the user is so much larger than what HW can support in terms of 
atomic writes, then we should let the user know about this.

> 
> Everything else is in the "make it behave correctly, but we don't
> care if it's slow" category.
> 
>> This check is not done as it is not fixed what the bdev can do in terms of
>> atomic writes - or, more specifically, what they request_queue reports is
>> not be fixed. There are things which can change this. For example, a FW
>> update could change all the atomic write capabilities of a disk. Or even if
>> we swapped a SCSI disk into another host the atomic write limits may change,
>> as the atomic write unit max depends on the SCSI HBA DMA limits. Changing
>> BIO_MAX_VECS - which could come from a kernel update - could also change
>> what we report as atomic write limit in the request queue.
> 
> If that sort of thing happens, then that's too bad. We already have
> these sorts of "do not do if you care about performance"
> constraints. e.g. don't do a RAID restripe that changes the
> alignment/size of the RAID device (e.g. add a single disk and make
> the stripe width wider) because the physical filesystem structure
> will no longer be aligned to the underlying hardware. instead, you
> have to grow striped volumes with compatible stripes in compatible
> sizes to ensure the filesystem remains aligned to the storage...
> 
> We don't try to cater for every single possible permutation of
> storage hardware configurations - that way lies madness. Design and
> optimise for the common case of correctly configured and well
> behaved storage, and everything else we just don't care about beyond
> "don't corrupt or lose data".

ok

> 
>>>>> And therein lies the problem.
>>>>>

...

>>
>> That sounds fine.
>>
>> My question then is how we determine this max atomic write size granularity.
>>
>> We don't explicitly tell the FS what atomic write size we want for a file.
>> Rather we mkfs with some extsize value which should match our atomic write
>> maximal value and then tell the FS we want to do atomic writes on a file,
>> and if this is accepted then we can query the atomic write min and max unit
>> size, and this would be [FS block size, min(bdev atomic write limit,
>> rtexsize)].
>>
>> If rtextsize is 16KB, then we have a good idea that we want 16KB atomic
>> writes support. So then we could use rtextsize as this max atomic write
>> size.
> 
> Maybe, but I think continuing to focus on this as 'atomic writes
> requires' is wrong.
> 
> The filesystem does not carea bout atomic writes. What it cares
> about is the allocation constraints that need to be implemented.
> That constraint is that all BMBT extent operations need to be
> aligned to a specific extent size, not filesystem blocks.
> 
> The current extent size hint (and rtextsize) only impact the
> -allocation of extents-. They are not directly placing constraints
> on the BMBT layout, they are placing constraints on the free space
> search that the allocator runs on the BNO/CNT btrees to select an
> extent that is then inserted into the BMBT.
> 
> The problem is that unwritten extent conversion, truncate, hole
> punching, etc also all need to be correctly aligned for files that
> are configured to support atomic writes. These operations place
> constraints on how the BMBT can modify the existing extent list.
> 
> These are different constraints to what rtextsize/extszhint apply,
> and that's the fundamental behavioural difference between existing
> extent size hint behaviour and the behaviour needed by atomic
> writes.
> 
>> But I am not 100% sure that it your idea (apologies if I am wrong - I
>> am sincerely trying to follow your idea), but rather it would be
>> min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and bdev
>> atomic write limit is 16KB, then there is no much point in dealing in 1MB
>> blocks for this unwritten extent conversion alignment.
> 
> Exactly my point - there really is no relationship between rtextsize
> and atomic write constraints and that it is a mistake to use
> rtextsize as it stands as a placeholder for atomic write
> constraints.
> 

ok

>> If so, then my
>> concern is that the bdev atomic write upper limit is not fixed. This can
>> solved, but I would still like to be clear on this max atomic write size.
> 
> Right, we haven't clearly defined how XFS is supposed align BMBT
> operations in a way that is compatible for atomic write operations.
> 
> What the patchset does is try to extend and infer things from
> existing allocation alignment constraints, but then not apply those
> constraints to critical extent state operations (pure BMBT
> modifications) that atomic writes also need constrained to work
> correctly and efficiently.

Right. Previously I also did mention that we could explicitly request 
the atomic write size per-inode, but a drawback is that this would 
require an on-disk format change.

> 
>>> i.e. atomic writes need to use max write size granularity for all IO
>>> operations, not filesystem block granularity.
>>>
>>> And that also means things like rtextsize and extsize hints need to
>>> match these atomic write requirements, too....
>>
>> As above, I am not 100% sure if you mean these to be the atomic write
>> maximal value.
> 
> Yes, they either need to be the same as the atomic write max value
> or, much better, once a hint has been set, then resultant size is
> then aligned up to be compatible with the atomic write constraints.
> 
> e.g. set an inode extent size hint of 960kB on a device with 128kB
> atomic write capability. If the inode has the atomic write flag set,
> then allocations need to round the extent size up from 960kB to 1MB
> so that the BMBT extent layout and alignment is compatible with 128kB
> atomic writes.
> 
> At this point, zeroing, punching, unwritten extent conversion, etc
> also needs to be done on aligned 128kB ranges to be comaptible with
> atomic writes, rather than filesysetm block boundaries that would
> normally be used if just the extent size hint was set.
> 
> This is effectively what the original "force align" inode flag bit
> did - it told the inode that all BMBT manipulations need to be
> extent size hint aligned, not just allocation. It's a generic flag
> that implements specific extent manipulation constraints that happen
> to be compatible with atomic writes when the right extent size hint
> is set.
> 
> So at this point, I'm not sure that we should have an "atomic
> writes" flag in the inode. 

Another motivation for this flag is that we can explicitly enable some 
software-based atomic write support for an inode when the backing device 
does not have HW support.

In addition, in this series setting FS_XFLAG_ATOMICWRITES means 
XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something 
similar for other OSes, and for those other OSes it may also mean some 
other special alignment feature enabled. We want a consistent user 
experience.

> We need to tell BMBT modifications
> to behave in a particular way - forced alignment - not that atomic
> writes are being used in the filesystem....
> 
> At this point, the filesystem can do all the extent modification
> alignment stuff that atomic writes require without caring if the
> block device supports atomic writes or even if the
> application is using atomic writes.
> 
> This means we can test the BMBT functionality in fstests without
> requiring hardware (or emulation) that supports atomic writes - all
> we need to do is set the forced align flag, an extent size hint and
> go run fsx on it...
> 

The current idea was that the forcealign feature would be required for 
the second phase for atomic write support - non-rtvol support. Darrick 
did send that series out separately over the New Year's break.

I think that you wanted to progress the following series first:
https://lore.kernel.org/linux-xfs/20231004001943.349265-1-david@fromorbit.com/

Right?

Thanks,
John
John Garry Feb. 14, 2024, 12:13 p.m. UTC | #12
>>>
>>> Not sure why we care about the file position, it's br_startblock that
>>> gets passed into the bio, not br_startoff.
>>
>> We just want to ensure that the length of the write is valid w.r.t. to the
>> offset within the extent, and br_startoff would be the offset within the
>> aligned extent.
> 
> Yes, I understand what br_startoff is, but this doesn't help me
> understand why this code is necessary.  Let's say you have a device that
> supports untorn writes of 16k in length provided the LBA of the write
> command is also aligned to 16k, and the fs has 4k blocks.
> 
> Userspace issues an 16k untorn write at offset 13k in the file, and gets
> this mapping:
> 
> [startoff: 13k, startblock: 16k, blockcount: 16k]
> 
> Why should this IO be rejected? 

It's rejected as it does not follow the rules.

> The physical space extent satisfies the
> alignment requirements of the underlying device, and the logical file
> space extent does not need aligning at all.

Sure. In this case, we can produce a single BIO and the underlying HW 
may be able to handle this atomically.

The point really is that we want a consistent userspace experience. We 
say that the write 'must' be naturally aligned, not 'should' be.

It's not really useful to the user if sometimes a write passes and 
sometimes it fails by chance of how the extents happen to be laid out.

Furthermore, in this case, what should the user do if this write at 13K 
offset fails as the 16K of data straddles 2x extents? They asked for 16K 
written at offset 13K and they want it done atomically - there is 
nothing which the FS can do to help. If they don't really need 16K 
written atomically, then better just do a regular write, or write 
individual chunks atomically.

Thanks,
John
Dave Chinner Feb. 14, 2024, 11:03 p.m. UTC | #13
On Wed, Feb 14, 2024 at 11:06:11AM +0000, John Garry wrote:
> 
> > 
> > > Setting the rtvol extsize at mkfs time or enabling atomic writes
> > > FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in
> > > terms of atomic writes.
> > 
> > Which is wrong. mkfs.xfs gets physical information about the volume
> > from the kernel and makes the filesystem accounting to that
> > information. That's how we do stripe alignment, sector sizing, etc.
> > Atomic write support and setting up alignment constraints should be
> > no different.
> 
> Yes, I was just looking at adding a mkfs option to format for atomic writes,
> which would check physical information about the volume and whether it suits
> rtextsize and then subsequently also set XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.

FWIW, atomic writes need to be implemented in XFS in a way that
isn't specific to the rtdev. There is no reason they cannot be
applied to the data device (via superblock max atomic write size
field or extent size hints and the align flag) so
please don't get hung up on rtextsize as the only thing that atomic
write alignment might apply to.

> > Yes, mkfs allows the user to override the hardware configsi it
> > probes, but it also warns when the override is doing something
> > sub-optimal (like aligning all AG headers to the same disk in a
> > stripe).
> > 
> > IOWs, mkfs should be pulling this atomic write info from the
> > hardware and configuring the filesysetm around that information.
> > That's the target we should be aiming the kernel implementation at
> > and optimising for - a filesystem that is correctly configured
> > according to published hardware capability.
> 
> Right
> 
> So, for example, if the atomic writes option is set and the rtextsize set by
> the user is so much larger than what HW can support in terms of atomic
> writes, then we should let the user know about this.

Well, this is part of the problem I mention above: you're focussing
entirely on the rtdev setup and not the general "atomic writes
require BMBT extent alignment constraints".

So, maybe, yes, we might want to warn that the rtextsize is much
bigger than the atomic write size of that device, but now there's
something else we need to take into account: the rtdev could have a
different atomic write size comxpapred to the data device.

What now?

IOWs, focussing on the rtdev misses key considerations for making
the functionality generic, and we most definitely don't want to have
to rev the on disk format a second time to support atomic writes
for the data device. Hence we likely need two variables for atomic
write sizes in the superblock - one for the rtdev, and one for the
data device. And this then feeds through to Darrick's multi-rtdev
stuff - each rtdev will need to have an attribute that tracks this
information.

Actually, now that I think about it, we may require 3 sizes - I'm in
the process of writing a design doc for an new journal format, and
one of the things I want it to be able to do is use atomic writes to
enable journal writes to be decoupled from device sector sizes. If
we are using atomic writes > sector size for the journal, then we
definitely need to record somewhere in the superblock what the
minimum journal IO size is because it isn't going to be the sector
size anymore...

[...]

> > > If so, then my
> > > concern is that the bdev atomic write upper limit is not fixed. This can
> > > solved, but I would still like to be clear on this max atomic write size.
> > 
> > Right, we haven't clearly defined how XFS is supposed align BMBT
> > operations in a way that is compatible for atomic write operations.
> > 
> > What the patchset does is try to extend and infer things from
> > existing allocation alignment constraints, but then not apply those
> > constraints to critical extent state operations (pure BMBT
> > modifications) that atomic writes also need constrained to work
> > correctly and efficiently.
> 
> Right. Previously I also did mention that we could explicitly request the
> atomic write size per-inode, but a drawback is that this would require an
> on-disk format change.

We're already having to change the on-disk format for this (inode
flag, superblock feature bit), so we really should be trying to make
this generic and properly featured so that it can be used for
anything that requires physical alignment of file data extents, not
just atomic writes...

> > > > i.e. atomic writes need to use max write size granularity for all IO
> > > > operations, not filesystem block granularity.
> > > > 
> > > > And that also means things like rtextsize and extsize hints need to
> > > > match these atomic write requirements, too....
> > > 
> > > As above, I am not 100% sure if you mean these to be the atomic write
> > > maximal value.
> > 
> > Yes, they either need to be the same as the atomic write max value
> > or, much better, once a hint has been set, then resultant size is
> > then aligned up to be compatible with the atomic write constraints.
> > 
> > e.g. set an inode extent size hint of 960kB on a device with 128kB
> > atomic write capability. If the inode has the atomic write flag set,
> > then allocations need to round the extent size up from 960kB to 1MB
> > so that the BMBT extent layout and alignment is compatible with 128kB
> > atomic writes.
> > 
> > At this point, zeroing, punching, unwritten extent conversion, etc
> > also needs to be done on aligned 128kB ranges to be comaptible with
> > atomic writes, rather than filesysetm block boundaries that would
> > normally be used if just the extent size hint was set.
> > 
> > This is effectively what the original "force align" inode flag bit
> > did - it told the inode that all BMBT manipulations need to be
> > extent size hint aligned, not just allocation. It's a generic flag
> > that implements specific extent manipulation constraints that happen
> > to be compatible with atomic writes when the right extent size hint
> > is set.
> > 
> > So at this point, I'm not sure that we should have an "atomic
> > writes" flag in the inode.
> 
> Another motivation for this flag is that we can explicitly enable some
> software-based atomic write support for an inode when the backing device
> does not have HW support.

That's orthogonal to the aligment issue. If the BMBT extents are
always aligned in a way that is compatible with atomic writes, we
don't need and aomtic writes flag to tell the filesystem it should
do an atomic write. That comes from userspace via the IOCB_ATOMIC
flag.

It is the IOCB_ATOMIC that triggers the software fallback when the
hardware doesn't support atomic writes, not an inode flag. All the
filesystem has to do is guarantee all extent manipulations are
correctly aligned and IOCB_ATOMIC can always be executed regardless
of whether it is hardware or software that does it.


> In addition, in this series setting FS_XFLAG_ATOMICWRITES means
> XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something
> similar for other OSes, and for those other OSes it may also mean some other
> special alignment feature enabled. We want a consistent user experience.

I don't care about other OSes - they don't implement the
FS_IOC_FSSETXATTR interfaces, so we just don't care about cross-OS
compatibility for the user API.

Fundamentally, atomic writes are *not a property of the filesystem
on-disk format*. They require extent tracking constraints (i.e.
alignment), and that's the property of the filesystem on-disk format
that we need to manipulate here.

Users are not going to care if the setup ioctl for atomic writes
is to set FS_XFLAG_ALIGN_EXTENTS vs FS_XFLAG_ATOMICWRITES. They
already know they have to align their IO properly for atomic writes,
so it's not like this is something they will be completely
unfamiliar with.

Indeed, FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize
= max_atomic_write_size as a user interface to set up the inode for 
atomic writes is pretty concise and easy to use. It also isn't
specific to atomic writes, and so this fucntionality can be used for
anything that needs aligned extent manipulation for performant
functionality.

> > to behave in a particular way - forced alignment - not that atomic
> > writes are being used in the filesystem....
> > 
> > At this point, the filesystem can do all the extent modification
> > alignment stuff that atomic writes require without caring if the
> > block device supports atomic writes or even if the
> > application is using atomic writes.
> > 
> > This means we can test the BMBT functionality in fstests without
> > requiring hardware (or emulation) that supports atomic writes - all
> > we need to do is set the forced align flag, an extent size hint and
> > go run fsx on it...
> > 
> 
> The current idea was that the forcealign feature would be required for the
> second phase for atomic write support - non-rtvol support. Darrick did send
> that series out separately over the New Year's break.

This is the wrong approach: this needs to be integrated into the
same patchset so we can review the changes for atomic writes as a
whole, not as two separate, independent on-disk format changes. The
on-disk format change that atomic writes need is aligned BMBT extent
manipulations, regardless of whether we are targeting the rtdev or
the datadev, and trying to separate them is leading you down
entirely the wrong path.

-Dave.
John Garry Feb. 15, 2024, 9:53 a.m. UTC | #14
>>
>> Yes, I was just looking at adding a mkfs option to format for atomic writes,
>> which would check physical information about the volume and whether it suits
>> rtextsize and then subsequently also set XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.
> 
> FWIW, atomic writes need to be implemented in XFS in a way that
> isn't specific to the rtdev. There is no reason they cannot be
> applied to the data device (via superblock max atomic write size
> field or extent size hints and the align flag) so
> please don't get hung up on rtextsize as the only thing that atomic
> write alignment might apply to.

Sure

> 
>>> Yes, mkfs allows the user to override the hardware configsi it
>>> probes, but it also warns when the override is doing something
>>> sub-optimal (like aligning all AG headers to the same disk in a
>>> stripe).
>>>
>>> IOWs, mkfs should be pulling this atomic write info from the
>>> hardware and configuring the filesysetm around that information.
>>> That's the target we should be aiming the kernel implementation at
>>> and optimising for - a filesystem that is correctly configured
>>> according to published hardware capability.
>>
>> Right
>>
>> So, for example, if the atomic writes option is set and the rtextsize set by
>> the user is so much larger than what HW can support in terms of atomic
>> writes, then we should let the user know about this.
> 
> Well, this is part of the problem I mention above: you're focussing
> entirely on the rtdev setup and not the general "atomic writes
> require BMBT extent alignment constraints".

I'm really just saying what I was considering based on this series only.

> 
> So, maybe, yes, we might want to warn that the rtextsize is much
> bigger than the atomic write size of that device, but now there's
> something else we need to take into account: the rtdev could have a
> different atomic write size comxpapred to the data device.
> 
> What now?
> 
> IOWs, focussing on the rtdev misses key considerations for making
> the functionality generic, and we most definitely don't want to have
> to rev the on disk format a second time to support atomic writes
> for the data device. Hence we likely need two variables for atomic
> write sizes in the superblock - one for the rtdev, and one for the
> data device. And this then feeds through to Darrick's multi-rtdev
> stuff - each rtdev will need to have an attribute that tracks this
> information.

ok


>>>
>>> What the patchset does is try to extend and infer things from
>>> existing allocation alignment constraints, but then not apply those
>>> constraints to critical extent state operations (pure BMBT
>>> modifications) that atomic writes also need constrained to work
>>> correctly and efficiently.
>>
>> Right. Previously I also did mention that we could explicitly request the
>> atomic write size per-inode, but a drawback is that this would require an
>> on-disk format change.
> 
> We're already having to change the on-disk format for this (inode
> flag, superblock feature bit), so we really should be trying to make
> this generic and properly featured so that it can be used for
> anything that requires physical alignment of file data extents, not
> just atomic writes...

ok

...

>> Another motivation for this flag is that we can explicitly enable some
>> software-based atomic write support for an inode when the backing device
>> does not have HW support.
> 
> That's orthogonal to the aligment issue. If the BMBT extents are
> always aligned in a way that is compatible with atomic writes, we
> don't need and aomtic writes flag to tell the filesystem it should
> do an atomic write. 

Any instruction to do an atomic write should be encoded in the userspace 
write operation. Or maybe the file open operation in future - I still 
get questions about O_ATOMIC.

> That comes from userspace via the IOCB_ATOMIC
> flag.
> 
> It is the IOCB_ATOMIC that triggers the software fallback when the
> hardware doesn't support atomic writes, not an inode flag. 

To me, any such fallback seems like something which we should be 
explicitly enabling.

> All the
> filesystem has to do is guarantee all extent manipulations are
> correctly aligned and IOCB_ATOMIC can always be executed regardless
> of whether it is hardware or software that does it.
> 
> 
>> In addition, in this series setting FS_XFLAG_ATOMICWRITES means
>> XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something
>> similar for other OSes, and for those other OSes it may also mean some other
>> special alignment feature enabled. We want a consistent user experience.
> 
> I don't care about other OSes - they don't implement the
> FS_IOC_FSSETXATTR interfaces, so we just don't care about cross-OS
> compatibility for the user API.

Other FSes need to support FS_IOC_FSSETXATTR for atomic writes. Or at 
least should support it....

> 
> Fundamentally, atomic writes are *not a property of the filesystem
> on-disk format*. They require extent tracking constraints (i.e.
> alignment), and that's the property of the filesystem on-disk format
> that we need to manipulate here.
> 
> Users are not going to care if the setup ioctl for atomic writes
> is to set FS_XFLAG_ALIGN_EXTENTS vs FS_XFLAG_ATOMICWRITES. They
> already know they have to align their IO properly for atomic writes,
> so it's not like this is something they will be completely
> unfamiliar with.
> 
> Indeed, FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize
> = max_atomic_write_size as a user interface to set up the inode for
> atomic writes is pretty concise and easy to use. It also isn't
> specific to atomic writes, and so this fucntionality can be used for
> anything that needs aligned extent manipulation for performant
> functionality.

This is where there seems to be a difference between how you would like 
atomic writes to be enabled and how Christoph would, judging by this:
https://lore.kernel.org/linux-fsdevel/20240110091929.GA31003@lst.de/

I think that if the FS and the userspace util can between them figure 
out what to do, then that is ok. This is something like what I proposed 
previously:

xfs_io -c "atomic-writes 64K" mnt/file

where the userspace util would use for the FS_IOC_FSSETXATTR ioctl:

FS_XFLAG_ATOMIC_WRITES | FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ 
fsx.fsx_extsize

There is a question on the purpose of the FS_XFLAG_ATOMIC_WRITES flag, 
but, to me, it does seem useful for future feature support.

We could just have FS_XFLAG_ATOMIC_WRITES | FS_XFLAG_EXTSIZE w/ 
fsx.fsx_extsize, and the kernel auto-enables FS_XFLAG_ALIGN_EXTENTS, but 
the other way seems better


> 
>>> to behave in a particular way - forced alignment - not that atomic
>>> writes are being used in the filesystem....
>>>
>>> At this point, the filesystem can do all the extent modification
>>> alignment stuff that atomic writes require without caring if the
>>> block device supports atomic writes or even if the
>>> application is using atomic writes.
>>>
>>> This means we can test the BMBT functionality in fstests without
>>> requiring hardware (or emulation) that supports atomic writes - all
>>> we need to do is set the forced align flag, an extent size hint and
>>> go run fsx on it...
>>>
>>
>> The current idea was that the forcealign feature would be required for the
>> second phase for atomic write support - non-rtvol support. Darrick did send
>> that series out separately over the New Year's break.
> 
> This is the wrong approach: this needs to be integrated into the
> same patchset so we can review the changes for atomic writes as a
> whole, not as two separate, independent on-disk format changes. The
> on-disk format change that atomic writes need is aligned BMBT extent
> manipulations, regardless of whether we are targeting the rtdev or
> the datadev, and trying to separate them is leading you down
> entirely the wrong path.
> 

ok, fine.

BTW, going back to the original discussion on the extent zeroing and 
your idea to do this in the iomap dio path, my impression is that we 
require changes like this:

- in iomap_dio_bio_iter(), we need to zero out the extent according to 
extsize (and not just FS block size)
- xfs_dio_write_end_io() -> xfs_iomap_write_unwritten() also needs to 
consider this extent being written, and not assume a FS block
- per-inode locking similar to what is done in 
xfs_file_dio_write_unaligned() - I need to check that further, though, 
as I am not yet sure on how we decide to use this exclusive lock or not.

Does this sound sane?

Thanks,
John
diff mbox series

Patch

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 18c8f168b153..758dc1c90a42 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -289,6 +289,9 @@  xfs_iomap_write_direct(
 		}
 	}
 
+	if (xfs_inode_atomicwrites(ip))
+		bmapi_flags = XFS_BMAPI_ZERO;
+
 	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
 			rblocks, force, &tp);
 	if (error)
@@ -812,6 +815,44 @@  xfs_direct_write_iomap_begin(
 	if (error)
 		goto out_unlock;
 
+	if (flags & IOMAP_ATOMIC) {
+		xfs_filblks_t unit_min_fsb, unit_max_fsb;
+		unsigned int unit_min, unit_max;
+
+		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
+		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
+
+		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
+			error = -EINVAL;
+			goto out_unlock;
+		}
+
+		if ((offset & mp->m_blockmask) ||
+		    (length & mp->m_blockmask)) {
+			error = -EINVAL;
+			goto out_unlock;
+		}
+
+		if (imap.br_blockcount == unit_min_fsb ||
+		    imap.br_blockcount == unit_max_fsb) {
+			/* ok if exactly min or max */
+		} else if (imap.br_blockcount < unit_min_fsb ||
+			   imap.br_blockcount > unit_max_fsb) {
+			error = -EINVAL;
+			goto out_unlock;
+		} else if (!is_power_of_2(imap.br_blockcount)) {
+			error = -EINVAL;
+			goto out_unlock;
+		}
+
+		if (imap.br_startoff &&
+		    imap.br_startoff & (imap.br_blockcount - 1)) {
+			error = -EINVAL;
+			goto out_unlock;
+		}
+	}
+
 	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
 		error = -EAGAIN;
 		if (flags & IOMAP_NOWAIT)