[v4,00/14] forcealign for xfs

Message ID	20240813163638.3751939-1-john.g.garry@oracle.com (mailing list archive)
Headers	show Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BBA1F1A2558; Tue, 13 Aug 2024 16:37:01 +0000 (UTC) From: John Garry <john.g.garry@oracle.com> To: chandan.babu@oracle.com, djwong@kernel.org, dchinner@redhat.com, hch@lst.de Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, catherine.hoang@oracle.com, martin.petersen@oracle.com, John Garry <john.g.garry@oracle.com> Subject: [PATCH v4 00/14] forcealign for xfs Date: Tue, 13 Aug 2024 16:36:24 +0000 Message-Id: <20240813163638.3751939-1-john.g.garry@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	forcealign for xfs \| expand [v4,00/14] forcealign for xfs [v4,01/14] xfs: only allow minlen allocations when near ENOSPC [v4,02/14] xfs: always tail align maxlen allocations [v4,03/14] xfs: simplify extent allocation alignment [v4,04/14] xfs: make EOF allocation simpler [v4,05/14] xfs: introduce forced allocation alignment [v4,06/14] xfs: align args->minlen for forced allocation alignment [v4,07/14] xfs: Introduce FORCEALIGN inode flag [v4,08/14] xfs: Update xfs_inode_alloc_unitsize() for forcealign [v4,09/14] xfs: Update xfs_setattr_size() for forcealign [v4,10/14] xfs: Do not free EOF blocks for forcealign [v4,11/14] xfs: Only free full extents for forcealign [v4,12/14] xfs: Unmap blocks according to forcealign [v4,13/14] xfs: Don't revert allocated offset for forcealign [v4,14/14] xfs: Enable file data forcealign feature

John Garry Aug. 13, 2024, 4:36 p.m. UTC

This series is being spun off the block atomic writes for xfs series at
[0].

That series got too big.

The actual forcealign patches are roughly the same in this series.

Why forcealign?
In some scenarios to may be required to guarantee extent alignment and
granularity.

For example, for atomic writes, the maximum atomic write unit size would
be limited at the extent alignment and granularity, guaranteeing that an
atomic write would not span data present in multiple extents.

forcealign may be useful as a performance tuning optimization in other
scenarios.

I decided not to support forcealign for RT devices here. Initially I
thought that it would be quite simple of implement. However, I discovered
through much testing and subsequent debug that this was not true, so I
decided to defer support to later.

Early development xfsprogs support is at:
https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/

Differences to v3:
- Add more RB tags (thanks)
- Change round-in/out API to up/down
- Change unmap blocks patch to use alloc_fsb and rework helper
  (Darrick)

Differences to v2:
- Add rounding to alloc unit helpers
- Update xfs_setattr_size()
- Disallow mount for forcealign and reflink
- Remove forcealign and RT/reflink inode checks
- Relocate setting of XFS_ALLOC_FORCEALIGN

Differences to v1:
- Add Darricks RB tags (thanks)
- Disallow mount for forcealign and RT
- Disallow cp --reflink from forcealign inode
- Comments improvements (Darrick)
- Coding style improvements (Darrick)
- Fix xfs_inode_alloc_unitsize() (Darrick)

Baseline:
7bf888fa26e8 (tag: xfs-6.11-fixes-1, xfs/xfs-6.11-fixes, xfs/for-next)
xfs: convert comma to semicolon

[0] https://lore.kernel.org/linux-xfs/20240607143919.2622319-1-john.g.garry@oracle.com/
[1] https://lore.kernel.org/linux-block/20240620125359.2684798-1-john.g.garry@oracle.com/

Darrick J. Wong (2):
  xfs: Introduce FORCEALIGN inode flag
  xfs: Enable file data forcealign feature

Dave Chinner (6):
  xfs: only allow minlen allocations when near ENOSPC
  xfs: always tail align maxlen allocations
  xfs: simplify extent allocation alignment
  xfs: make EOF allocation simpler
  xfs: introduce forced allocation alignment
  xfs: align args->minlen for forced allocation alignment

John Garry (6):
  xfs: Update xfs_inode_alloc_unitsize() for forcealign
  xfs: Update xfs_setattr_size() for forcealign
  xfs: Do not free EOF blocks for forcealign
  xfs: Only free full extents for forcealign
  xfs: Unmap blocks according to forcealign
  xfs: Don't revert allocated offset for forcealign

 fs/xfs/libxfs/xfs_alloc.c      |  33 ++--
 fs/xfs/libxfs/xfs_alloc.h      |   3 +-
 fs/xfs/libxfs/xfs_bmap.c       | 320 ++++++++++++++++++---------------
 fs/xfs/libxfs/xfs_format.h     |   9 +-
 fs/xfs/libxfs/xfs_ialloc.c     |  12 +-
 fs/xfs/libxfs/xfs_inode_buf.c  |  46 +++++
 fs/xfs/libxfs/xfs_inode_buf.h  |   3 +
 fs/xfs/libxfs/xfs_inode_util.c |  14 ++
 fs/xfs/libxfs/xfs_sb.c         |   2 +
 fs/xfs/xfs_bmap_util.c         |  14 +-
 fs/xfs/xfs_inode.c             |  41 ++++-
 fs/xfs/xfs_inode.h             |  16 ++
 fs/xfs/xfs_ioctl.c             |  46 ++++-
 fs/xfs/xfs_iops.c              |   4 +-
 fs/xfs/xfs_mount.h             |   2 +
 fs/xfs/xfs_reflink.c           |   5 +-
 fs/xfs/xfs_super.c             |  18 ++
 fs/xfs/xfs_trace.h             |   8 +-
 include/uapi/linux/fs.h        |   2 +
 19 files changed, 405 insertions(+), 193 deletions(-)

Ritesh Harjani (IBM) Sept. 4, 2024, 6:14 p.m. UTC | #1

John Garry <john.g.garry@oracle.com> writes:

> This series is being spun off the block atomic writes for xfs series at
> [0].
>
> That series got too big.
>
> The actual forcealign patches are roughly the same in this series.
>
> Why forcealign?
> In some scenarios to may be required to guarantee extent alignment and
> granularity.
>
> For example, for atomic writes, the maximum atomic write unit size would
> be limited at the extent alignment and granularity, guaranteeing that an
> atomic write would not span data present in multiple extents.
>
> forcealign may be useful as a performance tuning optimization in other
> scenarios.
>
> I decided not to support forcealign for RT devices here. Initially I
> thought that it would be quite simple of implement. However, I discovered
> through much testing and subsequent debug that this was not true, so I
> decided to defer support to later.
>
> Early development xfsprogs support is at:
> https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/
>

Hi John,

Thanks for your continued work on atomic write.
I went over the XFS patch series and this is my understanding + some queries. Could you please help with these.

1. As I understand XFS untorn atomic write support is built on top of FORCEALIGN feature (which this series is adding) which in turn uses extsize hint feature underneath.
   Now extsize hint mainly controls the alignment of both "physical start" & "logical start" offset and extent length, correct?
   This is done using args->alignment for start aand args->prod/mode variables for extent length. Correct?

   - If say we are not able to allocate an aligned physical start? Then since extsize is just a hint we go ahead with whatever best available extent is right?
   - also extsize looks to be only providing allocation side of hints. (not de-allocation). Correct?

2. If say there is an append write i.e. the allocation is needed to be done at EOF. Then we try for an exact bno (from eof block) and aligned extent length, right?
   i.e. xfs_bmap_btalloc_filestreams() -> xfs_bmap_btalloc_at_eof(ap, args);
   If it is not available then we try for nearby bno xfs_alloc_vextent_near_bno(args, target) and similar...

3. It is the FORCEALIGN feature which _mandates_ both allocation (by using extsize hint) and de-allocation to happen _only_ in extsize chunks.
   i.e. forcealign mandates -
   - the logical and physical start offset should be aligned as per args->alignment
   - extent length be aligned as per args->prod/mod.
     If above two cannot be satisfied then return -ENOSPC.

   - Does the unmapping of extents also only happens in extsize chunks (with forcealign)?
     If the start or end of the extent which needs unmapping is unaligned then we convert that extent to unwritten and skip, is it? (__xfs_bunmapi())
     This is a bit unclear to me. Maybe I need to look more deeper into the __xfs_bunmapi() while loop.

My knowledge about this is still limited so please ignore any silly questions.

-ritesh

Dave Chinner Sept. 4, 2024, 11:20 p.m. UTC | #2

On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
> John Garry <john.g.garry@oracle.com> writes:
> 
> > This series is being spun off the block atomic writes for xfs
> > series at [0].
> >
> > That series got too big.
> >
> > The actual forcealign patches are roughly the same in this
> > series.
> >
> > Why forcealign?  In some scenarios to may be required to
> > guarantee extent alignment and granularity.
> >
> > For example, for atomic writes, the maximum atomic write unit
> > size would be limited at the extent alignment and granularity,
> > guaranteeing that an atomic write would not span data present in
> > multiple extents.
> >
> > forcealign may be useful as a performance tuning optimization in
> > other scenarios.
> >
> > I decided not to support forcealign for RT devices here.
> > Initially I thought that it would be quite simple of implement.
> > However, I discovered through much testing and subsequent debug
> > that this was not true, so I decided to defer support to
> > later.
> >
> > Early development xfsprogs support is at:
> > https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/
> >
> 
> Hi John,
> 
> Thanks for your continued work on atomic write.  I went over the
> XFS patch series and this is my understanding + some queries.
> Could you please help with these.

Hi Ritesh - to make it easier for everyone to read and reply to you
emails, can you please word wrap the text at 72 columns?

> 1. As I understand XFS untorn atomic write support is built on top
> of FORCEALIGN feature (which this series is adding) which in turn
> uses extsize hint feature underneath.

Yes.

>    Now extsize hint mainly controls the alignment of both
>    "physical start" & "logical start" offset and extent length,
>    correct?

Yes.

>    This is done using args->alignment for start aand
>    args->prod/mode variables for extent length. Correct?

Yes.

>    - If say we are not able to allocate an aligned physical start?
>    Then since extsize is just a hint we go ahead with whatever
>    best available extent is right?

No. The definition of "forced alignment" is that we guarantee
aligned allocation to the extent size hint. i.e the extent size hint
is not a hint anymore - it defines the alignment we are guaranteeing
allocation will achieve.

hence if we can't align the extent to the alignment provided, we
fail the alignment.

>    - also extsize looks to be only providing allocation side of hints. (not de-allocation). Correct?

No. See the use of xfs_inode_alloc_unitsize() in all the places
where we free space. Forced alignment extends this function to
return the extent size, not the block size.

> 2. If say there is an append write i.e. the allocation is needed
> to be done at EOF. Then we try for an exact bno (from eof block)
> and aligned extent length, right?

Yes. This works because the previous extent is exactly aligned,
hence a contiguous allocation will continue to be correctly aligned
due to the forced alignment constraints.

>    i.e. xfs_bmap_btalloc_filestreams() ->
>    xfs_bmap_btalloc_at_eof(ap, args); If it is not available then
>    we try for nearby bno xfs_alloc_vextent_near_bno(args, target)
>    and similar...

yes, that's just the normal aligned allocation fallback path when
exact allocation fails.

> 3. It is the FORCEALIGN feature which _mandates_ both allocation
> (by using extsize hint) and de-allocation to happen _only_ in
> extsize chunks.
>
>    i.e. forcealign mandates -
>    - the logical and physical start offset should be aligned as
>    per args->alignment
>    - extent length be aligned as per args->prod/mod.
>      If above two cannot be satisfied then return -ENOSPC.

Yes.

> 
>    - Does the unmapping of extents also only happens in extsize
>    chunks (with forcealign)?

Yes, via use of xfs_inode_alloc_unitsize() in the high level code
aligning the fsbno ranges to be unmapped.

Remember, force align requires both logical file offset and
physical block number to be correctly aligned, so unmap alignment
has to be set up correctly at file offset level before we even know
what extents underly the file range we need to unmap....

>      If the start or end of the extent which needs unmapping is
>      unaligned then we convert that extent to unwritten and skip,
>      is it? (__xfs_bunmapi())

The high level code should be aligning the start and end of the
file range to be removed via xfs_inode_alloc_unitsize(). Hence 
the low level __xfs_bunmapi() code shouldn't ever be encountering
unaligned unmaps on force-aligned inodes.

-Dave.

Ritesh Harjani (IBM) Sept. 5, 2024, 3:56 a.m. UTC | #3

Thanks Dave for quick response. 

Dave Chinner <david@fromorbit.com> writes:

> On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
>> John Garry <john.g.garry@oracle.com> writes:
>> 
>> > This series is being spun off the block atomic writes for xfs
>> > series at [0].
>> >
>> > That series got too big.
>> >
>> > The actual forcealign patches are roughly the same in this
>> > series.
>> >
>> > Why forcealign?  In some scenarios to may be required to
>> > guarantee extent alignment and granularity.
>> >
>> > For example, for atomic writes, the maximum atomic write unit
>> > size would be limited at the extent alignment and granularity,
>> > guaranteeing that an atomic write would not span data present in
>> > multiple extents.
>> >
>> > forcealign may be useful as a performance tuning optimization in
>> > other scenarios.
>> >
>> > I decided not to support forcealign for RT devices here.
>> > Initially I thought that it would be quite simple of implement.
>> > However, I discovered through much testing and subsequent debug
>> > that this was not true, so I decided to defer support to
>> > later.
>> >
>> > Early development xfsprogs support is at:
>> > https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/
>> >
>> 
>> Hi John,
>> 
>> Thanks for your continued work on atomic write.  I went over the
>> XFS patch series and this is my understanding + some queries.
>> Could you please help with these.
>
> Hi Ritesh - to make it easier for everyone to read and reply to you
> emails, can you please word wrap the text at 72 columns?
>

argh! Sorry about that. I had formed my queries in a separate notes
application and copy pasted it here. Hopefully this time it will be ok.

>> 1. As I understand XFS untorn atomic write support is built on top
>> of FORCEALIGN feature (which this series is adding) which in turn
>> uses extsize hint feature underneath.
>
> Yes.
>
>>    Now extsize hint mainly controls the alignment of both
>>    "physical start" & "logical start" offset and extent length,
>>    correct?
>
> Yes.
>
>>    This is done using args->alignment for start aand
>>    args->prod/mode variables for extent length. Correct?
>
> Yes.
>
>>    - If say we are not able to allocate an aligned physical start?
>>    Then since extsize is just a hint we go ahead with whatever
>>    best available extent is right?
>
> No. The definition of "forced alignment" is that we guarantee
> aligned allocation to the extent size hint. i.e the extent size hint
> is not a hint anymore - it defines the alignment we are guaranteeing
> allocation will achieve.
>
> hence if we can't align the extent to the alignment provided, we
> fail the alignment.
>
>>    - also extsize looks to be only providing allocation side of hints. (not de-allocation). Correct?
>
> No. See the use of xfs_inode_alloc_unitsize() in all the places
> where we free space. Forced alignment extends this function to
> return the extent size, not the block size.
>

Sorry for not being explicit. For queries in point 1. above, I was
referring to extent size hint feature w/o FORCEALIGN. But I got the
gist from your response. Thanks!

>> 2. If say there is an append write i.e. the allocation is needed
>> to be done at EOF. Then we try for an exact bno (from eof block)
>> and aligned extent length, right?
>
> Yes. This works because the previous extent is exactly aligned,
> hence a contiguous allocation will continue to be correctly aligned
> due to the forced alignment constraints.
>
>>    i.e. xfs_bmap_btalloc_filestreams() ->
>>    xfs_bmap_btalloc_at_eof(ap, args); If it is not available then
>>    we try for nearby bno xfs_alloc_vextent_near_bno(args, target)
>>    and similar...
>
> yes, that's just the normal aligned allocation fallback path when
> exact allocation fails.
>
>> 3. It is the FORCEALIGN feature which _mandates_ both allocation
>> (by using extsize hint) and de-allocation to happen _only_ in
>> extsize chunks.
>>
>>    i.e. forcealign mandates -
>>    - the logical and physical start offset should be aligned as
>>    per args->alignment
>>    - extent length be aligned as per args->prod/mod.
>>      If above two cannot be satisfied then return -ENOSPC.
>
> Yes.
>
>> 
>>    - Does the unmapping of extents also only happens in extsize
>>    chunks (with forcealign)?
>
> Yes, via use of xfs_inode_alloc_unitsize() in the high level code
> aligning the fsbno ranges to be unmapped.
>
> Remember, force align requires both logical file offset and
> physical block number to be correctly aligned,

This is where I would like to double confirm it again. Even the
extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
physical start and logical start file offset and length right?

(Or does extsize hint only restricts alignment to logical start file
offset + length and not the physical start?)


Also it looks like there is no difference with ATOMIC_WRITE AND
FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is
adding additional natural alignment restrictions on pos and len). 
So why maintain 2 separate on disk inode flags for FORCEALIGN AND
ATOMIC_WRITE?
- Do you foresee FORCEALIGN to be also used at other places w/o
ATOMIC_WRITE where feature differentiation between the two on an
inode is required?

- Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN
& XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too?

- But why ro_compact for ATOMICWRITES? There aren't any on disk metadata
changes within XFS filesystem to support atomic writes, right? 
Is it something to just prevent users from destroying their own data
by not allowing a rw mount from an older kernel where users could do
unaligned writes to files marked for atomic writes?
Or is there any other reasoning to prevent XFS filesystem from becoming
inconsistent if an older kernel does a rw mount here.



> so unmap alignment
> has to be set up correctly at file offset level before we even know
> what extents underly the file range we need to unmap....
>
>>      If the start or end of the extent which needs unmapping is
>>      unaligned then we convert that extent to unwritten and skip,
>>      is it? (__xfs_bunmapi())
>
> The high level code should be aligning the start and end of the
> file range to be removed via xfs_inode_alloc_unitsize(). Hence 
> the low level __xfs_bunmapi() code shouldn't ever be encountering
> unaligned unmaps on force-aligned inodes.
>

Yes, but isn't this code snippet trying to handle a case when it finds an
unaligned extent during unmap? And what we are essentially trying to 
do here is leave the unwritten extent as is and if the found extent is
written then convert to unwritten and skip it (goto nodelete). This
means with forcealign if we encounter an unaligned extent then the file
will have this space reserved as is with extent marked unwritten. 

Is this understanding correct?

__xfs_bunmapi(...) 
{
    unsigned int alloc_fsb = xfs_inode_alloc_fsbsize(ip);
    <...>
    while (end != (xfs_fileoff_t)-1 && end >= start &
          (nexts == 0 || extno < nexts)) {
          <...>

          if (alloc_fsb == 1 || (flags & XFS_BMAPI_REMAP))
                goto delete;

         mod = xfs_bmap_alloc_unit_offset(ip, alloc_fsb,
                         del.br_startblock + del.br_blockcount);
         if (mod) {
                 /*
                  * Not aligned to allocation unit on the end.
                  * The extent could have been split into written
                  * and unwritten pieces, or we could just be
                  * unmapping part of it.  But we can't really
                  * get rid of part of an extent.
                  */
                 if (del.br_state == XFS_EXT_UNWRITTEN) {
                         /*
                          * This piece is unwritten, or we're not
                          * using unwritten extents.  Skip over it.
                          */
                         ASSERT((flags & XFS_BMAPI_REMAP) || end >= mod);
                         end -= mod > del.br_blockcount ?
                                 del.br_blockcount : mod;
                         if (end < got.br_startoff &&
                             !xfs_iext_prev_extent(ifp, &icur, &got)) {
                                 done = true;
                                 break;
                         }
                         continue;
                 }
                 /*
                  * It's written, turn it unwritten.
                  * This is better than zeroing it.
                  */
                 ASSERT(del.br_state == XFS_EXT_NORM);
                 ASSERT(tp->t_blk_res > 0);
                 /*
                  * If this spans an extent boundary, chop it back to
                  * the start of the one we end at.
                  */
                 if (del.br_blockcount > mod) {
                         del.br_startoff += del.br_blockcount - mod;
                         del.br_startblock += del.br_blockcount - mod;
                         del.br_blockcount = mod;
                 }
                 del.br_state = XFS_EXT_UNWRITTEN;
                 error = xfs_bmap_add_extent_unwritten_real(tp, ip,
                                 whichfork, &icur, &cur, &del,
                                 &logflags);
                 if (error)
                         goto error0;
                 goto nodelete;
         }

         mod = xfs_bmap_alloc_unit_offset(ip, alloc_fsb,
                                 del.br_startblock);
         if (mod) {
            // handle it for unaligned start block
            <...>
         }
    }
}

> -Dave.


Thanks a lot for answering the queries.

-ritesh

Dave Chinner Sept. 5, 2024, 6:33 a.m. UTC | #4

On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote:
> Dave Chinner <david@fromorbit.com> writes:
> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation
> >> (by using extsize hint) and de-allocation to happen _only_ in
> >> extsize chunks.
> >>
> >>    i.e. forcealign mandates -
> >>    - the logical and physical start offset should be aligned as
> >>    per args->alignment
> >>    - extent length be aligned as per args->prod/mod.
> >>      If above two cannot be satisfied then return -ENOSPC.
> >
> > Yes.
> >
> >> 
> >>    - Does the unmapping of extents also only happens in extsize
> >>    chunks (with forcealign)?
> >
> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code
> > aligning the fsbno ranges to be unmapped.
> >
> > Remember, force align requires both logical file offset and
> > physical block number to be correctly aligned,
> 
> This is where I would like to double confirm it again. Even the
> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
> physical start and logical start file offset and length right?

No.

> (Or does extsize hint only restricts alignment to logical start file
> offset + length and not the physical start?)

Neither.

extsize hint by itself (i.e. existing behaviour) has no alignment
effect at all. All it affects is -size- of the extent. i.e. once
the extent start is chosen, extent size hints will trim the length
of the extent to a multiple of the extent size hint. Alignment is
not considered at all.

> Also it looks like there is no difference with ATOMIC_WRITE AND
> FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is
> adding additional natural alignment restrictions on pos and len). 

Atomic write requires additional hardware support, and it restricts
the valid sizes of extent size hints that can be set. Only atomic
writes can be done on files marked as configured for atomic writes;
force alignment can be done on any file...

> So why maintain 2 separate on disk inode flags for FORCEALIGN AND
> ATOMIC_WRITE?

the atomic write flag indicates that a file has been set up
correctly for atomic writes to be able to issues reliably. force
alignment doesn't guarantee that - it's just a mechanism that tells
the allocator to behave a specific way.

> - Do you foresee FORCEALIGN to be also used at other places w/o
> ATOMIC_WRITE where feature differentiation between the two on an
> inode is required?

The already exist. For example, reliably allocating huge page
mappings on DAX filesystems requires 2MB forced alignment. 

> - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN
> & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too?

Same as above.

> - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata
> changes within XFS filesystem to support atomic writes, right? 

Because if you downgrade the kernel to something that doesn't
support atomic writes, then non-atomic sized/aligned data can be
written to the file and/or torn writes can occur.

Worse, extent size hints that don't match the underlying hardware
support could be set up for inodes, and when the kernel is upgraded
again then atomic writes will fail on inodes that have atomic write
flags set on them....

> Is it something to just prevent users from destroying their own data
> by not allowing a rw mount from an older kernel where users could do
> unaligned writes to files marked for atomic writes?
> Or is there any other reasoning to prevent XFS filesystem from becoming
> inconsistent if an older kernel does a rw mount here.

The older kernel does not know what the unknown inode flag means
(i.e. atomic writes) and so, by definition, we cannot allow it to
modify metadata or file data because it may not modify it in the
correct way for that flag being set on the inode.

Kernels that don't understand feature flags need to treat the
filesystem as read-only, no matter how trivial the feature addition
might seem.

> > so unmap alignment
> > has to be set up correctly at file offset level before we even know
> > what extents underly the file range we need to unmap....
> >
> >>      If the start or end of the extent which needs unmapping is
> >>      unaligned then we convert that extent to unwritten and skip,
> >>      is it? (__xfs_bunmapi())
> >
> > The high level code should be aligning the start and end of the
> > file range to be removed via xfs_inode_alloc_unitsize(). Hence 
> > the low level __xfs_bunmapi() code shouldn't ever be encountering
> > unaligned unmaps on force-aligned inodes.
> >
> 
> Yes, but isn't this code snippet trying to handle a case when it finds an
> unaligned extent during unmap?

It does exactly that.

> And what we are essentially trying to 
> do here is leave the unwritten extent as is and if the found extent is
> written then convert to unwritten and skip it (goto nodelete). This
> means with forcealign if we encounter an unaligned extent then the file
> will have this space reserved as is with extent marked unwritten. 
> 
> Is this understanding correct?

Yes, except for the fact that this code is not triggered by force
alignment.

This code is preexisting realtime file functionality. It is used
when the rtextent size is larger than a single filesytem block.

In these configurations, we do allocation in rtextsize units, but we
track written/unwritten extents in the BMBT on filesystem block
granularity. Hence we can have unaligned written/unwritten extent
boundaries, and if we aren't unmapping a whole rtextent then we
simply mark the unused portion of it as unwritten in the BMBT to
indicate it contains zeroes.

IOWs, existing RT files have different IO alignment and
written/unwritten extent conversion behaviour to the new forced
alignment feature. The implementation code is shared in many places,
but the higher level forced alignment functionality ensures there
are never unaligned unwritten extents created or unaligned
unmappings asked for. Hence this code does not trigger for the
forced alignment cases.

i.e. we have multiple seperate allocation alignment behaviours that
share an implementation, but they do not all behave exactly the same
way or provide the same alignment guarantees....

-Dave.

John Garry Sept. 5, 2024, 10:15 a.m. UTC | #5

> 
>> 1. As I understand XFS untorn atomic write support is built on top
>> of FORCEALIGN feature (which this series is adding) which in turn
>> uses extsize hint feature underneath.
> 
> Yes.
> 
>>     Now extsize hint mainly controls the alignment of both
>>     "physical start" & "logical start" offset and extent length,
>>     correct?
> 
> Yes.

To be clear, only for atomic writes do we require physical block alignment.

> 
>>     This is done using args->alignment for start aand
>>     args->prod/mode variables for extent length. Correct?
> 
> Yes.
> 
>>     - If say we are not able to allocate an aligned physical start?
>>     Then since extsize is just a hint we go ahead with whatever
>>     best available extent is right?

---

> 
>>
>>     - Does the unmapping of extents also only happens in extsize
>>     chunks (with forcealign)?
> 
> Yes, via use of xfs_inode_alloc_unitsize() in the high level code
> aligning the fsbno ranges to be unmapped.
> 
> Remember, force align requires both logical file offset and
> physical block number to be correctly aligned, so unmap alignment
> has to be set up correctly at file offset level before we even know
> what extents underly the file range we need to unmap....
> 
>>       If the start or end of the extent which needs unmapping is
>>       unaligned then we convert that extent to unwritten and skip,
>>       is it? (__xfs_bunmapi())
> 
> The high level code should be aligning the start and end of the
> file range to be removed via xfs_inode_alloc_unitsize(). 

Is that the case for something like truncate? There we just say what is 
the end block which we want to truncate to in 
xfs_itruncate_extents_flags(new_size)  -> 
xfs_bunmapi_range(XFS_B_TO_FSB(new_size)), and that may not be alloc 
unit aligned.

> Hence
> the low level __xfs_bunmapi() code shouldn't ever be encountering
> unaligned unmaps on force-aligned inodes.

Dave Chinner Sept. 5, 2024, 9:47 p.m. UTC | #6

On Thu, Sep 05, 2024 at 11:15:41AM +0100, John Garry wrote:
> > >     - Does the unmapping of extents also only happens in extsize
> > >     chunks (with forcealign)?
> > 
> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code
> > aligning the fsbno ranges to be unmapped.
> > 
> > Remember, force align requires both logical file offset and
> > physical block number to be correctly aligned, so unmap alignment
> > has to be set up correctly at file offset level before we even know
> > what extents underly the file range we need to unmap....
> > 
> > >       If the start or end of the extent which needs unmapping is
> > >       unaligned then we convert that extent to unwritten and skip,
> > >       is it? (__xfs_bunmapi())
> > 
> > The high level code should be aligning the start and end of the
> > file range to be removed via xfs_inode_alloc_unitsize().
> 
> Is that the case for something like truncate? There we just say what is the
> end block which we want to truncate to in
> xfs_itruncate_extents_flags(new_size)  ->
> xfs_bunmapi_range(XFS_B_TO_FSB(new_size)), and that may not be alloc unit
> aligned.

Ah, I thought we had that alignment in xfs_itruncate_extents_flags()
already, but if we don't then that's a bug that needs to be fixed.

We change the space reservation in xfs-setattr_size() for this case
(patch 9) but then don't do any alignment there - it relies on
xfs_itruncate_extents_flags() to do the right thing w.r.t. extent
removal alignment w.r.t. the new EOF.

i.e. The xfs_setattr_size() code takes care of EOF block zeroing and
page cache removal so the user doesn't see old data beyond EOF,
whilst xfs_itruncate_extents_flags() is supposed to take care of the
extent removal and the details of that operation (e.g. alignment).

Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
into account for the post-eof block removal, but doesn't change
xfs_free_eofblocks() at all. i.e  it also relies on
xfs_itruncate_extents_flags() to do the right thing for force
aligned inodes.

In this case, we are removing post-eof speculative preallocation
that that has been allocated by delalloc conversion during
writeback.  These post-eof extents will already be unwritten extents
because delalloc conversion uses unwritten extents to avoid
stale data exposure if we crash between allocation and the data
being written to the extents. Hence there should be no extents to
convert to unwritten in the majority of cases here.

The only case where we might get written extents beyond EOF is if
the file has been truncated down, but in that case we don't really
care because truncate should have already taken care of post-eof
extent alignment for us. xfs_can_free_eofblocks() will see this
extent alignment and so we'll skip xfs_free_eofblocks() in this case
altogether....

Hence xfs_free_eofblocks() should never need to convert a partial
unaligned extent range to unwritten when force-align is enabled
because the post-eof extents should already be unwritten. We also
want to leave the inode in the most optimal state for future
extension, which means we want the post-eof extent to be correctly
aligned.

Hence there are multiple reasons that xfs_itruncate_extents_flags()
should be aligning the post-EOF block it is starting the unmapping
at for force aligned allocation contexts. And in doing so, we remove
the weird corner case where we can have an unaligned extent state
boundary at EOF for atomic writes....

-Dave.

John Garry Sept. 6, 2024, 2:31 p.m. UTC | #7

On 05/09/2024 22:47, Dave Chinner wrote:
>>>>   If the start or end of the extent which needs unmapping is
>>>>        unaligned then we convert that extent to unwritten and skip,
>>>>        is it? (__xfs_bunmapi())
>>> The high level code should be aligning the start and end of the
>>> file range to be removed via xfs_inode_alloc_unitsize().
>> Is that the case for something like truncate? There we just say what is the
>> end block which we want to truncate to in
>> xfs_itruncate_extents_flags(new_size)  ->
>> xfs_bunmapi_range(XFS_B_TO_FSB(new_size)), and that may not be alloc unit
>> aligned.
> Ah, I thought we had that alignment in xfs_itruncate_extents_flags()
> already, but if we don't then that's a bug that needs to be fixed.

AFAICS, forcealign behaviour is same as RT, so then this would be a 
mainline bug, right?

 > > We change the space reservation in xfs-setattr_size() for this case
> (patch 9) but then don't do any alignment there - it relies on
> xfs_itruncate_extents_flags() to do the right thing w.r.t. extent
> removal alignment w.r.t. the new EOF.
> 
> i.e. The xfs_setattr_size() code takes care of EOF block zeroing and
> page cache removal so the user doesn't see old data beyond EOF,
> whilst xfs_itruncate_extents_flags() is supposed to take care of the
> extent removal and the details of that operation (e.g. alignment).

So we should roundup the unmap block to the alloc unit, correct? I have 
my doubts about that, and thought that xfs_bunmapi_range() takes care of 
any alignment handling.

> 
> Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
> into account for the post-eof block removal, but doesn't change
> xfs_free_eofblocks() at all. i.e  it also relies on
> xfs_itruncate_extents_flags() to do the right thing for force
> aligned inodes.

What state should the blocks post-EOF blocks be? A simple example of 
partially truncating an alloc unit is:

$xfs_io -c "extsize" mnt/file
[16384] mnt/file


$xfs_bmap -vvp mnt/file
mnt/file:
  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
    0: [0..20479]:      192..20671        0 (192..20671)     20480 000000


$truncate -s 10461184 mnt/file # 10M - 6FSB

$xfs_bmap -vvp mnt/file
mnt/file:
  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
    0: [0..20431]:      192..20623        0 (192..20623)     20432 000000
    1: [20432..20447]:  20624..20639      0 (20624..20639)      16 010000
  FLAG Values:
     0010000 Unwritten preallocated extent

Is that incorrect state?

> 
> In this case, we are removing post-eof speculative preallocation
> that that has been allocated by delalloc conversion during
> writeback.  These post-eof extents will already be unwritten extents
> because delalloc conversion uses unwritten extents to avoid
> stale data exposure if we crash between allocation and the data
> being written to the extents. Hence there should be no extents to
> convert to unwritten in the majority of cases here.
> 
> The only case where we might get written extents beyond EOF is if
> the file has been truncated down, but in that case we don't really
> care because truncate should have already taken care of post-eof
> extent alignment for us. xfs_can_free_eofblocks() will see this
> extent alignment and so we'll skip xfs_free_eofblocks() in this case
> altogether....
> 
> Hence xfs_free_eofblocks() should never need to convert a partial
> unaligned extent range to unwritten when force-align is enabled
> because the post-eof extents should already be unwritten. We also
> want to leave the inode in the most optimal state for future
> extension, which means we want the post-eof extent to be correctly
> aligned.
> 
> Hence there are multiple reasons that xfs_itruncate_extents_flags()
> should be aligning the post-EOF block it is starting the unmapping
> at for force aligned allocation contexts. And in doing so, we remove
> the weird corner case where we can have an unaligned extent state
> boundary at EOF for atomic writes....

Yeah, I don't think that sub-alloc unit extent zeroing would help us 
there, as we not be dealing with a new extent (for zeroing to occur).

Thanks,
John

Dave Chinner Sept. 8, 2024, 10:49 p.m. UTC | #8

On Fri, Sep 06, 2024 at 03:31:43PM +0100, John Garry wrote:
> On 05/09/2024 22:47, Dave Chinner wrote:
> > > > >   If the start or end of the extent which needs unmapping is
> > > > >        unaligned then we convert that extent to unwritten and skip,
> > > > >        is it? (__xfs_bunmapi())
> > > > The high level code should be aligning the start and end of the
> > > > file range to be removed via xfs_inode_alloc_unitsize().
> > > Is that the case for something like truncate? There we just say what is the
> > > end block which we want to truncate to in
> > > xfs_itruncate_extents_flags(new_size)  ->
> > > xfs_bunmapi_range(XFS_B_TO_FSB(new_size)), and that may not be alloc unit
> > > aligned.
> > Ah, I thought we had that alignment in xfs_itruncate_extents_flags()
> > already, but if we don't then that's a bug that needs to be fixed.
> 
> AFAICS, forcealign behaviour is same as RT, so then this would be a mainline
> bug, right?

No, I don't think so. I think this is a case where forcealign and RT
behaviours differ, primarily because RT doesn't actually care about
file offset -> physical offset alignment and can do unaligned IO
whenever it wants. Hence having an unaligned written->unwritten
extent state transition doesnt' affect anything for rt files...

> 
> > > We change the space reservation in xfs-setattr_size() for this case
> > (patch 9) but then don't do any alignment there - it relies on
> > xfs_itruncate_extents_flags() to do the right thing w.r.t. extent
> > removal alignment w.r.t. the new EOF.
> > 
> > i.e. The xfs_setattr_size() code takes care of EOF block zeroing and
> > page cache removal so the user doesn't see old data beyond EOF,
> > whilst xfs_itruncate_extents_flags() is supposed to take care of the
> > extent removal and the details of that operation (e.g. alignment).
> 
> So we should roundup the unmap block to the alloc unit, correct? I have my
> doubts about that, and thought that xfs_bunmapi_range() takes care of any
> alignment handling.

The start should round up, yes. And, no, xfs_bunmapi_range() isn't
specifically "taking care" of alignment. It's just marking a partial
alloc unit range as unwritten, which means we now have -unaligned
extents- in the BMBT. 

> 
> > 
> > Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
> > into account for the post-eof block removal, but doesn't change
> > xfs_free_eofblocks() at all. i.e  it also relies on
> > xfs_itruncate_extents_flags() to do the right thing for force
> > aligned inodes.
> 
> What state should the blocks post-EOF blocks be? A simple example of
> partially truncating an alloc unit is:
> 
> $xfs_io -c "extsize" mnt/file
> [16384] mnt/file
> 
> 
> $xfs_bmap -vvp mnt/file
> mnt/file:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>    0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
> 
> 
> $truncate -s 10461184 mnt/file # 10M - 6FSB
> 
> $xfs_bmap -vvp mnt/file
> mnt/file:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>    0: [0..20431]:      192..20623        0 (192..20623)     20432 000000
>    1: [20432..20447]:  20624..20639      0 (20624..20639)      16 010000
>  FLAG Values:
>     0010000 Unwritten preallocated extent
> 
> Is that incorrect state?

Think about it: what happens if you now truncate it back up to 10MB
(i.e. aligned length) and then do an aligned atomic write on it.

First: What happens when you truncate up?

......

Yes, iomap_zero_range() will see the unwritten extent and skip it.
i.e. The unwritten extent stays as an unwritten extent, it's now
within EOF. That written->unwritten extent boundary is not on an
aligned file offset.

Second: What happens when you do a correctly aligned atomic write
that spans this range now?

......

Iomap only maps a single extent at a time, so it will only map the
written range from the start of the IO (aligned) to the start of the
unwritten extent (unaligned).  Hence the atomic write will be
rejected because we can't do the atomic write to such an unaligned
extent.

That's not a bug in the atomic write path - this failure occurs
because of the truncate behaviour doing post-eof unwritten extent
conversion....

Yes, I agree that the entire -physical- extent is still correctly
aligned on disk so you could argue that the unwritten conversion
that xfs_bunmapi_range is doing is valid forced alignment behaviour.
However, the fact is that breaking the aligned physical extent into
two unaligned contiguous extents in different states in the BMBT
means that they are treated as two seperate unaligned extents, not
one contiguous aligned physical extent.

This extent state discontiunity is results in breaking physical IO
across the extent state boundary. Hence such an unaligned extent
state change violates the physical IO alignment guarantees that
forced alignment is supposed to provide atomic writes...

This is the reason why operations like hole punching round to extent
size for forced alignment at the highest level. i.e. We cannot allow
xfs_bunmapi_range() to "take care" of alignment handling because
managing partial extent unmappings with sub-alloc-unit unwritten
extents does not work for forced alignment.

Again, this is different to the traditional RT file behaviour - it
can use unwritten extents for sub-alloc-unit alignment unmaps
because the RT device can align file offset to any physical offset,
and issue unaligned sector sized IO without any restrictions. Forced
alignment does not have this freedom, and when we extend forced
alignment to RT files, it will not have the freedom to use
unwritten extents for sub-alloc-unit unmapping, either.

-Dave.

John Garry Sept. 9, 2024, 4:18 p.m. UTC | #9

>>
>> AFAICS, forcealign behaviour is same as RT, so then this would be a mainline
>> bug, right?
> 
> No, I don't think so. I think this is a case where forcealign and RT
> behaviours differ, primarily because RT doesn't actually care about
> file offset -> physical offset alignment and can do unaligned IO
> whenever it wants. Hence having an unaligned written->unwritten
> extent state transition doesnt' affect anything for rt files...

ok

> 
>>
>>>> We change the space reservation in xfs-setattr_size() for this case
>>> (patch 9) but then don't do any alignment there - it relies on
>>> xfs_itruncate_extents_flags() to do the right thing w.r.t. extent
>>> removal alignment w.r.t. the new EOF.
>>>
>>> i.e. The xfs_setattr_size() code takes care of EOF block zeroing and
>>> page cache removal so the user doesn't see old data beyond EOF,
>>> whilst xfs_itruncate_extents_flags() is supposed to take care of the
>>> extent removal and the details of that operation (e.g. alignment).
>>
>> So we should roundup the unmap block to the alloc unit, correct? I have my
>> doubts about that, and thought that xfs_bunmapi_range() takes care of any
>> alignment handling.
> 
> The start should round up, yes. And, no, xfs_bunmapi_range() isn't
> specifically "taking care" of alignment. It's just marking a partial
> alloc unit range as unwritten, which means we now have -unaligned
> extents- in the BMBT.

Yes, I have noticed this previously.

> 
>>
>>>
>>> Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
>>> into account for the post-eof block removal, but doesn't change
>>> xfs_free_eofblocks() at all. i.e  it also relies on
>>> xfs_itruncate_extents_flags() to do the right thing for force
>>> aligned inodes.
>>
>> What state should the blocks post-EOF blocks be? A simple example of
>> partially truncating an alloc unit is:
>>
>> $xfs_io -c "extsize" mnt/file
>> [16384] mnt/file
>>
>>
>> $xfs_bmap -vvp mnt/file
>> mnt/file:
>>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>     0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
>>
>>
>> $truncate -s 10461184 mnt/file # 10M - 6FSB
>>
>> $xfs_bmap -vvp mnt/file
>> mnt/file:
>>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>     0: [0..20431]:      192..20623        0 (192..20623)     20432 000000
>>     1: [20432..20447]:  20624..20639      0 (20624..20639)      16 010000
>>   FLAG Values:
>>      0010000 Unwritten preallocated extent
>>
>> Is that incorrect state?
> 
> Think about it: what happens if you now truncate it back up to 10MB
> (i.e. aligned length) and then do an aligned atomic write on it.
> 
> First: What happens when you truncate up?
> 
> ......
> 
> Yes, iomap_zero_range() will see the unwritten extent and skip it.
> i.e. The unwritten extent stays as an unwritten extent, it's now
> within EOF. That written->unwritten extent boundary is not on an
> aligned file offset.

Right

> 
> Second: What happens when you do a correctly aligned atomic write
> that spans this range now?
> 
> ......
> 
> Iomap only maps a single extent at a time, so it will only map the
> written range from the start of the IO (aligned) to the start of the
> unwritten extent (unaligned).  Hence the atomic write will be
> rejected because we can't do the atomic write to such an unaligned
> extent.

It was being considered to change this handling for atomic writes. More 
below at *.

> 
> That's not a bug in the atomic write path - this failure occurs
> because of the truncate behaviour doing post-eof unwritten extent
> conversion....
> 
> Yes, I agree that the entire -physical- extent is still correctly
> aligned on disk so you could argue that the unwritten conversion
> that xfs_bunmapi_range is doing is valid forced alignment behaviour.
> However, the fact is that breaking the aligned physical extent into
> two unaligned contiguous extents in different states in the BMBT
> means that they are treated as two seperate unaligned extents, not
> one contiguous aligned physical extent.

Right, this is problematic.

* I guess that you had not been following the recent discussion on this 
topic in the latest xfs atomic writes series @ 
https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/ 
and also mentioned earlier in 
https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/

There I dropped the sub-alloc unit zeroing. The concept to iter for a 
single bio seems sane, but as Darrick mentioned, we have issue of 
non-atomically committing all the extent conversions.

FWIW, this is how I think that the change according to Darrick's idea 
would look:

---->8----

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 72c981e3dc92..ec76d6a8750d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -244,7 +244,8 @@ xfs_iomap_write_direct(
  	xfs_fileoff_t		count_fsb,
  	unsigned int		flags,
  	struct xfs_bmbt_irec	*imap,
-	u64			*seq)
+	u64			*seq,
+	bool			zeroing)
  {
  	struct xfs_mount	*mp = ip->i_mount;
  	struct xfs_trans	*tp;
@@ -285,7 +286,7 @@ xfs_iomap_write_direct(
  	 * the reserve block pool for bmbt block allocation if there is no space
  	 * left but we need to do unwritten extent conversion.
  	 */
+	if (zeroing)
+		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
	if (flags & IOMAP_DAX) {
  		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
  		if (imap->br_state == XFS_EXT_UNWRITTEN) {
  			force = true;
@@ -780,6 +781,11 @@ xfs_direct_write_iomap_begin(
  	u16			iomap_flags = 0;
  	unsigned int		lockmode;
  	u64			seq;
+	bool			atomic = flags & IOMAP_ATOMIC;
+	struct xfs_bmbt_irec	imap2[XFS_BMAP_MAX_NMAP];
+	xfs_fileoff_t		_offset_fsb = xfs_inode_rounddown_alloc_unit(ip, 
offset_fsb);
+	xfs_fileoff_t		_end_fsb = xfs_inode_roundup_alloc_unit(ip, end_fsb);
+	int loop_index;

  	ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));

@@ -843,6 +849,9 @@ xfs_direct_write_iomap_begin(
  	if (imap_needs_alloc(inode, flags, &imap, nimaps))
  		goto allocate_blocks;

+	if (atomic && !imap_spans_range(&imap, offset_fsb, end_fsb))
+		goto out_atomic;
+
  	/*
  	 * NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with
  	 * a single map so that we avoid partial IO failures due to the rest of
@@ -897,7 +906,7 @@ xfs_direct_write_iomap_begin(
  	xfs_iunlock(ip, lockmode);

  	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
-			flags, &imap, &seq);
+			flags, &imap, &seq, false);
  	if (error)
  		return error;

@@ -918,6 +927,28 @@ xfs_direct_write_iomap_begin(
  	xfs_iunlock(ip, lockmode);
  	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);

+out_atomic:
+	nimaps = XFS_BMAP_MAX_NMAP;
+
+	error = xfs_bmapi_read(ip, _offset_fsb, _end_fsb - _offset_fsb, &imap2[0],
+			       &nimaps, 0);
+	for (loop_index = 0; loop_index < nimaps; loop_index++) {
+		struct xfs_bmbt_irec *_imap2 = &imap2[loop_index];
+
+		if (_imap2->br_state == XFS_EXT_UNWRITTEN) {
+
+			xfs_iunlock(ip, lockmode);
+
+			error = xfs_iomap_write_direct(ip, _imap2->br_startoff, 
_imap2->br_blockcount,
+					flags, &imap, &seq, true);
+			if (error)
+				return error;
+			goto relock;
+		}
+	}
+	/* We should not reach this, but TODO better handling */
+	error = -EINVAL;
+
  out_unlock:
  	if (lockmode)
  		xfs_iunlock(ip, lockmode);

-----8<----

But I have my doubts about using XFS_BMAPI_ZERO, even if it is just for 
pre-zeroing unwritten parts of the alloc unit for an atomic write.

> 
> This extent state discontiunity is results in breaking physical IO
> across the extent state boundary. Hence such an unaligned extent
> state change violates the physical IO alignment guarantees that
> forced alignment is supposed to provide atomic writes...

Yes

> 
> This is the reason why operations like hole punching round to extent
> size for forced alignment at the highest level. i.e. We cannot allow
> xfs_bunmapi_range() to "take care" of alignment handling because
> managing partial extent unmappings with sub-alloc-unit unwritten
> extents does not work for forced alignment.
> 
> Again, this is different to the traditional RT file behaviour - it
> can use unwritten extents for sub-alloc-unit alignment unmaps
> because the RT device can align file offset to any physical offset,
> and issue unaligned sector sized IO without any restrictions. Forced
> alignment does not have this freedom, and when we extend forced
> alignment to RT files, it will not have the freedom to use
> unwritten extents for sub-alloc-unit unmapping, either.
> 
So how do you think that we should actually implement 
xfs_itruncate_extents_flags() properly for forcealign? Would it simply 
be like:

--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
                 WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
                 return 0;
         }
+	if (xfs_inode_has_forcealign(ip))
+	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip, 
first_unmap_block);
         error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,

Thanks,
John

Ritesh Harjani (IBM) Sept. 10, 2024, 2:51 a.m. UTC | #10

Dave Chinner <david@fromorbit.com> writes:

> On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote:
>> Dave Chinner <david@fromorbit.com> writes:
>> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
>> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation
>> >> (by using extsize hint) and de-allocation to happen _only_ in
>> >> extsize chunks.
>> >>
>> >>    i.e. forcealign mandates -
>> >>    - the logical and physical start offset should be aligned as
>> >>    per args->alignment
>> >>    - extent length be aligned as per args->prod/mod.
>> >>      If above two cannot be satisfied then return -ENOSPC.
>> >
>> > Yes.
>> >
>> >> 
>> >>    - Does the unmapping of extents also only happens in extsize
>> >>    chunks (with forcealign)?
>> >
>> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code
>> > aligning the fsbno ranges to be unmapped.
>> >
>> > Remember, force align requires both logical file offset and
>> > physical block number to be correctly aligned,
>> 
>> This is where I would like to double confirm it again. Even the
>> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
>> physical start and logical start file offset and length right?
>
> No.
>
>> (Or does extsize hint only restricts alignment to logical start file
>> offset + length and not the physical start?)
>
> Neither.
>

Yes, thanks for the correction. Indeed extsize hint does not take care
of the physical start alignment at all.

> extsize hint by itself (i.e. existing behaviour) has no alignment
> effect at all. All it affects is -size- of the extent. i.e. once
> the extent start is chosen, extent size hints will trim the length
> of the extent to a multiple of the extent size hint. Alignment is
> not considered at all.
>

Please correct me I wrong here... but XFS considers aligning the logical
start and the length of the allocated extent (for extsize) as per below
code right? 

i.e.
1) xfs_direct_write_iomap_begin()
{
    <...>
    if (offset + length > XFS_ISIZE(ip))
		end_fsb = xfs_iomap_eof_align_last_fsb(ip, end_fsb);
                  => xfs_fileoff_t aligned_end_fsb = roundup_64(end_fsb, align);
                     return aligned_end_fsb
}

2) xfs_bmap_compute_alignments()
{
    <...>
    	else if (ap->datatype & XFS_ALLOC_USERDATA)
		     align = xfs_get_extsz_hint(ap->ip);

        if (align) {
            if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
                        ap->eof, 0, ap->conv, &ap->offset,
                        &ap->length))
                ASSERT(0);
            ASSERT(ap->length);

            args->prod = align;
            div_u64_rem(ap->offset, args->prod, &args->mod);
            if (args->mod)
                args->mod = args->prod - args->mod;
        }
        <...>
}

So args->prod and args->mod... aren't they use to align the logical
start and the length of the extent?


However, I do notice that when the file is closed XFS trims the length
allocated beyond EOF boundary (for extsize but not for forcealign from
the new forcealign series) i.e.

xfs_file_release() -> xfs_release() -> xfs_free_eofblocks()

I guess that is because xfs_can_free_eofblocks() does not consider
alignment for extsize in this function 

xfs_can_free_eofblocks()
{
<...>
	end_fsb = xfs_inode_roundup_alloc_unit(ip,
			XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip)));
<...>
}




>> Also it looks like there is no difference with ATOMIC_WRITE AND
>> FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is
>> adding additional natural alignment restrictions on pos and len). 
>
> Atomic write requires additional hardware support, and it restricts
> the valid sizes of extent size hints that can be set. Only atomic
> writes can be done on files marked as configured for atomic writes;
> force alignment can be done on any file...
>
>> So why maintain 2 separate on disk inode flags for FORCEALIGN AND
>> ATOMIC_WRITE?
>
> the atomic write flag indicates that a file has been set up
> correctly for atomic writes to be able to issues reliably. force
> alignment doesn't guarantee that - it's just a mechanism that tells
> the allocator to behave a specific way.
>
>> - Do you foresee FORCEALIGN to be also used at other places w/o
>> ATOMIC_WRITE where feature differentiation between the two on an
>> inode is required?
>
> The already exist. For example, reliably allocating huge page
> mappings on DAX filesystems requires 2MB forced alignment. 
>
>> - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN
>> & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too?
>
> Same as above.
>
>> - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata
>> changes within XFS filesystem to support atomic writes, right? 
>
> Because if you downgrade the kernel to something that doesn't
> support atomic writes, then non-atomic sized/aligned data can be
> written to the file and/or torn writes can occur.
>
> Worse, extent size hints that don't match the underlying hardware
> support could be set up for inodes, and when the kernel is upgraded
> again then atomic writes will fail on inodes that have atomic write
> flags set on them....
>
>> Is it something to just prevent users from destroying their own data
>> by not allowing a rw mount from an older kernel where users could do
>> unaligned writes to files marked for atomic writes?
>> Or is there any other reasoning to prevent XFS filesystem from becoming
>> inconsistent if an older kernel does a rw mount here.
>
> The older kernel does not know what the unknown inode flag means
> (i.e. atomic writes) and so, by definition, we cannot allow it to
> modify metadata or file data because it may not modify it in the
> correct way for that flag being set on the inode.
>
> Kernels that don't understand feature flags need to treat the
> filesystem as read-only, no matter how trivial the feature addition
> might seem.
>
>> > so unmap alignment
>> > has to be set up correctly at file offset level before we even know
>> > what extents underly the file range we need to unmap....
>> >
>> >>      If the start or end of the extent which needs unmapping is
>> >>      unaligned then we convert that extent to unwritten and skip,
>> >>      is it? (__xfs_bunmapi())
>> >
>> > The high level code should be aligning the start and end of the
>> > file range to be removed via xfs_inode_alloc_unitsize(). Hence 
>> > the low level __xfs_bunmapi() code shouldn't ever be encountering
>> > unaligned unmaps on force-aligned inodes.
>> >
>> 
>> Yes, but isn't this code snippet trying to handle a case when it finds an
>> unaligned extent during unmap?
>
> It does exactly that.
>
>> And what we are essentially trying to 
>> do here is leave the unwritten extent as is and if the found extent is
>> written then convert to unwritten and skip it (goto nodelete). This
>> means with forcealign if we encounter an unaligned extent then the file
>> will have this space reserved as is with extent marked unwritten. 
>> 
>> Is this understanding correct?
>
> Yes, except for the fact that this code is not triggered by force
> alignment.
>
> This code is preexisting realtime file functionality. It is used
> when the rtextent size is larger than a single filesytem block.
>
> In these configurations, we do allocation in rtextsize units, but we
> track written/unwritten extents in the BMBT on filesystem block
> granularity. Hence we can have unaligned written/unwritten extent
> boundaries, and if we aren't unmapping a whole rtextent then we
> simply mark the unused portion of it as unwritten in the BMBT to
> indicate it contains zeroes.
>
> IOWs, existing RT files have different IO alignment and
> written/unwritten extent conversion behaviour to the new forced
> alignment feature. The implementation code is shared in many places,
> but the higher level forced alignment functionality ensures there
> are never unaligned unwritten extents created or unaligned
> unmappings asked for. Hence this code does not trigger for the
> forced alignment cases.
>
> i.e. we have multiple seperate allocation alignment behaviours that
> share an implementation, but they do not all behave exactly the same
> way or provide the same alignment guarantees....
>

Thanks for taking time and explaining this. 

-ritesh

Ritesh Harjani (IBM) Sept. 10, 2024, 12:33 p.m. UTC | #11

Dave Chinner <david@fromorbit.com> writes:

> On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote:
>> Dave Chinner <david@fromorbit.com> writes:
>> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
>> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation
>> >> (by using extsize hint) and de-allocation to happen _only_ in
>> >> extsize chunks.
>> >>
>> >>    i.e. forcealign mandates -
>> >>    - the logical and physical start offset should be aligned as
>> >>    per args->alignment
>> >>    - extent length be aligned as per args->prod/mod.
>> >>      If above two cannot be satisfied then return -ENOSPC.
>> >
>> > Yes.
>> >
>> >> 
>> >>    - Does the unmapping of extents also only happens in extsize
>> >>    chunks (with forcealign)?
>> >
>> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code
>> > aligning the fsbno ranges to be unmapped.
>> >
>> > Remember, force align requires both logical file offset and
>> > physical block number to be correctly aligned,
>> 
>> This is where I would like to double confirm it again. Even the
>> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
>> physical start and logical start file offset and length right?
>
> No.
>
>> (Or does extsize hint only restricts alignment to logical start file
>> offset + length and not the physical start?)
>
> Neither.
>
> extsize hint by itself (i.e. existing behaviour) has no alignment
> effect at all. All it affects is -size- of the extent. i.e. once
> the extent start is chosen, extent size hints will trim the length
> of the extent to a multiple of the extent size hint. Alignment is
> not considered at all.
>
>> Also it looks like there is no difference with ATOMIC_WRITE AND
>> FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is
>> adding additional natural alignment restrictions on pos and len). 
>
> Atomic write requires additional hardware support, and it restricts
> the valid sizes of extent size hints that can be set. Only atomic
> writes can be done on files marked as configured for atomic writes;
> force alignment can be done on any file...
>
>> So why maintain 2 separate on disk inode flags for FORCEALIGN AND
>> ATOMIC_WRITE?
>
> the atomic write flag indicates that a file has been set up
> correctly for atomic writes to be able to issues reliably. force
> alignment doesn't guarantee that - it's just a mechanism that tells
> the allocator to behave a specific way.
>
>> - Do you foresee FORCEALIGN to be also used at other places w/o
>> ATOMIC_WRITE where feature differentiation between the two on an
>> inode is required?
>
> The already exist. For example, reliably allocating huge page
> mappings on DAX filesystems requires 2MB forced alignment. 
>
>> - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN
>> & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too?
>
> Same as above.
>
>> - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata
>> changes within XFS filesystem to support atomic writes, right? 
>
> Because if you downgrade the kernel to something that doesn't
> support atomic writes, then non-atomic sized/aligned data can be
> written to the file and/or torn writes can occur.
>
> Worse, extent size hints that don't match the underlying hardware
> support could be set up for inodes, and when the kernel is upgraded
> again then atomic writes will fail on inodes that have atomic write
> flags set on them....
>
>> Is it something to just prevent users from destroying their own data
>> by not allowing a rw mount from an older kernel where users could do
>> unaligned writes to files marked for atomic writes?
>> Or is there any other reasoning to prevent XFS filesystem from becoming
>> inconsistent if an older kernel does a rw mount here.
>
> The older kernel does not know what the unknown inode flag means
> (i.e. atomic writes) and so, by definition, we cannot allow it to
> modify metadata or file data because it may not modify it in the
> correct way for that flag being set on the inode.
>
> Kernels that don't understand feature flags need to treat the
> filesystem as read-only, no matter how trivial the feature addition
> might seem.
>

1. Will it require a fresh formatting of filesystem with mkfs.xfs for
enabling atomic writes (/forcealign) on XFS?
  a. Is that because reflink is not support with atomic writes
  (/forcealign) today?

As I understand for setting forcealign attr on any inode it checks for
whether xfs_has_forcealign(mp). That means forcealign can _only_ be
enabled during mkfs time and it also needs reflink to be disabled with
-m reflink=0. Right?

-ritesh

Dave Chinner Sept. 16, 2024, 5:25 a.m. UTC | #12

On Mon, Sep 09, 2024 at 05:18:43PM +0100, John Garry wrote:
> > > > Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
> > > > into account for the post-eof block removal, but doesn't change
> > > > xfs_free_eofblocks() at all. i.e  it also relies on
> > > > xfs_itruncate_extents_flags() to do the right thing for force
> > > > aligned inodes.
> > > 
> > > What state should the blocks post-EOF blocks be? A simple example of
> > > partially truncating an alloc unit is:
> > > 
> > > $xfs_io -c "extsize" mnt/file
> > > [16384] mnt/file
> > > 
> > > 
> > > $xfs_bmap -vvp mnt/file
> > > mnt/file:
> > >   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> > >     0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
> > > 
> > > 
> > > $truncate -s 10461184 mnt/file # 10M - 6FSB
> > > 
> > > $xfs_bmap -vvp mnt/file
> > > mnt/file:
> > >   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> > >     0: [0..20431]:      192..20623        0 (192..20623)     20432 000000
> > >     1: [20432..20447]:  20624..20639      0 (20624..20639)      16 010000
> > >   FLAG Values:
> > >      0010000 Unwritten preallocated extent
> > > 
> > > Is that incorrect state?
> > 
> > Think about it: what happens if you now truncate it back up to 10MB
> > (i.e. aligned length) and then do an aligned atomic write on it.
> > 
> > First: What happens when you truncate up?
> > 
> > ......
> > 
> > Yes, iomap_zero_range() will see the unwritten extent and skip it.
> > i.e. The unwritten extent stays as an unwritten extent, it's now
> > within EOF. That written->unwritten extent boundary is not on an
> > aligned file offset.
> 
> Right
> 
> > 
> > Second: What happens when you do a correctly aligned atomic write
> > that spans this range now?
> > 
> > ......
> > 
> > Iomap only maps a single extent at a time, so it will only map the
> > written range from the start of the IO (aligned) to the start of the
> > unwritten extent (unaligned).  Hence the atomic write will be
> > rejected because we can't do the atomic write to such an unaligned
> > extent.
> 
> It was being considered to change this handling for atomic writes. More
> below at *.

I don't think that this is something specific to atomic writes -
forced alignment means -alignment is guaranteed- regardless of what
ends up using it.

Yes, we can track unwritten extents on an -unaligned- boundary, but
that doesn't mean that we should allow it when we are trying to
guarantee logical and physical alignment of the file offset and
extent boundaries. i.e. The definition of forced alignment behaviour
is that all file offsets and extents in the file are aligned to the
same alignment.

I don't see an exception that allows for unaligned unwritten
extents in that definition.


> > That's not a bug in the atomic write path - this failure occurs
> > because of the truncate behaviour doing post-eof unwritten extent
> > conversion....
> > 
> > Yes, I agree that the entire -physical- extent is still correctly
> > aligned on disk so you could argue that the unwritten conversion
> > that xfs_bunmapi_range is doing is valid forced alignment behaviour.
> > However, the fact is that breaking the aligned physical extent into
> > two unaligned contiguous extents in different states in the BMBT
> > means that they are treated as two seperate unaligned extents, not
> > one contiguous aligned physical extent.
> 
> Right, this is problematic.
> 
> * I guess that you had not been following the recent discussion on this
> topic in the latest xfs atomic writes series @ https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/
> and also mentioned earlier in
> https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/
> 
> There I dropped the sub-alloc unit zeroing. The concept to iter for a single
> bio seems sane, but as Darrick mentioned, we have issue of non-atomically
> committing all the extent conversions.

Yes, I understand these problems exist.  My entire point is that the
forced alignment implemention should never allow such unaligned
extent patterns to be created in the first place. If we avoid
creating such situations in the first place, then we never have to
care about about unaligned unwritten extent conversion breaking
atomic IO.

FWIW, I also understand things are different if we are doing 128kB
atomic writes on 16kB force aligned files. However, in this
situation we are treating the 128kB atomic IO as eight individual
16kB atomic IOs that are physically contiguous. Hence in this
situation it doesn't matter if we have a mix of 16kB aligned
written/unwritten/hole extents as each 16kB chunks is independent of
the others.

What matters is that each indivudal 16kB chunk shows either the old
data or the new data - we are not guaranteeing that the entire 128kB
write is atomic. Hence in this situation we can both submit and
process each 16kB shunk as independent IOs with independent IO
compeltion transactions. All that matters is that we don't signal
completion to userspace until all the IO is complete, and we already
do that for fragmented DIO writes...

> > Again, this is different to the traditional RT file behaviour - it
> > can use unwritten extents for sub-alloc-unit alignment unmaps
> > because the RT device can align file offset to any physical offset,
> > and issue unaligned sector sized IO without any restrictions. Forced
> > alignment does not have this freedom, and when we extend forced
> > alignment to RT files, it will not have the freedom to use
> > unwritten extents for sub-alloc-unit unmapping, either.
> > 
> So how do you think that we should actually implement
> xfs_itruncate_extents_flags() properly for forcealign? Would it simply be
> like:
> 
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
>                 WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
>                 return 0;
>         }
> +	if (xfs_inode_has_forcealign(ip))
> +	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip,
> first_unmap_block);
>         error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,

Yes, it would be something like that, except it would have to be
done before first_unmap_block is verified.

-Dave.

Dave Chinner Sept. 16, 2024, 6:33 a.m. UTC | #13

On Tue, Sep 10, 2024 at 08:21:55AM +0530, Ritesh Harjani wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote:
> >> Dave Chinner <david@fromorbit.com> writes:
> >> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
> >> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation
> >> >> (by using extsize hint) and de-allocation to happen _only_ in
> >> >> extsize chunks.
> >> >>
> >> >>    i.e. forcealign mandates -
> >> >>    - the logical and physical start offset should be aligned as
> >> >>    per args->alignment
> >> >>    - extent length be aligned as per args->prod/mod.
> >> >>      If above two cannot be satisfied then return -ENOSPC.
> >> >
> >> > Yes.
> >> >
> >> >> 
> >> >>    - Does the unmapping of extents also only happens in extsize
> >> >>    chunks (with forcealign)?
> >> >
> >> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code
> >> > aligning the fsbno ranges to be unmapped.
> >> >
> >> > Remember, force align requires both logical file offset and
> >> > physical block number to be correctly aligned,
> >> 
> >> This is where I would like to double confirm it again. Even the
> >> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
> >> physical start and logical start file offset and length right?
> >
> > No.
> >
> >> (Or does extsize hint only restricts alignment to logical start file
> >> offset + length and not the physical start?)
> >
> > Neither.
> >
> 
> Yes, thanks for the correction. Indeed extsize hint does not take care
> of the physical start alignment at all.
> 
> > extsize hint by itself (i.e. existing behaviour) has no alignment
> > effect at all. All it affects is -size- of the extent. i.e. once
> > the extent start is chosen, extent size hints will trim the length
> > of the extent to a multiple of the extent size hint. Alignment is
> > not considered at all.
> >
> 
> Please correct me I wrong here... but XFS considers aligning the logical
> start and the length of the allocated extent (for extsize) as per below
> code right? 

Sorry, I was talking about physical alignment, not logical file
offset alignment. The logical file offset alignment that is done
for extent size hints is much more convoluted and dependent on
certain preconditions existing for it to function as forced
alignment/atomic writes require.

> 
> i.e.
> 1) xfs_direct_write_iomap_begin()
> {
>     <...>
>     if (offset + length > XFS_ISIZE(ip))
> 		end_fsb = xfs_iomap_eof_align_last_fsb(ip, end_fsb);
>                   => xfs_fileoff_t aligned_end_fsb = roundup_64(end_fsb, align);
>                      return aligned_end_fsb
> }

That's calculating the file offset of the end of the extent for an
extending write. It's not really an alignment - it's simply
calculating the file offset the allocation needs to cover to allow
for aligned allocation. This length needs to be fed into the
transaction reservation (i.e. ENOSPC checks) before we start the
allocation, so we have to have some idea of the extent size we are
going to allocate here...

> 2) xfs_bmap_compute_alignments()
> {
>     <...>
>     	else if (ap->datatype & XFS_ALLOC_USERDATA)
> 		     align = xfs_get_extsz_hint(ap->ip);
> 
>         if (align) {
>             if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>                         ap->eof, 0, ap->conv, &ap->offset,
>                         &ap->length))
>                 ASSERT(0);
>             ASSERT(ap->length);
> 
>             args->prod = align;
>             div_u64_rem(ap->offset, args->prod, &args->mod);
>             if (args->mod)
>                 args->mod = args->prod - args->mod;
>         }
>         <...>
> }
> 
> So args->prod and args->mod... aren't they use to align the logical
> start and the length of the extent?

Nope. They are only used way down in xfs_alloc_fix_len(), which
trims the length of the selected *physical* extent to the required
length.

Look further up - ap->offset is the logical file offset the
allocation needs to cover.  Logical alignment of the offset (i.e.
determining where in the file the physical extent will be placed) is
done in xfs_bmap_extsize_align(). As i said above, it's not purely
an extent size alignment calculation....

> However, I do notice that when the file is closed XFS trims the length
> allocated beyond EOF boundary (for extsize but not for forcealign from
> the new forcealign series) i.e.
> 
> xfs_file_release() -> xfs_release() -> xfs_free_eofblocks()
> 
> I guess that is because xfs_can_free_eofblocks() does not consider
> alignment for extsize in this function 

Of course - who wants large chunks of space allocated beyond EOF
when you are never going to write to the file again?

i.e. If you have large extsize hints then the post-eof tail can
consume a -lot- of space that won't otherwise get freed. This can
lead to rapid, unexpected ENOSPC, and it's not clear to users what
the cause is.

Hence we don't care if extsz is set on the inode or not when we
decide to remove post-eof blocks - reclaiming the unused space is
much more important that an occasional unaligned or small extent.

Forcealign changes that equation, but if you choose forcealign you
are doing it for a specific reason and likely not applying it to the
entire filesystem.....

-Dave.

Dave Chinner Sept. 16, 2024, 7:03 a.m. UTC | #14

On Tue, Sep 10, 2024 at 06:03:12PM +0530, Ritesh Harjani wrote:
> >> Is it something to just prevent users from destroying their own data
> >> by not allowing a rw mount from an older kernel where users could do
> >> unaligned writes to files marked for atomic writes?
> >> Or is there any other reasoning to prevent XFS filesystem from becoming
> >> inconsistent if an older kernel does a rw mount here.
> >
> > The older kernel does not know what the unknown inode flag means
> > (i.e. atomic writes) and so, by definition, we cannot allow it to
> > modify metadata or file data because it may not modify it in the
> > correct way for that flag being set on the inode.
> >
> > Kernels that don't understand feature flags need to treat the
> > filesystem as read-only, no matter how trivial the feature addition
> > might seem.
> >
> 
> 1. Will it require a fresh formatting of filesystem with mkfs.xfs for
> enabling atomic writes (/forcealign) on XFS?

Initially, yes.

>   a. Is that because reflink is not support with atomic writes
>   (/forcealign) today?

It's much more complex than that.

e.g. How does force-align and COW interact, especially w.r.t.
sub-alloc unit overwrites, cowextsz based preallocation and
unwritten extents in the COW fork?

> As I understand for setting forcealign attr on any inode it checks for
> whether xfs_has_forcealign(mp). That means forcealign can _only_ be
> enabled during mkfs time and it also needs reflink to be disabled with
> -m reflink=0. Right?

forcealign doesn't need to be completely turned off when reflink is
enabled and/or vice versa. Both can co-exist in the filesytsem at
the same time, but the current implementation does not allow
forcealign and reflink to be used on the same inode at the same
time.

It was decided that the best way to handle the lack of reflink
support initially was to make the two feature bits incompatible at
mount time. Hence we currently have to make a non-reflink filesystem
to test forcealign based functionality.

However, doing it this way means that when we fix the implementation
to support reflink and forcealign together, we just remove the mount
time check and all existing reflink filesystems can be immediately
upgraded to support forcealign.

OTOH, we can't do this with atomic writes. Atomic writes require
some mkfs help because they require explicit physical alignment of
the filesystem to the underlying storage. Hence we'll eventually end
up with atomic writes needing to be enabled at mkfs time, but force
align will be an upgradeable feature flag.

-Dave.

John Garry Sept. 16, 2024, 9:44 a.m. UTC | #15

>> * I guess that you had not been following the recent discussion on this
>> topic in the latest xfs atomic writes series @ https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!JIzCbkyp3JuPyzBx1n80WAdog5rLxMRB65FYrs1sFf3ei-wOdqrU_DZBE5zwrJXhrj949HSE0TwOEV0ciu8$
>> and also mentioned earlier in
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/__;!!ACWV5N9M2RV99hQ!JIzCbkyp3JuPyzBx1n80WAdog5rLxMRB65FYrs1sFf3ei-wOdqrU_DZBE5zwrJXhrj949HSE0TwOiiEnYSk$
>>
>> There I dropped the sub-alloc unit zeroing. The concept to iter for a single
>> bio seems sane, but as Darrick mentioned, we have issue of non-atomically
>> committing all the extent conversions.
> 
> Yes, I understand these problems exist.  My entire point is that the
> forced alignment implemention should never allow such unaligned
> extent patterns to be created in the first place. If we avoid
> creating such situations in the first place, then we never have to
> care about about unaligned unwritten extent conversion breaking
> atomic IO.

OK, but what about this situation with non-EOF unaligned extents:

# xfs_io -c "lsattr -v" mnt/file
[extsize, has-xattr, force-align] mnt/file
# xfs_io -c "extsize" mnt/file
[65536] mnt/file
#
# xfs_io  -d -c "pwrite 64k 64k" mnt/file
# xfs_io  -d -c "pwrite 8k 8k" mnt/file
# xfs_bmap -vvp  mnt/file
mnt/file:
EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..15]:         384..399          0 (384..399)          16 010000
   1: [16..31]:        400..415          0 (400..415)          16 000000
   2: [32..127]:       416..511          0 (416..511)          96 010000
   3: [128..255]:      256..383          0 (256..383)         128 000000
FLAG Values:
    0010000 Unwritten preallocated extent

Here we have unaligned extents wrt extsize.

The sub-alloc unit zeroing would solve that - is that what you would 
still advocate (to solve that issue)?

> 
> FWIW, I also understand things are different if we are doing 128kB
> atomic writes on 16kB force aligned files. However, in this
> situation we are treating the 128kB atomic IO as eight individual
> 16kB atomic IOs that are physically contiguous.

Yes, if 16kB force aligned, userspace can only issue 16KB atomic writes.

> Hence in this
> situation it doesn't matter if we have a mix of 16kB aligned
> written/unwritten/hole extents as each 16kB chunks is independent of
> the others.

Sure

> 
> What matters is that each indivudal 16kB chunk shows either the old
> data or the new data - we are not guaranteeing that the entire 128kB
> write is atomic. Hence in this situation we can both submit and
> process each 16kB shunk as independent IOs with independent IO
> compeltion transactions. All that matters is that we don't signal
> completion to userspace until all the IO is complete, and we already
> do that for fragmented DIO writes...
> 
>>> Again, this is different to the traditional RT file behaviour - it
>>> can use unwritten extents for sub-alloc-unit alignment unmaps
>>> because the RT device can align file offset to any physical offset,
>>> and issue unaligned sector sized IO without any restrictions. Forced
>>> alignment does not have this freedom, and when we extend forced
>>> alignment to RT files, it will not have the freedom to use
>>> unwritten extents for sub-alloc-unit unmapping, either.
>>>
>> So how do you think that we should actually implement
>> xfs_itruncate_extents_flags() properly for forcealign? Would it simply be
>> like:
>>
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
>>                  WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
>>                  return 0;
>>          }
>> +	if (xfs_inode_has_forcealign(ip))
>> +	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip,
>> first_unmap_block);
>>          error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,
> 
> Yes, it would be something like that, except it would have to be
> done before first_unmap_block is verified.
> 

ok, and are you still of the opinion that this does not apply to rtvol?

Thanks,
John

John Garry Sept. 16, 2024, 10:24 a.m. UTC | #16

On 16/09/2024 08:03, Dave Chinner wrote:
> OTOH, we can't do this with atomic writes. Atomic writes require
> some mkfs help because they require explicit physical alignment of
> the filesystem to the underlying storage. 

If we are enabling atomic writes at mkfs time, then we can ensure agsize 
% extsize == 0. That provides the physical alignment guarantee. It also 
makes sense to ensure extsize is a power-of-2.

However, extsize is re-configurble per inode. So, for an inode enabled 
for atomic writes, we must still ensure agsize % new extsize == 0 (and 
also new extsize is a power-of-2)

> Hence we'll eventually end
> up with atomic writes needing to be enabled at mkfs time, but force
> align will be an upgradeable feature flag.

Could atomic writes also be an upgradeable feature? We just need to 
ensure that agsize % extsize == 0 for an inode enabled for atomic 
writes. Valid extsize values may be quite limited, though, depending on 
the value of agsize.

Thanks,
John

Darrick J. Wong Sept. 17, 2024, 8:54 p.m. UTC | #17

On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote:
> On 16/09/2024 08:03, Dave Chinner wrote:
> > OTOH, we can't do this with atomic writes. Atomic writes require
> > some mkfs help because they require explicit physical alignment of
> > the filesystem to the underlying storage.

Forcealign requires agsize%extsize==0, it's atomicwrites that adds the
requirement that extsize be a power of 2...

> If we are enabling atomic writes at mkfs time, then we can ensure agsize %
> extsize == 0. That provides the physical alignment guarantee. It also makes
> sense to ensure extsize is a power-of-2.
> 
> However, extsize is re-configurble per inode. So, for an inode enabled for
> atomic writes, we must still ensure agsize % new extsize == 0 (and also new
> extsize is a power-of-2)

...so yes.

> > Hence we'll eventually end
> > up with atomic writes needing to be enabled at mkfs time, but force
> > align will be an upgradeable feature flag.
> 
> Could atomic writes also be an upgradeable feature? We just need to ensure
> that agsize % extsize == 0 for an inode enabled for atomic writes. Valid
> extsize values may be quite limited, though, depending on the value of
> agsize.

I don't see why forcealign and atomicwrites can't be added to the sb
featureset after the fact, though you're right that callers of xfs_io
chattr might be hard pressed to find an fsx_extsize value that fits.

--D

> Thanks,
> John
> 
>

Dave Chinner Sept. 17, 2024, 10:12 p.m. UTC | #18

On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote:
> On 16/09/2024 08:03, Dave Chinner wrote:
> > OTOH, we can't do this with atomic writes. Atomic writes require
> > some mkfs help because they require explicit physical alignment of
> > the filesystem to the underlying storage.
> 
> If we are enabling atomic writes at mkfs time, then we can ensure agsize %
> extsize == 0. That provides the physical alignment guarantee. It also makes
> sense to ensure extsize is a power-of-2.

No, mkfs does not want to align anything to "extsize". It needs to
align the filesystem geometry to be compatible with the underlying
block device atomic write alignment parameters.

We just don't care if extsize is not an exact multiple of agsize.
As long as extsize is aligned to the atomic write boundaries and the
start of the AG is aligned to atomic write boundaries, we can
allocate hardware aligned extsize sized extents from the AG.

AGs are always going to contain lots of non-aligned, randomly sized
extents for other stuff like metadata and unaligned file data.
Aligned allocation is all about finding extsized aligned free space
within the AG and has nothing to do with the size of the AG itself.

> However, extsize is re-configurble per inode. So, for an inode enabled for
> atomic writes, we must still ensure agsize % new extsize == 0 (and also new
> extsize is a power-of-2)

Ensuring that the extsize is aligned to the hardware atomic write
limits is a kernel runtime check when enabling atomic writes on an
inode.

In this case, we do not care what the AG size is - it is completely
irrelevant to these per-inode runtime checks because mkfs has
already guaranteed that the AG is correctly aligned to the
underlying hardware. That means is extsize is also aligned to the
underlying hardware, physical extent layout is guaranteed to be
compatible with the hardware constraints for atomic writes...

> > Hence we'll eventually end
> > up with atomic writes needing to be enabled at mkfs time, but force
> > align will be an upgradeable feature flag.
> 
> Could atomic writes also be an upgradeable feature? We just need to ensure
> that agsize % extsize == 0 for an inode enabled for atomic writes.

To turn the superblock feature bit on, we have to check the AGs are
correctly aligned to the *underlying hardware*. If they aren't
correctly aligned (and there is a good chance they will not be)
then we can't enable atomic writes at all. The only way to change
this is to physically move AGs around in the block device (i.e. via
xfs_expand tool I proposed).

i.e. the mkfs dependency on having the AGs aligned to the underlying
atomic write capabilities of the block device never goes away, even
if we want to make the feature dynamically enabled.

IOWs, yes, an existing filesystem -could- be upgradeable, but there
is no guarantee that is will be.

Quite frankly, we aren't going to see block devices that filesystems
already exist on suddenly sprout support for atomic writes mid-life.
Hence if mkfs detects atomic write support in the underlying device,
it should *always* modify the geometry to be compatible with atomic
writes and enable atomic write support.

Yes, that means the "incompat with reflink" issue needs to be fixed
before we take atomic writes out of experimental (i.e. we consistently
apply the same "full support" criteria we applied to DAX).

Hence by the time atomic writes are a fully supported feature, we're
going to be able to enable them by default at mkfs time for any
hardware that supports them...

> Valid
> extsize values may be quite limited, though, depending on the value of
> agsize.

No. The only limit agsize puts on extsize is that a single aligned
extent can't be larger than half the AG size. Forced alignment and
atomic writes don't change that.

-Dave.

Dave Chinner Sept. 17, 2024, 10:27 p.m. UTC | #19

On Mon, Sep 16, 2024 at 10:44:38AM +0100, John Garry wrote:
> 
> > > * I guess that you had not been following the recent discussion on this
> > > topic in the latest xfs atomic writes series @ https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!JIzCbkyp3JuPyzBx1n80WAdog5rLxMRB65FYrs1sFf3ei-wOdqrU_DZBE5zwrJXhrj949HSE0TwOEV0ciu8$
> > > and also mentioned earlier in
> > > https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/__;!!ACWV5N9M2RV99hQ!JIzCbkyp3JuPyzBx1n80WAdog5rLxMRB65FYrs1sFf3ei-wOdqrU_DZBE5zwrJXhrj949HSE0TwOiiEnYSk$
> > > 
> > > There I dropped the sub-alloc unit zeroing. The concept to iter for a single
> > > bio seems sane, but as Darrick mentioned, we have issue of non-atomically
> > > committing all the extent conversions.
> > 
> > Yes, I understand these problems exist.  My entire point is that the
> > forced alignment implemention should never allow such unaligned
> > extent patterns to be created in the first place. If we avoid
> > creating such situations in the first place, then we never have to
> > care about about unaligned unwritten extent conversion breaking
> > atomic IO.
> 
> OK, but what about this situation with non-EOF unaligned extents:
> 
> # xfs_io -c "lsattr -v" mnt/file
> [extsize, has-xattr, force-align] mnt/file
> # xfs_io -c "extsize" mnt/file
> [65536] mnt/file
> #
> # xfs_io  -d -c "pwrite 64k 64k" mnt/file
> # xfs_io  -d -c "pwrite 8k 8k" mnt/file
> # xfs_bmap -vvp  mnt/file
> mnt/file:
> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..15]:         384..399          0 (384..399)          16 010000
>   1: [16..31]:        400..415          0 (400..415)          16 000000
>   2: [32..127]:       416..511          0 (416..511)          96 010000
>   3: [128..255]:      256..383          0 (256..383)         128 000000
> FLAG Values:
>    0010000 Unwritten preallocated extent
> 
> Here we have unaligned extents wrt extsize.
> 
> The sub-alloc unit zeroing would solve that - is that what you would still
> advocate (to solve that issue)?

Yes, I thought that was already implemented for force-align with the
DIO code via the extsize zero-around changes in the iomap code. Why
isn't that zero-around code ensuring the correct extent layout here?

> > FWIW, I also understand things are different if we are doing 128kB
> > atomic writes on 16kB force aligned files. However, in this
> > situation we are treating the 128kB atomic IO as eight individual
> > 16kB atomic IOs that are physically contiguous.
> 
> Yes, if 16kB force aligned, userspace can only issue 16KB atomic writes.

Right, but the eventual goal (given the statx parameters) is to be
able to do 8x16kB sequential atomic writes as a single 128kB IO, yes?

> > > > Again, this is different to the traditional RT file behaviour - it
> > > > can use unwritten extents for sub-alloc-unit alignment unmaps
> > > > because the RT device can align file offset to any physical offset,
> > > > and issue unaligned sector sized IO without any restrictions. Forced
> > > > alignment does not have this freedom, and when we extend forced
> > > > alignment to RT files, it will not have the freedom to use
> > > > unwritten extents for sub-alloc-unit unmapping, either.
> > > > 
> > > So how do you think that we should actually implement
> > > xfs_itruncate_extents_flags() properly for forcealign? Would it simply be
> > > like:
> > > 
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
> > >                  WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
> > >                  return 0;
> > >          }
> > > +	if (xfs_inode_has_forcealign(ip))
> > > +	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip,
> > > first_unmap_block);
> > >          error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,
> > 
> > Yes, it would be something like that, except it would have to be
> > done before first_unmap_block is verified.
> > 
> 
> ok, and are you still of the opinion that this does not apply to rtvol?

The rtvol is *not* force-aligned. It -may- have some aligned
allocation requirements that are similar (i.e. sb_rextsize > 1 fsb)
but it does *not* force-align extents, written or unwritten.

The moment we add force-align support to RT files (as is the plan),
then the force-aligned inodes on the rtvol will need to behave as
force aligned inodes, not "rtvol" inodes.

-Dave.

Dave Chinner Sept. 17, 2024, 11:34 p.m. UTC | #20

On Tue, Sep 17, 2024 at 01:54:20PM -0700, Darrick J. Wong wrote:
> On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote:
> > On 16/09/2024 08:03, Dave Chinner wrote:
> > > OTOH, we can't do this with atomic writes. Atomic writes require
> > > some mkfs help because they require explicit physical alignment of
> > > the filesystem to the underlying storage.
> 
> Forcealign requires agsize%extsize==0,

No it doesn't.

AG size is irrelevant when aligning extents - all
that matters is that we can find a free extent that can be trimmed
to the alignment defined by extsize. 

> it's atomicwrites that adds the
> requirement that extsize be a power of 2...

Only by explicit implementation constraint.

Atomic writes do not require power of two extsize - they only
require correctly aligned physical extents.

e.g an 8kB atomic write is always guaranteed to succeed if
we have an extsize of 24kB for laying out the physical extents
because a 24kB physical extent is always 8kB aligned and is an exact
multiple of 8kB. This meets the requirements for 8kB atomic writes to
always succeed, and hence there is no fundamental requirement for
extsize to be a power of 2.

We have *chosen* to simplify the implementation by only allowing
a single aligned atomic write to be issued at a time. This means
the alignment and size of atomic writes is always the minimum size
the hardware advertises, and that is (at present) always a power of
2.

Hence the "extsize needs to be a power of 2" comes from the
constraints exposed from the block device configuration (i.e.
minimum atomic write unit), not from a filesystem design or
implementation constraint.

At the filesystem level, we have further simplified things by only
allowing extsize = atomic write size. Hence the initial
implementation ends up only support power of 2 extsize values. This
is not a hard design or implementation constraint, however.

.....

hmmmmm.

.....

In writing this I've had further thoughts on force-align and the
sub-alloc-unit unwritten extent issue we've been discussing here.
i.e. I've stopped and considered the existing design constraints
given what I wrote above and considered what is needed for
supporting large extsize for small atomic writes.

I think we need to support large extsize with small atomic write
size for two reasons:

1. extsize is still going to be needed for preventing excessive
fragmentation with atomic writes. It's the small DIO write workloads
that see lots of fragmentation, and applications using atomic writes
are essentially being forced down the path of being small DIO write
workloads.

2. we can allow force-align w/o atomic writes behaviour to match the
existing rtvol sb_rextsize > 1 fsb behaviour without impacting
atomic write behaviour. (i.e. less behavioural differences, more
common code, better performance, etc).

To do this (and I think we do want to do this), then we have to do
two things:

1. force-align needs to add a "unwritten align" inode parameter to
allow sub-extsize unwritten extent boundaries to exist in the BMBT.
(i.e.  similar to how rt files w/ sb_rextsize > 1 fsb currently
behave.)

This is purely an in-memory value - for pure "force-align" we can
set it 1 fsb and then the behaviour will match existing RT
behaviour.  We can abstract this behaviour by replacing the hard
coded 1 block alignment for unwritten conversion with an "unwritten
align" value which would initially be set to 1.

We can also abstract this code away from being "rt specific" and
make it entirely dependent on "alloc-unit" configuration. This means
rt, force-align and atomic write will all be running the same code,
which makes testing a lot easier..

2. inodes with atomic write enabled must set the unwritten align
value to the atomic write size exposed by the hardware, and the
extsize must be an exact integer multiple of the unwritten align
size.

The initial settings of unwritten align == extsize gives the current
behaviour of all allocation and extent conversion being aligned to
atomic write constraints.

The separation of unwritten conversion from the extsize then allows
allows the situation I described above with 8kB atomic writes
and 24kB extsize. Because unwritten conversion is aligned to atomic
wriet boundaries, we can use sub-alloc-unit unwritten extents
without violating atomic write boundaries.

This would allow us to use extsize for atomic writes in the same
manner we use it for now - enable large contiguous allocations to
prevent fragmentation when doing lots of concurrent small
"immediate write" operations across many files.

I think this can all be added on top of the existing patchset - it's
not really a fundamental change to any of it. It's a little bit more
abstraction and unification, but it enables a lot more flexibility
for optimising atomic write functionality in the future.

Thoughts?

-Dave.

John Garry Sept. 18, 2024, 7:59 a.m. UTC | #21

On 17/09/2024 23:12, Dave Chinner wrote:
> On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote:
>> On 16/09/2024 08:03, Dave Chinner wrote:
>>> OTOH, we can't do this with atomic writes. Atomic writes require
>>> some mkfs help because they require explicit physical alignment of
>>> the filesystem to the underlying storage.
>>
>> If we are enabling atomic writes at mkfs time, then we can ensure agsize %
>> extsize == 0. That provides the physical alignment guarantee. It also makes
>> sense to ensure extsize is a power-of-2.
> 
> No, mkfs does not want to align anything to "extsize". It needs to
> align the filesystem geometry to be compatible with the underlying
> block device atomic write alignment parameters.
> 
> We just don't care if extsize is not an exact multiple of agsize.
> As long as extsize is aligned to the atomic write boundaries and the
> start of the AG is aligned to atomic write boundaries, we can
> allocate hardware aligned extsize sized extents from the AG.
> 
> AGs are always going to contain lots of non-aligned, randomly sized
> extents for other stuff like metadata and unaligned file data.
> Aligned allocation is all about finding extsized aligned free space
> within the AG and has nothing to do with the size of the AG itself.

Fine, we can go the way of aligning the agsize to the atomic write unit 
max for mkfs.

> 
>> However, extsize is re-configurble per inode. So, for an inode enabled for
>> atomic writes, we must still ensure agsize % new extsize == 0 (and also new
>> extsize is a power-of-2)
> 
> Ensuring that the extsize is aligned to the hardware atomic write
> limits is a kernel runtime check when enabling atomic writes on an
> inode.
> 
> In this case, we do not care what the AG size is - it is completely
> irrelevant to these per-inode runtime checks because mkfs has
> already guaranteed that the AG is correctly aligned to the
> underlying hardware. That means is extsize is also aligned to the
> underlying hardware, physical extent layout is guaranteed to be
> compatible with the hardware constraints for atomic writes...

Sure, we would just need to enforce that extsize is a power-of-2 then.

> 
>>> Hence we'll eventually end
>>> up with atomic writes needing to be enabled at mkfs time, but force
>>> align will be an upgradeable feature flag.
>>
>> Could atomic writes also be an upgradeable feature? We just need to ensure
>> that agsize % extsize == 0 for an inode enabled for atomic writes.
> 
> To turn the superblock feature bit on, we have to check the AGs are
> correctly aligned to the *underlying hardware*. If they aren't
> correctly aligned (and there is a good chance they will not be)
> then we can't enable atomic writes at all. The only way to change
> this is to physically move AGs around in the block device (i.e. via
> xfs_expand tool I proposed).
 > > i.e. the mkfs dependency on having the AGs aligned to the underlying
> atomic write capabilities of the block device never goes away, even
> if we want to make the feature dynamically enabled.
> 
> IOWs, yes, an existing filesystem -could- be upgradeable, but there
> is no guarantee that is will be.
> 
> Quite frankly, we aren't going to see block devices that filesystems
> already exist on suddenly sprout support for atomic writes mid-life.

I would not be so sure. Some SCSI devices used in production which I 
know implicitly write 32KB atomically. And we would like to use them for 
atomic writes. 32KB is small and I guess that there is a small chance of 
pre-existing AGs not being 32KB aligned. I would need to check if there 
is even a min alignment for AGs...

> Hence if mkfs detects atomic write support in the underlying device,
> it should *always* modify the geometry to be compatible with atomic
> writes and enable atomic write support.

The current solution is to enable via commandline.

> 
> Yes, that means the "incompat with reflink" issue needs to be fixed
> before we take atomic writes out of experimental (i.e. we consistently
> apply the same "full support" criteria we applied to DAX).

In the meantime, if mkfs auto-enables atomic writes (when the HW 
supports), what will it do to reflink feature (in terms of enabling)?

> 
> Hence by the time atomic writes are a fully supported feature, we're
> going to be able to enable them by default at mkfs time for any
> hardware that supports them...
> 
>> Valid
>> extsize values may be quite limited, though, depending on the value of
>> agsize.
> 
> No. The only limit agsize puts on extsize is that a single aligned
> extent can't be larger than half the AG size. Forced alignment and
> atomic writes don't change that.
> 

ok

Thanks,
John

John Garry Sept. 18, 2024, 10:12 a.m. UTC | #22

On 17/09/2024 23:27, Dave Chinner wrote:
>> # xfs_bmap -vvp  mnt/file
>> mnt/file:
>> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>    0: [0..15]:         384..399          0 (384..399)          16 010000
>>    1: [16..31]:        400..415          0 (400..415)          16 000000
>>    2: [32..127]:       416..511          0 (416..511)          96 010000
>>    3: [128..255]:      256..383          0 (256..383)         128 000000
>> FLAG Values:
>>     0010000 Unwritten preallocated extent
>>
>> Here we have unaligned extents wrt extsize.
>>
>> The sub-alloc unit zeroing would solve that - is that what you would still
>> advocate (to solve that issue)?
> Yes, I thought that was already implemented for force-align with the
> DIO code via the extsize zero-around changes in the iomap code. Why
> isn't that zero-around code ensuring the correct extent layout here?

I just have not included the extsize zero-around changes here. They were 
just grouped with the atomic writes support, as they were added 
specifically for the atomic writes support. Indeed - to me at least - it 
is strange that the DIO code changes are required for XFS forcealign 
implementation. And, even if we use extsize zero-around changes for DIO 
path, what about buffered IO?

BTW, I still have concern with this extsize zero-around change which I 
was making:

xfs_iomap_write_unwritten()
{
	unsigned int rounding;

	/* when converting anything unwritten, we must be spanning an 	alloc 
unit, so round up/down */
	if (rounding > 1) {
		offset_fsb = rounddown(rounding);
		count_fsb = roundup(rounding);
	}

	...
	do {
		xfs_bmapi_write();
		...
		xfs_trans_commit();
	} while ();
}

As mentioned elsewhere, it's a bit of a bodge (to do this rounding).

> 
>>> FWIW, I also understand things are different if we are doing 128kB
>>> atomic writes on 16kB force aligned files. However, in this
>>> situation we are treating the 128kB atomic IO as eight individual
>>> 16kB atomic IOs that are physically contiguous.
>> Yes, if 16kB force aligned, userspace can only issue 16KB atomic writes.
> Right, but the eventual goal (given the statx parameters) is to be
> able to do 8x16kB sequential atomic writes as a single 128kB IO, yes?

No, if atomic write unit max is 16KB, then userspace can only issue a 
single 16KB atomic write.

However, some things to consider:
a. the block layer may merge those 16KB atomic writes
b. userspace may also merge 16KB atomic writes and issue a larger atomic 
write (if atomic write unit max is > 16KB)

I had been wondering if there is any value in a lib for helping with b.

> 
>>>>> Again, this is different to the traditional RT file behaviour - it
>>>>> can use unwritten extents for sub-alloc-unit alignment unmaps
>>>>> because the RT device can align file offset to any physical offset,
>>>>> and issue unaligned sector sized IO without any restrictions. Forced
>>>>> alignment does not have this freedom, and when we extend forced
>>>>> alignment to RT files, it will not have the freedom to use
>>>>> unwritten extents for sub-alloc-unit unmapping, either.
>>>>>
>>>> So how do you think that we should actually implement
>>>> xfs_itruncate_extents_flags() properly for forcealign? Would it simply be
>>>> like:
>>>>
>>>> --- a/fs/xfs/xfs_inode.c
>>>> +++ b/fs/xfs/xfs_inode.c
>>>> @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
>>>>                   WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
>>>>                   return 0;
>>>>           }
>>>> +	if (xfs_inode_has_forcealign(ip))
>>>> +	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip,
>>>> first_unmap_block);
>>>>           error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,
>>> Yes, it would be something like that, except it would have to be
>>> done before first_unmap_block is verified.
>>>
>> ok, and are you still of the opinion that this does not apply to rtvol?
> The rtvol is*not* force-aligned. It -may- have some aligned
> allocation requirements that are similar (i.e. sb_rextsize > 1 fsb)
> but it does*not* force-align extents, written or unwritten.
> 
> The moment we add force-align support to RT files (as is the plan),
> then the force-aligned inodes on the rtvol will need to behave as
> force aligned inodes, not "rtvol" inodes.

ok, fine

Thanks,
John

Dave Chinner Sept. 23, 2024, 2:57 a.m. UTC | #23

On Wed, Sep 18, 2024 at 08:59:41AM +0100, John Garry wrote:
> On 17/09/2024 23:12, Dave Chinner wrote:
> > On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote:
> > > > Hence we'll eventually end
> > > > up with atomic writes needing to be enabled at mkfs time, but force
> > > > align will be an upgradeable feature flag.
> > > 
> > > Could atomic writes also be an upgradeable feature? We just need to ensure
> > > that agsize % extsize == 0 for an inode enabled for atomic writes.
> > 
> > To turn the superblock feature bit on, we have to check the AGs are
> > correctly aligned to the *underlying hardware*. If they aren't
> > correctly aligned (and there is a good chance they will not be)
> > then we can't enable atomic writes at all. The only way to change
> > this is to physically move AGs around in the block device (i.e. via
> > xfs_expand tool I proposed).
> > > i.e. the mkfs dependency on having the AGs aligned to the underlying
> > atomic write capabilities of the block device never goes away, even
> > if we want to make the feature dynamically enabled.
> > 
> > IOWs, yes, an existing filesystem -could- be upgradeable, but there
> > is no guarantee that is will be.
> > 
> > Quite frankly, we aren't going to see block devices that filesystems
> > already exist on suddenly sprout support for atomic writes mid-life.
> 
> I would not be so sure. Some SCSI devices used in production which I know
> implicitly write 32KB atomically. And we would like to use them for atomic
> writes.

Ok, but that's not going to be widespread. Very little storage
hardware out there supports atomic writes - the vast majority of
deployments will be new hardware that will have mkfs run on it.

A better argument for dynamic upgrade is turning on atomic writes
on reflink enabled filesystems once the kernel implementation has
been updates to allow the two features to co-exist.

> 32KB is small and I guess that there is a small chance of
> pre-existing AGs not being 32KB aligned. I would need to check if there is
> even a min alignment for AGs...

There is no default alignment for AGs unless there is a stripe unit
set. Then it will align AGs to the stripe unit. There is also no
guarantee that stripe units are aligned to powers of two or atomic
write units...

> > Hence if mkfs detects atomic write support in the underlying device,
> > it should *always* modify the geometry to be compatible with atomic
> > writes and enable atomic write support.
> 
> The current solution is to enable via commandline.

Yes, that's the current proposal.  What I'm saying is that this
isn't a future proof solution, nor how we want this functionality to
work in the future.

We should be looking at the block device capabilities (like we do for
stripe unit, etc) and then *do the right thing automatically*. If
the block device advertises atomic write support, then we should
automatically align the filesystem to atomic write constraints, even
if atomic writes can not be immediately enabled (because reflink).

I'm trying to describe how we want things to work once atomic write
support is ubiquitous. It needs to be simple for users and admins,
and it should work (or be reliably upgradeable) out of the box on
all new hardware that supports this functionality.

> > Yes, that means the "incompat with reflink" issue needs to be fixed
> > before we take atomic writes out of experimental (i.e. we consistently
> > apply the same "full support" criteria we applied to DAX).
> 
> In the meantime, if mkfs auto-enables atomic writes (when the HW supports),
> what will it do to reflink feature (in terms of enabling)?

I didn't say we should always "auto-enable atomic writes".

I said if the hardware is atomic write capable, then mkfs should
always *align the filesystem* to atomic write constraints.  A kernel
upgrade will eventually allow reflink and atomic writes to co-exist,
but only if the filesystem is correctly aligned to the hardware
constrains for atomic writes. We need to ensure we leave that
upgrade path open....

.... and only once we have full support can we make "mkfs
auto-enable atomic writes".

-Dave.

Christoph Hellwig Sept. 23, 2024, 3:33 a.m. UTC | #24

On Mon, Sep 23, 2024 at 12:57:32PM +1000, Dave Chinner wrote:
> Ok, but that's not going to be widespread. Very little storage
> hardware out there supports atomic writes - the vast majority of
> deployments will be new hardware that will have mkfs run on it.

Just about every enterprise NVMe SSD supports atomic write size
larger than a single LBA, because it is completely natural fallout
from FTL deѕign.  That beeing said to support those SSDs a block
size of 16 or 32k would be a lot more natural than all the forcealign
madness.

John Garry Sept. 23, 2024, 8 a.m. UTC | #25

On 23/09/2024 03:57, Dave Chinner wrote:
>> In the meantime, if mkfs auto-enables atomic writes (when the HW supports),
>> what will it do to reflink feature (in terms of enabling)?
> I didn't say we should always "auto-enable atomic writes".
> 
> I said if the hardware is atomic write capable, then mkfs should
> always*align the filesystem* to atomic write constraints.  A kernel
> upgrade will eventually allow reflink and atomic writes to co-exist,
> but only if the filesystem is correctly aligned to the hardware
> constrains for atomic writes. We need to ensure we leave that
> upgrade path open....
> 
> .... and only once we have full support can we make "mkfs
> auto-enable atomic writes".

ok, fine. The current maximum value of atomic write unit max is 512KB 
(assuming 4K PAGE_SIZE and 512B sector size), so that should not be too 
needlessly inefficient for laying out the AGs. However, for 16KB+ 
PAGE_SIZE, that value could naturally be larger. However having HW which 
supports such large atomics would be very unlikely.

Thanks,
John

John Garry Sept. 23, 2024, 8:16 a.m. UTC | #26

On 23/09/2024 04:33, Christoph Hellwig wrote:
> On Mon, Sep 23, 2024 at 12:57:32PM +1000, Dave Chinner wrote:
>> Ok, but that's not going to be widespread. Very little storage
>> hardware out there supports atomic writes - the vast majority of
>> deployments will be new hardware that will have mkfs run on it.
> 
> Just about every enterprise NVMe SSD supports atomic write size
> larger than a single LBA, because it is completely natural fallout
> from FTL deѕign.  That beeing said to support those SSDs a block
> size of 16 or 32k would be a lot more natural than all the forcealign
> madness.
> 

Outside the block allocator changes, most changes for forcealign are 
just refactoring the RT big alloc unit checks. So - as you have said 
previously - this so-called madness is already there. How can the sanity 
be improved?

To me, yes, there are so many "if (RT)" checks and special cases in the 
code, which makes a maintenance headache.

Christoph Hellwig Sept. 23, 2024, 12:07 p.m. UTC | #27

On Mon, Sep 23, 2024 at 09:16:22AM +0100, John Garry wrote:
> Outside the block allocator changes, most changes for forcealign are just 
> refactoring the RT big alloc unit checks. So - as you have said previously 
> - this so-called madness is already there. How can the sanity be improved?

As a first step by not making it worse, and that not only means not
spreading the rtextent stuff further, but more importantly not introducing
additional complexities by requiring to be able to write over the
written/unwritten boundaries created by either rtextentsize > 1 or
the forcealign stuff if you actually want atomic writes.

> To me, yes, there are so many "if (RT)" checks and special cases in the 
> code, which makes a maintenance headache.

Replacing them with a different condition doesn't really make that
any better.

John Garry Sept. 23, 2024, 12:33 p.m. UTC | #28

On 23/09/2024 13:07, Christoph Hellwig wrote:
> On Mon, Sep 23, 2024 at 09:16:22AM +0100, John Garry wrote:
>> Outside the block allocator changes, most changes for forcealign are just
>> refactoring the RT big alloc unit checks. So - as you have said previously
>> - this so-called madness is already there. How can the sanity be improved?
> 
> As a first step by not making it worse, and that not only means not
> spreading the rtextent stuff further,

I assume that refactoring rtextent into "big alloc unit" is spreading 
(rtextent stuff), right? If so, what other solution? CoW?

> but more importantly not introducing
> additional complexities by requiring to be able to write over the
> written/unwritten boundaries created by either rtextentsize > 1 or
> the forcealign stuff if you actually want atomic writes.

The very original solution required a single mapping and in written 
state for atomic writes. Reverting to that would save a lot of hassle in 
the kernel. It just means that the user needs to manually pre-zero.

> 
>> To me, yes, there are so many "if (RT)" checks and special cases in the
>> code, which makes a maintenance headache.
> 
> Replacing them with a different condition doesn't really make that
> any better.

I am just saying that the rtextent stuff is not nice, but it is not 
going away. I suppose a tiny perk is that "big alloc unit" checks are 
better than "if (rt)" checks, as it makes the condition slightly more 
obvious.

Christoph Hellwig Sept. 24, 2024, 6:17 a.m. UTC | #29

On Mon, Sep 23, 2024 at 01:33:12PM +0100, John Garry wrote:
>> As a first step by not making it worse, and that not only means not
>> spreading the rtextent stuff further,
>
> I assume that refactoring rtextent into "big alloc unit" is spreading 
> (rtextent stuff), right? If so, what other solution? CoW?

Well, if you look at the force align series you'd agree that it
spreads the thing out into the btree allocator.  Or do I misread it?

>
>> but more importantly not introducing
>> additional complexities by requiring to be able to write over the
>> written/unwritten boundaries created by either rtextentsize > 1 or
>> the forcealign stuff if you actually want atomic writes.
>
> The very original solution required a single mapping and in written state 
> for atomic writes. Reverting to that would save a lot of hassle in the 
> kernel. It just means that the user needs to manually pre-zero.

What atomic I/O sizes do your users require?  Would they fit into
a large sector size now supported by XFS (i.e. 32k for now).

John Garry Sept. 24, 2024, 9:48 a.m. UTC | #30

On 24/09/2024 07:17, Christoph Hellwig wrote:
> On Mon, Sep 23, 2024 at 01:33:12PM +0100, John Garry wrote:
>>> As a first step by not making it worse, and that not only means not
>>> spreading the rtextent stuff further,
>>
>> I assume that refactoring rtextent into "big alloc unit" is spreading
>> (rtextent stuff), right? If so, what other solution? CoW?
> 
> Well, if you look at the force align series you'd agree that it
> spreads the thing out into the btree allocator.  Or do I misread it?

Yes, there are more changes than just refactoring "big alloc unit" 
stuff. There are btree allocator changes.

About those btree allocator changes, strictly speaking there are just a 
couple of changes to provide forcealign support - the rest are prep 
patches. And those forcealign changes build on pre-existing allocator 
features, like extent alignment and length specifiers.

> 
>>
>>> but more importantly not introducing
>>> additional complexities by requiring to be able to write over the
>>> written/unwritten boundaries created by either rtextentsize > 1 or
>>> the forcealign stuff if you actually want atomic writes.
>>
>> The very original solution required a single mapping and in written state
>> for atomic writes. Reverting to that would save a lot of hassle in the
>> kernel. It just means that the user needs to manually pre-zero.
> 
> What atomic I/O sizes do your users require?  Would they fit into
> a large sector size now supported by XFS (i.e. 32k for now).
> 

It could be used, but then we have 16KB filesystem block size, which 
some just may not want. And we just don't want 16KB sector size, but I 
don't think that we require that if we use RWF_ATOMIC.

[v4,00/14] forcealign for xfs

Message

Comments