mbox series

[RFCv1,0/1] EXT4 support of multi-fsblock atomic write with bigalloc

Message ID cover.1742699765.git.ritesh.list@gmail.com (mailing list archive)
Headers show
Series EXT4 support of multi-fsblock atomic write with bigalloc | expand

Message

Ritesh Harjani (IBM) March 23, 2025, 7:02 a.m. UTC
This is an RFC patch before LSFMM to preview the change of how multi-fsblock atomic write
support with bigalloc look like. There is a scope of improvement in the
implementation, however this shows the general idea of the design. More details
are provided in the actual patch. There are still todos and more testing is
needed. But with iomap limitation of single fsblock atomic write now lifted,
the patch has definitely started to look better.

This is based out of vfs.all tree [1] for 6.15, which now has the necessary
iomap changes required for the bigalloc support in ext4.

TODOs:
1. Add better testcases to test atomic write support with bigalloc.
2. Discuss the approach of keeping the jbd2 txn open while zeroing the short
   underlying unwritten extents or short holes to create a single mapped type
   extent mapping. This anyway should be a non-perfomance critical path.
3. We use ext4_map_blocks() in loop instead of modifying the block allocator.
   Again since it's non-performance sensitive path, so hopefully it should ok?
   Because otherwise one can argue why take and release
   EXT4_I(inode)->i_data_sem multiple times. We won't take & release any group
   lock for this, since we know that with bigalloc the cluster is anyway
   available to us.
4. Once when we start supporting file/inode marked with atomic writes attribute,
   maybe we can add some optimizations like zero out the entire underlying
   cluster when someone forcefully wants to fzero or fpunch an underlying disk
   block, to keep the mapped extent intact.
5. Stress test of this is still pending through fsx and xfstests.

Reviews are appreciated.

[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.all&id=4f76518956c037517a4e4b120186075d3afb8266

Ritesh Harjani (IBM) (1):
  ext4: Add atomic write support for bigalloc

 fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext4/super.c |  8 +++--
 2 files changed, 93 insertions(+), 5 deletions(-)

--
2.48.1

Comments

Ojaswin Mujoo March 25, 2025, 11:42 a.m. UTC | #1
On Sun, Mar 23, 2025 at 12:32:18PM +0530, Ritesh Harjani (IBM) wrote:
> EXT4 supports bigalloc feature which allows the FS to work in size of
> clusters (group of blocks) rather than individual blocks. This patch
> adds atomic write support for bigalloc so that systems with bs = ps can
> also create FS using -
>     mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
> 
> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
> adjust ext4's atomic write unit max value to cluster size. This can then support
> atomic write of size anywhere between [blocksize, clustersize].
> 
> We first query the underlying region of the requested range by calling
> ext4_map_blocks() call. Here are the various cases which we then handle
> for block allocation depending upon the underlying mapping type:
> 1. If the underlying region for the entire requested range is a mapped extent,
>    then we don't call ext4_map_blocks() to allocate anything. We don't need to
>    even start the jbd2 txn in this case.
> 2. For an append write case, we create a mapped extent.
> 3. If the underlying region is entirely a hole, then we create an unwritten
>    extent for the requested range.
> 4. If the underlying region is a large unwritten extent, then we split the
>    extent into 2 unwritten extent of required size.
> 5. If the underlying region has any type of mixed mapping, then we call
>    ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
>    within the requested range. This then provide a single mapped extent type
>    mapping for the requested range.
> 
> Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
> flag only when the underlying extent mapping of the requested range is
> not entirely a hole, an unwritten extent, or a fully mapped extent. That
> is, if the underlying region contains a mix of hole(s), unwritten
> extent(s), and mapped extent(s), we use this loop to ensure that all the
> short mappings are zeroed out. This guarantees that the entire requested
> range becomes a single, uniformly mapped extent. It is ok to do so
> because we know this is being done on a bigalloc enabled filesystem
> where the block bitmap represents the entire cluster unit.

Hi Ritesh, thanks for the patch. The approach looks good to me, just
adding a few comments below.
> 
> Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/ext4/super.c |  8 +++--
>  2 files changed, 93 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d04d8a7f12e7..0096a597ad04 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3332,6 +3332,67 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>     iomap->addr = IOMAP_NULL_ADDR;
>   }
>  }
> +/*
> + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested mapping
> + * [map.m_lblk, map.m_len] is one single contiguous extent with no mixed
> + * mappings. This function is only called when the bigalloc is enabled, so we
> + * know that the allocated physical extent start is always aligned properly.
> + *
> + * We call EXT4_GET_BLOCKS_ZERO only when the underlying physical extent for the
> + * requested range does not have a single mapping type (Hole, Mapped, or
> + * Unwritten) throughout. In that case we will loop over the requested range to
> + * allocate and zero out the unwritten / holes in between, to get a single
> + * mapped extent from [m_lblk, m_len]. This case is mostly non-performance
> + * critical path, so it should be ok to loop using ext4_map_blocks() with
> + * appropriate flags to allocate & zero the underlying short holes/unwritten
> + * extents within the requested range.
> + */
> +static int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
> +         struct ext4_map_blocks *map)
> +{
> + ext4_lblk_t m_lblk = map->m_lblk;
> + unsigned int m_len = map->m_len;
> + unsigned int mapped_len = 0, flags = 0;
> + u8 blkbits = inode->i_blkbits;
> + int ret;
> +
> + WARN_ON(!ext4_has_feature_bigalloc(inode->i_sb));
> +
> + ret = ext4_map_blocks(handle, inode, map, 0);
> + if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
> +   flags = EXT4_GET_BLOCKS_CREATE;
> + else if ((ret == 0 && map->m_len >= m_len) ||
> +   (ret >= m_len && map->m_flags & EXT4_MAP_UNWRITTEN))
> +   flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> + else
> +   flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> +
> + do {
> +   ret = ext4_map_blocks(handle, inode, map, flags);

With the multiple calls to map block for converting the extents, I don't
think the transaction reservation wouldn't be enough anymore since in
the worst case we could be converting atleast (max atomicwrite size / blocksize) 
extents. We need to account for that as well.

> +   if (ret < 0)
> +     return ret;
> +   mapped_len += map->m_len;
> +   map->m_lblk += map->m_len;
> +   map->m_len = m_len - mapped_len;
> + } while (mapped_len < m_len);

> +
> + map->m_lblk = m_lblk;
> + map->m_len = m_len;
> +
> + /*
> +  * We might have done some work in above loop. Let's ensure we query the
> +  * start of the physical extent, based on the origin m_lblk and m_len
> +  * and also ensure we were able to allocate the required range for doing
> +  * atomic write.
> +  */
> + ret = ext4_map_blocks(handle, inode, map, 0);

 Here, We are calling ext4_map_blocks() 3 times uneccessarily even if a
 single complete mapping is found. I think a better approach would be to
 just go for the map_blocks and then decide if we want to split. Also,
 factor out a function to do the zero out. So, somthing like:

  if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
    flags = EXT4_GET_BLOCKS_CREATE;
  else
    flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;

        ret = ext4_map_blocks(handle, inode, map, flags);

        if (map->m_len < m_len) {
          map->m_len = m_len;

                /* do the zero out */
          ext4_zero_mixed_mappings(handle, inode, map);
                ext4_map_blocks(handle, inode, map, 0);

                WARN_ON(!(map->m_flags & EXT4_MAP_MAPPED) || map->m_len < m_len);
        }

 I think this covers the 5 cases you mentioned in the commit message, if
 I'm not missing anything.  Also, this way we avoid the duplication for
 non zero-out cases and the zero-out function can then be resused incase
 we want to do the same for forcealign atomic writes in the future.

Regards,
ojaswin

> + if (ret != m_len) {
> +   ext4_warning_inode(inode, "allocation failed for atomic write request pos:%u, len:%u\n",
> +       m_lblk, m_len);
> +   return -EINVAL;
> + }
> + return mapped_len;
> +}
> 
>  static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>           unsigned int flags)
> @@ -3377,7 +3438,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>   else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>     m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> 
> - ret = ext4_map_blocks(handle, inode, map, m_flags);
> + if (flags & IOMAP_ATOMIC && ext4_has_feature_bigalloc(inode->i_sb))
> +   ret = ext4_map_blocks_atomic(handle, inode, map);
> + else
> +   ret = ext4_map_blocks(handle, inode, map, m_flags);
> 
>   /*
>    * We cannot fill holes in indirect tree based inodes as that could
> @@ -3401,6 +3465,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   int ret;
>   struct ext4_map_blocks map;
>   u8 blkbits = inode->i_blkbits;
> + unsigned int m_len_orig;
> 
>   if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
>     return -EINVAL;
> @@ -3414,6 +3479,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   map.m_lblk = offset >> blkbits;
>   map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>         EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> + m_len_orig = map.m_len;
> 
>   if (flags & IOMAP_WRITE) {
>     /*
> @@ -3424,8 +3490,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>      */
>     if (offset + length <= i_size_read(inode)) {
>       ret = ext4_map_blocks(NULL, inode, &map, 0);
> -     if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
> -       goto out;
> +     /*
> +      * For atomic writes the entire requested length should
> +      * be mapped.
> +      */
> +     if (map.m_flags & EXT4_MAP_MAPPED) {
> +       if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
> +          (flags & IOMAP_ATOMIC && ret >= m_len_orig))
> +         goto out;
> +     }
> +     map.m_len = m_len_orig;
>     }
>     ret = ext4_iomap_alloc(inode, &map, flags);
>   } else {
> @@ -3442,6 +3516,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>    */
>   map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
> 
> + /*
> +  * Before returning to iomap, let's ensure the allocated mapping
> +  * covers the entire requested length for atomic writes.
> +  */
> + if (flags & IOMAP_ATOMIC) {
> +   if (map.m_len < (length >> blkbits)) {
> +     WARN_ON(1);
> +     return -EINVAL;
> +   }
> + }
>   ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> 
>   return 0;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index a50e5c31b937..cbb24d535d59 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4442,12 +4442,13 @@ static int ext4_handle_clustersize(struct super_block *sb)
>  /*
>   * ext4_atomic_write_init: Initializes filesystem min & max atomic write units.
>   * @sb: super block
> - * TODO: Later add support for bigalloc
>   */
>  static void ext4_atomic_write_init(struct super_block *sb)
>  {
>   struct ext4_sb_info *sbi = EXT4_SB(sb);
>   struct block_device *bdev = sb->s_bdev;
> + unsigned int blkbits = sb->s_blocksize_bits;
> + unsigned int clustersize = sb->s_blocksize;
> 
>   if (!bdev_can_atomic_write(bdev))
>     return;
> @@ -4455,9 +4456,12 @@ static void ext4_atomic_write_init(struct super_block *sb)
>   if (!ext4_has_feature_extents(sb))
>     return;
> 
> + if (ext4_has_feature_bigalloc(sb))
> +   clustersize = 1U << (sbi->s_cluster_bits + blkbits);
> +
>   sbi->s_awu_min = max(sb->s_blocksize,
>             bdev_atomic_write_unit_min_bytes(bdev));
> - sbi->s_awu_max = min(sb->s_blocksize,
> + sbi->s_awu_max = min(clustersize,
>             bdev_atomic_write_unit_max_bytes(bdev));
>   if (sbi->s_awu_min && sbi->s_awu_max &&
>       sbi->s_awu_min <= sbi->s_awu_max) {
> --
> 2.48.1
>