[RFCv1,0/1] EXT4 support of multi-fsblock atomic write with bigalloc

Message ID	cover.1742699765.git.ritesh.list@gmail.com (mailing list archive)
Headers	show Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5E8803FD1; Sun, 23 Mar 2025 07:02:27 +0000 (UTC) From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, John Garry <john.g.garry@oracle.com>, djwong@kernel.org, linux-xfs@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>, Ojaswin Mujoo <ojaswin@linux.ibm.com>, "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Subject: [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc Date: Sun, 23 Mar 2025 12:32:17 +0530 Message-ID: <cover.1742699765.git.ritesh.list@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-ID: <20250323070217.jfIhyllRxLF3C3dNCITBOApEXmkZ0mG92Fg6M8Xk2cA@z>
Series	EXT4 support of multi-fsblock atomic write with bigalloc \| expand [RFCv1,0/1] EXT4 support of multi-fsblock atomic write with bigalloc [RFCv1,1/1] ext4: Add multi-fsblock atomic write support with bigalloc

On Sun, Mar 23, 2025 at 12:32:18PM +0530, Ritesh Harjani (IBM) wrote: > EXT4 supports bigalloc feature which allows the FS to work in size of > clusters (group of blocks) rather than individual blocks. This patch > adds atomic write support for bigalloc so that systems with bs = ps can > also create FS using - > mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev> > > With bigalloc ext4 can support multi-fsblock atomic writes. We will have to > adjust ext4's atomic write unit max value to cluster size. This can then support > atomic write of size anywhere between [blocksize, clustersize]. > > We first query the underlying region of the requested range by calling > ext4_map_blocks() call. Here are the various cases which we then handle > for block allocation depending upon the underlying mapping type: > 1. If the underlying region for the entire requested range is a mapped extent, > then we don't call ext4_map_blocks() to allocate anything. We don't need to > even start the jbd2 txn in this case. > 2. For an append write case, we create a mapped extent. > 3. If the underlying region is entirely a hole, then we create an unwritten > extent for the requested range. > 4. If the underlying region is a large unwritten extent, then we split the > extent into 2 unwritten extent of required size. > 5. If the underlying region has any type of mixed mapping, then we call > ext4_map_blocks() in a loop to zero out the unwritten and the hole regions > within the requested range. This then provide a single mapped extent type > mapping for the requested range. > > Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO > flag only when the underlying extent mapping of the requested range is > not entirely a hole, an unwritten extent, or a fully mapped extent. That > is, if the underlying region contains a mix of hole(s), unwritten > extent(s), and mapped extent(s), we use this loop to ensure that all the > short mappings are zeroed out. This guarantees that the entire requested > range becomes a single, uniformly mapped extent. It is ok to do so > because we know this is being done on a bigalloc enabled filesystem > where the block bitmap represents the entire cluster unit. Hi Ritesh, thanks for the patch. The approach looks good to me, just adding a few comments below. > > Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com> > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > --- > fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++-- > fs/ext4/super.c | 8 +++-- > 2 files changed, 93 insertions(+), 5 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index d04d8a7f12e7..0096a597ad04 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3332,6 +3332,67 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, > iomap->addr = IOMAP_NULL_ADDR; > } > } > +/* > + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested mapping > + * [map.m_lblk, map.m_len] is one single contiguous extent with no mixed > + * mappings. This function is only called when the bigalloc is enabled, so we > + * know that the allocated physical extent start is always aligned properly. > + * > + * We call EXT4_GET_BLOCKS_ZERO only when the underlying physical extent for the > + * requested range does not have a single mapping type (Hole, Mapped, or > + * Unwritten) throughout. In that case we will loop over the requested range to > + * allocate and zero out the unwritten / holes in between, to get a single > + * mapped extent from [m_lblk, m_len]. This case is mostly non-performance > + * critical path, so it should be ok to loop using ext4_map_blocks() with > + * appropriate flags to allocate & zero the underlying short holes/unwritten > + * extents within the requested range. > + */ > +static int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode, > + struct ext4_map_blocks *map) > +{ > + ext4_lblk_t m_lblk = map->m_lblk; > + unsigned int m_len = map->m_len; > + unsigned int mapped_len = 0, flags = 0; > + u8 blkbits = inode->i_blkbits; > + int ret; > + > + WARN_ON(!ext4_has_feature_bigalloc(inode->i_sb)); > + > + ret = ext4_map_blocks(handle, inode, map, 0); > + if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode)) > + flags = EXT4_GET_BLOCKS_CREATE; > + else if ((ret == 0 && map->m_len >= m_len) || > + (ret >= m_len && map->m_flags & EXT4_MAP_UNWRITTEN)) > + flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; > + else > + flags = EXT4_GET_BLOCKS_CREATE_ZERO; > + > + do { > + ret = ext4_map_blocks(handle, inode, map, flags); With the multiple calls to map block for converting the extents, I don't think the transaction reservation wouldn't be enough anymore since in the worst case we could be converting atleast (max atomicwrite size / blocksize) extents. We need to account for that as well. > + if (ret < 0) > + return ret; > + mapped_len += map->m_len; > + map->m_lblk += map->m_len; > + map->m_len = m_len - mapped_len; > + } while (mapped_len < m_len); > + > + map->m_lblk = m_lblk; > + map->m_len = m_len; > + > + /* > + * We might have done some work in above loop. Let's ensure we query the > + * start of the physical extent, based on the origin m_lblk and m_len > + * and also ensure we were able to allocate the required range for doing > + * atomic write. > + */ > + ret = ext4_map_blocks(handle, inode, map, 0); Here, We are calling ext4_map_blocks() 3 times uneccessarily even if a single complete mapping is found. I think a better approach would be to just go for the map_blocks and then decide if we want to split. Also, factor out a function to do the zero out. So, somthing like: if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode)) flags = EXT4_GET_BLOCKS_CREATE; else flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; ret = ext4_map_blocks(handle, inode, map, flags); if (map->m_len < m_len) { map->m_len = m_len; /* do the zero out */ ext4_zero_mixed_mappings(handle, inode, map); ext4_map_blocks(handle, inode, map, 0); WARN_ON(!(map->m_flags & EXT4_MAP_MAPPED) || map->m_len < m_len); } I think this covers the 5 cases you mentioned in the commit message, if I'm not missing anything. Also, this way we avoid the duplication for non zero-out cases and the zero-out function can then be resused incase we want to do the same for forcealign atomic writes in the future. Regards, ojaswin > + if (ret != m_len) { > + ext4_warning_inode(inode, "allocation failed for atomic write request pos:%u, len:%u\n", > + m_lblk, m_len); > + return -EINVAL; > + } > + return mapped_len; > +} > > static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, > unsigned int flags) > @@ -3377,7 +3438,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, > else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; > > - ret = ext4_map_blocks(handle, inode, map, m_flags); > + if (flags & IOMAP_ATOMIC && ext4_has_feature_bigalloc(inode->i_sb)) > + ret = ext4_map_blocks_atomic(handle, inode, map); > + else > + ret = ext4_map_blocks(handle, inode, map, m_flags); > > /* > * We cannot fill holes in indirect tree based inodes as that could > @@ -3401,6 +3465,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > int ret; > struct ext4_map_blocks map; > u8 blkbits = inode->i_blkbits; > + unsigned int m_len_orig; > > if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) > return -EINVAL; > @@ -3414,6 +3479,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > map.m_lblk = offset >> blkbits; > map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, > EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; > + m_len_orig = map.m_len; > > if (flags & IOMAP_WRITE) { > /* > @@ -3424,8 +3490,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > */ > if (offset + length <= i_size_read(inode)) { > ret = ext4_map_blocks(NULL, inode, &map, 0); > - if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED)) > - goto out; > + /* > + * For atomic writes the entire requested length should > + * be mapped. > + */ > + if (map.m_flags & EXT4_MAP_MAPPED) { > + if ((!(flags & IOMAP_ATOMIC) && ret > 0) || > + (flags & IOMAP_ATOMIC && ret >= m_len_orig)) > + goto out; > + } > + map.m_len = m_len_orig; > } > ret = ext4_iomap_alloc(inode, &map, flags); > } else { > @@ -3442,6 +3516,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > */ > map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len); > > + /* > + * Before returning to iomap, let's ensure the allocated mapping > + * covers the entire requested length for atomic writes. > + */ > + if (flags & IOMAP_ATOMIC) { > + if (map.m_len < (length >> blkbits)) { > + WARN_ON(1); > + return -EINVAL; > + } > + } > ext4_set_iomap(inode, iomap, &map, offset, length, flags); > > return 0; > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index a50e5c31b937..cbb24d535d59 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -4442,12 +4442,13 @@ static int ext4_handle_clustersize(struct super_block *sb) > /* > * ext4_atomic_write_init: Initializes filesystem min & max atomic write units. > * @sb: super block > - * TODO: Later add support for bigalloc > */ > static void ext4_atomic_write_init(struct super_block *sb) > { > struct ext4_sb_info *sbi = EXT4_SB(sb); > struct block_device *bdev = sb->s_bdev; > + unsigned int blkbits = sb->s_blocksize_bits; > + unsigned int clustersize = sb->s_blocksize; > > if (!bdev_can_atomic_write(bdev)) > return; > @@ -4455,9 +4456,12 @@ static void ext4_atomic_write_init(struct super_block *sb) > if (!ext4_has_feature_extents(sb)) > return; > > + if (ext4_has_feature_bigalloc(sb)) > + clustersize = 1U << (sbi->s_cluster_bits + blkbits); > + > sbi->s_awu_min = max(sb->s_blocksize, > bdev_atomic_write_unit_min_bytes(bdev)); > - sbi->s_awu_max = min(sb->s_blocksize, > + sbi->s_awu_max = min(clustersize, > bdev_atomic_write_unit_max_bytes(bdev)); > if (sbi->s_awu_min && sbi->s_awu_max && > sbi->s_awu_min <= sbi->s_awu_max) { > -- > 2.48.1 >

[RFCv1,0/1] EXT4 support of multi-fsblock atomic write with bigalloc

Message

Comments