Message ID | cover.1726034272.git.ojaswin@linux.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | ext4: Implement support for extsize hints | expand |
On 11/09/2024 10:01, Ojaswin Mujoo wrote: > This patchset implements extsize hint feature for ext4. Posting this RFC to get > some early review comments on the design and implementation bits. This feature > is similar to what we have in XFS too with some differences. > > extsize on ext4 is a hint to mballoc (multi-block allocator) and extent > handling layer to do aligned allocations. We use allocation criteria 0 > (CR_POWER2_ALIGNED) for doing aligned power-of-2 allocations. With extsize hint > we try to align the logical start (m_lblk) and length(m_len) of the allocation > to be extsize aligned. CR_POWER2_ALIGNED criteria in mballoc automatically make > sure that we get the aligned physical start (m_pblk) as well. So in this way > extsize can make sure that lblk, len and pblk all are aligned for the allocated > extent w.r.t extsize. > > Note that extsize feature is just a hinting mechanism to ext4 multi-block > allocator. That means that if we are unable to get an aligned allocation for > some reason, than we drop this flag and continue with unaligned allocation to > serve the request. However when we will add atomic/untorn writes support, then > we will enforce the aligned allocation and can return -ENOSPC if aligned > allocation was not successful. A few questions/confirmations: - You have no intention of adding an equivalent of forcealign, right? - Would you also plan on using FS_IOC_FS(GET/SET)XATTR interface for enabling atomic writes on a per-inode basis? - Can extsize be set at mkfs time? - Is there any userspace support for this series available? - how would/could extsize interact with bigalloc? > > Comparison with XFS extsize feature - > ===================================== > 1. extsize in XFS is a hint for aligning only the logical start and the lengh > of the allocation v/s extsize on ext4 make sure the physical start of the > extent gets aligned as well. note that forcealign with extsize aligns AG block also only for atomic writes do we enforce the AG block is aligned to physical block > > 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize > hint. That means on XFS for eof allocations (with extsize hint) only logical > start gets aligned. However extsize hint in ext4 for eof allocation is not > supported in this version of the series. > > 3. XFS allows extsize to be set on file with no extents but delayed data. > However, ext4 don't allow that for simplicity. The user is expected to set > it on a file before changing it's i_size. > > 4. XFS allows non-power-of-2 values for extsize but ext4 does not, since we > primarily would like to support atomic writes with extsize. > > 5. In ext4 we chose to store the extsize value in SYSTEM_XATTR rather than an > inode field as it was simple and most flexible, since there might be more > features like atomic/untorn writes coming in future. > > 6. In buffered-io path XFS switches to non-delalloc allocations for extsize hint. > The same has been kept for EXT4 as well. > > Some TODOs: > =========== > 1. EOF allocations support can be added and can be kept similar to XFS Note that EOF alignment for forcealign may change - it needs to be discussed further. Thanks, John . > > Rest of the design details can be found in the individual commit messages. > > Thoughts and suggestions are welcome! > > Ojaswin Mujoo (5): > ext4: add aligned allocation hint in mballoc > ext4: allow inode preallocation for aligned alloc > ext4: Support for extsize hint using FS_IOC_FS(GET/SET)XATTR > ext4: pass lblk and len explicitly to ext4_split_extent*() > ext4: Add extsize hint support > > fs/ext4/ext4.h | 12 +- > fs/ext4/ext4_jbd2.h | 15 ++ > fs/ext4/extents.c | 224 ++++++++++++++---- > fs/ext4/inode.c | 442 +++++++++++++++++++++++++++++++++--- > fs/ext4/ioctl.c | 119 ++++++++++ > fs/ext4/mballoc.c | 126 ++++++++-- > fs/ext4/super.c | 1 + > include/trace/events/ext4.h | 2 + > 8 files changed, 841 insertions(+), 100 deletions(-) >
John Garry <john.g.garry@oracle.com> writes: > On 11/09/2024 10:01, Ojaswin Mujoo wrote: >> This patchset implements extsize hint feature for ext4. Posting this RFC to get >> some early review comments on the design and implementation bits. This feature >> is similar to what we have in XFS too with some differences. >> >> extsize on ext4 is a hint to mballoc (multi-block allocator) and extent >> handling layer to do aligned allocations. We use allocation criteria 0 >> (CR_POWER2_ALIGNED) for doing aligned power-of-2 allocations. With extsize hint >> we try to align the logical start (m_lblk) and length(m_len) of the allocation >> to be extsize aligned. CR_POWER2_ALIGNED criteria in mballoc automatically make >> sure that we get the aligned physical start (m_pblk) as well. So in this way >> extsize can make sure that lblk, len and pblk all are aligned for the allocated >> extent w.r.t extsize. >> >> Note that extsize feature is just a hinting mechanism to ext4 multi-block >> allocator. That means that if we are unable to get an aligned allocation for >> some reason, than we drop this flag and continue with unaligned allocation to >> serve the request. However when we will add atomic/untorn writes support, then >> we will enforce the aligned allocation and can return -ENOSPC if aligned >> allocation was not successful. > > A few questions/confirmations: > - You have no intention of adding an equivalent of forcealign, right? extsize is just a hinting mechanism that too only for __allocation__ path. But for atomic writes we do require some form of forcealign (like how we have in XFS). So we could either call this directly as atomic write feature or can may as well call this forcealign feature and make atomic writes depend upon it, like how XFS is doing it. I still haven't understood if there is/will be a user specifically for forcealign other than atomic writes. Since you asked, I am more curious to know if there is some more context to your question? > > - Would you also plan on using FS_IOC_FS(GET/SET)XATTR interface for > enabling atomic writes on a per-inode basis? Yes, that interface should indeed be kept same for EXT4 too. > > - Can extsize be set at mkfs time? Good point. For now in this series, extsize can only be set using the same ioctl on a per inode basis. IIUC, XFS supports doing both right. We can do this on a per-inode basis during ioctl or it also supports setting this during mkfs.xfs time. (maybe xfsprogs only allows setting this at mkfs time for rtvolumes for now) So if this is set during mkfs.xfs time and then by default all inodes will have this extsize attribute value set right? BTW, this brings me to another question that I had asked here too [1]. 1. For XFS, atomic writes can only be enabled with a fresh mkfs.xfs -d atomic-writes=1 right? 2. For atomic writes to be enabled, we need all 3 features to be enabled during mkfs.xfs time itself right? i.e. "mkfs.xfs -i forcealign=1 -d extsize=16384 -d atomic-writes=1" [1]: https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/ > > - Is there any userspace support for this series available? Make sense to maybe provide a userspace support link too. For now, a quick hack would be to just allow setting extsize hint for other fileystems as well in xfs_io. diff --git a/io/open.c b/io/open.c index 15850b55..6407b7e8 100644 --- a/io/open.c +++ b/io/open.c @@ -980,7 +980,7 @@ open_init(void) extsize_cmd.args = _("[-D | -R] [extsize]"); extsize_cmd.argmin = 0; extsize_cmd.argmax = -1; - extsize_cmd.flags = CMD_NOMAP_OK; + extsize_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK; extsize_cmd.oneline = _("get/set preferred extent size (in bytes) for the open file"); extsize_cmd.help = extsize_help; <e.g> /dev/loop6 on /mnt1/test type ext4 (rw,relatime) root@qemu:~/xt/xfsprogs-dev# ./io/xfs_io -fc "extsize" /mnt1/test/f1 [0] /mnt1/test/f1 root@qemu:~/xt/xfsprogs-dev# ./io/xfs_io -c "extsize 16384" /mnt1/test/f1 root@qemu:~/xt/xfsprogs-dev# ./io/xfs_io -c "extsize" /mnt1/test/f1 [16384] /mnt1/test/f1 > > - how would/could extsize interact with bigalloc? > As of now it is kept disabled with bigalloc. + if (sbi->s_cluster_ratio > 1) { + msg = "Can't use extsize hint with bigalloc"; + err = -EINVAL; + goto error; + } >> >> Comparison with XFS extsize feature - >> ===================================== >> 1. extsize in XFS is a hint for aligning only the logical start and the lengh >> of the allocation v/s extsize on ext4 make sure the physical start of the >> extent gets aligned as well. > > note that forcealign with extsize aligns AG block also Can you expand that on a bit. You mean during mkfs.xfs time we ensure agblock boundaries are extsize aligned? > > only for atomic writes do we enforce the AG block is aligned to physical > block > If you could expand that a bit please? You meant during mkfs.xfs time for atomic writes we ensure ag block start bounaries are extsize aligned? >> >> 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize >> hint. That means on XFS for eof allocations (with extsize hint) only logical >> start gets aligned. However extsize hint in ext4 for eof allocation is not >> supported in this version of the series. >> >> 3. XFS allows extsize to be set on file with no extents but delayed data. >> However, ext4 don't allow that for simplicity. The user is expected to set >> it on a file before changing it's i_size. >> >> 4. XFS allows non-power-of-2 values for extsize but ext4 does not, since we >> primarily would like to support atomic writes with extsize. >> >> 5. In ext4 we chose to store the extsize value in SYSTEM_XATTR rather than an >> inode field as it was simple and most flexible, since there might be more >> features like atomic/untorn writes coming in future. >> >> 6. In buffered-io path XFS switches to non-delalloc allocations for extsize hint. >> The same has been kept for EXT4 as well. >> >> Some TODOs: >> =========== >> 1. EOF allocations support can be added and can be kept similar to XFS > > Note that EOF alignment for forcealign may change - it needs to be > discussed further. Sure, thanks for pointing that out. I guess you are referring to mainly the truncate related EOF alignment change required with forcealign for XFS. -ritesh
On 13/09/2024 11:54, Ritesh Harjani (IBM) wrote: > John Garry <john.g.garry@oracle.com> writes: > >> On 11/09/2024 10:01, Ojaswin Mujoo wrote: >>> This patchset implements extsize hint feature for ext4. Posting this RFC to get >>> some early review comments on the design and implementation bits. This feature >>> is similar to what we have in XFS too with some differences. >>> >>> extsize on ext4 is a hint to mballoc (multi-block allocator) and extent >>> handling layer to do aligned allocations. We use allocation criteria 0 >>> (CR_POWER2_ALIGNED) for doing aligned power-of-2 allocations. With extsize hint >>> we try to align the logical start (m_lblk) and length(m_len) of the allocation >>> to be extsize aligned. CR_POWER2_ALIGNED criteria in mballoc automatically make >>> sure that we get the aligned physical start (m_pblk) as well. So in this way >>> extsize can make sure that lblk, len and pblk all are aligned for the allocated >>> extent w.r.t extsize. >>> >>> Note that extsize feature is just a hinting mechanism to ext4 multi-block >>> allocator. That means that if we are unable to get an aligned allocation for >>> some reason, than we drop this flag and continue with unaligned allocation to >>> serve the request. However when we will add atomic/untorn writes support, then >>> we will enforce the aligned allocation and can return -ENOSPC if aligned >>> allocation was not successful. >> >> A few questions/confirmations: >> - You have no intention of adding an equivalent of forcealign, right? > > extsize is just a hinting mechanism that too only for __allocation__ > path. But for atomic writes we do require some form of forcealign (like > how we have in XFS). So we could either call this directly as atomic > write feature or can may as well call this forcealign feature and make > atomic writes depend upon it, like how XFS is doing it. > > I still haven't understood if there is/will be a user specifically for > forcealign other than atomic writes. > > Since you asked, I am more curious to know if there is some more context > to your question? As Darrick mentioned at the following, forcealign could be used for DAX: https://lore.kernel.org/linux-xfs/170404855884.1770028.10371509002317647981.stgit@frogsfrogsfrogs/ > >> >> - Would you also plan on using FS_IOC_FS(GET/SET)XATTR interface for >> enabling atomic writes on a per-inode basis? > > Yes, that interface should indeed be kept same for EXT4 too. > >> >> - Can extsize be set at mkfs time? > > Good point. For now in this series, extsize can only be set using the > same ioctl on a per inode basis. > > IIUC, XFS supports doing both right. We can do this on a per-inode basis > during ioctl or it also supports setting this during mkfs.xfs time. Right > (maybe xfsprogs only allows setting this at mkfs time for rtvolumes for now) extsize hint can already be set at mkfs time for both rtvol and !rtvol today. > > So if this is set during mkfs.xfs time and then by default all inodes will > have this extsize attribute value set right? Right But there is still the option to set this later with xfs_io -c "extsize" per-inode. > > BTW, this brings me to another question that I had asked here too [1]. > 1. For XFS, atomic writes can only be enabled with a fresh mkfs.xfs -d > atomic-writes=1 right? Correct Setting atomic-writes=1 enables the feature in the SB > 2. For atomic writes to be enabled, we need all 3 features to be > enabled during mkfs.xfs time itself right? Right, that is how it is currently done. But you could easily set extsize=4K at mkfs time so that not all inodes have a 16KB extsize, as in the example below. In this case, certain atomic write inodes could have their extsize increased to 16KB. > i.e. > "mkfs.xfs -i forcealign=1 -d extsize=16384 -d atomic-writes=1" > > [1]: https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!J0dwKULbs9neFPRiUN1VR63Ea-Qgjk77y6SFN4GPBN2zqIGP46CDH0vG6fpvEMDFCq-O05CMePOn70hy9FA3zlw$ > >> >> - Is there any userspace support for this series available? > > Make sense to maybe provide a userspace support link too. > For now, a quick hack would be to just allow setting extsize hint for > other fileystems as well in xfs_io. > > diff --git a/io/open.c b/io/open.c > index 15850b55..6407b7e8 100644 > --- a/io/open.c > +++ b/io/open.c > @@ -980,7 +980,7 @@ open_init(void) > extsize_cmd.args = _("[-D | -R] [extsize]"); > extsize_cmd.argmin = 0; > extsize_cmd.argmax = -1; > - extsize_cmd.flags = CMD_NOMAP_OK; > + extsize_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK; > extsize_cmd.oneline = > _("get/set preferred extent size (in bytes) for the open file"); > extsize_cmd.help = extsize_help; > > <e.g> > /dev/loop6 on /mnt1/test type ext4 (rw,relatime) > > root@qemu:~/xt/xfsprogs-dev# ./io/xfs_io -fc "extsize" /mnt1/test/f1 > [0] /mnt1/test/f1 > root@qemu:~/xt/xfsprogs-dev# ./io/xfs_io -c "extsize 16384" /mnt1/test/f1 > root@qemu:~/xt/xfsprogs-dev# ./io/xfs_io -c "extsize" /mnt1/test/f1 > [16384] /mnt1/test/f1 ok > > >> >> - how would/could extsize interact with bigalloc? >> > > As of now it is kept disabled with bigalloc. > > + if (sbi->s_cluster_ratio > 1) { > + msg = "Can't use extsize hint with bigalloc"; > + err = -EINVAL; > + goto error; > + } > > >>> >>> Comparison with XFS extsize feature - >>> ===================================== >>> 1. extsize in XFS is a hint for aligning only the logical start and the lengh >>> of the allocation v/s extsize on ext4 make sure the physical start of the >>> extent gets aligned as well. >> >> note that forcealign with extsize aligns AG block also > > Can you expand that on a bit. You mean during mkfs.xfs time we ensure > agblock boundaries are extsize aligned? Yes, see align_ag_geometry() at https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/ > >> >> only for atomic writes do we enforce the AG block is aligned to physical >> block >> > > If you could expand that a bit please? You meant during mkfs.xfs > time for atomic writes we ensure ag block start bounaries are extsize aligned? We do this for forcealign with the extsize value supplied at mkfs time. There are 2x things to consider about this: - mkfs-specified extsize need not necessarily be a power-of-2 - even if this mkfs-specified extsize is a power-of-2, attempting to increase extsize for an inode enabled for atomic writes may be restricted, as the new extsize may not align with the AG count. For example, extsize was 64KB and AG count = 16400 FSB (1025 * 64KB), then we cannot enable an inode for atomic writes with extsize = 128KB, as the disk block would not be aligned with the AG block. > > >>> >>> 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize >>> hint. That means on XFS for eof allocations (with extsize hint) only logical >>> start gets aligned. However extsize hint in ext4 for eof allocation is not >>> supported in this version of the series. >>> >>> 3. XFS allows extsize to be set on file with no extents but delayed data. >>> However, ext4 don't allow that for simplicity. The user is expected to set >>> it on a file before changing it's i_size. >>> >>> 4. XFS allows non-power-of-2 values for extsize but ext4 does not, since we >>> primarily would like to support atomic writes with extsize. >>> >>> 5. In ext4 we chose to store the extsize value in SYSTEM_XATTR rather than an >>> inode field as it was simple and most flexible, since there might be more >>> features like atomic/untorn writes coming in future. >>> >>> 6. In buffered-io path XFS switches to non-delalloc allocations for extsize hint. >>> The same has been kept for EXT4 as well. >>> >>> Some TODOs: >>> =========== >>> 1. EOF allocations support can be added and can be kept similar to XFS >> >> Note that EOF alignment for forcealign may change - it needs to be >> discussed further. > > Sure, thanks for pointing that out. > I guess you are referring to mainly the truncate related EOF alignment change > required with forcealign for XFS. > Thanks, John
On Wed, Sep 11, 2024 at 02:31:04PM +0530, Ojaswin Mujoo wrote: > This patchset implements extsize hint feature for ext4. Posting this RFC to get > some early review comments on the design and implementation bits. This feature > is similar to what we have in XFS too with some differences. > > extsize on ext4 is a hint to mballoc (multi-block allocator) and extent > handling layer to do aligned allocations. We use allocation criteria 0 > (CR_POWER2_ALIGNED) for doing aligned power-of-2 allocations. With extsize hint > we try to align the logical start (m_lblk) and length(m_len) of the allocation > to be extsize aligned. CR_POWER2_ALIGNED criteria in mballoc automatically make > sure that we get the aligned physical start (m_pblk) as well. So in this way > extsize can make sure that lblk, len and pblk all are aligned for the allocated > extent w.r.t extsize. > > Note that extsize feature is just a hinting mechanism to ext4 multi-block > allocator. That means that if we are unable to get an aligned allocation for > some reason, than we drop this flag and continue with unaligned allocation to > serve the request. However when we will add atomic/untorn writes support, then > we will enforce the aligned allocation and can return -ENOSPC if aligned > allocation was not successful. > > Comparison with XFS extsize feature - > ===================================== > 1. extsize in XFS is a hint for aligning only the logical start and the lengh > of the allocation v/s extsize on ext4 make sure the physical start of the > extent gets aligned as well. What happens when you can't align the physical start of the extent? It fails the allocation with ENOSPC? For XFS, the existing extent size behaviour is a hint, and so we ignore the hint if we cannot perform the allocation with the suggested alignment. i.e. We should not fail an allocation with an extent size hint until we are actually very near ENOSPC. With the new force-align feature, the physical alignment within an AG gets aligned to the extent size. In this case, if we can't find an aligned free extent to allocate, we fail the allocation (ENOSPC). Hence with forced alignment, we can have ENOSPC occur when there are large amounts of free space available in the filesystem. This is almost certainly what most people -don't want-, but it is a requirement for atomic writes. To make matters worse, this behaviour will almost certainly get worst as filesystem ages and free space slowly fragments over time. IOWs, by making the ext4 extsize have forced alignment semantics by default, it means users will see ENOSPC at lot more frequently and in situations where it is most definitely not expected. We also have to keep in mind that there are applications out there that set and use extent size hints, and so enabling extsize in ext4 will result in those applications silently starting to use them. If ext4 supporting extsize hints drastically changes the behaviour of the filesystem then that is going to cause significant unexpected regressions for users as they upgrade kernels and filesystems. Hence I strongly suggest that ext4 implements extent size hints in the same way that XFS does. i.e. unless forced alignment has been enabled for the inode, extsize is just a hint that gets discarded if aligned allocation does not succeed. Behaviour such as extent size hinting *should* be the same across all filesystems that provide this functionality. This makes using extent size hints much easier for users, admins and application developers. The last thing I want to hear is application devs tell me at conferences that "we don't use extent size hints anymore because ext4..." > 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize > hint. That means on XFS for eof allocations (with extsize hint) only logical > start gets aligned. I'm not sure I understand what you are saying here. XFS does extsize alignment of both the start and end of post-eof extents the same as it does for extents within EOF. For example: # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "bmap -vvp" foo wrote 4096/4096 bytes at offset 0 4 KiB, 1 ops; 0.0308 sec (129.815 KiB/sec and 32.4538 ops/sec) foo: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..7]: 256504..256511 0 (256504..256511) 8 000000 1: [8..31]: 256512..256535 0 (256512..256535) 24 010000 FLAG Values: 0100000 Shared extent 0010000 Unwritten preallocated extent There's a 4k written extent at 0, and a 12k unwritten extent beyond EOF at 4k. I.e. we have an extent of 16kB as the hint required that is correctly aligned beyond EOF. If I then write another 4k at 20k (beyond both EOF and the unwritten extent beyond EOF: # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "pwrite 20k 4k" -c "bmap -vvp" foo wrote 4096/4096 bytes at offset 0 4 KiB, 1 ops; 0.0210 sec (190.195 KiB/sec and 47.5489 ops/sec) wrote 4096/4096 bytes at offset 20480 4 KiB, 1 ops; 0.0001 sec (21.701 MiB/sec and 5555.5556 ops/sec) foo: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..7]: 180000..180007 0 (180000..180007) 8 000000 1: [8..39]: 180008..180039 0 (180008..180039) 32 010000 2: [40..47]: 180040..180047 0 (180040..180047) 8 000000 3: [48..63]: 180048..180063 0 (180048..180063) 16 010000 FLAG Values: 0100000 Shared extent 0010000 Unwritten preallocated extent You can see we did contiguous allocation of another 16kB at offset 16kB, and then wrote to 20k for 4kB.. i.e. the new extent was correctly aligned at both sides as the extsize hint says it should be.... > However extsize hint in ext4 for eof allocation is not > supported in this version of the series. If you can't do extsize aligned allocations for EOF extension, then how to applications use atomic writes to atomically extend the file? > 3. XFS allows extsize to be set on file with no extents but delayed data. It does? <looks> Yep, it doesn't check ip->i_delayed_blks is zero when changing extsize. I think that's simply a bug, not intended behaviour, because delalloc will not have reserved space for the extsize hint rounding needed when writeback occurs. Can you send a patch to add this check? > However, ext4 don't allow that for simplicity. The user is expected to set > it on a file before changing it's i_size. We don't actually care about i_size in XFS - the determining factor is whether there are extents allocated on disk. i.e. we can truncate up and then set the extent size hint because there are no extents allocated even though the size is non-zero. There are almost certainly applications out there that change extent size after truncating to a non-zero size, so this needs to work on ext4 the same way it does on XFS. Otherwise people are going to complain that their applications suddenly stop working properly on ext4.... > 4. XFS allows non-power-of-2 values for extsize but ext4 does not, since we > primarily would like to support atomic writes with extsize. Yes, ext4 can make that restriction if desired. Keep in mind that the XFS atomic write support is still evolving, and I think the way we are using extent size hints isn't fully solidified yet. Indeed, I think that we can allow non-power-of-2 extent sizes for atomic writes, because integer multiples of the atomic write unit will still ensure that physical extents are properly aligned for atomic writes to succeed. e.g. 24kB extent size is compatible with 8kB atomic write sizes. To make that work efficiently unwritten extent boundaries need to be maintained at atomic write alignments (8kB), not extent size alignment (24kB), but other than that I don't think anything else is needed.... This is desirable because it will allow extent size hints to remain usable for their original purposes even with atomic writes on XFS. i.e. fragmentation minimisation for small random DIO write worklaods (exactly the sort of IO you'd consider using atomic writes for!), alignment of extents to [non-power-of-2] RAID stripe geometry, etc. > 5. In ext4 we chose to store the extsize value in SYSTEM_XATTR rather than an > inode field as it was simple and most flexible, since there might be more > features like atomic/untorn writes coming in future. Does that mean you can query and set it through the user xattr interfaces? If so, how do you enforce the values users set are correct? > 6. In buffered-io path XFS switches to non-delalloc allocations for extsize hint. > The same has been kept for EXT4 as well. That's an internal XFS implementation detail that you don't need to replicate. Historically speaking, we didn't use unwritten extents for delayed allocation and so we couldn't do within-EOF extsize unaligned writes without adding special additional zero-around code to ensure that we never exposed stale data to userspace from the extra allocation that the data write did not cover. We now use unwritten extents for delalloc conversion, so this istale data exposure issue no longer exists. We should really switch this code back to using delalloc because it is much faster and less fragmentation prone than direct extsize allocation.... -Dave.
On Wed, Sep 18, 2024 at 07:54:27PM +1000, Dave Chinner wrote: > On Wed, Sep 11, 2024 at 02:31:04PM +0530, Ojaswin Mujoo wrote: > > This patchset implements extsize hint feature for ext4. Posting this RFC to get > > some early review comments on the design and implementation bits. This feature > > is similar to what we have in XFS too with some differences. > > > > extsize on ext4 is a hint to mballoc (multi-block allocator) and extent > > handling layer to do aligned allocations. We use allocation criteria 0 > > (CR_POWER2_ALIGNED) for doing aligned power-of-2 allocations. With extsize hint > > we try to align the logical start (m_lblk) and length(m_len) of the allocation > > to be extsize aligned. CR_POWER2_ALIGNED criteria in mballoc automatically make > > sure that we get the aligned physical start (m_pblk) as well. So in this way > > extsize can make sure that lblk, len and pblk all are aligned for the allocated > > extent w.r.t extsize. > > > > Note that extsize feature is just a hinting mechanism to ext4 multi-block > > allocator. That means that if we are unable to get an aligned allocation for > > some reason, than we drop this flag and continue with unaligned allocation to > > serve the request. However when we will add atomic/untorn writes support, then > > we will enforce the aligned allocation and can return -ENOSPC if aligned > > allocation was not successful. > > > > Comparison with XFS extsize feature - > > ===================================== > > 1. extsize in XFS is a hint for aligning only the logical start and the lengh > > of the allocation v/s extsize on ext4 make sure the physical start of the > > extent gets aligned as well. > > What happens when you can't align the physical start of the extent? > It fails the allocation with ENOSPC? Hi Dave, thanks for the review. No, ext4 treats it as a hint as well and we fallback to nonaligned allocation > > For XFS, the existing extent size behaviour is a hint, and so we > ignore the hint if we cannot perform the allocation with the > suggested alignment. i.e. We should not fail an allocation with an > extent size hint until we are actually very near ENOSPC. > > With the new force-align feature, the physical alignment within an > AG gets aligned to the extent size. In this case, if we can't find > an aligned free extent to allocate, we fail the allocation (ENOSPC). > Hence with forced alignment, we can have ENOSPC occur when there are > large amounts of free space available in the filesystem. > > This is almost certainly what most people -don't want-, but it is a > requirement for atomic writes. To make matters worse, this behaviour > will almost certainly get worst as filesystem ages and free space > slowly fragments over time. > > IOWs, by making the ext4 extsize have forced alignment semantics by > default, it means users will see ENOSPC at lot more frequently and > in situations where it is most definitely not expected. > > We also have to keep in mind that there are applications out there > that set and use extent size hints, and so enabling extsize in ext4 > will result in those applications silently starting to use them. If > ext4 supporting extsize hints drastically changes the behaviour of > the filesystem then that is going to cause significant unexpected > regressions for users as they upgrade kernels and filesystems. > > Hence I strongly suggest that ext4 implements extent size hints in > the same way that XFS does. i.e. unless forced alignment has been > enabled for the inode, extsize is just a hint that gets discarded if > aligned allocation does not succeed. > > Behaviour such as extent size hinting *should* be the same across > all filesystems that provide this functionality. This makes using > extent size hints much easier for users, admins and application > developers. The last thing I want to hear is application devs tell > me at conferences that "we don't use extent size hints anymore > because ext4..." Yes, makes sense :) Nothing to worry here tho as ext4 also treats the extsize value as a hint exactly like XFS. We have tried to keep the behavior as similar to XFS as possible for the exact reasons you mentioned. And yes, we do plan to add a forcealign (or similar) feature for ext4 as well for atomic writes which would change the hint to a mandate > > > 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize > > hint. That means on XFS for eof allocations (with extsize hint) only logical > > start gets aligned. > > I'm not sure I understand what you are saying here. XFS does extsize > alignment of both the start and end of post-eof extents the same as > it does for extents within EOF. For example: > > # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "bmap -vvp" foo > wrote 4096/4096 bytes at offset 0 > 4 KiB, 1 ops; 0.0308 sec (129.815 KiB/sec and 32.4538 ops/sec) > foo: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..7]: 256504..256511 0 (256504..256511) 8 000000 > 1: [8..31]: 256512..256535 0 (256512..256535) 24 010000 > FLAG Values: > 0100000 Shared extent > 0010000 Unwritten preallocated extent > > There's a 4k written extent at 0, and a 12k unwritten extent > beyond EOF at 4k. I.e. we have an extent of 16kB as the hint > required that is correctly aligned beyond EOF. > > If I then write another 4k at 20k (beyond both EOF and the unwritten > extent beyond EOF: > > # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "pwrite 20k 4k" -c "bmap -vvp" foo > wrote 4096/4096 bytes at offset 0 > 4 KiB, 1 ops; 0.0210 sec (190.195 KiB/sec and 47.5489 ops/sec) > wrote 4096/4096 bytes at offset 20480 > 4 KiB, 1 ops; 0.0001 sec (21.701 MiB/sec and 5555.5556 ops/sec) > foo: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..7]: 180000..180007 0 (180000..180007) 8 000000 > 1: [8..39]: 180008..180039 0 (180008..180039) 32 010000 > 2: [40..47]: 180040..180047 0 (180040..180047) 8 000000 > 3: [48..63]: 180048..180063 0 (180048..180063) 16 010000 > FLAG Values: > 0100000 Shared extent > 0010000 Unwritten preallocated extent > > You can see we did contiguous allocation of another 16kB at offset > 16kB, and then wrote to 20k for 4kB.. i.e. the new extent was > correctly aligned at both sides as the extsize hint says it should > be.... Sorry for the confusion Dave. What was meant is that XFS would indeed respect extsize hint for EOF allocations but if we close the file, since we trim the blocks past EOF upon close, we would only see that the lstart is aligned but the end would not. For example: $ xfs_io -c "open -dft foo" -c "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "bmap -vvp" -c "close" -c "open foo" -c "bmap -vvp" wrote 4096/4096 bytes at offset 0 4 KiB, 1 ops; 0.0003 sec (9.864 MiB/sec and 2525.2525 ops/sec) foo: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..7]: 384..391 0 (384..391) 8 000000 1: [8..31]: 392..415 0 (392..415) 24 010000 FLAG Values: 0010000 Unwritten preallocated extent foo: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..7]: 384..391 0 (384..391) 8 000000 > > > However extsize hint in ext4 for eof allocation is not > > supported in this version of the series. > > If you can't do extsize aligned allocations for EOF extension, then > how to applications use atomic writes to atomically extend the file? > In this particular RFC we can't and we'll have to go via the 'set extsize hint and then truncate' route. But we do plan to add this in next revision. > > 3. XFS allows extsize to be set on file with no extents but delayed data. > > It does? > > <looks> > > Yep, it doesn't check ip->i_delayed_blks is zero when changing > extsize. > > I think that's simply a bug, not intended behaviour, because > delalloc will not have reserved space for the extsize hint rounding > needed when writeback occurs. Can you send a patch to add this > check? Got it, sure I can send a patch for this. > > > However, ext4 don't allow that for simplicity. The user is expected to set > > it on a file before changing it's i_size. > > We don't actually care about i_size in XFS - the determining factor > is whether there are extents allocated on disk. i.e. we can truncate > up and then set the extent size hint because there are no extents > allocated even though the size is non-zero. That's right Dave, my intention was also to just make sure that before setting extsize: 1. We dont have dellayed allocs in flight 2. We dont have blocks allocated on disk So ideally truncate followed by extsize set should have worked. And in that sense, you are right that using i_size (or i_disksize in ext4) is not correct. I will fix this behavior in next revision, thanks. > > There are almost certainly applications out there that change extent > size after truncating to a non-zero size, so this needs to work on > ext4 the same way it does on XFS. Otherwise people are going to > complain that their applications suddenly stop working properly on > ext4.... > > > 4. XFS allows non-power-of-2 values for extsize but ext4 does not, since we > > primarily would like to support atomic writes with extsize. > > Yes, ext4 can make that restriction if desired. > > Keep in mind that the XFS atomic write support is still evolving, > and I think the way we are using extent size hints isn't fully > solidified yet. > > Indeed, I think that we can allow non-power-of-2 extent sizes for > atomic writes, because integer multiples of the atomic write unit > will still ensure that physical extents are properly aligned for > atomic writes to succeed. e.g. 24kB extent size is compatible with > 8kB atomic write sizes. > > To make that work efficiently unwritten extent boundaries need to be > maintained at atomic write alignments (8kB), not extent size > alignment (24kB), but other than that I don't think anything else is > needed.... > > This is desirable because it will allow extent size hints to remain > usable for their original purposes even with atomic writes on XFS. > i.e. fragmentation minimisation for small random DIO write worklaods > (exactly the sort of IO you'd consider using atomic writes for!), > alignment of extents to [non-power-of-2] RAID stripe geometry, etc. Got it, I agree that extsize doesn't **have** to be power-of-2 but for this revision we have kept it that way cause getting power of 2 aligned blocks comes almost without much changes in ext4 allocator. However, it shouldn't be a problem to support non power-of-2 blocks. We already have some aligned allocation logic for RAID use case which can be leveraged. > > > 5. In ext4 we chose to store the extsize value in SYSTEM_XATTR rather than an > > inode field as it was simple and most flexible, since there might be more > > features like atomic/untorn writes coming in future. > > Does that mean you can query and set it through the user xattr > interfaces? If so, how do you enforce the values users set are > correct? AFAICU, ext4 (and XFS) doesn't provide a handler for system xattrs, so its not possible for a user to get/set it from the xattr interface. They'd have to go through the ioctl. $ getfattr -n system.extsize test test: system.extsize: Operation not supported That being said, in case in future we would for some reason want to add a handler for system xattrs to ext4, we'll have to be mindful of making sure users can't get or set extsize through the xattr interfaces. > > > 6. In buffered-io path XFS switches to non-delalloc allocations for extsize hint. > > The same has been kept for EXT4 as well. > > That's an internal XFS implementation detail that you don't need to > replicate. Historically speaking, we didn't use unwritten extents > for delayed allocation and so we couldn't do within-EOF extsize > unaligned writes without adding special additional zero-around code to > ensure that we never exposed stale data to userspace from the extra > allocation that the data write did not cover. > > We now use unwritten extents for delalloc conversion, so this istale > data exposure issue no longer exists. We should really switch this > code back to using delalloc because it is much faster and less > fragmentation prone than direct extsize allocation.... Thanks for the context Dave, I didn't implement it this time around as I wanted to be sure what challenges XFS faced and ext4 will face while trying extsize with delalloc. I think this clears things up and this can be added in the next revisions. > > -Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, Sep 19, 2024 at 12:43:17PM +0530, Ojaswin Mujoo wrote: > On Wed, Sep 18, 2024 at 07:54:27PM +1000, Dave Chinner wrote: > > On Wed, Sep 11, 2024 at 02:31:04PM +0530, Ojaswin Mujoo wrote: > > Behaviour such as extent size hinting *should* be the same across > > all filesystems that provide this functionality. This makes using > > extent size hints much easier for users, admins and application > > developers. The last thing I want to hear is application devs tell > > me at conferences that "we don't use extent size hints anymore > > because ext4..." > > Yes, makes sense :) > > Nothing to worry here tho as ext4 also treats the extsize value as a > hint exactly like XFS. We have tried to keep the behavior as similar > to XFS as possible for the exact reasons you mentioned. It is worth explicitly stating this (i.e. all the behaviours that are the same) in the design documentation rather than just the corner cases where it is different. It was certainly not clear how failures were treated. > And yes, we do plan to add a forcealign (or similar) feature for ext4 as > well for atomic writes which would change the hint to a mandate Ok. That should be stated, too. FWIW, it would be a good idea to document this all in the kernel documentation itself, so there is a guideline for other filesystems to implement the same behaviour. e.g. in Documentation/filesystems/extent-size-hints.rst > > > 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize > > > hint. That means on XFS for eof allocations (with extsize hint) only logical > > > start gets aligned. > > > > I'm not sure I understand what you are saying here. XFS does extsize > > alignment of both the start and end of post-eof extents the same as > > it does for extents within EOF. For example: > > > > # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "bmap -vvp" foo > > wrote 4096/4096 bytes at offset 0 > > 4 KiB, 1 ops; 0.0308 sec (129.815 KiB/sec and 32.4538 ops/sec) > > foo: > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > 0: [0..7]: 256504..256511 0 (256504..256511) 8 000000 > > 1: [8..31]: 256512..256535 0 (256512..256535) 24 010000 > > FLAG Values: > > 0100000 Shared extent > > 0010000 Unwritten preallocated extent > > > > There's a 4k written extent at 0, and a 12k unwritten extent > > beyond EOF at 4k. I.e. we have an extent of 16kB as the hint > > required that is correctly aligned beyond EOF. > > > > If I then write another 4k at 20k (beyond both EOF and the unwritten > > extent beyond EOF: > > > > # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "pwrite 20k 4k" -c "bmap -vvp" foo > > wrote 4096/4096 bytes at offset 0 > > 4 KiB, 1 ops; 0.0210 sec (190.195 KiB/sec and 47.5489 ops/sec) > > wrote 4096/4096 bytes at offset 20480 > > 4 KiB, 1 ops; 0.0001 sec (21.701 MiB/sec and 5555.5556 ops/sec) > > foo: > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > 0: [0..7]: 180000..180007 0 (180000..180007) 8 000000 > > 1: [8..39]: 180008..180039 0 (180008..180039) 32 010000 > > 2: [40..47]: 180040..180047 0 (180040..180047) 8 000000 > > 3: [48..63]: 180048..180063 0 (180048..180063) 16 010000 > > FLAG Values: > > 0100000 Shared extent > > 0010000 Unwritten preallocated extent > > > > You can see we did contiguous allocation of another 16kB at offset > > 16kB, and then wrote to 20k for 4kB.. i.e. the new extent was > > correctly aligned at both sides as the extsize hint says it should > > be.... > > Sorry for the confusion Dave. What was meant is that XFS would indeed > respect extsize hint for EOF allocations but if we close the file, since > we trim the blocks past EOF upon close, we would only see that the > lstart is aligned but the end would not. Right, but that is desired behaviour, especially when extsize is large. i.e. when the file is closed it is an indication that the file will not be written again, so we don't need to keep post-eof blocks around for fragmentation prevention reasons. Removing post-EOF extents on close prevents large extsize hints from consuming lots of unused space on files that are never going to be written to again(*). That's user visible, and because it can cause premature ENOSPC, users will report this excessive space usage behaviour as a bug (and they are right). Hence removing post-eof extents on file close when extent size hints are in use comes under the guise of Good Behaviour To Have. (*) think about how much space is wasted if you clone a kernel git tree under a 1MB extent size hint directory. All those tiny header files now take up 1MB of space on disk.... Keep in mind that when the file is opened for write again, the extent size hint still gets applied to the new extents. If the extending write starts beyond the EOF extsize range, then the new extent after the hole at EOF will be fully extsize aligned, as expected. If the new write is exactly extending the file, then the new extents will not be extsize aligned - the start will be at the EOF block, and they will be extsize -length-. IOWs, the extent size is maintained, just the logical alignment is not exactly extsize aligned. This could be considered a bug, but it's never been an issue for anyone because, in XFS, physical extent alignment is separate (and maintained regardless of logical alignment) for extent size hint based allocations. Adding force-align will prevent this behaviour from occurring, as post-eof trimming will be done to extsize alignment, not to the EOF block. Hence open/close/open will not affect logical or physical alignment of force-align extents (and hence won't affect atomic writes). -Dave.
On Fri, Sep 20, 2024 at 08:34:14AM +1000, Dave Chinner wrote: > On Thu, Sep 19, 2024 at 12:43:17PM +0530, Ojaswin Mujoo wrote: > > On Wed, Sep 18, 2024 at 07:54:27PM +1000, Dave Chinner wrote: > > > On Wed, Sep 11, 2024 at 02:31:04PM +0530, Ojaswin Mujoo wrote: > > > Behaviour such as extent size hinting *should* be the same across > > > all filesystems that provide this functionality. This makes using > > > extent size hints much easier for users, admins and application > > > developers. The last thing I want to hear is application devs tell > > > me at conferences that "we don't use extent size hints anymore > > > because ext4..." > > > > Yes, makes sense :) > > > > Nothing to worry here tho as ext4 also treats the extsize value as a > > hint exactly like XFS. We have tried to keep the behavior as similar > > to XFS as possible for the exact reasons you mentioned. > > It is worth explicitly stating this (i.e. all the behaviours that > are the same) in the design documentation rather than just the > corner cases where it is different. It was certainly not clear how > failures were treated. Got it Dave, I did mention it in the actual commit 5/5 but I agree. I will update the cover letter to be more clear about the design in future revisions. > > > And yes, we do plan to add a forcealign (or similar) feature for ext4 as > > well for atomic writes which would change the hint to a mandate > > Ok. That should be stated, too. > > FWIW, it would be a good idea to document this all in the kernel > documentation itself, so there is a guideline for other filesystems > to implement the same behaviour. e.g. in > Documentation/filesystems/extent-size-hints.rst Okay makes sense, I can look into this as a next step. > > > > > 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize > > > > hint. That means on XFS for eof allocations (with extsize hint) only logical > > > > start gets aligned. > > > > > > I'm not sure I understand what you are saying here. XFS does extsize > > > alignment of both the start and end of post-eof extents the same as > > > it does for extents within EOF. For example: > > > > > > # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "bmap -vvp" foo > > > wrote 4096/4096 bytes at offset 0 > > > 4 KiB, 1 ops; 0.0308 sec (129.815 KiB/sec and 32.4538 ops/sec) > > > foo: > > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > > 0: [0..7]: 256504..256511 0 (256504..256511) 8 000000 > > > 1: [8..31]: 256512..256535 0 (256512..256535) 24 010000 > > > FLAG Values: > > > 0100000 Shared extent > > > 0010000 Unwritten preallocated extent > > > > > > There's a 4k written extent at 0, and a 12k unwritten extent > > > beyond EOF at 4k. I.e. we have an extent of 16kB as the hint > > > required that is correctly aligned beyond EOF. > > > > > > If I then write another 4k at 20k (beyond both EOF and the unwritten > > > extent beyond EOF: > > > > > > # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "pwrite 20k 4k" -c "bmap -vvp" foo > > > wrote 4096/4096 bytes at offset 0 > > > 4 KiB, 1 ops; 0.0210 sec (190.195 KiB/sec and 47.5489 ops/sec) > > > wrote 4096/4096 bytes at offset 20480 > > > 4 KiB, 1 ops; 0.0001 sec (21.701 MiB/sec and 5555.5556 ops/sec) > > > foo: > > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > > 0: [0..7]: 180000..180007 0 (180000..180007) 8 000000 > > > 1: [8..39]: 180008..180039 0 (180008..180039) 32 010000 > > > 2: [40..47]: 180040..180047 0 (180040..180047) 8 000000 > > > 3: [48..63]: 180048..180063 0 (180048..180063) 16 010000 > > > FLAG Values: > > > 0100000 Shared extent > > > 0010000 Unwritten preallocated extent > > > > > > You can see we did contiguous allocation of another 16kB at offset > > > 16kB, and then wrote to 20k for 4kB.. i.e. the new extent was > > > correctly aligned at both sides as the extsize hint says it should > > > be.... > > > > Sorry for the confusion Dave. What was meant is that XFS would indeed > > respect extsize hint for EOF allocations but if we close the file, since > > we trim the blocks past EOF upon close, we would only see that the > > lstart is aligned but the end would not. > > Right, but that is desired behaviour, especially when extsize is > large. i.e. when the file is closed it is an indication that the > file will not be written again, so we don't need to keep post-eof > blocks around for fragmentation prevention reasons. > > Removing post-EOF extents on close prevents large extsize hints from > consuming lots of unused space on files that are never going to be > written to again(*). That's user visible, and because it can cause > premature ENOSPC, users will report this excessive space usage > behaviour as a bug (and they are right). Hence removing post-eof > extents on file close when extent size hints are in use comes under > the guise of Good Behaviour To Have. > > (*) think about how much space is wasted if you clone a kernel git > tree under a 1MB extent size hint directory. All those tiny header > files now take up 1MB of space on disk.... > > Keep in mind that when the file is opened for write again, the > extent size hint still gets applied to the new extents. If the > extending write starts beyond the EOF extsize range, then the new > extent after the hole at EOF will be fully extsize aligned, as > expected. > > If the new write is exactly extending the file, then the new extents > will not be extsize aligned - the start will be at the EOF block, > and they will be extsize -length-. IOWs, the extent size is > maintained, just the logical alignment is not exactly extsize > aligned. This could be considered a bug, but it's never been an > issue for anyone because, in XFS, physical extent alignment is > separate (and maintained regardless of logical alignment) for extent > size hint based allocations. > > Adding force-align will prevent this behaviour from occurring, as > post-eof trimming will be done to extsize alignment, not to the EOF > block. Hence open/close/open will not affect logical or physical > alignment of force-align extents (and hence won't affect atomic > writes). Thanks for the context, I will try to keep this behavior similar to XFS once we implement the EOF support for extsize hints in next revision. Regards, Ojaswin > > -Dave. > -- > Dave Chinner > david@fromorbit.com