Message ID | 20240124142645.9334-6-john.g.garry@oracle.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | block atomic writes for XFS | expand |
On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote: > Ensure that when creating a mapping that we adhere to all the atomic > write rules. > > We check that the mapping covers the complete range of the write to ensure > that we'll be just creating a single mapping. > > Currently minimum granularity is the FS block size, but it should be > possibly to support lower in future. > > Signed-off-by: John Garry <john.g.garry@oracle.com> > --- > I am setting this as an RFC as I am not sure on the change in > xfs_iomap_write_direct() - it gives the desired result AFAICS. > > fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 41 insertions(+) > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 18c8f168b153..758dc1c90a42 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -289,6 +289,9 @@ xfs_iomap_write_direct( > } > } > > + if (xfs_inode_atomicwrites(ip)) > + bmapi_flags = XFS_BMAPI_ZERO; Why do we want to write zeroes to the disk if we're allocating space even if we're not sending an atomic write? (This might want an explanation for why we're doing this at all -- it's to avoid unwritten extent conversion, which defeats hardware untorn writes.) I think we should support IOCB_ATOMIC when the mapping is unwritten -- the data will land on disk in an untorn fashion, the unwritten extent conversion on IO completion is itself atomic, and callers still have to set O_DSYNC to persist anything. Then we can avoid the cost of BMAPI_ZERO, because double-writes aren't free. > + > error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks, > rblocks, force, &tp); > if (error) > @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin( > if (error) > goto out_unlock; > > + if (flags & IOMAP_ATOMIC) { > + xfs_filblks_t unit_min_fsb, unit_max_fsb; > + unsigned int unit_min, unit_max; > + > + xfs_get_atomic_write_attr(ip, &unit_min, &unit_max); > + unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min); > + unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max); > + > + if (!imap_spans_range(&imap, offset_fsb, end_fsb)) { > + error = -EINVAL; > + goto out_unlock; > + } > + > + if ((offset & mp->m_blockmask) || > + (length & mp->m_blockmask)) { > + error = -EINVAL; > + goto out_unlock; > + } > + > + if (imap.br_blockcount == unit_min_fsb || > + imap.br_blockcount == unit_max_fsb) { > + /* ok if exactly min or max */ > + } else if (imap.br_blockcount < unit_min_fsb || > + imap.br_blockcount > unit_max_fsb) { > + error = -EINVAL; > + goto out_unlock; > + } else if (!is_power_of_2(imap.br_blockcount)) { > + error = -EINVAL; > + goto out_unlock; > + } > + > + if (imap.br_startoff && > + imap.br_startoff & (imap.br_blockcount - 1)) { Not sure why we care about the file position, it's br_startblock that gets passed into the bio, not br_startoff. I'm also still not convinced that any of this validation is useful here. The block device stack underneath the filesystem can change at any time without any particular notice to the fs, so the only way to find out if the proposed IO would meet the alignment constraints is to submit_bio and see what happens. (The "one bio per untorn write request" thing in the direct-io.c patch sound sane to me though.) --D > + error = -EINVAL; > + goto out_unlock; > + } > + } > + > if (imap_needs_cow(ip, flags, &imap, nimaps)) { > error = -EAGAIN; > if (flags & IOMAP_NOWAIT) > -- > 2.31.1 > >
On 02/02/2024 18:47, Darrick J. Wong wrote: > On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote: >> Ensure that when creating a mapping that we adhere to all the atomic >> write rules. >> >> We check that the mapping covers the complete range of the write to ensure >> that we'll be just creating a single mapping. >> >> Currently minimum granularity is the FS block size, but it should be >> possibly to support lower in future. >> >> Signed-off-by: John Garry <john.g.garry@oracle.com> >> --- >> I am setting this as an RFC as I am not sure on the change in >> xfs_iomap_write_direct() - it gives the desired result AFAICS. >> >> fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 41 insertions(+) >> >> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c >> index 18c8f168b153..758dc1c90a42 100644 >> --- a/fs/xfs/xfs_iomap.c >> +++ b/fs/xfs/xfs_iomap.c >> @@ -289,6 +289,9 @@ xfs_iomap_write_direct( >> } >> } >> >> + if (xfs_inode_atomicwrites(ip)) >> + bmapi_flags = XFS_BMAPI_ZERO; > > Why do we want to write zeroes to the disk if we're allocating space > even if we're not sending an atomic write? > > (This might want an explanation for why we're doing this at all -- it's > to avoid unwritten extent conversion, which defeats hardware untorn > writes.) It's to handle the scenario where we have a partially written extent, and then try to issue an atomic write which covers the complete extent. In this scenario, the iomap code will issue 2x IOs, which is unacceptable. So we ensure that the extent is completely written whenever we allocate it. At least that is my idea. > > I think we should support IOCB_ATOMIC when the mapping is unwritten -- > the data will land on disk in an untorn fashion, the unwritten extent > conversion on IO completion is itself atomic, and callers still have to > set O_DSYNC to persist anything. But does this work for the scenario above? > Then we can avoid the cost of > BMAPI_ZERO, because double-writes aren't free. About double-writes not being free, I thought that this was acceptable to just have this write zero when initially allocating the extent as it should not add too much overhead in practice, i.e. it's one off. > >> + >> error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks, >> rblocks, force, &tp); >> if (error) >> @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin( >> if (error) >> goto out_unlock; >> >> + if (flags & IOMAP_ATOMIC) { >> + xfs_filblks_t unit_min_fsb, unit_max_fsb; >> + unsigned int unit_min, unit_max; >> + >> + xfs_get_atomic_write_attr(ip, &unit_min, &unit_max); >> + unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min); >> + unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max); >> + >> + if (!imap_spans_range(&imap, offset_fsb, end_fsb)) { >> + error = -EINVAL; >> + goto out_unlock; >> + } >> + >> + if ((offset & mp->m_blockmask) || >> + (length & mp->m_blockmask)) { >> + error = -EINVAL; >> + goto out_unlock; >> + } >> + >> + if (imap.br_blockcount == unit_min_fsb || >> + imap.br_blockcount == unit_max_fsb) { >> + /* ok if exactly min or max */ >> + } else if (imap.br_blockcount < unit_min_fsb || >> + imap.br_blockcount > unit_max_fsb) { >> + error = -EINVAL; >> + goto out_unlock; >> + } else if (!is_power_of_2(imap.br_blockcount)) { >> + error = -EINVAL; >> + goto out_unlock; >> + } >> + >> + if (imap.br_startoff && >> + imap.br_startoff & (imap.br_blockcount - 1)) { > > Not sure why we care about the file position, it's br_startblock that > gets passed into the bio, not br_startoff. We just want to ensure that the length of the write is valid w.r.t. to the offset within the extent, and br_startoff would be the offset within the aligned extent. > > I'm also still not convinced that any of this validation is useful here. > The block device stack underneath the filesystem can change at any time > without any particular notice to the fs, so the only way to find out if > the proposed IO would meet the alignment constraints is to submit_bio > and see what happens. I am not sure what submit_bio() would do differently. If the block device is changing underneath the block layer, then there is where these things need to be checked. > > (The "one bio per untorn write request" thing in the direct-io.c patch > sound sane to me though.) > ok Thanks, John
On Mon, Feb 05, 2024 at 01:36:03PM +0000, John Garry wrote: > On 02/02/2024 18:47, Darrick J. Wong wrote: > > On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote: > > > Ensure that when creating a mapping that we adhere to all the atomic > > > write rules. > > > > > > We check that the mapping covers the complete range of the write to ensure > > > that we'll be just creating a single mapping. > > > > > > Currently minimum granularity is the FS block size, but it should be > > > possibly to support lower in future. > > > > > > Signed-off-by: John Garry <john.g.garry@oracle.com> > > > --- > > > I am setting this as an RFC as I am not sure on the change in > > > xfs_iomap_write_direct() - it gives the desired result AFAICS. > > > > > > fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 41 insertions(+) > > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > index 18c8f168b153..758dc1c90a42 100644 > > > --- a/fs/xfs/xfs_iomap.c > > > +++ b/fs/xfs/xfs_iomap.c > > > @@ -289,6 +289,9 @@ xfs_iomap_write_direct( > > > } > > > } > > > + if (xfs_inode_atomicwrites(ip)) > > > + bmapi_flags = XFS_BMAPI_ZERO; We really, really don't want to be doing this during allocation unless we can avoid it. If the filesystem block size is 64kB, we could be allocating up to 96GB per extent, and that becomes an uninterruptable write stream inside a transaction context that holds inode metadata locked. IOWs, if the inode is already dirty, this data zeroing effectively pins the tail of the journal until the data writes complete, and hence can potentially stall the entire filesystem for that length of time. Historical note: XFS_BMAPI_ZERO was introduced for DAX where unwritten extents are not used for initial allocation because the direct zeroing overhead is typically much lower than unwritten extent conversion overhead. It was not intended as a general purpose "zero data at allocation time" solution primarily because of how easy it would be to DOS the storage with a single, unkillable fallocate() call on slow storage. > > Why do we want to write zeroes to the disk if we're allocating space > > even if we're not sending an atomic write? > > > > (This might want an explanation for why we're doing this at all -- it's > > to avoid unwritten extent conversion, which defeats hardware untorn > > writes.) > > It's to handle the scenario where we have a partially written extent, and > then try to issue an atomic write which covers the complete extent. When/how would that ever happen with the forcealign bits being set preventing unaligned allocation and writes? > In this > scenario, the iomap code will issue 2x IOs, which is unacceptable. So we > ensure that the extent is completely written whenever we allocate it. At > least that is my idea. So return an unaligned extent, and then the IOMAP_ATOMIC checks you add below say "no" and then the application has to do things the slow, safe way.... > > I think we should support IOCB_ATOMIC when the mapping is unwritten -- > > the data will land on disk in an untorn fashion, the unwritten extent > > conversion on IO completion is itself atomic, and callers still have to > > set O_DSYNC to persist anything. > > But does this work for the scenario above? Probably not, but if we want the mapping to return a single contiguous extent mapping that spans both unwritten and written states, then we should directly code that behaviour for atomic IO and not try to hack around it via XFS_BMAPI_ZERO. Unwritten extent conversion will already do the right thing in that it will only convert unwritten regions to written in the larger range that is passed to it, but if there are multiple regions that need conversion then the conversion won't be atomic. > > Then we can avoid the cost of > > BMAPI_ZERO, because double-writes aren't free. > > About double-writes not being free, I thought that this was acceptable to > just have this write zero when initially allocating the extent as it should > not add too much overhead in practice, i.e. it's one off. The whole point about atomic writes is they are a performance optimisation. If the cost of enabling atomic writes is that we double the amount of IO we are doing, then we've lost more performance than we gained by using atomic writes. That doesn't seem desirable.... > > > > > > + > > > error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks, > > > rblocks, force, &tp); > > > if (error) > > > @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin( > > > if (error) > > > goto out_unlock; > > > + if (flags & IOMAP_ATOMIC) { > > > + xfs_filblks_t unit_min_fsb, unit_max_fsb; > > > + unsigned int unit_min, unit_max; > > > + > > > + xfs_get_atomic_write_attr(ip, &unit_min, &unit_max); > > > + unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min); > > > + unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max); > > > + > > > + if (!imap_spans_range(&imap, offset_fsb, end_fsb)) { > > > + error = -EINVAL; > > > + goto out_unlock; > > > + } > > > + > > > + if ((offset & mp->m_blockmask) || > > > + (length & mp->m_blockmask)) { > > > + error = -EINVAL; > > > + goto out_unlock; > > > + } That belongs in the iomap DIO setup code, not here. It's also only checking the data offset/length is filesystem block aligned, not atomic IO aligned, too. > > > + > > > + if (imap.br_blockcount == unit_min_fsb || > > > + imap.br_blockcount == unit_max_fsb) { > > > + /* ok if exactly min or max */ Why? Exact sizing doesn't imply alignment is correct. > > > + } else if (imap.br_blockcount < unit_min_fsb || > > > + imap.br_blockcount > unit_max_fsb) { > > > + error = -EINVAL; > > > + goto out_unlock; Why do this after an exact check? > > > + } else if (!is_power_of_2(imap.br_blockcount)) { > > > + error = -EINVAL; > > > + goto out_unlock; Why does this matter? If the extent mapping spans a range larger than was asked for, who cares what size it is as the infrastructure is only going to do IO for the sub-range in the mapping the user asked for.... > > > + } > > > + > > > + if (imap.br_startoff && > > > + imap.br_startoff & (imap.br_blockcount - 1)) { > > > > Not sure why we care about the file position, it's br_startblock that > > gets passed into the bio, not br_startoff. > > We just want to ensure that the length of the write is valid w.r.t. to the > offset within the extent, and br_startoff would be the offset within the > aligned extent. I'm not sure why the filesystem extent mapping code needs to care about IOMAP_ATOMIC like this - the extent allocation behaviour is determined by the inode forcealign flag, not IOMAP_ATOMIC. Everything else we have to do is just mapping the offset/len that was passed to it from the iomap DIO layer. As long as we allocate with correct alignment and return a mapping that spans the start offset of the requested range, we've done our job here. Actually determining if the mapping returned for IO is suitable for the type of IO we are doing (i.e. IOMAP_ATOMIC) is the responsibility of the iomap infrastructure. The same checks will have to be done for every filesystem that implements atomic writes, so these checks belong in the generic code, not the filesystem mapping callouts. -Dave
Hi Dave, >>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c >>>> index 18c8f168b153..758dc1c90a42 100644 >>>> --- a/fs/xfs/xfs_iomap.c >>>> +++ b/fs/xfs/xfs_iomap.c >>>> @@ -289,6 +289,9 @@ xfs_iomap_write_direct( >>>> } >>>> } >>>> + if (xfs_inode_atomicwrites(ip)) >>>> + bmapi_flags = XFS_BMAPI_ZERO; > > We really, really don't want to be doing this during allocation > unless we can avoid it. If the filesystem block size is 64kB, we > could be allocating up to 96GB per extent, and that becomes an > uninterruptable write stream inside a transaction context that holds > inode metadata locked. Where does that 96GB figure come from? > > IOWs, if the inode is already dirty, this data zeroing effectively > pins the tail of the journal until the data writes complete, and > hence can potentially stall the entire filesystem for that length of > time. > > Historical note: XFS_BMAPI_ZERO was introduced for DAX where > unwritten extents are not used for initial allocation because the > direct zeroing overhead is typically much lower than unwritten > extent conversion overhead. It was not intended as a general > purpose "zero data at allocation time" solution primarily because of > how easy it would be to DOS the storage with a single, unkillable > fallocate() call on slow storage. Understood > >>> Why do we want to write zeroes to the disk if we're allocating space >>> even if we're not sending an atomic write? >>> >>> (This might want an explanation for why we're doing this at all -- it's >>> to avoid unwritten extent conversion, which defeats hardware untorn >>> writes.) >> >> It's to handle the scenario where we have a partially written extent, and >> then try to issue an atomic write which covers the complete extent. > > When/how would that ever happen with the forcealign bits being set > preventing unaligned allocation and writes? Consider this scenario: # mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda # mount /dev/sda mnt -o rtdev=/dev/sdb # touch mnt/file # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic write, 4096B at pos 0 # filefrag -v mnt/file Filesystem type is: 58465342 File size of mnt/file is 4096 (1 block of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 24.. 24: 1: last,eof mnt/file: 1 extent found # /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file wrote -1 bytes at pos 0 write_size=16384 # For the 2nd write, which would cover a 16KB extent, the iomap code will iter twice and produce 2x BIOs, which we don't want - that's why it errors there. With the change in this patch, instead we have something like this after the first write: # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file wrote 4096 bytes at pos 0 write_size=4096 # filefrag -v mnt/file Filesystem type is: 58465342 File size of mnt/file is 4096 (1 block of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 3: 24.. 27: 4: last,eof mnt/file: 1 extent found # So the 16KB extent is in written state and the 2nd 16KB write would iter once, producing a single BIO. > >> In this >> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we >> ensure that the extent is completely written whenever we allocate it. At >> least that is my idea. > > So return an unaligned extent, and then the IOMAP_ATOMIC checks you > add below say "no" and then the application has to do things the > slow, safe way.... We have been porting atomic write support to some database apps and they (database developers) have had to do something like manually zero the complete file to get around this issue, but that's not a good user experience. Note that in their case the first 4KB write is non-atomic, but that does not change things. They do these 4KB writes in some DB setup phase. > >>> I think we should support IOCB_ATOMIC when the mapping is unwritten -- >>> the data will land on disk in an untorn fashion, the unwritten extent >>> conversion on IO completion is itself atomic, and callers still have to >>> set O_DSYNC to persist anything. >> >> But does this work for the scenario above? > > Probably not, but if we want the mapping to return a single > contiguous extent mapping that spans both unwritten and written > states, then we should directly code that behaviour for atomic > IO and not try to hack around it via XFS_BMAPI_ZERO. > > Unwritten extent conversion will already do the right thing in that > it will only convert unwritten regions to written in the larger > range that is passed to it, but if there are multiple regions that > need conversion then the conversion won't be atomic. We would need something atomic. > >>> Then we can avoid the cost of >>> BMAPI_ZERO, because double-writes aren't free. >> >> About double-writes not being free, I thought that this was acceptable to >> just have this write zero when initially allocating the extent as it should >> not add too much overhead in practice, i.e. it's one off. > > The whole point about atomic writes is they are a performance > optimisation. If the cost of enabling atomic writes is that we > double the amount of IO we are doing, then we've lost more > performance than we gained by using atomic writes. That doesn't > seem desirable.... But the zero'ing is a one off per extent allocation, right? I would expect just an initial overhead when the database is being created/extended. Anyway, I did mark this as an RFC for this same reason. > >> >>> >>>> + >>>> error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks, >>>> rblocks, force, &tp); >>>> if (error) >>>> @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin( >>>> if (error) >>>> goto out_unlock; >>>> + if (flags & IOMAP_ATOMIC) { >>>> + xfs_filblks_t unit_min_fsb, unit_max_fsb; >>>> + unsigned int unit_min, unit_max; >>>> + >>>> + xfs_get_atomic_write_attr(ip, &unit_min, &unit_max); >>>> + unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min); >>>> + unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max); >>>> + >>>> + if (!imap_spans_range(&imap, offset_fsb, end_fsb)) { >>>> + error = -EINVAL; >>>> + goto out_unlock; >>>> + } >>>> + >>>> + if ((offset & mp->m_blockmask) || >>>> + (length & mp->m_blockmask)) { >>>> + error = -EINVAL; >>>> + goto out_unlock; >>>> + } > > That belongs in the iomap DIO setup code, not here. It's also only > checking the data offset/length is filesystem block aligned, not > atomic IO aligned, too. hmmm... I'm not sure about that. Initially XFS will only support writes whose size is a multiple of FS block size, and this is what we are checking here, even if it is not obvious. The idea is that we can first ensure size is a multiple of FS blocksize, and then can use br_blockcount directly, below. > >>>> + >>>> + if (imap.br_blockcount == unit_min_fsb || >>>> + imap.br_blockcount == unit_max_fsb) { >>>> + /* ok if exactly min or max */ > > Why? Exact sizing doesn't imply alignment is correct. We're not checking alignment specifically, but just checking that the size is ok. > >>>> + } else if (imap.br_blockcount < unit_min_fsb || >>>> + imap.br_blockcount > unit_max_fsb) { >>>> + error = -EINVAL; >>>> + goto out_unlock; > > Why do this after an exact check? And this is a continuation of the size check. > >>>> + } else if (!is_power_of_2(imap.br_blockcount)) { >>>> + error = -EINVAL; >>>> + goto out_unlock; > > Why does this matter? If the extent mapping spans a range larger > than was asked for, who cares what size it is as the infrastructure > is only going to do IO for the sub-range in the mapping the user > asked for.... ok, so where would be a better check for power-of-2 write length? In iomap DIO code? I was thinking of doing that, but not so happy with sparse checks. > >>>> + } >>>> + >>>> + if (imap.br_startoff && >>>> + imap.br_startoff & (imap.br_blockcount - 1)) { >>> >>> Not sure why we care about the file position, it's br_startblock that >>> gets passed into the bio, not br_startoff. >> >> We just want to ensure that the length of the write is valid w.r.t. to the >> offset within the extent, and br_startoff would be the offset within the >> aligned extent. > > I'm not sure why the filesystem extent mapping code needs to care > about IOMAP_ATOMIC like this - the extent allocation behaviour is > determined by the inode forcealign flag, not IOMAP_ATOMIC. > Everything else we have to do is just mapping the offset/len that > was passed to it from the iomap DIO layer. As long as we allocate > with correct alignment and return a mapping that spans the start > offset of the requested range, we've done our job here. > > Actually determining if the mapping returned for IO is suitable for > the type of IO we are doing (i.e. IOMAP_ATOMIC) is the > responsibility of the iomap infrastructure. The same checks will > have to be done for every filesystem that implements atomic writes, > so these checks belong in the generic code, not the filesystem > mapping callouts. We can move some of these checks to the core iomap code. However, the core iomap code does not know FS atomic write min and max per inode, so we need some checks here. Thanks, John
On Tue, Feb 06, 2024 at 09:53:11AM +0000, John Garry wrote: > Hi Dave, > > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > > > index 18c8f168b153..758dc1c90a42 100644 > > > > > --- a/fs/xfs/xfs_iomap.c > > > > > +++ b/fs/xfs/xfs_iomap.c > > > > > @@ -289,6 +289,9 @@ xfs_iomap_write_direct( > > > > > } > > > > > } > > > > > + if (xfs_inode_atomicwrites(ip)) > > > > > + bmapi_flags = XFS_BMAPI_ZERO; > > > > We really, really don't want to be doing this during allocation > > unless we can avoid it. If the filesystem block size is 64kB, we > > could be allocating up to 96GB per extent, and that becomes an > > uninterruptable write stream inside a transaction context that holds > > inode metadata locked. > > Where does that 96GB figure come from? My inability to do math. The actual number is 128GB. Max extent size = XFS_MAX_BMBT_EXTLEN * fs block size. = 2^21 * fs block size. So for a 4kB block size filesystem, that's 8GB max extent length, and that's the most we will allocate in a single transaction (i.e. one new BMBT record). For 64kB block size, we can get 128GB of space allocated in a single transaction. > > > > Why do we want to write zeroes to the disk if we're allocating space > > > > even if we're not sending an atomic write? > > > > > > > > (This might want an explanation for why we're doing this at all -- it's > > > > to avoid unwritten extent conversion, which defeats hardware untorn > > > > writes.) > > > > > > It's to handle the scenario where we have a partially written extent, and > > > then try to issue an atomic write which covers the complete extent. > > > > When/how would that ever happen with the forcealign bits being set > > preventing unaligned allocation and writes? > > Consider this scenario: > > # mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda > # mount /dev/sda mnt -o rtdev=/dev/sdb > # touch mnt/file > # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic > write, 4096B at pos 0 Please don't write one-off custom test programs to issue IO - please use and enhance xfs_io so the test cases can then be put straight into fstests without adding yet another "do some minor IO variant" test program. This also means you don't need a random assortment of other tools. i.e. # xfs_io -dc "pwrite -VA 0 4096" /root/mnt/file Should do an RWF_ATOMIC IO, and # xfs_io -dc "pwrite -VAD 0 4096" /root/mnt/file should do an RWF_ATOMIC|RWF_DSYNC IO... > # filefrag -v mnt/file xfs_io -c "fiemap" mnt/file > Filesystem type is: 58465342 > File size of mnt/file is 4096 (1 block of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: > flags: > 0: 0.. 0: 24.. 24: 1: > last,eof > mnt/file: 1 extent found > # /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file > wrote -1 bytes at pos 0 write_size=16384 > # Whole test as one repeatable command: # xfs_io -d -c "truncate 0" -c "chattr +r" \ -c "pwrite -VAD 0 4096" \ -c "fiemap" \ -c "pwrite -VAD 0 16384" \ /mnt/root/file > > For the 2nd write, which would cover a 16KB extent, the iomap code will iter > twice and produce 2x BIOs, which we don't want - that's why it errors there. Yes, but I think that's a feature. You've optimised the filesystem layout for IO that is 64kB sized and aligned IO, but your test case is mixing 4kB and 16KB IO. The filesystem should be telling you that you're doing something that is sub-optimal for it's configuration, and refusing to do weird overlapping sub-rtextsize atomic IO is a pretty good sign that you've got something wrong. The whole reason for rtextsize existing is to optimise the rtextent allocation to the typical minimum IO size done to that volume. If all your IO is sub-rtextsize size and alignment, then all that has been done is forcing the entire rt device IO into a corner it was never really intended nor optimised for. Why should we jump through crazy hoops to try to make filesystems optimised for large IOs with mismatched, overlapping small atomic writes? > With the change in this patch, instead we have something like this after the > first write: > > # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file > wrote 4096 bytes at pos 0 write_size=4096 > # filefrag -v mnt/file > Filesystem type is: 58465342 > File size of mnt/file is 4096 (1 block of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: > flags: > 0: 0.. 3: 24.. 27: 4: > last,eof > mnt/file: 1 extent found > # > > So the 16KB extent is in written state and the 2nd 16KB write would iter > once, producing a single BIO. Sure, I know how it works. My point is that it's a terrible way to go about allowing that second atomic write to succeed. > > > In this > > > scenario, the iomap code will issue 2x IOs, which is unacceptable. So we > > > ensure that the extent is completely written whenever we allocate it. At > > > least that is my idea. > > > > So return an unaligned extent, and then the IOMAP_ATOMIC checks you > > add below say "no" and then the application has to do things the > > slow, safe way.... > > We have been porting atomic write support to some database apps and they > (database developers) have had to do something like manually zero the > complete file to get around this issue, but that's not a good user > experience. Better the application zeros the file when it is being initialised and doesn't have performance constraints rather than forcing the filesystem to do it in the IO fast path when IO performance and latency actually matters to the application. There are production databases that already do this manual zero initialisation to avoid unwritten extent conversion overhead during runtime operation. That's because they want FUA writes to work, and that gives 25% better IO performance over the same O_DSYNC writes doing allocation and/or unwritten extent conversion after fallocate() which requires journal flushes with O_DSYNC writes. Using atomic writes is no different. > Note that in their case the first 4KB write is non-atomic, but that does not > change things. They do these 4KB writes in some DB setup phase. And therein lies the problem. If you are doing sub-rtextent IO at all, then you are forcing the filesystem down the path of explicitly using unwritten extents and requiring O_DSYNC direct IO to do journal flushes in IO completion context and then performance just goes down hill from them. The requirement for unwritten extents to track sub-rtextsize written regions is what you're trying to work around with XFS_BMAPI_ZERO so that atomic writes will always see "atomic write aligned" allocated regions. Do you see the problem here? You've explicitly told the filesystem that allocation is aligned to 64kB chunks, then because the filesystem block size is 4kB, it's allowed to track unwritten regions at 4kB boundaries. Then you do 4kB aligned file IO, which then changes unwritten extents at 4kB boundaries. Then you do a overlapping 16kB IO that *requires* 16kB allocation alignment, and things go BOOM. Yes, they should go BOOM. This is a horrible configuration - it is incomaptible with 16kB aligned and sized atomic IO. Allocation is aligned to 64kB, written region tracking is aligned to 4kB, and there's nothing to tell the filesystem that it should be maintaining 16kB "written alignment" so that 16kB atomic writes can always be issued atomically. i.e. if we are going to do 16kB aligned atomic IO, then all the allocation and unwritten tracking needs to be done in 16kB aligned chunks, not 4kB. That means a 4KB write into an unwritten region or a hole actually needs to zero the rest of the 16KB range it sits within. The direct IO code can do this, but it needs extension of the unaligned IO serialisation in XFS (the alignment checks in xfs_file_dio_write()) and the the sub-block zeroing in iomap_dio_bio_iter() (the need_zeroing padding has to span the fs allocation size, not the fsblock size) to do this safely. Regardless of how we do it, all IO concurrency on this file is shot if we have sub-rtextent sized IOs being done. That is true even with this patch set - XFS_BMAPI_ZERO is done whilst holding the XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the zeroing is being done. IOWs, anything to do with sub-rtextent IO really has to be treated like sub-fsblock DIO - i.e. exclusive inode access until the sub-rtextent zeroing has been completed. > > > > I think we should support IOCB_ATOMIC when the mapping is unwritten -- > > > > the data will land on disk in an untorn fashion, the unwritten extent > > > > conversion on IO completion is itself atomic, and callers still have to > > > > set O_DSYNC to persist anything. > > > > > > But does this work for the scenario above? > > > > Probably not, but if we want the mapping to return a single > > contiguous extent mapping that spans both unwritten and written > > states, then we should directly code that behaviour for atomic > > IO and not try to hack around it via XFS_BMAPI_ZERO. > > > > Unwritten extent conversion will already do the right thing in that > > it will only convert unwritten regions to written in the larger > > range that is passed to it, but if there are multiple regions that > > need conversion then the conversion won't be atomic. > > We would need something atomic. > > > > > > > Then we can avoid the cost of > > > > BMAPI_ZERO, because double-writes aren't free. > > > > > > About double-writes not being free, I thought that this was acceptable to > > > just have this write zero when initially allocating the extent as it should > > > not add too much overhead in practice, i.e. it's one off. > > > > The whole point about atomic writes is they are a performance > > optimisation. If the cost of enabling atomic writes is that we > > double the amount of IO we are doing, then we've lost more > > performance than we gained by using atomic writes. That doesn't > > seem desirable.... > > But the zero'ing is a one off per extent allocation, right? I would expect > just an initial overhead when the database is being created/extended. So why can't the application do that manually like others already do to enable FUA optimisations for O_DSYNC writes? FWIW, we probably should look to extend fallocate() to allow userspace to say "write real zeroes, not fake ones" so the filesystem can call blkdev_issue_zeroout() after preallocation to offload the zeroing to the hardware and then clear the unwritten bits on the preallocated range... > > > > > error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks, > > > > > rblocks, force, &tp); > > > > > if (error) > > > > > @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin( > > > > > if (error) > > > > > goto out_unlock; > > > > > + if (flags & IOMAP_ATOMIC) { > > > > > + xfs_filblks_t unit_min_fsb, unit_max_fsb; > > > > > + unsigned int unit_min, unit_max; > > > > > + > > > > > + xfs_get_atomic_write_attr(ip, &unit_min, &unit_max); > > > > > + unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min); > > > > > + unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max); > > > > > + > > > > > + if (!imap_spans_range(&imap, offset_fsb, end_fsb)) { > > > > > + error = -EINVAL; > > > > > + goto out_unlock; > > > > > + } > > > > > + > > > > > + if ((offset & mp->m_blockmask) || > > > > > + (length & mp->m_blockmask)) { > > > > > + error = -EINVAL; > > > > > + goto out_unlock; > > > > > + } > > > > That belongs in the iomap DIO setup code, not here. It's also only > > checking the data offset/length is filesystem block aligned, not > > atomic IO aligned, too. > > hmmm... I'm not sure about that. Initially XFS will only support writes > whose size is a multiple of FS block size, and this is what we are checking > here, even if it is not obvious. Which means, initially, iomap only supposed writes that are a multiple of filesystem block size. regardless, this should be checked in the submission path, not in the extent mapping callback. FWIW, we've already established above that iomap needs to handle rtextsize chunks rather than fs block size for correct zeroing behaviour for atomic writes, so this probably just needs to go away. > The idea is that we can first ensure size is a multiple of FS blocksize, and > then can use br_blockcount directly, below. Yes, and we can do all these checks on the iomap that we return to the iomap layer. All this is doing is running the checks on the XFS imap before it is formatted into the iomap iomap and returned to the iomap layer. These checks can be done on the returned iomap in the iomap layer if IOMAP_ATOMIC is set.... > However, the core iomap code does not know FS atomic write min and max per > inode, so we need some checks here. So maybe we should pass them down to iomap in the iocb when IOCB_ATOMIC is set, or reject the IO at the filesytem layer when checking atomic IO alignment before we pass the IO to the iomap layer... -Dave.
On 07/02/2024 00:06, Dave Chinner wrote: >>> We really, really don't want to be doing this during allocation >>> unless we can avoid it. If the filesystem block size is 64kB, we >>> could be allocating up to 96GB per extent, and that becomes an >>> uninterruptable write stream inside a transaction context that holds >>> inode metadata locked. >> Where does that 96GB figure come from? > My inability to do math. The actual number is 128GB. > > Max extent size = XFS_MAX_BMBT_EXTLEN * fs block size. > = 2^21 * fs block size. > > So for a 4kB block size filesystem, that's 8GB max extent length, > and that's the most we will allocate in a single transaction (i.e. > one new BMBT record). > > For 64kB block size, we can get 128GB of space allocated in a single > transaction. atomic write unit max theoretical upper limit is rounddown_power_of_2(2^32 - 1) = 2GB So this would be what is expected to be the largest extent size requested for atomic writes. I am not saying that 2GB is small, but certainly much smaller than 128GB. > >>>>> Why do we want to write zeroes to the disk if we're allocating space >>>>> even if we're not sending an atomic write? >>>>> >>>>> (This might want an explanation for why we're doing this at all -- it's >>>>> to avoid unwritten extent conversion, which defeats hardware untorn >>>>> writes.) >>>> It's to handle the scenario where we have a partially written extent, and >>>> then try to issue an atomic write which covers the complete extent. >>> When/how would that ever happen with the forcealign bits being set >>> preventing unaligned allocation and writes? >> Consider this scenario: >> >> # mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda >> # mount /dev/sda mnt -o rtdev=/dev/sdb >> # touch mnt/file >> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic >> write, 4096B at pos 0 > Please don't write one-off custom test programs to issue IO - please > use and enhance xfs_io so the test cases can then be put straight > into fstests without adding yet another "do some minor IO variant" > test program. This also means you don't need a random assortment of > other tools. > > i.e. > > # xfs_io -dc "pwrite -VA 0 4096" /root/mnt/file > > Should do an RWF_ATOMIC IO, and > > # xfs_io -dc "pwrite -VAD 0 4096" /root/mnt/file > > should do an RWF_ATOMIC|RWF_DSYNC IO... > > >> # filefrag -v mnt/file > xfs_io -c "fiemap" mnt/file Fine, but I like using something generic for accessing block devices and also other FSes. I didn't think that xfs_io can do that. Anyway, we can look to add atomic write support to xfs_io and any other xfs-progs > >> Filesystem type is: 58465342 >> File size of mnt/file is 4096 (1 block of 4096 bytes) >> ext: logical_offset: physical_offset: length: expected: >> flags: >> 0: 0.. 0: 24.. 24: 1: >> last,eof >> mnt/file: 1 extent found >> # /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file >> wrote -1 bytes at pos 0 write_size=16384 >> # > Whole test as one repeatable command: > > # xfs_io -d -c "truncate 0" -c "chattr +r" \ > -c "pwrite -VAD 0 4096" \ > -c "fiemap" \ > -c "pwrite -VAD 0 16384" \ > /mnt/root/file >> For the 2nd write, which would cover a 16KB extent, the iomap code will iter >> twice and produce 2x BIOs, which we don't want - that's why it errors there. > Yes, but I think that's a feature. You've optimised the filesystem > layout for IO that is 64kB sized and aligned IO, but your test case > is mixing 4kB and 16KB IO. The filesystem should be telling you that > you're doing something that is sub-optimal for it's configuration, > and refusing to do weird overlapping sub-rtextsize atomic IO is a > pretty good sign that you've got something wrong. Then we really end up with a strange behavior for the user. I mean, the user may ask - "why did this 16KB atomic write pass and this one fail? I'm following all the rules", and then "No one said not to mix write sizes or not mix atomic and non-atomic writes, so should be ok. Indeed, that earlier 4K write for the same region passed". Playing devil's advocate here, at least this behavior should be documented. > > The whole reason for rtextsize existing is to optimise the rtextent > allocation to the typical minimum IO size done to that volume. If > all your IO is sub-rtextsize size and alignment, then all that has > been done is forcing the entire rt device IO into a corner it was > never really intended nor optimised for. Sure, but just because we are optimized for a certain IO write size should not mean that other writes are disallowed or quite problematic. > > Why should we jump through crazy hoops to try to make filesystems > optimised for large IOs with mismatched, overlapping small atomic > writes? As mentioned, typically the atomic writes will be the same size, but we may have other writes of smaller size. > >> With the change in this patch, instead we have something like this after the >> first write: >> >> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file >> wrote 4096 bytes at pos 0 write_size=4096 >> # filefrag -v mnt/file >> Filesystem type is: 58465342 >> File size of mnt/file is 4096 (1 block of 4096 bytes) >> ext: logical_offset: physical_offset: length: expected: >> flags: >> 0: 0.. 3: 24.. 27: 4: >> last,eof >> mnt/file: 1 extent found >> # >> >> So the 16KB extent is in written state and the 2nd 16KB write would iter >> once, producing a single BIO. > Sure, I know how it works. My point is that it's a terrible way to > go about allowing that second atomic write to succeed. I think 'terrible' is a bit too strong a word here. Indeed, you suggest to manually zero the file to solve this problem, below, while this code change does the same thing automatically. BTW, there was a request for behavior like in this patch. Please see this discussion on the ext4 atomic writes port: https://lore.kernel.org/linux-ext4/ZXhb0tKFvAge%2FGWf@infradead.org/ So we should have some solution where the kernel automatically takes care of this unwritten extent issue. > >>>> In this >>>> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we >>>> ensure that the extent is completely written whenever we allocate it. At >>>> least that is my idea. >>> So return an unaligned extent, and then the IOMAP_ATOMIC checks you >>> add below say "no" and then the application has to do things the >>> slow, safe way.... >> We have been porting atomic write support to some database apps and they >> (database developers) have had to do something like manually zero the >> complete file to get around this issue, but that's not a good user >> experience. > Better the application zeros the file when it is being initialised > and doesn't have performance constraints rather than forcing the > filesystem to do it in the IO fast path when IO performance and > latency actually matters to the application. Can't we do both? I mean, the well-informed user can still pre-zero the file just to ensure we aren't doing this zero'ing with the extent allocation. > > There are production databases that already do this manual zero > initialisation to avoid unwritten extent conversion overhead during > runtime operation. That's because they want FUA writes to work, and > that gives 25% better IO performance over the same O_DSYNC writes > doing allocation and/or unwritten extent conversion after > fallocate() which requires journal flushes with O_DSYNC writes. > > Using atomic writes is no different. Sure, and as I said, they can do both. The user is already wise enough to zero the file for other performance reasons (like FUA writes). > >> Note that in their case the first 4KB write is non-atomic, but that does not >> change things. They do these 4KB writes in some DB setup phase. JFYI, these 4K writes were in some compressed mode in the DB setup phase, hence the smaller size. > And therein lies the problem. > > If you are doing sub-rtextent IO at all, then you are forcing the > filesystem down the path of explicitly using unwritten extents and > requiring O_DSYNC direct IO to do journal flushes in IO completion > context and then performance just goes down hill from them. > > The requirement for unwritten extents to track sub-rtextsize written > regions is what you're trying to work around with XFS_BMAPI_ZERO so > that atomic writes will always see "atomic write aligned" allocated > regions. > > Do you see the problem here? You've explicitly told the filesystem > that allocation is aligned to 64kB chunks, then because the > filesystem block size is 4kB, it's allowed to track unwritten > regions at 4kB boundaries. Then you do 4kB aligned file IO, which > then changes unwritten extents at 4kB boundaries. Then you do a > overlapping 16kB IO that*requires* 16kB allocation alignment, and > things go BOOM. > > Yes, they should go BOOM. > > This is a horrible configuration - it is incomaptible with 16kB > aligned and sized atomic IO. Just because the DB may do 16KB atomic writes most of the time should not disallow it from any other form of writes. Indeed at https://lore.kernel.org/linux-nvme/ZSR4jeSKlppLWjQy@dread.disaster.area/ you wrote "Not every IO to every file needs to be atomic." (sorry for quoting you) So the user can do other regular writes, but you say that they should be always writing full extents. This just may not suit some DBs. > Allocation is aligned to 64kB, written > region tracking is aligned to 4kB, and there's nothing to tell the > filesystem that it should be maintaining 16kB "written alignment" so > that 16kB atomic writes can always be issued atomically. > > i.e. if we are going to do 16kB aligned atomic IO, then all the > allocation and unwritten tracking needs to be done in 16kB aligned > chunks, not 4kB. That means a 4KB write into an unwritten region or > a hole actually needs to zero the rest of the 16KB range it sits > within. > > The direct IO code can do this, but it needs extension of the > unaligned IO serialisation in XFS (the alignment checks in > xfs_file_dio_write()) and the the sub-block zeroing in > iomap_dio_bio_iter() (the need_zeroing padding has to span the fs > allocation size, not the fsblock size) to do this safely. > > Regardless of how we do it, all IO concurrency on this file is shot > if we have sub-rtextent sized IOs being done. That is true even with > this patch set - XFS_BMAPI_ZERO is done whilst holding the > XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the > zeroing is being done. > > IOWs, anything to do with sub-rtextent IO really has to be treated > like sub-fsblock DIO - i.e. exclusive inode access until the > sub-rtextent zeroing has been completed. I do understand that this is not perfect that we may have mixed block sizes being written, but I don't think that we should disallow it and throw an error. I would actually think that the worst thing is that the user does not know this restriction. > >>>>> I think we should support IOCB_ATOMIC when the mapping is unwritten -- >>>>> the data will land on disk in an untorn fashion, the unwritten extent >>>>> conversion on IO completion is itself atomic, and callers still have to >>>>> set O_DSYNC to persist anything. >>>> But does this work for the scenario above? >>> Probably not, but if we want the mapping to return a single >>> contiguous extent mapping that spans both unwritten and written >>> states, then we should directly code that behaviour for atomic >>> IO and not try to hack around it via XFS_BMAPI_ZERO. >>> >>> Unwritten extent conversion will already do the right thing in that >>> it will only convert unwritten regions to written in the larger >>> range that is passed to it, but if there are multiple regions that >>> need conversion then the conversion won't be atomic. >> We would need something atomic. >> >>>>> Then we can avoid the cost of >>>>> BMAPI_ZERO, because double-writes aren't free. >>>> About double-writes not being free, I thought that this was acceptable to >>>> just have this write zero when initially allocating the extent as it should >>>> not add too much overhead in practice, i.e. it's one off. >>> The whole point about atomic writes is they are a performance >>> optimisation. If the cost of enabling atomic writes is that we >>> double the amount of IO we are doing, then we've lost more >>> performance than we gained by using atomic writes. That doesn't >>> seem desirable.... >> But the zero'ing is a one off per extent allocation, right? I would expect >> just an initial overhead when the database is being created/extended. > So why can't the application do that manually like others already do > to enable FUA optimisations for O_DSYNC writes? Is this even officially documented as advice or a suggestion for users? > > FWIW, we probably should look to extend fallocate() to allow > userspace to say "write real zeroes, not fake ones" so the > filesystem can call blkdev_issue_zeroout() after preallocation to > offload the zeroing to the hardware and then clear the unwritten > bits on the preallocated range... ack > >>>>>> error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks, >>>>>> rblocks, force, &tp); >>>>>> if (error) >>>>>> @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin( >>>>>> if (error) >>>>>> goto out_unlock; >>>>>> + if (flags & IOMAP_ATOMIC) { >>>>>> + xfs_filblks_t unit_min_fsb, unit_max_fsb; >>>>>> + unsigned int unit_min, unit_max; >>>>>> + >>>>>> + xfs_get_atomic_write_attr(ip, &unit_min, &unit_max); >>>>>> + unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min); >>>>>> + unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max); >>>>>> + >>>>>> + if (!imap_spans_range(&imap, offset_fsb, end_fsb)) { >>>>>> + error = -EINVAL; >>>>>> + goto out_unlock; >>>>>> + } >>>>>> + >>>>>> + if ((offset & mp->m_blockmask) || >>>>>> + (length & mp->m_blockmask)) { >>>>>> + error = -EINVAL; >>>>>> + goto out_unlock; >>>>>> + } >>> That belongs in the iomap DIO setup code, not here. It's also only >>> checking the data offset/length is filesystem block aligned, not >>> atomic IO aligned, too. >> hmmm... I'm not sure about that. Initially XFS will only support writes >> whose size is a multiple of FS block size, and this is what we are checking >> here, even if it is not obvious. > Which means, initially, iomap only supposed writes that are a > multiple of filesystem block size. regardless, this should be > checked in the submission path, not in the extent mapping callback. > > FWIW, we've already established above that iomap needs to handle > rtextsize chunks rather than fs block size for correct zeroing > behaviour for atomic writes, so this probably just needs to go away. Fine, I think that all this can be moved to iomap core / removed. > >> The idea is that we can first ensure size is a multiple of FS blocksize, and >> then can use br_blockcount directly, below. > Yes, and we can do all these checks on the iomap that we return to > the iomap layer. All this is doing is running the checks on the XFS > imap before it is formatted into the iomap iomap and returned to the > iomap layer. These checks can be done on the returned iomap in the > iomap layer if IOMAP_ATOMIC is set.... > >> However, the core iomap code does not know FS atomic write min and max per >> inode, so we need some checks here. > So maybe we should pass them down to iomap in the iocb when > IOCB_ATOMIC is set, or reject the IO at the filesytem layer when > checking atomic IO alignment before we pass the IO to the iomap > layer... Yes, I think that something like this is possible. As for using kiocb, it's quite a small structure, so I can't imagine that it would be welcome to add all this atomic write stuff. So we could add extra awu min and max args to iomao_dio_rw(), and these could be filled in by the calling FS. But that function already has enough args. Alternatively we could add an iomap_ops method to look up awu min and max for the inode. Thanks, John
On Wed, Feb 07, 2024 at 02:13:23PM +0000, John Garry wrote: > On 07/02/2024 00:06, Dave Chinner wrote: > > > > We really, really don't want to be doing this during allocation > > > > unless we can avoid it. If the filesystem block size is 64kB, we > > > > could be allocating up to 96GB per extent, and that becomes an > > > > uninterruptable write stream inside a transaction context that holds > > > > inode metadata locked. > > > Where does that 96GB figure come from? > > My inability to do math. The actual number is 128GB. > > > > Max extent size = XFS_MAX_BMBT_EXTLEN * fs block size. > > = 2^21 * fs block size. > > > > So for a 4kB block size filesystem, that's 8GB max extent length, > > and that's the most we will allocate in a single transaction (i.e. > > one new BMBT record). > > > > For 64kB block size, we can get 128GB of space allocated in a single > > transaction. > > atomic write unit max theoretical upper limit is rounddown_power_of_2(2^32 - > 1) = 2GB > > So this would be what is expected to be the largest extent size requested > for atomic writes. I am not saying that 2GB is small, but certainly much > smaller than 128GB. *cough* Extent size hints. I'm a little disappointed that after all these discussions about how we decouple extent allocation size and alignment from the user IO size and alignment with things like extent size hints, force align, etc that you are still thinking that user IO size and alignment directly drives extent allocation size and alignment.... > > > > > > Why do we want to write zeroes to the disk if we're allocating space > > > > > > even if we're not sending an atomic write? > > > > > > > > > > > > (This might want an explanation for why we're doing this at all -- it's > > > > > > to avoid unwritten extent conversion, which defeats hardware untorn > > > > > > writes.) > > > > > It's to handle the scenario where we have a partially written extent, and > > > > > then try to issue an atomic write which covers the complete extent. > > > > When/how would that ever happen with the forcealign bits being set > > > > preventing unaligned allocation and writes? > > > Consider this scenario: > > > > > > # mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda > > > # mount /dev/sda mnt -o rtdev=/dev/sdb > > > # touch mnt/file > > > # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic > > > write, 4096B at pos 0 > > Please don't write one-off custom test programs to issue IO - please > > use and enhance xfs_io so the test cases can then be put straight > > into fstests without adding yet another "do some minor IO variant" > > test program. This also means you don't need a random assortment of > > other tools. > > > > i.e. > > > > # xfs_io -dc "pwrite -VA 0 4096" /root/mnt/file > > > > Should do an RWF_ATOMIC IO, and > > > > # xfs_io -dc "pwrite -VAD 0 4096" /root/mnt/file > > > > should do an RWF_ATOMIC|RWF_DSYNC IO... > > > > > > > # filefrag -v mnt/file > > xfs_io -c "fiemap" mnt/file > > Fine, but I like using something generic for accessing block devices and > also other FSes. I didn't think that xfs_io can do that. Yes, it can. We use it extensively in fstests because it works for any filesystem, not just XFS. > Anyway, we can look to add atomic write support to xfs_io and any other > xfs-progs Please do, then the support is there for developers, users and fstests without needing to write their own custom test programs. > > > Filesystem type is: 58465342 > > > File size of mnt/file is 4096 (1 block of 4096 bytes) > > > ext: logical_offset: physical_offset: length: expected: > > > flags: > > > 0: 0.. 0: 24.. 24: 1: > > > last,eof > > > mnt/file: 1 extent found > > > # /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file > > > wrote -1 bytes at pos 0 write_size=16384 > > > # > > Whole test as one repeatable command: > > > > # xfs_io -d -c "truncate 0" -c "chattr +r" \ > > -c "pwrite -VAD 0 4096" \ > > -c "fiemap" \ > > -c "pwrite -VAD 0 16384" \ > > /mnt/root/file > > > For the 2nd write, which would cover a 16KB extent, the iomap code will iter > > > twice and produce 2x BIOs, which we don't want - that's why it errors there. > > Yes, but I think that's a feature. You've optimised the filesystem > > layout for IO that is 64kB sized and aligned IO, but your test case > > is mixing 4kB and 16KB IO. The filesystem should be telling you that > > you're doing something that is sub-optimal for it's configuration, > > and refusing to do weird overlapping sub-rtextsize atomic IO is a > > pretty good sign that you've got something wrong. > > Then we really end up with a strange behavior for the user. I mean, the user > may ask - "why did this 16KB atomic write pass and this one fail? I'm > following all the rules", and then "No one said not to mix write sizes or > not mix atomic and non-atomic writes, so should be ok. Indeed, that earlier > 4K write for the same region passed". > > Playing devil's advocate here, at least this behavior should be documented. That's what man pages are for, yes? Are you expecting your deployments to be run on highly suboptimal configurations and so the code needs to be optimised for this behaviour, or are you expecting them to be run on correctly configured systems which would never see these issues? > > The whole reason for rtextsize existing is to optimise the rtextent > > allocation to the typical minimum IO size done to that volume. If > > all your IO is sub-rtextsize size and alignment, then all that has > > been done is forcing the entire rt device IO into a corner it was > > never really intended nor optimised for. > > Sure, but just because we are optimized for a certain IO write size should > not mean that other writes are disallowed or quite problematic. Atomic writes are just "other writes". They are writes that are *expected to fail* if they cannot be done atomically. Application writers will quickly learn how to do sane, fast, reliable atomic write IO if we reject anything that is going to requires some complex, sub-optimal workaround in the kernel to make it work. The simplest solution is to -fail the write-, because userspace *must* be prepared for *any* atomic write to fail. > > Why should we jump through crazy hoops to try to make filesystems > > optimised for large IOs with mismatched, overlapping small atomic > > writes? > > As mentioned, typically the atomic writes will be the same size, but we may > have other writes of smaller size. Then we need the tiny write to allocate and zero according to the maximum sized atomic write bounds. Then we just don't care about large atomic IO overlapping small IO, because the extent on disk aligned to the large atomic IO is then always guaranteed to be the correct size and shape. > > > With the change in this patch, instead we have something like this after the > > > first write: > > > > > > # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file > > > wrote 4096 bytes at pos 0 write_size=4096 > > > # filefrag -v mnt/file > > > Filesystem type is: 58465342 > > > File size of mnt/file is 4096 (1 block of 4096 bytes) > > > ext: logical_offset: physical_offset: length: expected: > > > flags: > > > 0: 0.. 3: 24.. 27: 4: > > > last,eof > > > mnt/file: 1 extent found > > > # > > > > > > So the 16KB extent is in written state and the 2nd 16KB write would iter > > > once, producing a single BIO. > > Sure, I know how it works. My point is that it's a terrible way to > > go about allowing that second atomic write to succeed. > I think 'terrible' is a bit too strong a word here. Doing it anything in a way that a user can DOS the entire filesystem is *terrible*. No ifs, buts or otherwise. > Indeed, you suggest to > manually zero the file to solve this problem, below, while this code change > does the same thing automatically. Yes, but I also outlined a way that it can be done automatically without being terrible. There are multiple options here, I outlined two different approaches that are acceptible. > > > > > In this > > > > > scenario, the iomap code will issue 2x IOs, which is unacceptable. So we > > > > > ensure that the extent is completely written whenever we allocate it. At > > > > > least that is my idea. > > > > So return an unaligned extent, and then the IOMAP_ATOMIC checks you > > > > add below say "no" and then the application has to do things the > > > > slow, safe way.... > > > We have been porting atomic write support to some database apps and they > > > (database developers) have had to do something like manually zero the > > > complete file to get around this issue, but that's not a good user > > > experience. > > Better the application zeros the file when it is being initialised > > and doesn't have performance constraints rather than forcing the > > filesystem to do it in the IO fast path when IO performance and > > latency actually matters to the application. > > Can't we do both? I mean, the well-informed user can still pre-zero the file > just to ensure we aren't doing this zero'ing with the extent allocation. I never said we can't do zeroing. I just said that it's normally better when the application controls zeroing directly. > > And therein lies the problem. > > > > If you are doing sub-rtextent IO at all, then you are forcing the > > filesystem down the path of explicitly using unwritten extents and > > requiring O_DSYNC direct IO to do journal flushes in IO completion > > context and then performance just goes down hill from them. > > > > The requirement for unwritten extents to track sub-rtextsize written > > regions is what you're trying to work around with XFS_BMAPI_ZERO so > > that atomic writes will always see "atomic write aligned" allocated > > regions. > > > > Do you see the problem here? You've explicitly told the filesystem > > that allocation is aligned to 64kB chunks, then because the > > filesystem block size is 4kB, it's allowed to track unwritten > > regions at 4kB boundaries. Then you do 4kB aligned file IO, which > > then changes unwritten extents at 4kB boundaries. Then you do a > > overlapping 16kB IO that*requires* 16kB allocation alignment, and > > things go BOOM. > > > > Yes, they should go BOOM. > > > > This is a horrible configuration - it is incomaptible with 16kB > > aligned and sized atomic IO. > > Just because the DB may do 16KB atomic writes most of the time should not > disallow it from any other form of writes. That's not what I said. I said the using sub-rtextsize atomic writes with single FSB unwritten extent tracking is horrible and incompatible with doing 16kB atomic writes. This setup will not work at all well with your patches and should go BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup has uncoordinated extent allocation and unwritten conversion granularity. That's the fundamental design problem with your approach - it allows unwritten conversion at *minimum IO sizes* and that does not work with atomic IOs with larger alignment requirements. The fundamental design principle is this: for maximally sized atomic writes to always succeed we require every allocation, zeroing and unwritten conversion operation to use alignments and sizes that are compatible with the maximum atomic write sizes being used. i.e. atomic writes need to use max write size granularity for all IO operations, not filesystem block granularity. And that also means things like rtextsize and extsize hints need to match these atomic write requirements, too.... > > Allocation is aligned to 64kB, written > > region tracking is aligned to 4kB, and there's nothing to tell the > > filesystem that it should be maintaining 16kB "written alignment" so > > that 16kB atomic writes can always be issued atomically. > > > > i.e. if we are going to do 16kB aligned atomic IO, then all the > > allocation and unwritten tracking needs to be done in 16kB aligned > > chunks, not 4kB. That means a 4KB write into an unwritten region or > > a hole actually needs to zero the rest of the 16KB range it sits > > within. > > > > The direct IO code can do this, but it needs extension of the > > unaligned IO serialisation in XFS (the alignment checks in > > xfs_file_dio_write()) and the the sub-block zeroing in > > iomap_dio_bio_iter() (the need_zeroing padding has to span the fs > > allocation size, not the fsblock size) to do this safely. > > > > Regardless of how we do it, all IO concurrency on this file is shot > > if we have sub-rtextent sized IOs being done. That is true even with > > this patch set - XFS_BMAPI_ZERO is done whilst holding the > > XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the > > zeroing is being done. > > > > IOWs, anything to do with sub-rtextent IO really has to be treated > > like sub-fsblock DIO - i.e. exclusive inode access until the > > sub-rtextent zeroing has been completed. > > I do understand that this is not perfect that we may have mixed block sizes > being written, but I don't think that we should disallow it and throw an > error. Ummmm, did you read what you quoted? The above is an outline of the IO path modifications that will allow mixed IO sizes to be used with atomic writes without requiring the XFS_BMAPI_ZERO hack. It pushes the sub-atomic write alignment zeroing out to the existing DIO sub-block zeroing, hence ensuring that we only ever convert unwritten extents on max sized atomic write boundaries for atomic write enabled inodes. At no point have I said "no mixed writes". I've said no to the XFS_BMAPI_ZERO hack, but then I've explained the fundamental issue that it works around and given you a decent amount of detail on how to sanely implementing mixed write support that will work (slowly) with those configurations and IO patterns. So it's your choice - you can continue to beleive I don't mixed writes to work at all, or you can go back and try to understand the IO path changes I've suggested that will allow mixed atomic writes to work as well as they possibly can.... -Dave.
>> >> Playing devil's advocate here, at least this behavior should be documented. > > That's what man pages are for, yes? > > Are you expecting your deployments to be run on highly suboptimal > configurations and so the code needs to be optimised for this > behaviour, or are you expecting them to be run on correctly > configured systems which would never see these issues? The latter hopefully > > >>> The whole reason for rtextsize existing is to optimise the rtextent >>> allocation to the typical minimum IO size done to that volume. If >>> all your IO is sub-rtextsize size and alignment, then all that has >>> been done is forcing the entire rt device IO into a corner it was >>> never really intended nor optimised for. >> >> Sure, but just because we are optimized for a certain IO write size should >> not mean that other writes are disallowed or quite problematic. > > Atomic writes are just "other writes". They are writes that are > *expected to fail* if they cannot be done atomically. Agreed > > Application writers will quickly learn how to do sane, fast, > reliable atomic write IO if we reject anything that is going to > requires some complex, sub-optimal workaround in the kernel to make > it work. The simplest solution is to -fail the write-, because > userspace *must* be prepared for *any* atomic write to fail. Sure, but it needs to be such that the application writer at least knows why it failed, which so far had not been documented. > >>> Why should we jump through crazy hoops to try to make filesystems >>> optimised for large IOs with mismatched, overlapping small atomic >>> writes? >> >> As mentioned, typically the atomic writes will be the same size, but we may >> have other writes of smaller size. > > Then we need the tiny write to allocate and zero according to the > maximum sized atomic write bounds. Then we just don't care about > large atomic IO overlapping small IO, because the extent on disk > aligned to the large atomic IO is then always guaranteed to be the > correct size and shape. I think it's worth mentioning that there is currently a separation between how we configure the FS extent size for atomic writes and what the bdev can actually support in terms of atomic writes. Setting the rtvol extsize at mkfs time or enabling atomic writes FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in terms of atomic writes. This check is not done as it is not fixed what the bdev can do in terms of atomic writes - or, more specifically, what they request_queue reports is not be fixed. There are things which can change this. For example, a FW update could change all the atomic write capabilities of a disk. Or even if we swapped a SCSI disk into another host the atomic write limits may change, as the atomic write unit max depends on the SCSI HBA DMA limits. Changing BIO_MAX_VECS - which could come from a kernel update - could also change what we report as atomic write limit in the request queue. > > >>>> With the change in this patch, instead we have something like this after the >>>> first write: >>>> >>>> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file >>>> wrote 4096 bytes at pos 0 write_size=4096 >>>> # filefrag -v mnt/file >>>> Filesystem type is: 58465342 >>>> File size of mnt/file is 4096 (1 block of 4096 bytes) >>>> ext: logical_offset: physical_offset: length: expected: >>>> flags: >>>> 0: 0.. 3: 24.. 27: 4: >>>> last,eof >>>> mnt/file: 1 extent found >>>> # >>>> >>>> So the 16KB extent is in written state and the 2nd 16KB write would iter >>>> once, producing a single BIO. >>> Sure, I know how it works. My point is that it's a terrible way to >>> go about allowing that second atomic write to succeed. >> I think 'terrible' is a bit too strong a word here. > > Doing it anything in a way that a user can DOS the entire filesystem > is *terrible*. No ifs, buts or otherwise. Understood > >> Indeed, you suggest to >> manually zero the file to solve this problem, below, while this code change >> does the same thing automatically. > > Yes, but I also outlined a way that it can be done automatically > without being terrible. There are multiple options here, I outlined > two different approaches that are acceptible. I think that I need to check these alternate solutions in more detail. More below. > >>>>>> In this >>>>>> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we >>>>>> ensure that the extent is completely written whenever we allocate it. At >>>>>> least that is my idea. >>>>> So return an unaligned extent, and then the IOMAP_ATOMIC checks you >>>>> add below say "no" and then the application has to do things the >>>>> slow, safe way.... >>>> We have been porting atomic write support to some database apps and they >>>> (database developers) have had to do something like manually zero the >>>> complete file to get around this issue, but that's not a good user >>>> experience. >>> Better the application zeros the file when it is being initialised >>> and doesn't have performance constraints rather than forcing the >>> filesystem to do it in the IO fast path when IO performance and >>> latency actually matters to the application. >> >> Can't we do both? I mean, the well-informed user can still pre-zero the file >> just to ensure we aren't doing this zero'ing with the extent allocation. > > I never said we can't do zeroing. I just said that it's normally > better when the application controls zeroing directly. ok > >>> And therein lies the problem. >>> >>> If you are doing sub-rtextent IO at all, then you are forcing the >>> filesystem down the path of explicitly using unwritten extents and >>> requiring O_DSYNC direct IO to do journal flushes in IO completion >>> context and then performance just goes down hill from them. >>> >>> The requirement for unwritten extents to track sub-rtextsize written >>> regions is what you're trying to work around with XFS_BMAPI_ZERO so >>> that atomic writes will always see "atomic write aligned" allocated >>> regions. >>> >>> Do you see the problem here? You've explicitly told the filesystem >>> that allocation is aligned to 64kB chunks, then because the >>> filesystem block size is 4kB, it's allowed to track unwritten >>> regions at 4kB boundaries. Then you do 4kB aligned file IO, which >>> then changes unwritten extents at 4kB boundaries. Then you do a >>> overlapping 16kB IO that*requires* 16kB allocation alignment, and >>> things go BOOM. >>> >>> Yes, they should go BOOM. >>> >>> This is a horrible configuration - it is incomaptible with 16kB >>> aligned and sized atomic IO. >> >> Just because the DB may do 16KB atomic writes most of the time should not >> disallow it from any other form of writes. > > That's not what I said. I said the using sub-rtextsize atomic writes > with single FSB unwritten extent tracking is horrible and > incompatible with doing 16kB atomic writes. > > This setup will not work at all well with your patches and should go > BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup > has uncoordinated extent allocation and unwritten conversion > granularity. > > That's the fundamental design problem with your approach - it allows > unwritten conversion at *minimum IO sizes* and that does not work > with atomic IOs with larger alignment requirements. > > The fundamental design principle is this: for maximally sized atomic > writes to always succeed we require every allocation, zeroing and > unwritten conversion operation to use alignments and sizes that are > compatible with the maximum atomic write sizes being used. > That sounds fine. My question then is how we determine this max atomic write size granularity. We don't explicitly tell the FS what atomic write size we want for a file. Rather we mkfs with some extsize value which should match our atomic write maximal value and then tell the FS we want to do atomic writes on a file, and if this is accepted then we can query the atomic write min and max unit size, and this would be [FS block size, min(bdev atomic write limit, rtexsize)]. If rtextsize is 16KB, then we have a good idea that we want 16KB atomic writes support. So then we could use rtextsize as this max atomic write size. But I am not 100% sure that it your idea (apologies if I am wrong - I am sincerely trying to follow your idea), but rather it would be min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and bdev atomic write limit is 16KB, then there is no much point in dealing in 1MB blocks for this unwritten extent conversion alignment. If so, then my concern is that the bdev atomic write upper limit is not fixed. This can solved, but I would still like to be clear on this max atomic write size. > i.e. atomic writes need to use max write size granularity for all IO > operations, not filesystem block granularity. > > And that also means things like rtextsize and extsize hints need to > match these atomic write requirements, too.... > As above, I am not 100% sure if you mean these to be the atomic write maximal value. >>> Allocation is aligned to 64kB, written >>> region tracking is aligned to 4kB, and there's nothing to tell the >>> filesystem that it should be maintaining 16kB "written alignment" so >>> that 16kB atomic writes can always be issued atomically. Please note that in my previous example the mkfs rtextsize arg should really have been 16KB, and that the intention would have been to enable 16KB atomic writes. I used 64KB casually as I thought it should be possible to support sub-rtextsize atomic writes. The point which I was trying to make was that the 16KB atomic write and 4KB regular write intermixing was problematic. >>> >>> i.e. if we are going to do 16kB aligned atomic IO, then all the >>> allocation and unwritten tracking needs to be done in 16kB aligned >>> chunks, not 4kB. That means a 4KB write into an unwritten region or >>> a hole actually needs to zero the rest of the 16KB range it sits >>> within. >>> >>> The direct IO code can do this, but it needs extension of the >>> unaligned IO serialisation in XFS (the alignment checks in >>> xfs_file_dio_write()) and the the sub-block zeroing in >>> iomap_dio_bio_iter() (the need_zeroing padding has to span the fs >>> allocation size, not the fsblock size) to do this safely. >>> >>> Regardless of how we do it, all IO concurrency on this file is shot >>> if we have sub-rtextent sized IOs being done. That is true even with >>> this patch set - XFS_BMAPI_ZERO is done whilst holding the >>> XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the >>> zeroing is being done. >>> >>> IOWs, anything to do with sub-rtextent IO really has to be treated >>> like sub-fsblock DIO - i.e. exclusive inode access until the >>> sub-rtextent zeroing has been completed. >> >> I do understand that this is not perfect that we may have mixed block sizes >> being written, but I don't think that we should disallow it and throw an >> error. > > Ummmm, did you read what you quoted? > > The above is an outline of the IO path modifications that will allow > mixed IO sizes to be used with atomic writes without requiring the > XFS_BMAPI_ZERO hack. It pushes the sub-atomic write alignment > zeroing out to the existing DIO sub-block zeroing, hence ensuring > that we only ever convert unwritten extents on max sized atomic > write boundaries for atomic write enabled inodes. ok, I get this idea. And, indeed, it does sound better than the XFS_BMAPI_ZERO proposal. > > At no point have I said "no mixed writes". For sure > I've said no to the > XFS_BMAPI_ZERO hack, but then I've explained the fundamental issue > that it works around and given you a decent amount of detail on how > to sanely implementing mixed write support that will work (slowly) > with those configurations and IO patterns. > > So it's your choice - you can continue to beleive I don't mixed > writes to work at all, or you can go back and try to understand the > IO path changes I've suggested that will allow mixed atomic writes > to work as well as they possibly can.... > Ack Much appreciated, John
On Mon, Feb 05, 2024 at 01:36:03PM +0000, John Garry wrote: > On 02/02/2024 18:47, Darrick J. Wong wrote: > > On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote: > > > Ensure that when creating a mapping that we adhere to all the atomic > > > write rules. > > > > > > We check that the mapping covers the complete range of the write to ensure > > > that we'll be just creating a single mapping. > > > > > > Currently minimum granularity is the FS block size, but it should be > > > possibly to support lower in future. > > > > > > Signed-off-by: John Garry <john.g.garry@oracle.com> > > > --- > > > I am setting this as an RFC as I am not sure on the change in > > > xfs_iomap_write_direct() - it gives the desired result AFAICS. > > > > > > fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 41 insertions(+) > > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > index 18c8f168b153..758dc1c90a42 100644 > > > --- a/fs/xfs/xfs_iomap.c > > > +++ b/fs/xfs/xfs_iomap.c > > > @@ -289,6 +289,9 @@ xfs_iomap_write_direct( > > > } > > > } > > > + if (xfs_inode_atomicwrites(ip)) > > > + bmapi_flags = XFS_BMAPI_ZERO; > > > > Why do we want to write zeroes to the disk if we're allocating space > > even if we're not sending an atomic write? > > > > (This might want an explanation for why we're doing this at all -- it's > > to avoid unwritten extent conversion, which defeats hardware untorn > > writes.) > > It's to handle the scenario where we have a partially written extent, and > then try to issue an atomic write which covers the complete extent. In this > scenario, the iomap code will issue 2x IOs, which is unacceptable. So we > ensure that the extent is completely written whenever we allocate it. At > least that is my idea. > > > > > I think we should support IOCB_ATOMIC when the mapping is unwritten -- > > the data will land on disk in an untorn fashion, the unwritten extent > > conversion on IO completion is itself atomic, and callers still have to > > set O_DSYNC to persist anything. > > But does this work for the scenario above? > > > Then we can avoid the cost of > > BMAPI_ZERO, because double-writes aren't free. > > About double-writes not being free, I thought that this was acceptable to > just have this write zero when initially allocating the extent as it should > not add too much overhead in practice, i.e. it's one off. > > > > > > + > > > error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks, > > > rblocks, force, &tp); > > > if (error) > > > @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin( > > > if (error) > > > goto out_unlock; > > > + if (flags & IOMAP_ATOMIC) { > > > + xfs_filblks_t unit_min_fsb, unit_max_fsb; > > > + unsigned int unit_min, unit_max; > > > + > > > + xfs_get_atomic_write_attr(ip, &unit_min, &unit_max); > > > + unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min); > > > + unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max); > > > + > > > + if (!imap_spans_range(&imap, offset_fsb, end_fsb)) { > > > + error = -EINVAL; > > > + goto out_unlock; > > > + } > > > + > > > + if ((offset & mp->m_blockmask) || > > > + (length & mp->m_blockmask)) { > > > + error = -EINVAL; > > > + goto out_unlock; > > > + } > > > + > > > + if (imap.br_blockcount == unit_min_fsb || > > > + imap.br_blockcount == unit_max_fsb) { > > > + /* ok if exactly min or max */ > > > + } else if (imap.br_blockcount < unit_min_fsb || > > > + imap.br_blockcount > unit_max_fsb) { > > > + error = -EINVAL; > > > + goto out_unlock; > > > + } else if (!is_power_of_2(imap.br_blockcount)) { > > > + error = -EINVAL; > > > + goto out_unlock; > > > + } > > > + > > > + if (imap.br_startoff && > > > + imap.br_startoff & (imap.br_blockcount - 1)) { > > > > Not sure why we care about the file position, it's br_startblock that > > gets passed into the bio, not br_startoff. > > We just want to ensure that the length of the write is valid w.r.t. to the > offset within the extent, and br_startoff would be the offset within the > aligned extent. Yes, I understand what br_startoff is, but this doesn't help me understand why this code is necessary. Let's say you have a device that supports untorn writes of 16k in length provided the LBA of the write command is also aligned to 16k, and the fs has 4k blocks. Userspace issues an 16k untorn write at offset 13k in the file, and gets this mapping: [startoff: 13k, startblock: 16k, blockcount: 16k] Why should this IO be rejected? The physical space extent satisfies the alignment requirements of the underlying device, and the logical file space extent does not need aligning at all. > > I'm also still not convinced that any of this validation is useful here. > > The block device stack underneath the filesystem can change at any time > > without any particular notice to the fs, so the only way to find out if > > the proposed IO would meet the alignment constraints is to submit_bio > > and see what happens. > > I am not sure what submit_bio() would do differently. If the block device is > changing underneath the block layer, then there is where these things need > to be checked. Agreed. > > > > (The "one bio per untorn write request" thing in the direct-io.c patch > > sound sane to me though.) > > > > ok > > Thanks, > John >
On Fri, Feb 09, 2024 at 12:47:38PM +0000, John Garry wrote: > > > > Why should we jump through crazy hoops to try to make filesystems > > > > optimised for large IOs with mismatched, overlapping small atomic > > > > writes? > > > > > > As mentioned, typically the atomic writes will be the same size, but we may > > > have other writes of smaller size. > > > > Then we need the tiny write to allocate and zero according to the > > maximum sized atomic write bounds. Then we just don't care about > > large atomic IO overlapping small IO, because the extent on disk > > aligned to the large atomic IO is then always guaranteed to be the > > correct size and shape. > > I think it's worth mentioning that there is currently a separation between > how we configure the FS extent size for atomic writes and what the bdev can > actually support in terms of atomic writes. And that's part of what is causing all the issues here - we're trying to jump though hoops at the fs level to handle cases that they device doesn't support and vice versa. > Setting the rtvol extsize at mkfs time or enabling atomic writes > FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in > terms of atomic writes. Which is wrong. mkfs.xfs gets physical information about the volume from the kernel and makes the filesystem accounting to that information. That's how we do stripe alignment, sector sizing, etc. Atomic write support and setting up alignment constraints should be no different. Yes, mkfs allows the user to override the hardware configsi it probes, but it also warns when the override is doing something sub-optimal (like aligning all AG headers to the same disk in a stripe). IOWs, mkfs should be pulling this atomic write info from the hardware and configuring the filesysetm around that information. That's the target we should be aiming the kernel implementation at and optimising for - a filesystem that is correctly configured according to published hardware capability. Everything else is in the "make it behave correctly, but we don't care if it's slow" category. > This check is not done as it is not fixed what the bdev can do in terms of > atomic writes - or, more specifically, what they request_queue reports is > not be fixed. There are things which can change this. For example, a FW > update could change all the atomic write capabilities of a disk. Or even if > we swapped a SCSI disk into another host the atomic write limits may change, > as the atomic write unit max depends on the SCSI HBA DMA limits. Changing > BIO_MAX_VECS - which could come from a kernel update - could also change > what we report as atomic write limit in the request queue. If that sort of thing happens, then that's too bad. We already have these sorts of "do not do if you care about performance" constraints. e.g. don't do a RAID restripe that changes the alignment/size of the RAID device (e.g. add a single disk and make the stripe width wider) because the physical filesystem structure will no longer be aligned to the underlying hardware. instead, you have to grow striped volumes with compatible stripes in compatible sizes to ensure the filesystem remains aligned to the storage... We don't try to cater for every single possible permutation of storage hardware configurations - that way lies madness. Design and optimise for the common case of correctly configured and well behaved storage, and everything else we just don't care about beyond "don't corrupt or lose data". > > > > And therein lies the problem. > > > > > > > > If you are doing sub-rtextent IO at all, then you are forcing the > > > > filesystem down the path of explicitly using unwritten extents and > > > > requiring O_DSYNC direct IO to do journal flushes in IO completion > > > > context and then performance just goes down hill from them. > > > > > > > > The requirement for unwritten extents to track sub-rtextsize written > > > > regions is what you're trying to work around with XFS_BMAPI_ZERO so > > > > that atomic writes will always see "atomic write aligned" allocated > > > > regions. > > > > > > > > Do you see the problem here? You've explicitly told the filesystem > > > > that allocation is aligned to 64kB chunks, then because the > > > > filesystem block size is 4kB, it's allowed to track unwritten > > > > regions at 4kB boundaries. Then you do 4kB aligned file IO, which > > > > then changes unwritten extents at 4kB boundaries. Then you do a > > > > overlapping 16kB IO that*requires* 16kB allocation alignment, and > > > > things go BOOM. > > > > > > > > Yes, they should go BOOM. > > > > > > > > This is a horrible configuration - it is incomaptible with 16kB > > > > aligned and sized atomic IO. > > > > > > Just because the DB may do 16KB atomic writes most of the time should not > > > disallow it from any other form of writes. > > > > That's not what I said. I said the using sub-rtextsize atomic writes > > with single FSB unwritten extent tracking is horrible and > > incompatible with doing 16kB atomic writes. > > > > This setup will not work at all well with your patches and should go > > BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup > > has uncoordinated extent allocation and unwritten conversion > > granularity. > > > > That's the fundamental design problem with your approach - it allows > > unwritten conversion at *minimum IO sizes* and that does not work > > with atomic IOs with larger alignment requirements. > > > > The fundamental design principle is this: for maximally sized atomic > > writes to always succeed we require every allocation, zeroing and > > unwritten conversion operation to use alignments and sizes that are > > compatible with the maximum atomic write sizes being used. > > > > That sounds fine. > > My question then is how we determine this max atomic write size granularity. > > We don't explicitly tell the FS what atomic write size we want for a file. > Rather we mkfs with some extsize value which should match our atomic write > maximal value and then tell the FS we want to do atomic writes on a file, > and if this is accepted then we can query the atomic write min and max unit > size, and this would be [FS block size, min(bdev atomic write limit, > rtexsize)]. > > If rtextsize is 16KB, then we have a good idea that we want 16KB atomic > writes support. So then we could use rtextsize as this max atomic write > size. Maybe, but I think continuing to focus on this as 'atomic writes requires' is wrong. The filesystem does not carea bout atomic writes. What it cares about is the allocation constraints that need to be implemented. That constraint is that all BMBT extent operations need to be aligned to a specific extent size, not filesystem blocks. The current extent size hint (and rtextsize) only impact the -allocation of extents-. They are not directly placing constraints on the BMBT layout, they are placing constraints on the free space search that the allocator runs on the BNO/CNT btrees to select an extent that is then inserted into the BMBT. The problem is that unwritten extent conversion, truncate, hole punching, etc also all need to be correctly aligned for files that are configured to support atomic writes. These operations place constraints on how the BMBT can modify the existing extent list. These are different constraints to what rtextsize/extszhint apply, and that's the fundamental behavioural difference between existing extent size hint behaviour and the behaviour needed by atomic writes. > But I am not 100% sure that it your idea (apologies if I am wrong - I > am sincerely trying to follow your idea), but rather it would be > min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and bdev > atomic write limit is 16KB, then there is no much point in dealing in 1MB > blocks for this unwritten extent conversion alignment. Exactly my point - there really is no relationship between rtextsize and atomic write constraints and that it is a mistake to use rtextsize as it stands as a placeholder for atomic write constraints. > If so, then my > concern is that the bdev atomic write upper limit is not fixed. This can > solved, but I would still like to be clear on this max atomic write size. Right, we haven't clearly defined how XFS is supposed align BMBT operations in a way that is compatible for atomic write operations. What the patchset does is try to extend and infer things from existing allocation alignment constraints, but then not apply those constraints to critical extent state operations (pure BMBT modifications) that atomic writes also need constrained to work correctly and efficiently. > > i.e. atomic writes need to use max write size granularity for all IO > > operations, not filesystem block granularity. > > > > And that also means things like rtextsize and extsize hints need to > > match these atomic write requirements, too.... > > As above, I am not 100% sure if you mean these to be the atomic write > maximal value. Yes, they either need to be the same as the atomic write max value or, much better, once a hint has been set, then resultant size is then aligned up to be compatible with the atomic write constraints. e.g. set an inode extent size hint of 960kB on a device with 128kB atomic write capability. If the inode has the atomic write flag set, then allocations need to round the extent size up from 960kB to 1MB so that the BMBT extent layout and alignment is compatible with 128kB atomic writes. At this point, zeroing, punching, unwritten extent conversion, etc also needs to be done on aligned 128kB ranges to be comaptible with atomic writes, rather than filesysetm block boundaries that would normally be used if just the extent size hint was set. This is effectively what the original "force align" inode flag bit did - it told the inode that all BMBT manipulations need to be extent size hint aligned, not just allocation. It's a generic flag that implements specific extent manipulation constraints that happen to be compatible with atomic writes when the right extent size hint is set. So at this point, I'm not sure that we should have an "atomic writes" flag in the inode. We need to tell BMBT modifications to behave in a particular way - forced alignment - not that atomic writes are being used in the filesystem.... At this point, the filesystem can do all the extent modification alignment stuff that atomic writes require without caring if the block device supports atomic writes or even if the application is using atomic writes. This means we can test the BMBT functionality in fstests without requiring hardware (or emulation) that supports atomic writes - all we need to do is set the forced align flag, an extent size hint and go run fsx on it... -Dave.
> >> Setting the rtvol extsize at mkfs time or enabling atomic writes >> FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in >> terms of atomic writes. > > Which is wrong. mkfs.xfs gets physical information about the volume > from the kernel and makes the filesystem accounting to that > information. That's how we do stripe alignment, sector sizing, etc. > Atomic write support and setting up alignment constraints should be > no different. Yes, I was just looking at adding a mkfs option to format for atomic writes, which would check physical information about the volume and whether it suits rtextsize and then subsequently also set XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES. > > Yes, mkfs allows the user to override the hardware configsi it > probes, but it also warns when the override is doing something > sub-optimal (like aligning all AG headers to the same disk in a > stripe). > > IOWs, mkfs should be pulling this atomic write info from the > hardware and configuring the filesysetm around that information. > That's the target we should be aiming the kernel implementation at > and optimising for - a filesystem that is correctly configured > according to published hardware capability. Right So, for example, if the atomic writes option is set and the rtextsize set by the user is so much larger than what HW can support in terms of atomic writes, then we should let the user know about this. > > Everything else is in the "make it behave correctly, but we don't > care if it's slow" category. > >> This check is not done as it is not fixed what the bdev can do in terms of >> atomic writes - or, more specifically, what they request_queue reports is >> not be fixed. There are things which can change this. For example, a FW >> update could change all the atomic write capabilities of a disk. Or even if >> we swapped a SCSI disk into another host the atomic write limits may change, >> as the atomic write unit max depends on the SCSI HBA DMA limits. Changing >> BIO_MAX_VECS - which could come from a kernel update - could also change >> what we report as atomic write limit in the request queue. > > If that sort of thing happens, then that's too bad. We already have > these sorts of "do not do if you care about performance" > constraints. e.g. don't do a RAID restripe that changes the > alignment/size of the RAID device (e.g. add a single disk and make > the stripe width wider) because the physical filesystem structure > will no longer be aligned to the underlying hardware. instead, you > have to grow striped volumes with compatible stripes in compatible > sizes to ensure the filesystem remains aligned to the storage... > > We don't try to cater for every single possible permutation of > storage hardware configurations - that way lies madness. Design and > optimise for the common case of correctly configured and well > behaved storage, and everything else we just don't care about beyond > "don't corrupt or lose data". ok > >>>>> And therein lies the problem. >>>>> ... >> >> That sounds fine. >> >> My question then is how we determine this max atomic write size granularity. >> >> We don't explicitly tell the FS what atomic write size we want for a file. >> Rather we mkfs with some extsize value which should match our atomic write >> maximal value and then tell the FS we want to do atomic writes on a file, >> and if this is accepted then we can query the atomic write min and max unit >> size, and this would be [FS block size, min(bdev atomic write limit, >> rtexsize)]. >> >> If rtextsize is 16KB, then we have a good idea that we want 16KB atomic >> writes support. So then we could use rtextsize as this max atomic write >> size. > > Maybe, but I think continuing to focus on this as 'atomic writes > requires' is wrong. > > The filesystem does not carea bout atomic writes. What it cares > about is the allocation constraints that need to be implemented. > That constraint is that all BMBT extent operations need to be > aligned to a specific extent size, not filesystem blocks. > > The current extent size hint (and rtextsize) only impact the > -allocation of extents-. They are not directly placing constraints > on the BMBT layout, they are placing constraints on the free space > search that the allocator runs on the BNO/CNT btrees to select an > extent that is then inserted into the BMBT. > > The problem is that unwritten extent conversion, truncate, hole > punching, etc also all need to be correctly aligned for files that > are configured to support atomic writes. These operations place > constraints on how the BMBT can modify the existing extent list. > > These are different constraints to what rtextsize/extszhint apply, > and that's the fundamental behavioural difference between existing > extent size hint behaviour and the behaviour needed by atomic > writes. > >> But I am not 100% sure that it your idea (apologies if I am wrong - I >> am sincerely trying to follow your idea), but rather it would be >> min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and bdev >> atomic write limit is 16KB, then there is no much point in dealing in 1MB >> blocks for this unwritten extent conversion alignment. > > Exactly my point - there really is no relationship between rtextsize > and atomic write constraints and that it is a mistake to use > rtextsize as it stands as a placeholder for atomic write > constraints. > ok >> If so, then my >> concern is that the bdev atomic write upper limit is not fixed. This can >> solved, but I would still like to be clear on this max atomic write size. > > Right, we haven't clearly defined how XFS is supposed align BMBT > operations in a way that is compatible for atomic write operations. > > What the patchset does is try to extend and infer things from > existing allocation alignment constraints, but then not apply those > constraints to critical extent state operations (pure BMBT > modifications) that atomic writes also need constrained to work > correctly and efficiently. Right. Previously I also did mention that we could explicitly request the atomic write size per-inode, but a drawback is that this would require an on-disk format change. > >>> i.e. atomic writes need to use max write size granularity for all IO >>> operations, not filesystem block granularity. >>> >>> And that also means things like rtextsize and extsize hints need to >>> match these atomic write requirements, too.... >> >> As above, I am not 100% sure if you mean these to be the atomic write >> maximal value. > > Yes, they either need to be the same as the atomic write max value > or, much better, once a hint has been set, then resultant size is > then aligned up to be compatible with the atomic write constraints. > > e.g. set an inode extent size hint of 960kB on a device with 128kB > atomic write capability. If the inode has the atomic write flag set, > then allocations need to round the extent size up from 960kB to 1MB > so that the BMBT extent layout and alignment is compatible with 128kB > atomic writes. > > At this point, zeroing, punching, unwritten extent conversion, etc > also needs to be done on aligned 128kB ranges to be comaptible with > atomic writes, rather than filesysetm block boundaries that would > normally be used if just the extent size hint was set. > > This is effectively what the original "force align" inode flag bit > did - it told the inode that all BMBT manipulations need to be > extent size hint aligned, not just allocation. It's a generic flag > that implements specific extent manipulation constraints that happen > to be compatible with atomic writes when the right extent size hint > is set. > > So at this point, I'm not sure that we should have an "atomic > writes" flag in the inode. Another motivation for this flag is that we can explicitly enable some software-based atomic write support for an inode when the backing device does not have HW support. In addition, in this series setting FS_XFLAG_ATOMICWRITES means XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something similar for other OSes, and for those other OSes it may also mean some other special alignment feature enabled. We want a consistent user experience. > We need to tell BMBT modifications > to behave in a particular way - forced alignment - not that atomic > writes are being used in the filesystem.... > > At this point, the filesystem can do all the extent modification > alignment stuff that atomic writes require without caring if the > block device supports atomic writes or even if the > application is using atomic writes. > > This means we can test the BMBT functionality in fstests without > requiring hardware (or emulation) that supports atomic writes - all > we need to do is set the forced align flag, an extent size hint and > go run fsx on it... > The current idea was that the forcealign feature would be required for the second phase for atomic write support - non-rtvol support. Darrick did send that series out separately over the New Year's break. I think that you wanted to progress the following series first: https://lore.kernel.org/linux-xfs/20231004001943.349265-1-david@fromorbit.com/ Right? Thanks, John
>>> >>> Not sure why we care about the file position, it's br_startblock that >>> gets passed into the bio, not br_startoff. >> >> We just want to ensure that the length of the write is valid w.r.t. to the >> offset within the extent, and br_startoff would be the offset within the >> aligned extent. > > Yes, I understand what br_startoff is, but this doesn't help me > understand why this code is necessary. Let's say you have a device that > supports untorn writes of 16k in length provided the LBA of the write > command is also aligned to 16k, and the fs has 4k blocks. > > Userspace issues an 16k untorn write at offset 13k in the file, and gets > this mapping: > > [startoff: 13k, startblock: 16k, blockcount: 16k] > > Why should this IO be rejected? It's rejected as it does not follow the rules. > The physical space extent satisfies the > alignment requirements of the underlying device, and the logical file > space extent does not need aligning at all. Sure. In this case, we can produce a single BIO and the underlying HW may be able to handle this atomically. The point really is that we want a consistent userspace experience. We say that the write 'must' be naturally aligned, not 'should' be. It's not really useful to the user if sometimes a write passes and sometimes it fails by chance of how the extents happen to be laid out. Furthermore, in this case, what should the user do if this write at 13K offset fails as the 16K of data straddles 2x extents? They asked for 16K written at offset 13K and they want it done atomically - there is nothing which the FS can do to help. If they don't really need 16K written atomically, then better just do a regular write, or write individual chunks atomically. Thanks, John
On Wed, Feb 14, 2024 at 11:06:11AM +0000, John Garry wrote: > > > > > > Setting the rtvol extsize at mkfs time or enabling atomic writes > > > FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in > > > terms of atomic writes. > > > > Which is wrong. mkfs.xfs gets physical information about the volume > > from the kernel and makes the filesystem accounting to that > > information. That's how we do stripe alignment, sector sizing, etc. > > Atomic write support and setting up alignment constraints should be > > no different. > > Yes, I was just looking at adding a mkfs option to format for atomic writes, > which would check physical information about the volume and whether it suits > rtextsize and then subsequently also set XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES. FWIW, atomic writes need to be implemented in XFS in a way that isn't specific to the rtdev. There is no reason they cannot be applied to the data device (via superblock max atomic write size field or extent size hints and the align flag) so please don't get hung up on rtextsize as the only thing that atomic write alignment might apply to. > > Yes, mkfs allows the user to override the hardware configsi it > > probes, but it also warns when the override is doing something > > sub-optimal (like aligning all AG headers to the same disk in a > > stripe). > > > > IOWs, mkfs should be pulling this atomic write info from the > > hardware and configuring the filesysetm around that information. > > That's the target we should be aiming the kernel implementation at > > and optimising for - a filesystem that is correctly configured > > according to published hardware capability. > > Right > > So, for example, if the atomic writes option is set and the rtextsize set by > the user is so much larger than what HW can support in terms of atomic > writes, then we should let the user know about this. Well, this is part of the problem I mention above: you're focussing entirely on the rtdev setup and not the general "atomic writes require BMBT extent alignment constraints". So, maybe, yes, we might want to warn that the rtextsize is much bigger than the atomic write size of that device, but now there's something else we need to take into account: the rtdev could have a different atomic write size comxpapred to the data device. What now? IOWs, focussing on the rtdev misses key considerations for making the functionality generic, and we most definitely don't want to have to rev the on disk format a second time to support atomic writes for the data device. Hence we likely need two variables for atomic write sizes in the superblock - one for the rtdev, and one for the data device. And this then feeds through to Darrick's multi-rtdev stuff - each rtdev will need to have an attribute that tracks this information. Actually, now that I think about it, we may require 3 sizes - I'm in the process of writing a design doc for an new journal format, and one of the things I want it to be able to do is use atomic writes to enable journal writes to be decoupled from device sector sizes. If we are using atomic writes > sector size for the journal, then we definitely need to record somewhere in the superblock what the minimum journal IO size is because it isn't going to be the sector size anymore... [...] > > > If so, then my > > > concern is that the bdev atomic write upper limit is not fixed. This can > > > solved, but I would still like to be clear on this max atomic write size. > > > > Right, we haven't clearly defined how XFS is supposed align BMBT > > operations in a way that is compatible for atomic write operations. > > > > What the patchset does is try to extend and infer things from > > existing allocation alignment constraints, but then not apply those > > constraints to critical extent state operations (pure BMBT > > modifications) that atomic writes also need constrained to work > > correctly and efficiently. > > Right. Previously I also did mention that we could explicitly request the > atomic write size per-inode, but a drawback is that this would require an > on-disk format change. We're already having to change the on-disk format for this (inode flag, superblock feature bit), so we really should be trying to make this generic and properly featured so that it can be used for anything that requires physical alignment of file data extents, not just atomic writes... > > > > i.e. atomic writes need to use max write size granularity for all IO > > > > operations, not filesystem block granularity. > > > > > > > > And that also means things like rtextsize and extsize hints need to > > > > match these atomic write requirements, too.... > > > > > > As above, I am not 100% sure if you mean these to be the atomic write > > > maximal value. > > > > Yes, they either need to be the same as the atomic write max value > > or, much better, once a hint has been set, then resultant size is > > then aligned up to be compatible with the atomic write constraints. > > > > e.g. set an inode extent size hint of 960kB on a device with 128kB > > atomic write capability. If the inode has the atomic write flag set, > > then allocations need to round the extent size up from 960kB to 1MB > > so that the BMBT extent layout and alignment is compatible with 128kB > > atomic writes. > > > > At this point, zeroing, punching, unwritten extent conversion, etc > > also needs to be done on aligned 128kB ranges to be comaptible with > > atomic writes, rather than filesysetm block boundaries that would > > normally be used if just the extent size hint was set. > > > > This is effectively what the original "force align" inode flag bit > > did - it told the inode that all BMBT manipulations need to be > > extent size hint aligned, not just allocation. It's a generic flag > > that implements specific extent manipulation constraints that happen > > to be compatible with atomic writes when the right extent size hint > > is set. > > > > So at this point, I'm not sure that we should have an "atomic > > writes" flag in the inode. > > Another motivation for this flag is that we can explicitly enable some > software-based atomic write support for an inode when the backing device > does not have HW support. That's orthogonal to the aligment issue. If the BMBT extents are always aligned in a way that is compatible with atomic writes, we don't need and aomtic writes flag to tell the filesystem it should do an atomic write. That comes from userspace via the IOCB_ATOMIC flag. It is the IOCB_ATOMIC that triggers the software fallback when the hardware doesn't support atomic writes, not an inode flag. All the filesystem has to do is guarantee all extent manipulations are correctly aligned and IOCB_ATOMIC can always be executed regardless of whether it is hardware or software that does it. > In addition, in this series setting FS_XFLAG_ATOMICWRITES means > XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something > similar for other OSes, and for those other OSes it may also mean some other > special alignment feature enabled. We want a consistent user experience. I don't care about other OSes - they don't implement the FS_IOC_FSSETXATTR interfaces, so we just don't care about cross-OS compatibility for the user API. Fundamentally, atomic writes are *not a property of the filesystem on-disk format*. They require extent tracking constraints (i.e. alignment), and that's the property of the filesystem on-disk format that we need to manipulate here. Users are not going to care if the setup ioctl for atomic writes is to set FS_XFLAG_ALIGN_EXTENTS vs FS_XFLAG_ATOMICWRITES. They already know they have to align their IO properly for atomic writes, so it's not like this is something they will be completely unfamiliar with. Indeed, FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize = max_atomic_write_size as a user interface to set up the inode for atomic writes is pretty concise and easy to use. It also isn't specific to atomic writes, and so this fucntionality can be used for anything that needs aligned extent manipulation for performant functionality. > > to behave in a particular way - forced alignment - not that atomic > > writes are being used in the filesystem.... > > > > At this point, the filesystem can do all the extent modification > > alignment stuff that atomic writes require without caring if the > > block device supports atomic writes or even if the > > application is using atomic writes. > > > > This means we can test the BMBT functionality in fstests without > > requiring hardware (or emulation) that supports atomic writes - all > > we need to do is set the forced align flag, an extent size hint and > > go run fsx on it... > > > > The current idea was that the forcealign feature would be required for the > second phase for atomic write support - non-rtvol support. Darrick did send > that series out separately over the New Year's break. This is the wrong approach: this needs to be integrated into the same patchset so we can review the changes for atomic writes as a whole, not as two separate, independent on-disk format changes. The on-disk format change that atomic writes need is aligned BMBT extent manipulations, regardless of whether we are targeting the rtdev or the datadev, and trying to separate them is leading you down entirely the wrong path. -Dave.
>> >> Yes, I was just looking at adding a mkfs option to format for atomic writes, >> which would check physical information about the volume and whether it suits >> rtextsize and then subsequently also set XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES. > > FWIW, atomic writes need to be implemented in XFS in a way that > isn't specific to the rtdev. There is no reason they cannot be > applied to the data device (via superblock max atomic write size > field or extent size hints and the align flag) so > please don't get hung up on rtextsize as the only thing that atomic > write alignment might apply to. Sure > >>> Yes, mkfs allows the user to override the hardware configsi it >>> probes, but it also warns when the override is doing something >>> sub-optimal (like aligning all AG headers to the same disk in a >>> stripe). >>> >>> IOWs, mkfs should be pulling this atomic write info from the >>> hardware and configuring the filesysetm around that information. >>> That's the target we should be aiming the kernel implementation at >>> and optimising for - a filesystem that is correctly configured >>> according to published hardware capability. >> >> Right >> >> So, for example, if the atomic writes option is set and the rtextsize set by >> the user is so much larger than what HW can support in terms of atomic >> writes, then we should let the user know about this. > > Well, this is part of the problem I mention above: you're focussing > entirely on the rtdev setup and not the general "atomic writes > require BMBT extent alignment constraints". I'm really just saying what I was considering based on this series only. > > So, maybe, yes, we might want to warn that the rtextsize is much > bigger than the atomic write size of that device, but now there's > something else we need to take into account: the rtdev could have a > different atomic write size comxpapred to the data device. > > What now? > > IOWs, focussing on the rtdev misses key considerations for making > the functionality generic, and we most definitely don't want to have > to rev the on disk format a second time to support atomic writes > for the data device. Hence we likely need two variables for atomic > write sizes in the superblock - one for the rtdev, and one for the > data device. And this then feeds through to Darrick's multi-rtdev > stuff - each rtdev will need to have an attribute that tracks this > information. ok >>> >>> What the patchset does is try to extend and infer things from >>> existing allocation alignment constraints, but then not apply those >>> constraints to critical extent state operations (pure BMBT >>> modifications) that atomic writes also need constrained to work >>> correctly and efficiently. >> >> Right. Previously I also did mention that we could explicitly request the >> atomic write size per-inode, but a drawback is that this would require an >> on-disk format change. > > We're already having to change the on-disk format for this (inode > flag, superblock feature bit), so we really should be trying to make > this generic and properly featured so that it can be used for > anything that requires physical alignment of file data extents, not > just atomic writes... ok ... >> Another motivation for this flag is that we can explicitly enable some >> software-based atomic write support for an inode when the backing device >> does not have HW support. > > That's orthogonal to the aligment issue. If the BMBT extents are > always aligned in a way that is compatible with atomic writes, we > don't need and aomtic writes flag to tell the filesystem it should > do an atomic write. Any instruction to do an atomic write should be encoded in the userspace write operation. Or maybe the file open operation in future - I still get questions about O_ATOMIC. > That comes from userspace via the IOCB_ATOMIC > flag. > > It is the IOCB_ATOMIC that triggers the software fallback when the > hardware doesn't support atomic writes, not an inode flag. To me, any such fallback seems like something which we should be explicitly enabling. > All the > filesystem has to do is guarantee all extent manipulations are > correctly aligned and IOCB_ATOMIC can always be executed regardless > of whether it is hardware or software that does it. > > >> In addition, in this series setting FS_XFLAG_ATOMICWRITES means >> XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something >> similar for other OSes, and for those other OSes it may also mean some other >> special alignment feature enabled. We want a consistent user experience. > > I don't care about other OSes - they don't implement the > FS_IOC_FSSETXATTR interfaces, so we just don't care about cross-OS > compatibility for the user API. Other FSes need to support FS_IOC_FSSETXATTR for atomic writes. Or at least should support it.... > > Fundamentally, atomic writes are *not a property of the filesystem > on-disk format*. They require extent tracking constraints (i.e. > alignment), and that's the property of the filesystem on-disk format > that we need to manipulate here. > > Users are not going to care if the setup ioctl for atomic writes > is to set FS_XFLAG_ALIGN_EXTENTS vs FS_XFLAG_ATOMICWRITES. They > already know they have to align their IO properly for atomic writes, > so it's not like this is something they will be completely > unfamiliar with. > > Indeed, FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize > = max_atomic_write_size as a user interface to set up the inode for > atomic writes is pretty concise and easy to use. It also isn't > specific to atomic writes, and so this fucntionality can be used for > anything that needs aligned extent manipulation for performant > functionality. This is where there seems to be a difference between how you would like atomic writes to be enabled and how Christoph would, judging by this: https://lore.kernel.org/linux-fsdevel/20240110091929.GA31003@lst.de/ I think that if the FS and the userspace util can between them figure out what to do, then that is ok. This is something like what I proposed previously: xfs_io -c "atomic-writes 64K" mnt/file where the userspace util would use for the FS_IOC_FSSETXATTR ioctl: FS_XFLAG_ATOMIC_WRITES | FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize There is a question on the purpose of the FS_XFLAG_ATOMIC_WRITES flag, but, to me, it does seem useful for future feature support. We could just have FS_XFLAG_ATOMIC_WRITES | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize, and the kernel auto-enables FS_XFLAG_ALIGN_EXTENTS, but the other way seems better > >>> to behave in a particular way - forced alignment - not that atomic >>> writes are being used in the filesystem.... >>> >>> At this point, the filesystem can do all the extent modification >>> alignment stuff that atomic writes require without caring if the >>> block device supports atomic writes or even if the >>> application is using atomic writes. >>> >>> This means we can test the BMBT functionality in fstests without >>> requiring hardware (or emulation) that supports atomic writes - all >>> we need to do is set the forced align flag, an extent size hint and >>> go run fsx on it... >>> >> >> The current idea was that the forcealign feature would be required for the >> second phase for atomic write support - non-rtvol support. Darrick did send >> that series out separately over the New Year's break. > > This is the wrong approach: this needs to be integrated into the > same patchset so we can review the changes for atomic writes as a > whole, not as two separate, independent on-disk format changes. The > on-disk format change that atomic writes need is aligned BMBT extent > manipulations, regardless of whether we are targeting the rtdev or > the datadev, and trying to separate them is leading you down > entirely the wrong path. > ok, fine. BTW, going back to the original discussion on the extent zeroing and your idea to do this in the iomap dio path, my impression is that we require changes like this: - in iomap_dio_bio_iter(), we need to zero out the extent according to extsize (and not just FS block size) - xfs_dio_write_end_io() -> xfs_iomap_write_unwritten() also needs to consider this extent being written, and not assume a FS block - per-inode locking similar to what is done in xfs_file_dio_write_unaligned() - I need to check that further, though, as I am not yet sure on how we decide to use this exclusive lock or not. Does this sound sane? Thanks, John
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 18c8f168b153..758dc1c90a42 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -289,6 +289,9 @@ xfs_iomap_write_direct( } } + if (xfs_inode_atomicwrites(ip)) + bmapi_flags = XFS_BMAPI_ZERO; + error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks, rblocks, force, &tp); if (error) @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin( if (error) goto out_unlock; + if (flags & IOMAP_ATOMIC) { + xfs_filblks_t unit_min_fsb, unit_max_fsb; + unsigned int unit_min, unit_max; + + xfs_get_atomic_write_attr(ip, &unit_min, &unit_max); + unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min); + unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max); + + if (!imap_spans_range(&imap, offset_fsb, end_fsb)) { + error = -EINVAL; + goto out_unlock; + } + + if ((offset & mp->m_blockmask) || + (length & mp->m_blockmask)) { + error = -EINVAL; + goto out_unlock; + } + + if (imap.br_blockcount == unit_min_fsb || + imap.br_blockcount == unit_max_fsb) { + /* ok if exactly min or max */ + } else if (imap.br_blockcount < unit_min_fsb || + imap.br_blockcount > unit_max_fsb) { + error = -EINVAL; + goto out_unlock; + } else if (!is_power_of_2(imap.br_blockcount)) { + error = -EINVAL; + goto out_unlock; + } + + if (imap.br_startoff && + imap.br_startoff & (imap.br_blockcount - 1)) { + error = -EINVAL; + goto out_unlock; + } + } + if (imap_needs_cow(ip, flags, &imap, nimaps)) { error = -EAGAIN; if (flags & IOMAP_NOWAIT)
Ensure that when creating a mapping that we adhere to all the atomic write rules. We check that the mapping covers the complete range of the write to ensure that we'll be just creating a single mapping. Currently minimum granularity is the FS block size, but it should be possibly to support lower in future. Signed-off-by: John Garry <john.g.garry@oracle.com> --- I am setting this as an RFC as I am not sure on the change in xfs_iomap_write_direct() - it gives the desired result AFAICS. fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+)