Message ID | 20250310183946.932054-4-john.g.garry@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | large atomic writes for xfs with CoW | expand |
On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote: > Refactor xfs_reflink_end_cow_extent() into separate parts which process > the CoW range and commit the transaction. > > This refactoring will be used in future for when it is required to commit > a range of extents as a single transaction, similar to how it was done > pre-commit d6f215f359637. Darrick pointed out that if you do more than just a tiny number of extents per transactions you run out of log reservations very quickly here: https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/ how does your scheme deal with that?
On 12/03/2025 07:24, Christoph Hellwig wrote: > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote: >> Refactor xfs_reflink_end_cow_extent() into separate parts which process >> the CoW range and commit the transaction. >> >> This refactoring will be used in future for when it is required to commit >> a range of extents as a single transaction, similar to how it was done >> pre-commit d6f215f359637. > > Darrick pointed out that if you do more than just a tiny number > of extents per transactions you run out of log reservations very > quickly here: > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$ > > how does your scheme deal with that? > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this, right? Or does the log reservation have a hard size limit, regardless of that calculation? Thanks, John
On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote: > On 12/03/2025 07:24, Christoph Hellwig wrote: > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote: > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process > > > the CoW range and commit the transaction. > > > > > > This refactoring will be used in future for when it is required to commit > > > a range of extents as a single transaction, similar to how it was done > > > pre-commit d6f215f359637. > > > > Darrick pointed out that if you do more than just a tiny number > > of extents per transactions you run out of log reservations very > > quickly here: > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$ > > > > how does your scheme deal with that? > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this, > right? Or does the log reservation have a hard size limit, regardless of > that calculation? The resblks calculated there are the reserved disk blocks and have nothing to do with the log reservations, which comes from the tr_write field passed in. There is some kind of upper limited to it obviously by the log size, although I'm not sure if we've formalized that somewhere. Dave might be the right person to ask about that.
On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote: > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote: > > On 12/03/2025 07:24, Christoph Hellwig wrote: > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote: > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process > > > > the CoW range and commit the transaction. > > > > > > > > This refactoring will be used in future for when it is required to commit > > > > a range of extents as a single transaction, similar to how it was done > > > > pre-commit d6f215f359637. > > > > > > Darrick pointed out that if you do more than just a tiny number > > > of extents per transactions you run out of log reservations very > > > quickly here: > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$ > > > > > > how does your scheme deal with that? > > > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this, > > right? Or does the log reservation have a hard size limit, regardless of > > that calculation? > > The resblks calculated there are the reserved disk blocks and have > nothing to do with the log reservations, which comes from the > tr_write field passed in. There is some kind of upper limited to it > obviously by the log size, although I'm not sure if we've formalized > that somewhere. Dave might be the right person to ask about that. The (very very rough) upper limit for how many intent items you can attach to a tr_write transaction is: per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size) max_blocks = tr_write::tr_logres / per_extent_cost (ili_size is the inode log item size) ((I would halve that for the sake of paranoia)) since you have to commit all those intent items into the first transaction in the chain. The difficulty we've always had is computing the size of an intent item in the ondisk log, since that's a (somewhat minor) layering violation -- it's xfs_cui_log_format_sizeof() for a CUI, but then there' could be overhead for the ondisk log headers themselves. Maybe we ought to formalize the computation of that since reap.c also has a handwavy XREAP_MAX_DEFER_CHAIN that it uses to roll the scrub transaction periodically... because I'd prefer we not add another hardcoded limit. My guess is that the software fallback can probably support any awu_max that a hardware wants to throw at us, but let's actually figure out the min(sw, hw) that we can support and cap it at that. --D
On 12/03/2025 15:46, Darrick J. Wong wrote: > On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote: >> On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote: >>> On 12/03/2025 07:24, Christoph Hellwig wrote: >>>> On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote: >>>>> Refactor xfs_reflink_end_cow_extent() into separate parts which process >>>>> the CoW range and commit the transaction. >>>>> >>>>> This refactoring will be used in future for when it is required to commit >>>>> a range of extents as a single transaction, similar to how it was done >>>>> pre-commit d6f215f359637. >>>> >>>> Darrick pointed out that if you do more than just a tiny number >>>> of extents per transactions you run out of log reservations very >>>> quickly here: >>>> >>>> https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$ >>>> >>>> how does your scheme deal with that? >>>> >>> The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this, >>> right? Or does the log reservation have a hard size limit, regardless of >>> that calculation? >> >> The resblks calculated there are the reserved disk blocks and have >> nothing to do with the log reservations, which comes from the >> tr_write field passed in. There is some kind of upper limited to it >> obviously by the log size, although I'm not sure if we've formalized >> that somewhere. Dave might be the right person to ask about that. > > The (very very rough) upper limit for how many intent items you can > attach to a tr_write transaction is: > > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size) > max_blocks = tr_write::tr_logres / per_extent_cost > > (ili_size is the inode log item size) So will it be something like this: static size_t xfs_compute_awu_max_extents( struct xfs_mount *mp) { struct xfs_trans_res *resp = &M_RES(mp)->tr_write; size_t logtotal = xfs_bui_log_format_sizeof(1)+ xfs_cui_log_format_sizeof(1) + xfs_efi_log_format_sizeof(1) + xfs_rui_log_format_sizeof(1) + sizeof(struct xfs_inode_log_format); return rounddown_pow_of_two(resp->tr_logres / logtotal); } static inline void xfs_compute_awu_max( struct xfs_mount *mp, int jjcount) { .... mp->m_awu_max = min_t(unsigned int, awu_max, xfs_compute_awu_max_extents(mp)); } > > ((I would halve that for the sake of paranoia)) > > since you have to commit all those intent items into the first > transaction in the chain. The difficulty we've always had is computing > the size of an intent item in the ondisk log, since that's a (somewhat > minor) layering violation -- it's xfs_cui_log_format_sizeof() for a CUI, > but then there' could be overhead for the ondisk log headers themselves. > > Maybe we ought to formalize the computation of that since reap.c also > has a handwavy XREAP_MAX_DEFER_CHAIN that it uses to roll the scrub > transaction periodically... because I'd prefer we not add another > hardcoded limit. My guess is that the software fallback can probably > support any awu_max that a hardware wants to throw at us, but let's > actually figure out the min(sw, hw) that we can support and cap it at > that. > > --D
On Wed, Mar 12, 2025 at 10:06:11PM +0000, John Garry wrote: > On 12/03/2025 15:46, Darrick J. Wong wrote: > > On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote: > > > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote: > > > > On 12/03/2025 07:24, Christoph Hellwig wrote: > > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote: > > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process > > > > > > the CoW range and commit the transaction. > > > > > > > > > > > > This refactoring will be used in future for when it is required to commit > > > > > > a range of extents as a single transaction, similar to how it was done > > > > > > pre-commit d6f215f359637. > > > > > > > > > > Darrick pointed out that if you do more than just a tiny number > > > > > of extents per transactions you run out of log reservations very > > > > > quickly here: > > > > > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$ > > > > > > > > > > how does your scheme deal with that? > > > > > > > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this, > > > > right? Or does the log reservation have a hard size limit, regardless of > > > > that calculation? > > > > > > The resblks calculated there are the reserved disk blocks and have > > > nothing to do with the log reservations, which comes from the > > > tr_write field passed in. There is some kind of upper limited to it > > > obviously by the log size, although I'm not sure if we've formalized > > > that somewhere. Dave might be the right person to ask about that. > > > > The (very very rough) upper limit for how many intent items you can > > attach to a tr_write transaction is: > > > > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size) > > max_blocks = tr_write::tr_logres / per_extent_cost > > > > (ili_size is the inode log item size) > > So will it be something like this: > > static size_t > xfs_compute_awu_max_extents( > struct xfs_mount *mp) > { > struct xfs_trans_res *resp = &M_RES(mp)->tr_write; > size_t logtotal = xfs_bui_log_format_sizeof(1)+ Might want to call it "per_extent_logres" since that's what it is. > xfs_cui_log_format_sizeof(1) + > xfs_efi_log_format_sizeof(1) + > xfs_rui_log_format_sizeof(1) + > sizeof(struct xfs_inode_log_format); Something like that, yeah. You should probably add xfs_log_dinode_size(ip->i_mount) to that. What you're really doing is summing the *nbytes output of the ->iop_size() call for each possible log item. For the four log intent items it's the xfs_FOO_log_format_sizeof() function like you have above. For inode items it's: *nbytes += sizeof(struct xfs_inode_log_format) + xfs_log_dinode_size(ip->i_mount); > return rounddown_pow_of_two(resp->tr_logres / logtotal); and like I said earlier, you should double logtotal to be on the safe side with a 2x safety margin: /* 100% safety margin for safety's sake */ return rounddown_pow_of_two(resp->tr_logres / (2 * per_extent_logres)); I'm curious what number you get back from this function? Hopefully it's at least a few hundred blocks. Thanks for putting that together. :) --D > } > > static inline void > xfs_compute_awu_max( > struct xfs_mount *mp, int jjcount) > { > .... > mp->m_awu_max = > min_t(unsigned int, awu_max, xfs_compute_awu_max_extents(mp)); > } > > > > > ((I would halve that for the sake of paranoia)) > > > > since you have to commit all those intent items into the first > > transaction in the chain. The difficulty we've always had is computing > > the size of an intent item in the ondisk log, since that's a (somewhat > > minor) layering violation -- it's xfs_cui_log_format_sizeof() for a CUI, > > but then there' could be overhead for the ondisk log headers themselves. > > > > Maybe we ought to formalize the computation of that since reap.c also > > has a handwavy XREAP_MAX_DEFER_CHAIN that it uses to roll the scrub > > transaction periodically... because I'd prefer we not add another > > hardcoded limit. My guess is that the software fallback can probably > > support any awu_max that a hardware wants to throw at us, but let's > > actually figure out the min(sw, hw) that we can support and cap it at > > that. > > > > --D > >
On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote: > On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote: > > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote: > > > On 12/03/2025 07:24, Christoph Hellwig wrote: > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote: > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process > > > > > the CoW range and commit the transaction. > > > > > > > > > > This refactoring will be used in future for when it is required to commit > > > > > a range of extents as a single transaction, similar to how it was done > > > > > pre-commit d6f215f359637. > > > > > > > > Darrick pointed out that if you do more than just a tiny number > > > > of extents per transactions you run out of log reservations very > > > > quickly here: > > > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$ > > > > > > > > how does your scheme deal with that? > > > > > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this, > > > right? Or does the log reservation have a hard size limit, regardless of > > > that calculation? > > > > The resblks calculated there are the reserved disk blocks Used for btree block allocations that might be needed during the processing of the transaction. > > and have > > nothing to do with the log reservations, which comes from the > > tr_write field passed in. There is some kind of upper limited to it > > obviously by the log size, although I'm not sure if we've formalized > > that somewhere. Dave might be the right person to ask about that. > > The (very very rough) upper limit for how many intent items you can > attach to a tr_write transaction is: > > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size) > max_blocks = tr_write::tr_logres / per_extent_cost > > (ili_size is the inode log item size) That doesn't sound right. The number of intents we can log is not dependent on the aggregated size of all intent types. We do not log all those intent types in a single transaction, nor do we process more than one type of intent in a given transaction. Also, we only log the inode once per transaction, so that is not a per-extent overhead. Realistically, the tr_write transaction is goign to be at least a 100kB because it has to be big enough to log full splits of multiple btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB filesystem spits out: xfs_trans_resv_calc: dev 7:0 type 0 logres 193528 logcount 5 flags 0x4 About 190kB. However, intents are typically very small - around 32 bytes in size plus another 12 bytes for the log region ophdr. This implies that we can fit thousands of individual intents in a single tr_write log reservation on any given filesystem, and the number of loop iterations in a transaction is therefore dependent largely on how many intents are logged per iteration. Hence if we are walking a range of extents in the BMBT to unmap them, then we should only be generating 2 intents per loop - a BUI for the BMBT removal and a CUI for the shared refcount decrease. That means we should be able to run at least a thousand iterations of that loop per transaction without getting anywhere near the transaction reservation limits. *However!* We have to relog every intent we haven't processed in the deferred batch every-so-often to prevent the outstanding intents from pinning the tail of the log. Hence the larger the number of intents in the initial batch, the more work we have to do later on (and the more overall log space and bandwidth they will consume) to relog them them over and over again until they pop to the head of the processing queue. Hence there is no real perforamce advantage to creating massive intent batches because we end up doing more work later on to relog those intents to prevent journal space deadlocks. It also doesn't speed up processing, because we still process the intent chains one at a time from start to completion before moving on to the next high level intent chain that needs to be processed. Further, after the first couple of intent chains have been processed, the initial log space reservation will have run out, and we are now asking for a new resrevation on every transaction roll we do. i.e. we now are now doing a log space reservation on every transaction roll in the processing chain instead of only doing it once per high level intent chain. Hence from a log space accounting perspective (the hottest code path in the journal), it is far more efficient to perform a single high level transaction per extent unmap operation than it is to batch intents into a single high level transaction. My advice is this: we should never batch high level iterative intent-based operations into a single transaction because it's a false optimisation. It might look like it is an efficiency improvement from the high level, but it ends up hammering the hot, performance critical paths in the transaction subsystem much, much harder and so will end up being slower than the single transaction per intent-based operation algorithm when it matters most.... -Dave.
On Thu, Mar 13, 2025 at 12:25:21PM +1100, Dave Chinner wrote: > On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote: > > On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote: > > > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote: > > > > On 12/03/2025 07:24, Christoph Hellwig wrote: > > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote: > > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process > > > > > > the CoW range and commit the transaction. > > > > > > > > > > > > This refactoring will be used in future for when it is required to commit > > > > > > a range of extents as a single transaction, similar to how it was done > > > > > > pre-commit d6f215f359637. > > > > > > > > > > Darrick pointed out that if you do more than just a tiny number > > > > > of extents per transactions you run out of log reservations very > > > > > quickly here: > > > > > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$ > > > > > > > > > > how does your scheme deal with that? > > > > > > > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this, > > > > right? Or does the log reservation have a hard size limit, regardless of > > > > that calculation? > > > > > > The resblks calculated there are the reserved disk blocks > > Used for btree block allocations that might be needed during the > processing of the transaction. > > > > and have > > > nothing to do with the log reservations, which comes from the > > > tr_write field passed in. There is some kind of upper limited to it > > > obviously by the log size, although I'm not sure if we've formalized > > > that somewhere. Dave might be the right person to ask about that. > > > > The (very very rough) upper limit for how many intent items you can > > attach to a tr_write transaction is: > > > > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size) > > max_blocks = tr_write::tr_logres / per_extent_cost > > > > (ili_size is the inode log item size) > > That doesn't sound right. The number of intents we can log is not > dependent on the aggregated size of all intent types. We do not log > all those intent types in a single transaction, nor do we process > more than one type of intent in a given transaction. Also, we only > log the inode once per transaction, so that is not a per-extent > overhead. > > Realistically, the tr_write transaction is goign to be at least a > 100kB because it has to be big enough to log full splits of multiple > btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB > filesystem spits out: > > xfs_trans_resv_calc: dev 7:0 type 0 logres 193528 logcount 5 flags 0x4 > > About 190kB. > > However, intents are typically very small - around 32 bytes in size > plus another 12 bytes for the log region ophdr. > > This implies that we can fit thousands of individual intents in a > single tr_write log reservation on any given filesystem, and the > number of loop iterations in a transaction is therefore dependent > largely on how many intents are logged per iteration. > > Hence if we are walking a range of extents in the BMBT to unmap > them, then we should only be generating 2 intents per loop - a BUI > for the BMBT removal and a CUI for the shared refcount decrease. > That means we should be able to run at least a thousand iterations > of that loop per transaction without getting anywhere near the > transaction reservation limits. > > *However!* > > We have to relog every intent we haven't processed in the deferred > batch every-so-often to prevent the outstanding intents from pinning > the tail of the log. Hence the larger the number of intents in the > initial batch, the more work we have to do later on (and the more > overall log space and bandwidth they will consume) to relog them > them over and over again until they pop to the head of the > processing queue. > > Hence there is no real perforamce advantage to creating massive intent > batches because we end up doing more work later on to relog those > intents to prevent journal space deadlocks. It also doesn't speed up > processing, because we still process the intent chains one at a time > from start to completion before moving on to the next high level > intent chain that needs to be processed. > > Further, after the first couple of intent chains have been > processed, the initial log space reservation will have run out, and > we are now asking for a new resrevation on every transaction roll we > do. i.e. we now are now doing a log space reservation on every > transaction roll in the processing chain instead of only doing it > once per high level intent chain. > > Hence from a log space accounting perspective (the hottest code path > in the journal), it is far more efficient to perform a single high > level transaction per extent unmap operation than it is to batch > intents into a single high level transaction. > > My advice is this: we should never batch high level iterative > intent-based operations into a single transaction because it's a > false optimisation. It might look like it is an efficiency > improvement from the high level, but it ends up hammering the hot, > performance critical paths in the transaction subsystem much, much > harder and so will end up being slower than the single transaction > per intent-based operation algorithm when it matters most.... How specifically do you propose remapping all the extents in a file range after an untorn write? The regular cow ioend does a single transaction per extent across the entire ioend range and cannot deliver untorn writes. This latest proposal does, but now you've torn that idea down too. At this point I have run out of ideas and conclude that can only submit to your superior intellect. --D > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On 13/03/2025 04:51, Darrick J. Wong wrote: >> Hence if we are walking a range of extents in the BMBT to unmap >> them, then we should only be generating 2 intents per loop - a BUI >> for the BMBT removal and a CUI for the shared refcount decrease. >> That means we should be able to run at least a thousand iterations >> of that loop per transaction without getting anywhere near the >> transaction reservation limits. >> >> *However!* >> >> We have to relog every intent we haven't processed in the deferred >> batch every-so-often to prevent the outstanding intents from pinning >> the tail of the log. Hence the larger the number of intents in the >> initial batch, the more work we have to do later on (and the more >> overall log space and bandwidth they will consume) to relog them >> them over and over again until they pop to the head of the >> processing queue. >> >> Hence there is no real perforamce advantage to creating massive intent >> batches because we end up doing more work later on to relog those >> intents to prevent journal space deadlocks. It also doesn't speed up >> processing, because we still process the intent chains one at a time >> from start to completion before moving on to the next high level >> intent chain that needs to be processed. >> >> Further, after the first couple of intent chains have been >> processed, the initial log space reservation will have run out, and >> we are now asking for a new resrevation on every transaction roll we >> do. i.e. we now are now doing a log space reservation on every >> transaction roll in the processing chain instead of only doing it >> once per high level intent chain. >> >> Hence from a log space accounting perspective (the hottest code path >> in the journal), it is far more efficient to perform a single high >> level transaction per extent unmap operation than it is to batch >> intents into a single high level transaction. >> >> My advice is this: we should never batch high level iterative >> intent-based operations into a single transaction because it's a >> false optimisation. It might look like it is an efficiency >> improvement from the high level, but it ends up hammering the hot, >> performance critical paths in the transaction subsystem much, much >> harder and so will end up being slower than the single transaction >> per intent-based operation algorithm when it matters most.... > How specifically do you propose remapping all the extents in a file > range after an untorn write? The regular cow ioend does a single > transaction per extent across the entire ioend range and cannot deliver > untorn writes. This latest proposal does, but now you've torn that idea > down too. > > At this point I have run out of ideas and conclude that can only submit > to your superior intellect. > > --D I'm hearing that we can fit thousands without getting anywhere the limits - this is good. But then also it is not optimal in terms of performance to batch, right? Performance is not so important here. This is for a software fallback, which we should not frequently hit. And even if we do, we're still typically not going to have many extents. For our specific purpose, we want 16KB atomic writes - that is max of 4 extents. So this does not really sound like something to be concerned with for these atomic write sizes. We can add some arbitrary FS awu max, like 64KB, if that makes people feel more comfortable. Thanks, John
On Wed, Mar 12, 2025 at 09:51:21PM -0700, Darrick J. Wong wrote: > On Thu, Mar 13, 2025 at 12:25:21PM +1100, Dave Chinner wrote: > > On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote: > > > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote: > > > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process > > > > > > > the CoW range and commit the transaction. > > > > > > > > > > > > > > This refactoring will be used in future for when it is required to commit > > > > > > > a range of extents as a single transaction, similar to how it was done > > > > > > > pre-commit d6f215f359637. > > > > > > > > > > > > Darrick pointed out that if you do more than just a tiny number > > > > > > of extents per transactions you run out of log reservations very > > > > > > quickly here: > > > > > > > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$ > > > > > > > > > > > > how does your scheme deal with that? > > > > > > > > > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this, > > > > > right? Or does the log reservation have a hard size limit, regardless of > > > > > that calculation? > > > > > > > > The resblks calculated there are the reserved disk blocks > > > > Used for btree block allocations that might be needed during the > > processing of the transaction. > > > > > > and have > > > > nothing to do with the log reservations, which comes from the > > > > tr_write field passed in. There is some kind of upper limited to it > > > > obviously by the log size, although I'm not sure if we've formalized > > > > that somewhere. Dave might be the right person to ask about that. > > > > > > The (very very rough) upper limit for how many intent items you can > > > attach to a tr_write transaction is: > > > > > > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size) > > > max_blocks = tr_write::tr_logres / per_extent_cost > > > > > > (ili_size is the inode log item size) > > > > That doesn't sound right. The number of intents we can log is not > > dependent on the aggregated size of all intent types. We do not log > > all those intent types in a single transaction, nor do we process > > more than one type of intent in a given transaction. Also, we only > > log the inode once per transaction, so that is not a per-extent > > overhead. > > > > Realistically, the tr_write transaction is goign to be at least a > > 100kB because it has to be big enough to log full splits of multiple > > btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB > > filesystem spits out: > > > > xfs_trans_resv_calc: dev 7:0 type 0 logres 193528 logcount 5 flags 0x4 > > > > About 190kB. > > > > However, intents are typically very small - around 32 bytes in size > > plus another 12 bytes for the log region ophdr. > > > > This implies that we can fit thousands of individual intents in a > > single tr_write log reservation on any given filesystem, and the > > number of loop iterations in a transaction is therefore dependent > > largely on how many intents are logged per iteration. > > > > Hence if we are walking a range of extents in the BMBT to unmap > > them, then we should only be generating 2 intents per loop - a BUI > > for the BMBT removal and a CUI for the shared refcount decrease. > > That means we should be able to run at least a thousand iterations > > of that loop per transaction without getting anywhere near the > > transaction reservation limits. > > > > *However!* > > > > We have to relog every intent we haven't processed in the deferred > > batch every-so-often to prevent the outstanding intents from pinning > > the tail of the log. Hence the larger the number of intents in the > > initial batch, the more work we have to do later on (and the more > > overall log space and bandwidth they will consume) to relog them > > them over and over again until they pop to the head of the > > processing queue. > > > > Hence there is no real perforamce advantage to creating massive intent > > batches because we end up doing more work later on to relog those > > intents to prevent journal space deadlocks. It also doesn't speed up > > processing, because we still process the intent chains one at a time > > from start to completion before moving on to the next high level > > intent chain that needs to be processed. > > > > Further, after the first couple of intent chains have been > > processed, the initial log space reservation will have run out, and > > we are now asking for a new resrevation on every transaction roll we > > do. i.e. we now are now doing a log space reservation on every > > transaction roll in the processing chain instead of only doing it > > once per high level intent chain. > > > > Hence from a log space accounting perspective (the hottest code path > > in the journal), it is far more efficient to perform a single high > > level transaction per extent unmap operation than it is to batch > > intents into a single high level transaction. > > > > My advice is this: we should never batch high level iterative > > intent-based operations into a single transaction because it's a > > false optimisation. It might look like it is an efficiency > > improvement from the high level, but it ends up hammering the hot, > > performance critical paths in the transaction subsystem much, much > > harder and so will end up being slower than the single transaction > > per intent-based operation algorithm when it matters most.... > > How specifically do you propose remapping all the extents in a file > range after an untorn write? Sorry, I didn't realise that was the context of the question that was asked - there was not enough context in the email I replied to to indicate this important detail. hence it just looked like a question about "how many intents can we batch into a single write transaction reservation". I gave that answer (thousands) and then recommended against doing batching like this as an optimisation. Batching operations into a single context is normally done as an optimisation, so that is what I assumed was being talked about here.... > The regular cow ioend does a single > transaction per extent across the entire ioend range and cannot deliver > untorn writes. This latest proposal does, but now you've torn that idea > down too. > > At this point I have run out of ideas and conclude that can only submit > to your superior intellect. I think you're jumping to incorrect conclusions, and then making needless personal attacks. This is unacceptable behaviour, Darrick, and if you keep it up you are going to end up having to explain yourself to the CoC committee.... Anyway.... Now I understand the assumed context was batching for atomicity and not optimisation, I'll address that aspect of the suggested approach: overloading the existing write reservation with a special case like this is the wrong way to define a new atomic operation. That is: trying to infer limits of special case behaviour by adding a heuristic based calculation based on the write log reservation is poor design. The write reservation varies according to the current filesystem's geometry (filesystem block size, AG size, capacity, etc) and kernel version. Hence the batch size supported for atomic writes would then vary unpredictably from filesystem to filesystem and even within the same filesystem over time. Given that we have to abort atomic writes up front if the mapping covering the atomic write range is more fragmented than this calculated value, this unpredicable limit will be directly exposed to userspace via the errors it generates. We do not want anything even vaguely related to transaction reservation sizes exposed to userspace. They are, and should always be, entirely internal filesystem implementation details. A much better way to define an atomic unmap operation is to set a fixed maximum number of extents that a batched atomic unmap operation needs to support. With a fixed extent count, we can then base the transaction reservation size required for the operation on this number. We know that the write length is never going to be larger than 2GB due to MAX_RW_COUNT bounding of user IO. Hence there is a maximum number of extents that a single write can map to. Whether that's the size we try to support is separate discussion, but I point it out as a hard upper maximum. Regardless of what we define as the maximum number of extents for the atomic unmap, we can now calculate the exact log space reservation the an unmap intents will require. We can then max() that with the log space reservation that any of those specific intents will require to process. Now we have an exact log reservation to an atomic unmap of a known, fixed number of extents. Then we can look at what the common unmap behaviour is (e.g. through tracing various test cases) are, and determine how many extents we are typically going to convert in a single atomic unmap operation. That then guides us to an optimal log count for the atomic unmap reservation. This gives us a single log space reservation that can handle a known, fixed number of extents on any filesystem. It gives us an optimised log count to minimise the number of log space reservations the common case code is going to need. It gives us a reservation size that will contribute to defining the minimum supported log size for the features enabled in the filesystem. It gives us a consistent behaviour that we can directly exercise from userspace with relatively simple test code based on constant values. It means the high level unmap code doesn't rely on heuristics to prevent transaction reservation overflows. It means the high level code can use bound loops and compile time constants without having to worry that they will ever overrun the capability of the underlying filesystem or kernel. Simple, huh? -Dave.
On Thu, Mar 13, 2025 at 06:11:12AM +0000, John Garry wrote: > On 13/03/2025 04:51, Darrick J. Wong wrote: > > > Hence if we are walking a range of extents in the BMBT to unmap > > > them, then we should only be generating 2 intents per loop - a BUI > > > for the BMBT removal and a CUI for the shared refcount decrease. > > > That means we should be able to run at least a thousand iterations > > > of that loop per transaction without getting anywhere near the > > > transaction reservation limits. > > > > > > *However!* > > > > > > We have to relog every intent we haven't processed in the deferred > > > batch every-so-often to prevent the outstanding intents from pinning > > > the tail of the log. Hence the larger the number of intents in the > > > initial batch, the more work we have to do later on (and the more > > > overall log space and bandwidth they will consume) to relog them > > > them over and over again until they pop to the head of the > > > processing queue. > > > > > > Hence there is no real perforamce advantage to creating massive intent > > > batches because we end up doing more work later on to relog those > > > intents to prevent journal space deadlocks. It also doesn't speed up > > > processing, because we still process the intent chains one at a time > > > from start to completion before moving on to the next high level > > > intent chain that needs to be processed. > > > > > > Further, after the first couple of intent chains have been > > > processed, the initial log space reservation will have run out, and > > > we are now asking for a new resrevation on every transaction roll we > > > do. i.e. we now are now doing a log space reservation on every > > > transaction roll in the processing chain instead of only doing it > > > once per high level intent chain. > > > > > > Hence from a log space accounting perspective (the hottest code path > > > in the journal), it is far more efficient to perform a single high > > > level transaction per extent unmap operation than it is to batch > > > intents into a single high level transaction. > > > > > > My advice is this: we should never batch high level iterative > > > intent-based operations into a single transaction because it's a > > > false optimisation. It might look like it is an efficiency > > > improvement from the high level, but it ends up hammering the hot, > > > performance critical paths in the transaction subsystem much, much > > > harder and so will end up being slower than the single transaction > > > per intent-based operation algorithm when it matters most.... > > How specifically do you propose remapping all the extents in a file > > range after an untorn write? The regular cow ioend does a single > > transaction per extent across the entire ioend range and cannot deliver > > untorn writes. This latest proposal does, but now you've torn that idea > > down too. > > > > At this point I have run out of ideas and conclude that can only submit > > to your superior intellect. > > > > --D > > I'm hearing that we can fit thousands without getting anywhere the limits - > this is good. > > But then also it is not optimal in terms of performance to batch, right? > Performance is not so important here. This is for a software fallback, which > we should not frequently hit. And even if we do, we're still typically not > going to have many extents. > > For our specific purpose, we want 16KB atomic writes - that is max of 4 > extents. So this does not really sound like something to be concerned with > for these atomic write sizes. Apart from the fact that we should not be overloading some other transaction reservation definition for this special case? Saying "it should work" does not justify not thinking about constraints, layered design, application exposure to error cases, overruns, etc. i.e. the whole point of the software fallback is to make atomic writes largely generic. Saying "if we limit them to 16kB" it's not really generic, is it? > We can add some arbitrary FS awu max, like 64KB, if that makes people feel > more comfortable. I was thinking more like 4-16MB as a usable maximum size for atomic writes. i.e. allow for whole file atomic overwrites for small-medium sized files, and decent IO sizes for performance when overwriting large files. If we set the max at 4MB, that's 1024 extents on a 4kB block size filesystem. That gives us 2048 intents in a single unmap operation which we can directly calculate the transaction reservation size it will need. We need to do this as two separate reservation steps with a max() calculation, because the processing reservation size reduces with filesystem block size but the extent unmap intent overhead goes up as the block count increases with decreasing block size. i.e. the two components of the transacation reservation scale in different directions. If we are adding a new atomic transaction, we really need to design it properly from the ground up, not hack around an existing transaction reservation whilst handwaving about how "it should be enough". This "maximum size" is going to be exposed directly to userspace, hence we need to think carefully about what the maximum supported limits are going to be and how we can support them with a minimum of effort long into the future. Hacking around the existing write transaction isn't the way to do this, especially as that may change in future as we modify internal allocation behaviour over time. Saying "we only need 16kB right now, so that's all we should support" isn't the right approach to take here.... -Dave.
On Thu, Mar 13, 2025 at 06:21:29PM +1100, Dave Chinner wrote: > On Wed, Mar 12, 2025 at 09:51:21PM -0700, Darrick J. Wong wrote: > > On Thu, Mar 13, 2025 at 12:25:21PM +1100, Dave Chinner wrote: > > > On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote: > > > > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote: > > > > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process > > > > > > > > the CoW range and commit the transaction. > > > > > > > > > > > > > > > > This refactoring will be used in future for when it is required to commit > > > > > > > > a range of extents as a single transaction, similar to how it was done > > > > > > > > pre-commit d6f215f359637. > > > > > > > > > > > > > > Darrick pointed out that if you do more than just a tiny number > > > > > > > of extents per transactions you run out of log reservations very > > > > > > > quickly here: > > > > > > > > > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$ > > > > > > > > > > > > > > how does your scheme deal with that? > > > > > > > > > > > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this, > > > > > > right? Or does the log reservation have a hard size limit, regardless of > > > > > > that calculation? > > > > > > > > > > The resblks calculated there are the reserved disk blocks > > > > > > Used for btree block allocations that might be needed during the > > > processing of the transaction. > > > > > > > > and have > > > > > nothing to do with the log reservations, which comes from the > > > > > tr_write field passed in. There is some kind of upper limited to it > > > > > obviously by the log size, although I'm not sure if we've formalized > > > > > that somewhere. Dave might be the right person to ask about that. > > > > > > > > The (very very rough) upper limit for how many intent items you can > > > > attach to a tr_write transaction is: > > > > > > > > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size) > > > > max_blocks = tr_write::tr_logres / per_extent_cost > > > > > > > > (ili_size is the inode log item size) > > > > > > That doesn't sound right. The number of intents we can log is not > > > dependent on the aggregated size of all intent types. We do not log > > > all those intent types in a single transaction, nor do we process > > > more than one type of intent in a given transaction. Also, we only > > > log the inode once per transaction, so that is not a per-extent > > > overhead. > > > > > > Realistically, the tr_write transaction is goign to be at least a > > > 100kB because it has to be big enough to log full splits of multiple > > > btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB > > > filesystem spits out: > > > > > > xfs_trans_resv_calc: dev 7:0 type 0 logres 193528 logcount 5 flags 0x4 > > > > > > About 190kB. > > > > > > However, intents are typically very small - around 32 bytes in size > > > plus another 12 bytes for the log region ophdr. > > > > > > This implies that we can fit thousands of individual intents in a > > > single tr_write log reservation on any given filesystem, and the > > > number of loop iterations in a transaction is therefore dependent > > > largely on how many intents are logged per iteration. > > > > > > Hence if we are walking a range of extents in the BMBT to unmap > > > them, then we should only be generating 2 intents per loop - a BUI > > > for the BMBT removal and a CUI for the shared refcount decrease. > > > That means we should be able to run at least a thousand iterations > > > of that loop per transaction without getting anywhere near the > > > transaction reservation limits. > > > > > > *However!* > > > > > > We have to relog every intent we haven't processed in the deferred > > > batch every-so-often to prevent the outstanding intents from pinning > > > the tail of the log. Hence the larger the number of intents in the > > > initial batch, the more work we have to do later on (and the more > > > overall log space and bandwidth they will consume) to relog them > > > them over and over again until they pop to the head of the > > > processing queue. > > > > > > Hence there is no real perforamce advantage to creating massive intent > > > batches because we end up doing more work later on to relog those > > > intents to prevent journal space deadlocks. It also doesn't speed up > > > processing, because we still process the intent chains one at a time > > > from start to completion before moving on to the next high level > > > intent chain that needs to be processed. > > > > > > Further, after the first couple of intent chains have been > > > processed, the initial log space reservation will have run out, and > > > we are now asking for a new resrevation on every transaction roll we > > > do. i.e. we now are now doing a log space reservation on every > > > transaction roll in the processing chain instead of only doing it > > > once per high level intent chain. > > > > > > Hence from a log space accounting perspective (the hottest code path > > > in the journal), it is far more efficient to perform a single high > > > level transaction per extent unmap operation than it is to batch > > > intents into a single high level transaction. > > > > > > My advice is this: we should never batch high level iterative > > > intent-based operations into a single transaction because it's a > > > false optimisation. It might look like it is an efficiency > > > improvement from the high level, but it ends up hammering the hot, > > > performance critical paths in the transaction subsystem much, much > > > harder and so will end up being slower than the single transaction > > > per intent-based operation algorithm when it matters most.... > > > > How specifically do you propose remapping all the extents in a file > > range after an untorn write? > > Sorry, I didn't realise that was the context of the question that > was asked - there was not enough context in the email I replied to > to indicate this important detail. hence it just looked like a > question about "how many intents can we batch into a single write > transaction reservation". > > I gave that answer (thousands) and then recommended against doing > batching like this as an optimisation. Batching operations into a > single context is normally done as an optimisation, so that is what > I assumed was being talked about here.... > > > The regular cow ioend does a single > > transaction per extent across the entire ioend range and cannot deliver > > untorn writes. This latest proposal does, but now you've torn that idea > > down too. > > > > At this point I have run out of ideas and conclude that can only submit > > to your superior intellect. > > I think you're jumping to incorrect conclusions, and then making > needless personal attacks. This is unacceptable behaviour, Darrick, > and if you keep it up you are going to end up having to explain > yourself to the CoC committee.... I apologize to everyone here for using sarcasm. Anyone should be able to participate in the FOSS community without fear of being snarked at having their choices undermined. Senior developers ought to enable collaboration within community norms in any way they can. I will take this opportunity to remind myself of how I would like to interact with other people. --D
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index e9791e567bdf..e3e594c65745 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -788,35 +788,19 @@ xfs_reflink_update_quota( * requirements as low as possible. */ STATIC int -xfs_reflink_end_cow_extent( +xfs_reflink_end_cow_extent_locked( + struct xfs_trans *tp, struct xfs_inode *ip, xfs_fileoff_t *offset_fsb, xfs_fileoff_t end_fsb) { struct xfs_iext_cursor icur; struct xfs_bmbt_irec got, del, data; - struct xfs_mount *mp = ip->i_mount; - struct xfs_trans *tp; struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_COW_FORK); - unsigned int resblks; int nmaps; bool isrt = XFS_IS_REALTIME_INODE(ip); int error; - resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, - XFS_TRANS_RESERVE, &tp); - if (error) - return error; - - /* - * Lock the inode. We have to ijoin without automatic unlock because - * the lead transaction is the refcountbt record deletion; the data - * fork update follows as a deferred log item. - */ - xfs_ilock(ip, XFS_ILOCK_EXCL); - xfs_trans_ijoin(tp, ip, 0); - /* * In case of racing, overlapping AIO writes no COW extents might be * left by the time I/O completes for the loser of the race. In that @@ -825,7 +809,7 @@ xfs_reflink_end_cow_extent( if (!xfs_iext_lookup_extent(ip, ifp, *offset_fsb, &icur, &got) || got.br_startoff >= end_fsb) { *offset_fsb = end_fsb; - goto out_cancel; + return 0; } /* @@ -839,7 +823,7 @@ xfs_reflink_end_cow_extent( if (!xfs_iext_next_extent(ifp, &icur, &got) || got.br_startoff >= end_fsb) { *offset_fsb = end_fsb; - goto out_cancel; + return 0; } } del = got; @@ -848,14 +832,14 @@ xfs_reflink_end_cow_extent( error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK, XFS_IEXT_REFLINK_END_COW_CNT); if (error) - goto out_cancel; + return error; /* Grab the corresponding mapping in the data fork. */ nmaps = 1; error = xfs_bmapi_read(ip, del.br_startoff, del.br_blockcount, &data, &nmaps, 0); if (error) - goto out_cancel; + return error; /* We can only remap the smaller of the two extent sizes. */ data.br_blockcount = min(data.br_blockcount, del.br_blockcount); @@ -884,7 +868,7 @@ xfs_reflink_end_cow_extent( error = xfs_bunmapi(NULL, ip, data.br_startoff, data.br_blockcount, 0, 1, &done); if (error) - goto out_cancel; + return error; ASSERT(done); } @@ -901,17 +885,46 @@ xfs_reflink_end_cow_extent( /* Remove the mapping from the CoW fork. */ xfs_bmap_del_extent_cow(ip, &icur, &got, &del); - error = xfs_trans_commit(tp); - xfs_iunlock(ip, XFS_ILOCK_EXCL); - if (error) - return error; - /* Update the caller about how much progress we made. */ *offset_fsb = del.br_startoff + del.br_blockcount; return 0; +} -out_cancel: - xfs_trans_cancel(tp); + +/* + * Remap part of the CoW fork into the data fork. + * + * We aim to remap the range starting at @offset_fsb and ending at @end_fsb + * into the data fork; this function will remap what it can (at the end of the + * range) and update @end_fsb appropriately. Each remap gets its own + * transaction because we can end up merging and splitting bmbt blocks for + * every remap operation and we'd like to keep the block reservation + * requirements as low as possible. + */ +STATIC int +xfs_reflink_end_cow_extent( + struct xfs_inode *ip, + xfs_fileoff_t *offset_fsb, + xfs_fileoff_t end_fsb) +{ + struct xfs_mount *mp = ip->i_mount; + struct xfs_trans *tp; + unsigned int resblks; + int error; + + resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, + XFS_TRANS_RESERVE, &tp); + if (error) + return error; + xfs_ilock(ip, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, ip, 0); + + error = xfs_reflink_end_cow_extent_locked(tp, ip, offset_fsb, end_fsb); + if (error) + xfs_trans_cancel(tp); + else + error = xfs_trans_commit(tp); xfs_iunlock(ip, XFS_ILOCK_EXCL); return error; }