diff mbox series

[v5,03/10] xfs: Refactor xfs_reflink_end_cow_extent()

Message ID 20250310183946.932054-4-john.g.garry@oracle.com (mailing list archive)
State New
Headers show
Series large atomic writes for xfs with CoW | expand

Commit Message

John Garry March 10, 2025, 6:39 p.m. UTC
Refactor xfs_reflink_end_cow_extent() into separate parts which process
the CoW range and commit the transaction.

This refactoring will be used in future for when it is required to commit
a range of extents as a single transaction, similar to how it was done
pre-commit d6f215f359637.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_reflink.c | 73 ++++++++++++++++++++++++++------------------
 1 file changed, 43 insertions(+), 30 deletions(-)

Comments

Christoph Hellwig March 12, 2025, 7:24 a.m. UTC | #1
On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> Refactor xfs_reflink_end_cow_extent() into separate parts which process
> the CoW range and commit the transaction.
> 
> This refactoring will be used in future for when it is required to commit
> a range of extents as a single transaction, similar to how it was done
> pre-commit d6f215f359637.

Darrick pointed out that if you do more than just a tiny number
of extents per transactions you run out of log reservations very
quickly here:

https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/

how does your scheme deal with that?
John Garry March 12, 2025, 8:27 a.m. UTC | #2
On 12/03/2025 07:24, Christoph Hellwig wrote:
> On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
>> Refactor xfs_reflink_end_cow_extent() into separate parts which process
>> the CoW range and commit the transaction.
>>
>> This refactoring will be used in future for when it is required to commit
>> a range of extents as a single transaction, similar to how it was done
>> pre-commit d6f215f359637.
> 
> Darrick pointed out that if you do more than just a tiny number
> of extents per transactions you run out of log reservations very
> quickly here:
> 
> https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> 
> how does your scheme deal with that?
> 
The resblks calculation in xfs_reflink_end_atomic_cow() takes care of 
this, right? Or does the log reservation have a hard size limit, 
regardless of that calculation?

Thanks,
John
Christoph Hellwig March 12, 2025, 8:35 a.m. UTC | #3
On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
> On 12/03/2025 07:24, Christoph Hellwig wrote:
> > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > the CoW range and commit the transaction.
> > > 
> > > This refactoring will be used in future for when it is required to commit
> > > a range of extents as a single transaction, similar to how it was done
> > > pre-commit d6f215f359637.
> > 
> > Darrick pointed out that if you do more than just a tiny number
> > of extents per transactions you run out of log reservations very
> > quickly here:
> > 
> > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > 
> > how does your scheme deal with that?
> > 
> The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> right? Or does the log reservation have a hard size limit, regardless of
> that calculation?

The resblks calculated there are the reserved disk blocks and have
nothing to do with the log reservations, which comes from the
tr_write field passed in.  There is some kind of upper limited to it
obviously by the log size, although I'm not sure if we've formalized
that somewhere.  Dave might be the right person to ask about that.
Darrick J. Wong March 12, 2025, 3:46 p.m. UTC | #4
On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote:
> On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
> > On 12/03/2025 07:24, Christoph Hellwig wrote:
> > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > the CoW range and commit the transaction.
> > > > 
> > > > This refactoring will be used in future for when it is required to commit
> > > > a range of extents as a single transaction, similar to how it was done
> > > > pre-commit d6f215f359637.
> > > 
> > > Darrick pointed out that if you do more than just a tiny number
> > > of extents per transactions you run out of log reservations very
> > > quickly here:
> > > 
> > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > 
> > > how does your scheme deal with that?
> > > 
> > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > right? Or does the log reservation have a hard size limit, regardless of
> > that calculation?
> 
> The resblks calculated there are the reserved disk blocks and have
> nothing to do with the log reservations, which comes from the
> tr_write field passed in.  There is some kind of upper limited to it
> obviously by the log size, although I'm not sure if we've formalized
> that somewhere.  Dave might be the right person to ask about that.

The (very very rough) upper limit for how many intent items you can
attach to a tr_write transaction is:

per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
max_blocks = tr_write::tr_logres / per_extent_cost

(ili_size is the inode log item size)

((I would halve that for the sake of paranoia))

since you have to commit all those intent items into the first
transaction in the chain.  The difficulty we've always had is computing
the size of an intent item in the ondisk log, since that's a (somewhat
minor) layering violation -- it's xfs_cui_log_format_sizeof() for a CUI,
but then there' could be overhead for the ondisk log headers themselves.

Maybe we ought to formalize the computation of that since reap.c also
has a handwavy XREAP_MAX_DEFER_CHAIN that it uses to roll the scrub
transaction periodically... because I'd prefer we not add another
hardcoded limit.  My guess is that the software fallback can probably
support any awu_max that a hardware wants to throw at us, but let's
actually figure out the min(sw, hw) that we can support and cap it at
that.

--D
John Garry March 12, 2025, 10:06 p.m. UTC | #5
On 12/03/2025 15:46, Darrick J. Wong wrote:
> On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote:
>> On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
>>> On 12/03/2025 07:24, Christoph Hellwig wrote:
>>>> On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
>>>>> Refactor xfs_reflink_end_cow_extent() into separate parts which process
>>>>> the CoW range and commit the transaction.
>>>>>
>>>>> This refactoring will be used in future for when it is required to commit
>>>>> a range of extents as a single transaction, similar to how it was done
>>>>> pre-commit d6f215f359637.
>>>>
>>>> Darrick pointed out that if you do more than just a tiny number
>>>> of extents per transactions you run out of log reservations very
>>>> quickly here:
>>>>
>>>> https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
>>>>
>>>> how does your scheme deal with that?
>>>>
>>> The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
>>> right? Or does the log reservation have a hard size limit, regardless of
>>> that calculation?
>>
>> The resblks calculated there are the reserved disk blocks and have
>> nothing to do with the log reservations, which comes from the
>> tr_write field passed in.  There is some kind of upper limited to it
>> obviously by the log size, although I'm not sure if we've formalized
>> that somewhere.  Dave might be the right person to ask about that.
> 
> The (very very rough) upper limit for how many intent items you can
> attach to a tr_write transaction is:
> 
> per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> max_blocks = tr_write::tr_logres / per_extent_cost
> 
> (ili_size is the inode log item size)

So will it be something like this:

static size_t
xfs_compute_awu_max_extents(
	struct xfs_mount	*mp)
{
	struct xfs_trans_res	*resp = &M_RES(mp)->tr_write;
	size_t			logtotal = xfs_bui_log_format_sizeof(1)+
				xfs_cui_log_format_sizeof(1) +
				xfs_efi_log_format_sizeof(1) +
				xfs_rui_log_format_sizeof(1) +
				sizeof(struct xfs_inode_log_format);

	return rounddown_pow_of_two(resp->tr_logres / logtotal);
}

static inline void
xfs_compute_awu_max(
	struct xfs_mount	*mp, int jjcount)
{
....
	mp->m_awu_max =
	min_t(unsigned int, awu_max, xfs_compute_awu_max_extents(mp));
}

> 
> ((I would halve that for the sake of paranoia))
> 
> since you have to commit all those intent items into the first
> transaction in the chain.  The difficulty we've always had is computing
> the size of an intent item in the ondisk log, since that's a (somewhat
> minor) layering violation -- it's xfs_cui_log_format_sizeof() for a CUI,
> but then there' could be overhead for the ondisk log headers themselves.
> 
> Maybe we ought to formalize the computation of that since reap.c also
> has a handwavy XREAP_MAX_DEFER_CHAIN that it uses to roll the scrub
> transaction periodically... because I'd prefer we not add another
> hardcoded limit.  My guess is that the software fallback can probably
> support any awu_max that a hardware wants to throw at us, but let's
> actually figure out the min(sw, hw) that we can support and cap it at
> that.
> 
> --D
Darrick J. Wong March 12, 2025, 11:22 p.m. UTC | #6
On Wed, Mar 12, 2025 at 10:06:11PM +0000, John Garry wrote:
> On 12/03/2025 15:46, Darrick J. Wong wrote:
> > On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote:
> > > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
> > > > On 12/03/2025 07:24, Christoph Hellwig wrote:
> > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > > > the CoW range and commit the transaction.
> > > > > > 
> > > > > > This refactoring will be used in future for when it is required to commit
> > > > > > a range of extents as a single transaction, similar to how it was done
> > > > > > pre-commit d6f215f359637.
> > > > > 
> > > > > Darrick pointed out that if you do more than just a tiny number
> > > > > of extents per transactions you run out of log reservations very
> > > > > quickly here:
> > > > > 
> > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > > > 
> > > > > how does your scheme deal with that?
> > > > > 
> > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > > > right? Or does the log reservation have a hard size limit, regardless of
> > > > that calculation?
> > > 
> > > The resblks calculated there are the reserved disk blocks and have
> > > nothing to do with the log reservations, which comes from the
> > > tr_write field passed in.  There is some kind of upper limited to it
> > > obviously by the log size, although I'm not sure if we've formalized
> > > that somewhere.  Dave might be the right person to ask about that.
> > 
> > The (very very rough) upper limit for how many intent items you can
> > attach to a tr_write transaction is:
> > 
> > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> > max_blocks = tr_write::tr_logres / per_extent_cost
> > 
> > (ili_size is the inode log item size)
> 
> So will it be something like this:
> 
> static size_t
> xfs_compute_awu_max_extents(
> 	struct xfs_mount	*mp)
> {
> 	struct xfs_trans_res	*resp = &M_RES(mp)->tr_write;
> 	size_t			logtotal = xfs_bui_log_format_sizeof(1)+

Might want to call it "per_extent_logres" since that's what it is.

> 				xfs_cui_log_format_sizeof(1) +
> 				xfs_efi_log_format_sizeof(1) +
> 				xfs_rui_log_format_sizeof(1) +
> 				sizeof(struct xfs_inode_log_format);

Something like that, yeah.  You should probably add
xfs_log_dinode_size(ip->i_mount) to that.

What you're really doing is summing the *nbytes output of the
->iop_size() call for each possible log item.  For the four log intent
items it's the xfs_FOO_log_format_sizeof() function like you have above.
For inode items it's:

	*nbytes += sizeof(struct xfs_inode_log_format) +
		   xfs_log_dinode_size(ip->i_mount);

> 	return rounddown_pow_of_two(resp->tr_logres / logtotal);

and like I said earlier, you should double logtotal to be on the safe
side with a 2x safety margin:

	/* 100% safety margin for safety's sake */
	return rounddown_pow_of_two(resp->tr_logres /
				    (2 * per_extent_logres));

I'm curious what number you get back from this function?  Hopefully it's
at least a few hundred blocks.

Thanks for putting that together.  :)

--D

> }
> 
> static inline void
> xfs_compute_awu_max(
> 	struct xfs_mount	*mp, int jjcount)
> {
> ....
> 	mp->m_awu_max =
> 	min_t(unsigned int, awu_max, xfs_compute_awu_max_extents(mp));
> }
> 
> > 
> > ((I would halve that for the sake of paranoia))
> > 
> > since you have to commit all those intent items into the first
> > transaction in the chain.  The difficulty we've always had is computing
> > the size of an intent item in the ondisk log, since that's a (somewhat
> > minor) layering violation -- it's xfs_cui_log_format_sizeof() for a CUI,
> > but then there' could be overhead for the ondisk log headers themselves.
> > 
> > Maybe we ought to formalize the computation of that since reap.c also
> > has a handwavy XREAP_MAX_DEFER_CHAIN that it uses to roll the scrub
> > transaction periodically... because I'd prefer we not add another
> > hardcoded limit.  My guess is that the software fallback can probably
> > support any awu_max that a hardware wants to throw at us, but let's
> > actually figure out the min(sw, hw) that we can support and cap it at
> > that.
> > 
> > --D
> 
>
Dave Chinner March 13, 2025, 1:25 a.m. UTC | #7
On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote:
> > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
> > > On 12/03/2025 07:24, Christoph Hellwig wrote:
> > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > > the CoW range and commit the transaction.
> > > > > 
> > > > > This refactoring will be used in future for when it is required to commit
> > > > > a range of extents as a single transaction, similar to how it was done
> > > > > pre-commit d6f215f359637.
> > > > 
> > > > Darrick pointed out that if you do more than just a tiny number
> > > > of extents per transactions you run out of log reservations very
> > > > quickly here:
> > > > 
> > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > > 
> > > > how does your scheme deal with that?
> > > > 
> > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > > right? Or does the log reservation have a hard size limit, regardless of
> > > that calculation?
> > 
> > The resblks calculated there are the reserved disk blocks

Used for btree block allocations that might be needed during the
processing of the transaction.

> > and have
> > nothing to do with the log reservations, which comes from the
> > tr_write field passed in.  There is some kind of upper limited to it
> > obviously by the log size, although I'm not sure if we've formalized
> > that somewhere.  Dave might be the right person to ask about that.
> 
> The (very very rough) upper limit for how many intent items you can
> attach to a tr_write transaction is:
> 
> per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> max_blocks = tr_write::tr_logres / per_extent_cost
> 
> (ili_size is the inode log item size)

That doesn't sound right. The number of intents we can log is not
dependent on the aggregated size of all intent types. We do not log
all those intent types in a single transaction, nor do we process
more than one type of intent in a given transaction. Also, we only
log the inode once per transaction, so that is not a per-extent
overhead.

Realistically, the tr_write transaction is goign to be at least a
100kB because it has to be big enough to log full splits of multiple
btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB
filesystem spits out:

xfs_trans_resv_calc:  dev 7:0 type 0 logres 193528 logcount 5 flags 0x4

About 190kB.

However, intents are typically very small - around 32 bytes in size
plus another 12 bytes for the log region ophdr.

This implies that we can fit thousands of individual intents in a
single tr_write log reservation on any given filesystem, and the
number of loop iterations in a transaction is therefore dependent
largely on how many intents are logged per iteration.

Hence if we are walking a range of extents in the BMBT to unmap
them, then we should only be generating 2 intents per loop - a BUI
for the BMBT removal and a CUI for the shared refcount decrease.
That means we should be able to run at least a thousand iterations
of that loop per transaction without getting anywhere near the
transaction reservation limits.

*However!*

We have to relog every intent we haven't processed in the deferred
batch every-so-often to prevent the outstanding intents from pinning
the tail of the log. Hence the larger the number of intents in the
initial batch, the more work we have to do later on (and the more
overall log space and bandwidth they will consume) to relog them
them over and over again until they pop to the head of the
processing queue.

Hence there is no real perforamce advantage to creating massive intent
batches because we end up doing more work later on to relog those
intents to prevent journal space deadlocks. It also doesn't speed up
processing, because we still process the intent chains one at a time
from start to completion before moving on to the next high level
intent chain that needs to be processed.

Further, after the first couple of intent chains have been
processed, the initial log space reservation will have run out, and
we are now asking for a new resrevation on every transaction roll we
do. i.e. we now are now doing a log space reservation on every
transaction roll in the processing chain instead of only doing it
once per high level intent chain.

Hence from a log space accounting perspective (the hottest code path
in the journal), it is far more efficient to perform a single high
level transaction per extent unmap operation than it is to batch
intents into a single high level transaction.

My advice is this: we should never batch high level iterative
intent-based operations into a single transaction because it's a
false optimisation.  It might look like it is an efficiency
improvement from the high level, but it ends up hammering the hot,
performance critical paths in the transaction subsystem much, much
harder and so will end up being slower than the single transaction
per intent-based operation algorithm when it matters most....

-Dave.
Darrick J. Wong March 13, 2025, 4:51 a.m. UTC | #8
On Thu, Mar 13, 2025 at 12:25:21PM +1100, Dave Chinner wrote:
> On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote:
> > > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
> > > > On 12/03/2025 07:24, Christoph Hellwig wrote:
> > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > > > the CoW range and commit the transaction.
> > > > > > 
> > > > > > This refactoring will be used in future for when it is required to commit
> > > > > > a range of extents as a single transaction, similar to how it was done
> > > > > > pre-commit d6f215f359637.
> > > > > 
> > > > > Darrick pointed out that if you do more than just a tiny number
> > > > > of extents per transactions you run out of log reservations very
> > > > > quickly here:
> > > > > 
> > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > > > 
> > > > > how does your scheme deal with that?
> > > > > 
> > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > > > right? Or does the log reservation have a hard size limit, regardless of
> > > > that calculation?
> > > 
> > > The resblks calculated there are the reserved disk blocks
> 
> Used for btree block allocations that might be needed during the
> processing of the transaction.
> 
> > > and have
> > > nothing to do with the log reservations, which comes from the
> > > tr_write field passed in.  There is some kind of upper limited to it
> > > obviously by the log size, although I'm not sure if we've formalized
> > > that somewhere.  Dave might be the right person to ask about that.
> > 
> > The (very very rough) upper limit for how many intent items you can
> > attach to a tr_write transaction is:
> > 
> > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> > max_blocks = tr_write::tr_logres / per_extent_cost
> > 
> > (ili_size is the inode log item size)
> 
> That doesn't sound right. The number of intents we can log is not
> dependent on the aggregated size of all intent types. We do not log
> all those intent types in a single transaction, nor do we process
> more than one type of intent in a given transaction. Also, we only
> log the inode once per transaction, so that is not a per-extent
> overhead.
> 
> Realistically, the tr_write transaction is goign to be at least a
> 100kB because it has to be big enough to log full splits of multiple
> btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB
> filesystem spits out:
> 
> xfs_trans_resv_calc:  dev 7:0 type 0 logres 193528 logcount 5 flags 0x4
> 
> About 190kB.
> 
> However, intents are typically very small - around 32 bytes in size
> plus another 12 bytes for the log region ophdr.
> 
> This implies that we can fit thousands of individual intents in a
> single tr_write log reservation on any given filesystem, and the
> number of loop iterations in a transaction is therefore dependent
> largely on how many intents are logged per iteration.
> 
> Hence if we are walking a range of extents in the BMBT to unmap
> them, then we should only be generating 2 intents per loop - a BUI
> for the BMBT removal and a CUI for the shared refcount decrease.
> That means we should be able to run at least a thousand iterations
> of that loop per transaction without getting anywhere near the
> transaction reservation limits.
> 
> *However!*
> 
> We have to relog every intent we haven't processed in the deferred
> batch every-so-often to prevent the outstanding intents from pinning
> the tail of the log. Hence the larger the number of intents in the
> initial batch, the more work we have to do later on (and the more
> overall log space and bandwidth they will consume) to relog them
> them over and over again until they pop to the head of the
> processing queue.
> 
> Hence there is no real perforamce advantage to creating massive intent
> batches because we end up doing more work later on to relog those
> intents to prevent journal space deadlocks. It also doesn't speed up
> processing, because we still process the intent chains one at a time
> from start to completion before moving on to the next high level
> intent chain that needs to be processed.
> 
> Further, after the first couple of intent chains have been
> processed, the initial log space reservation will have run out, and
> we are now asking for a new resrevation on every transaction roll we
> do. i.e. we now are now doing a log space reservation on every
> transaction roll in the processing chain instead of only doing it
> once per high level intent chain.
> 
> Hence from a log space accounting perspective (the hottest code path
> in the journal), it is far more efficient to perform a single high
> level transaction per extent unmap operation than it is to batch
> intents into a single high level transaction.
> 
> My advice is this: we should never batch high level iterative
> intent-based operations into a single transaction because it's a
> false optimisation.  It might look like it is an efficiency
> improvement from the high level, but it ends up hammering the hot,
> performance critical paths in the transaction subsystem much, much
> harder and so will end up being slower than the single transaction
> per intent-based operation algorithm when it matters most....

How specifically do you propose remapping all the extents in a file
range after an untorn write?  The regular cow ioend does a single
transaction per extent across the entire ioend range and cannot deliver
untorn writes.  This latest proposal does, but now you've torn that idea
down too.

At this point I have run out of ideas and conclude that can only submit
to your superior intellect.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
John Garry March 13, 2025, 6:11 a.m. UTC | #9
On 13/03/2025 04:51, Darrick J. Wong wrote:
>> Hence if we are walking a range of extents in the BMBT to unmap
>> them, then we should only be generating 2 intents per loop - a BUI
>> for the BMBT removal and a CUI for the shared refcount decrease.
>> That means we should be able to run at least a thousand iterations
>> of that loop per transaction without getting anywhere near the
>> transaction reservation limits.
>>
>> *However!*
>>
>> We have to relog every intent we haven't processed in the deferred
>> batch every-so-often to prevent the outstanding intents from pinning
>> the tail of the log. Hence the larger the number of intents in the
>> initial batch, the more work we have to do later on (and the more
>> overall log space and bandwidth they will consume) to relog them
>> them over and over again until they pop to the head of the
>> processing queue.
>>
>> Hence there is no real perforamce advantage to creating massive intent
>> batches because we end up doing more work later on to relog those
>> intents to prevent journal space deadlocks. It also doesn't speed up
>> processing, because we still process the intent chains one at a time
>> from start to completion before moving on to the next high level
>> intent chain that needs to be processed.
>>
>> Further, after the first couple of intent chains have been
>> processed, the initial log space reservation will have run out, and
>> we are now asking for a new resrevation on every transaction roll we
>> do. i.e. we now are now doing a log space reservation on every
>> transaction roll in the processing chain instead of only doing it
>> once per high level intent chain.
>>
>> Hence from a log space accounting perspective (the hottest code path
>> in the journal), it is far more efficient to perform a single high
>> level transaction per extent unmap operation than it is to batch
>> intents into a single high level transaction.
>>
>> My advice is this: we should never batch high level iterative
>> intent-based operations into a single transaction because it's a
>> false optimisation.  It might look like it is an efficiency
>> improvement from the high level, but it ends up hammering the hot,
>> performance critical paths in the transaction subsystem much, much
>> harder and so will end up being slower than the single transaction
>> per intent-based operation algorithm when it matters most....
> How specifically do you propose remapping all the extents in a file
> range after an untorn write?  The regular cow ioend does a single
> transaction per extent across the entire ioend range and cannot deliver
> untorn writes.  This latest proposal does, but now you've torn that idea
> down too.
> 
> At this point I have run out of ideas and conclude that can only submit
> to your superior intellect.
> 
> --D

I'm hearing that we can fit thousands without getting anywhere the 
limits - this is good.

But then also it is not optimal in terms of performance to batch, right? 
Performance is not so important here. This is for a software fallback, 
which we should not frequently hit. And even if we do, we're still 
typically not going to have many extents.

For our specific purpose, we want 16KB atomic writes - that is max of 4 
extents. So this does not really sound like something to be concerned 
with for these atomic write sizes.

We can add some arbitrary FS awu max, like 64KB, if that makes people 
feel more comfortable.

Thanks,
John
Dave Chinner March 13, 2025, 7:21 a.m. UTC | #10
On Wed, Mar 12, 2025 at 09:51:21PM -0700, Darrick J. Wong wrote:
> On Thu, Mar 13, 2025 at 12:25:21PM +1100, Dave Chinner wrote:
> > On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote:
> > > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > > > > the CoW range and commit the transaction.
> > > > > > > 
> > > > > > > This refactoring will be used in future for when it is required to commit
> > > > > > > a range of extents as a single transaction, similar to how it was done
> > > > > > > pre-commit d6f215f359637.
> > > > > > 
> > > > > > Darrick pointed out that if you do more than just a tiny number
> > > > > > of extents per transactions you run out of log reservations very
> > > > > > quickly here:
> > > > > > 
> > > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > > > > 
> > > > > > how does your scheme deal with that?
> > > > > > 
> > > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > > > > right? Or does the log reservation have a hard size limit, regardless of
> > > > > that calculation?
> > > > 
> > > > The resblks calculated there are the reserved disk blocks
> > 
> > Used for btree block allocations that might be needed during the
> > processing of the transaction.
> > 
> > > > and have
> > > > nothing to do with the log reservations, which comes from the
> > > > tr_write field passed in.  There is some kind of upper limited to it
> > > > obviously by the log size, although I'm not sure if we've formalized
> > > > that somewhere.  Dave might be the right person to ask about that.
> > > 
> > > The (very very rough) upper limit for how many intent items you can
> > > attach to a tr_write transaction is:
> > > 
> > > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> > > max_blocks = tr_write::tr_logres / per_extent_cost
> > > 
> > > (ili_size is the inode log item size)
> > 
> > That doesn't sound right. The number of intents we can log is not
> > dependent on the aggregated size of all intent types. We do not log
> > all those intent types in a single transaction, nor do we process
> > more than one type of intent in a given transaction. Also, we only
> > log the inode once per transaction, so that is not a per-extent
> > overhead.
> > 
> > Realistically, the tr_write transaction is goign to be at least a
> > 100kB because it has to be big enough to log full splits of multiple
> > btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB
> > filesystem spits out:
> > 
> > xfs_trans_resv_calc:  dev 7:0 type 0 logres 193528 logcount 5 flags 0x4
> > 
> > About 190kB.
> > 
> > However, intents are typically very small - around 32 bytes in size
> > plus another 12 bytes for the log region ophdr.
> > 
> > This implies that we can fit thousands of individual intents in a
> > single tr_write log reservation on any given filesystem, and the
> > number of loop iterations in a transaction is therefore dependent
> > largely on how many intents are logged per iteration.
> > 
> > Hence if we are walking a range of extents in the BMBT to unmap
> > them, then we should only be generating 2 intents per loop - a BUI
> > for the BMBT removal and a CUI for the shared refcount decrease.
> > That means we should be able to run at least a thousand iterations
> > of that loop per transaction without getting anywhere near the
> > transaction reservation limits.
> > 
> > *However!*
> > 
> > We have to relog every intent we haven't processed in the deferred
> > batch every-so-often to prevent the outstanding intents from pinning
> > the tail of the log. Hence the larger the number of intents in the
> > initial batch, the more work we have to do later on (and the more
> > overall log space and bandwidth they will consume) to relog them
> > them over and over again until they pop to the head of the
> > processing queue.
> > 
> > Hence there is no real perforamce advantage to creating massive intent
> > batches because we end up doing more work later on to relog those
> > intents to prevent journal space deadlocks. It also doesn't speed up
> > processing, because we still process the intent chains one at a time
> > from start to completion before moving on to the next high level
> > intent chain that needs to be processed.
> > 
> > Further, after the first couple of intent chains have been
> > processed, the initial log space reservation will have run out, and
> > we are now asking for a new resrevation on every transaction roll we
> > do. i.e. we now are now doing a log space reservation on every
> > transaction roll in the processing chain instead of only doing it
> > once per high level intent chain.
> > 
> > Hence from a log space accounting perspective (the hottest code path
> > in the journal), it is far more efficient to perform a single high
> > level transaction per extent unmap operation than it is to batch
> > intents into a single high level transaction.
> > 
> > My advice is this: we should never batch high level iterative
> > intent-based operations into a single transaction because it's a
> > false optimisation.  It might look like it is an efficiency
> > improvement from the high level, but it ends up hammering the hot,
> > performance critical paths in the transaction subsystem much, much
> > harder and so will end up being slower than the single transaction
> > per intent-based operation algorithm when it matters most....
> 
> How specifically do you propose remapping all the extents in a file
> range after an untorn write?

Sorry, I didn't realise that was the context of the question that
was asked - there was not enough context in the email I replied to
to indicate this important detail. hence it just looked like a
question about "how many intents can we batch into a single write
transaction reservation".

I gave that answer (thousands) and then recommended against doing
batching like this as an optimisation. Batching operations into a
single context is normally done as an optimisation, so that is what
I assumed was being talked about here....

> The regular cow ioend does a single
> transaction per extent across the entire ioend range and cannot deliver
> untorn writes.  This latest proposal does, but now you've torn that idea
> down too.
>
> At this point I have run out of ideas and conclude that can only submit
> to your superior intellect.

I think you're jumping to incorrect conclusions, and then making
needless personal attacks. This is unacceptable behaviour, Darrick,
and if you keep it up you are going to end up having to explain
yourself to the CoC committee....

Anyway....

Now I understand the assumed context was batching for atomicity and
not optimisation, I'll address that aspect of the suggested
approach: overloading the existing write reservation with a special
case like this is the wrong way to define a new atomic operation.

That is: trying to infer limits of special case behaviour by adding
a heuristic based calculation based on the write log reservation
is poor design.

The write reservation varies according to the current filesystem's
geometry (filesystem block size, AG size, capacity, etc) and kernel
version. Hence the batch size supported for atomic writes would then
vary unpredictably from filesystem to filesystem and even within the
same filesystem over time.

Given that we have to abort atomic writes up front if the mapping
covering the atomic write range is more fragmented than this
calculated value, this unpredicable limit will be directly exposed
to userspace via the errors it generates.

We do not want anything even vaguely related to transaction
reservation sizes exposed to userspace. They are, and should always
be, entirely internal filesystem implementation details.

A much better way to define an atomic unmap operation is to set a
fixed maximum number of extents that a batched atomic unmap
operation needs to support. With a fixed extent count, we can then
base the transaction reservation size required for the operation on
this number.

We know that the write length is never going to be larger than 2GB
due to MAX_RW_COUNT bounding of user IO. Hence there is a maximum
number of extents that a single write can map to. Whether that's the
size we try to support is separate discussion, but I point it out as
a hard upper maximum.

Regardless of what we define as the maximum number of extents for
the atomic unmap, we can now calculate the exact log space
reservation the an unmap intents will require. We can then max()
that with the log space reservation that any of those specific
intents will require to process. Now we have an exact log
reservation to an atomic unmap of a known, fixed number of extents.

Then we can look at what the common unmap behaviour is (e.g. through
tracing various test cases) are, and determine how many extents we
are typically going to convert in a single atomic unmap operation.
That then guides us to an optimal log count for the atomic unmap
reservation.

This gives us a single log space reservation that can handle a known,
fixed number of extents on any filesystem. 

It gives us an optimised log count to minimise the number of log
space reservations the common case code is going to need.

It gives us a reservation size that will contribute to defining the
minimum supported log size for the features enabled in the
filesystem.

It gives us a consistent behaviour that we can directly exercise
from userspace with relatively simple test code based on constant
values.

It means the high level unmap code doesn't rely on heuristics to
prevent transaction reservation overflows.

It means the high level code can use bound loops and compile time
constants without having to worry that they will ever overrun the
capability of the underlying filesystem or kernel.

Simple, huh?

-Dave.
Dave Chinner March 18, 2025, 12:43 a.m. UTC | #11
On Thu, Mar 13, 2025 at 06:11:12AM +0000, John Garry wrote:
> On 13/03/2025 04:51, Darrick J. Wong wrote:
> > > Hence if we are walking a range of extents in the BMBT to unmap
> > > them, then we should only be generating 2 intents per loop - a BUI
> > > for the BMBT removal and a CUI for the shared refcount decrease.
> > > That means we should be able to run at least a thousand iterations
> > > of that loop per transaction without getting anywhere near the
> > > transaction reservation limits.
> > > 
> > > *However!*
> > > 
> > > We have to relog every intent we haven't processed in the deferred
> > > batch every-so-often to prevent the outstanding intents from pinning
> > > the tail of the log. Hence the larger the number of intents in the
> > > initial batch, the more work we have to do later on (and the more
> > > overall log space and bandwidth they will consume) to relog them
> > > them over and over again until they pop to the head of the
> > > processing queue.
> > > 
> > > Hence there is no real perforamce advantage to creating massive intent
> > > batches because we end up doing more work later on to relog those
> > > intents to prevent journal space deadlocks. It also doesn't speed up
> > > processing, because we still process the intent chains one at a time
> > > from start to completion before moving on to the next high level
> > > intent chain that needs to be processed.
> > > 
> > > Further, after the first couple of intent chains have been
> > > processed, the initial log space reservation will have run out, and
> > > we are now asking for a new resrevation on every transaction roll we
> > > do. i.e. we now are now doing a log space reservation on every
> > > transaction roll in the processing chain instead of only doing it
> > > once per high level intent chain.
> > > 
> > > Hence from a log space accounting perspective (the hottest code path
> > > in the journal), it is far more efficient to perform a single high
> > > level transaction per extent unmap operation than it is to batch
> > > intents into a single high level transaction.
> > > 
> > > My advice is this: we should never batch high level iterative
> > > intent-based operations into a single transaction because it's a
> > > false optimisation.  It might look like it is an efficiency
> > > improvement from the high level, but it ends up hammering the hot,
> > > performance critical paths in the transaction subsystem much, much
> > > harder and so will end up being slower than the single transaction
> > > per intent-based operation algorithm when it matters most....
> > How specifically do you propose remapping all the extents in a file
> > range after an untorn write?  The regular cow ioend does a single
> > transaction per extent across the entire ioend range and cannot deliver
> > untorn writes.  This latest proposal does, but now you've torn that idea
> > down too.
> > 
> > At this point I have run out of ideas and conclude that can only submit
> > to your superior intellect.
> > 
> > --D
> 
> I'm hearing that we can fit thousands without getting anywhere the limits -
> this is good.
> 
> But then also it is not optimal in terms of performance to batch, right?
> Performance is not so important here. This is for a software fallback, which
> we should not frequently hit. And even if we do, we're still typically not
> going to have many extents.
> 
> For our specific purpose, we want 16KB atomic writes - that is max of 4
> extents. So this does not really sound like something to be concerned with
> for these atomic write sizes.

Apart from the fact that we should not be overloading some other
transaction reservation definition for this special case? Saying
"it should work" does not justify not thinking about constraints,
layered design, application exposure to error cases, overruns, etc.

i.e. the whole point of the software fallback is to make atomic
writes largely generic. Saying "if we limit them to 16kB" it's not
really generic, is it?

> We can add some arbitrary FS awu max, like 64KB, if that makes people feel
> more comfortable.

I was thinking more like 4-16MB as a usable maximum size for atomic
writes. i.e. allow for whole file atomic overwrites for small-medium
sized files, and decent IO sizes for performance when overwriting
large files.

If we set the max at 4MB, that's 1024 extents on a 4kB
block size filesystem. That gives us 2048 intents in a single unmap
operation which we can directly calculate the transaction
reservation size it will need. We need to do this as two separate
reservation steps with a max() calculation, because the processing
reservation size reduces with filesystem block size but the extent
unmap intent overhead goes up as the block count increases with
decreasing block size. i.e. the two components of the transacation
reservation scale in different directions.

If we are adding a new atomic transaction, we really need to design
it properly from the ground up, not hack around an existing
transaction reservation whilst handwaving about how "it should be
enough".

This "maximum size" is going to be exposed directly to userspace,
hence we need to think carefully about what the maximum supported
limits are going to be and how we can support them with a minimum of
effort long into the future. Hacking around the existing write
transaction isn't the way to do this, especially as that may change
in future as we modify internal allocation behaviour over time.

Saying "we only need 16kB right now, so that's all we should
support" isn't the right approach to take here....

-Dave.
Darrick J. Wong March 22, 2025, 5:19 a.m. UTC | #12
On Thu, Mar 13, 2025 at 06:21:29PM +1100, Dave Chinner wrote:
> On Wed, Mar 12, 2025 at 09:51:21PM -0700, Darrick J. Wong wrote:
> > On Thu, Mar 13, 2025 at 12:25:21PM +1100, Dave Chinner wrote:
> > > On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote:
> > > > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > > > > > the CoW range and commit the transaction.
> > > > > > > > 
> > > > > > > > This refactoring will be used in future for when it is required to commit
> > > > > > > > a range of extents as a single transaction, similar to how it was done
> > > > > > > > pre-commit d6f215f359637.
> > > > > > > 
> > > > > > > Darrick pointed out that if you do more than just a tiny number
> > > > > > > of extents per transactions you run out of log reservations very
> > > > > > > quickly here:
> > > > > > > 
> > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > > > > > 
> > > > > > > how does your scheme deal with that?
> > > > > > > 
> > > > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > > > > > right? Or does the log reservation have a hard size limit, regardless of
> > > > > > that calculation?
> > > > > 
> > > > > The resblks calculated there are the reserved disk blocks
> > > 
> > > Used for btree block allocations that might be needed during the
> > > processing of the transaction.
> > > 
> > > > > and have
> > > > > nothing to do with the log reservations, which comes from the
> > > > > tr_write field passed in.  There is some kind of upper limited to it
> > > > > obviously by the log size, although I'm not sure if we've formalized
> > > > > that somewhere.  Dave might be the right person to ask about that.
> > > > 
> > > > The (very very rough) upper limit for how many intent items you can
> > > > attach to a tr_write transaction is:
> > > > 
> > > > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> > > > max_blocks = tr_write::tr_logres / per_extent_cost
> > > > 
> > > > (ili_size is the inode log item size)
> > > 
> > > That doesn't sound right. The number of intents we can log is not
> > > dependent on the aggregated size of all intent types. We do not log
> > > all those intent types in a single transaction, nor do we process
> > > more than one type of intent in a given transaction. Also, we only
> > > log the inode once per transaction, so that is not a per-extent
> > > overhead.
> > > 
> > > Realistically, the tr_write transaction is goign to be at least a
> > > 100kB because it has to be big enough to log full splits of multiple
> > > btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB
> > > filesystem spits out:
> > > 
> > > xfs_trans_resv_calc:  dev 7:0 type 0 logres 193528 logcount 5 flags 0x4
> > > 
> > > About 190kB.
> > > 
> > > However, intents are typically very small - around 32 bytes in size
> > > plus another 12 bytes for the log region ophdr.
> > > 
> > > This implies that we can fit thousands of individual intents in a
> > > single tr_write log reservation on any given filesystem, and the
> > > number of loop iterations in a transaction is therefore dependent
> > > largely on how many intents are logged per iteration.
> > > 
> > > Hence if we are walking a range of extents in the BMBT to unmap
> > > them, then we should only be generating 2 intents per loop - a BUI
> > > for the BMBT removal and a CUI for the shared refcount decrease.
> > > That means we should be able to run at least a thousand iterations
> > > of that loop per transaction without getting anywhere near the
> > > transaction reservation limits.
> > > 
> > > *However!*
> > > 
> > > We have to relog every intent we haven't processed in the deferred
> > > batch every-so-often to prevent the outstanding intents from pinning
> > > the tail of the log. Hence the larger the number of intents in the
> > > initial batch, the more work we have to do later on (and the more
> > > overall log space and bandwidth they will consume) to relog them
> > > them over and over again until they pop to the head of the
> > > processing queue.
> > > 
> > > Hence there is no real perforamce advantage to creating massive intent
> > > batches because we end up doing more work later on to relog those
> > > intents to prevent journal space deadlocks. It also doesn't speed up
> > > processing, because we still process the intent chains one at a time
> > > from start to completion before moving on to the next high level
> > > intent chain that needs to be processed.
> > > 
> > > Further, after the first couple of intent chains have been
> > > processed, the initial log space reservation will have run out, and
> > > we are now asking for a new resrevation on every transaction roll we
> > > do. i.e. we now are now doing a log space reservation on every
> > > transaction roll in the processing chain instead of only doing it
> > > once per high level intent chain.
> > > 
> > > Hence from a log space accounting perspective (the hottest code path
> > > in the journal), it is far more efficient to perform a single high
> > > level transaction per extent unmap operation than it is to batch
> > > intents into a single high level transaction.
> > > 
> > > My advice is this: we should never batch high level iterative
> > > intent-based operations into a single transaction because it's a
> > > false optimisation.  It might look like it is an efficiency
> > > improvement from the high level, but it ends up hammering the hot,
> > > performance critical paths in the transaction subsystem much, much
> > > harder and so will end up being slower than the single transaction
> > > per intent-based operation algorithm when it matters most....
> > 
> > How specifically do you propose remapping all the extents in a file
> > range after an untorn write?
> 
> Sorry, I didn't realise that was the context of the question that
> was asked - there was not enough context in the email I replied to
> to indicate this important detail. hence it just looked like a
> question about "how many intents can we batch into a single write
> transaction reservation".
> 
> I gave that answer (thousands) and then recommended against doing
> batching like this as an optimisation. Batching operations into a
> single context is normally done as an optimisation, so that is what
> I assumed was being talked about here....
> 
> > The regular cow ioend does a single
> > transaction per extent across the entire ioend range and cannot deliver
> > untorn writes.  This latest proposal does, but now you've torn that idea
> > down too.
> >
> > At this point I have run out of ideas and conclude that can only submit
> > to your superior intellect.
> 
> I think you're jumping to incorrect conclusions, and then making
> needless personal attacks. This is unacceptable behaviour, Darrick,
> and if you keep it up you are going to end up having to explain
> yourself to the CoC committee....

I apologize to everyone here for using sarcasm.  Anyone should be able
to participate in the FOSS community without fear of being snarked at
having their choices undermined.  Senior developers ought to enable
collaboration within community norms in any way they can.  I will take
this opportunity to remind myself of how I would like to interact with
other people.

--D
diff mbox series

Patch

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index e9791e567bdf..e3e594c65745 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -788,35 +788,19 @@  xfs_reflink_update_quota(
  * requirements as low as possible.
  */
 STATIC int
-xfs_reflink_end_cow_extent(
+xfs_reflink_end_cow_extent_locked(
+	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	xfs_fileoff_t		*offset_fsb,
 	xfs_fileoff_t		end_fsb)
 {
 	struct xfs_iext_cursor	icur;
 	struct xfs_bmbt_irec	got, del, data;
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_trans	*tp;
 	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_COW_FORK);
-	unsigned int		resblks;
 	int			nmaps;
 	bool			isrt = XFS_IS_REALTIME_INODE(ip);
 	int			error;
 
-	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
-			XFS_TRANS_RESERVE, &tp);
-	if (error)
-		return error;
-
-	/*
-	 * Lock the inode.  We have to ijoin without automatic unlock because
-	 * the lead transaction is the refcountbt record deletion; the data
-	 * fork update follows as a deferred log item.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_trans_ijoin(tp, ip, 0);
-
 	/*
 	 * In case of racing, overlapping AIO writes no COW extents might be
 	 * left by the time I/O completes for the loser of the race.  In that
@@ -825,7 +809,7 @@  xfs_reflink_end_cow_extent(
 	if (!xfs_iext_lookup_extent(ip, ifp, *offset_fsb, &icur, &got) ||
 	    got.br_startoff >= end_fsb) {
 		*offset_fsb = end_fsb;
-		goto out_cancel;
+		return 0;
 	}
 
 	/*
@@ -839,7 +823,7 @@  xfs_reflink_end_cow_extent(
 		if (!xfs_iext_next_extent(ifp, &icur, &got) ||
 		    got.br_startoff >= end_fsb) {
 			*offset_fsb = end_fsb;
-			goto out_cancel;
+			return 0;
 		}
 	}
 	del = got;
@@ -848,14 +832,14 @@  xfs_reflink_end_cow_extent(
 	error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
 			XFS_IEXT_REFLINK_END_COW_CNT);
 	if (error)
-		goto out_cancel;
+		return error;
 
 	/* Grab the corresponding mapping in the data fork. */
 	nmaps = 1;
 	error = xfs_bmapi_read(ip, del.br_startoff, del.br_blockcount, &data,
 			&nmaps, 0);
 	if (error)
-		goto out_cancel;
+		return error;
 
 	/* We can only remap the smaller of the two extent sizes. */
 	data.br_blockcount = min(data.br_blockcount, del.br_blockcount);
@@ -884,7 +868,7 @@  xfs_reflink_end_cow_extent(
 		error = xfs_bunmapi(NULL, ip, data.br_startoff,
 				data.br_blockcount, 0, 1, &done);
 		if (error)
-			goto out_cancel;
+			return error;
 		ASSERT(done);
 	}
 
@@ -901,17 +885,46 @@  xfs_reflink_end_cow_extent(
 	/* Remove the mapping from the CoW fork. */
 	xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
 
-	error = xfs_trans_commit(tp);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	if (error)
-		return error;
-
 	/* Update the caller about how much progress we made. */
 	*offset_fsb = del.br_startoff + del.br_blockcount;
 	return 0;
+}
 
-out_cancel:
-	xfs_trans_cancel(tp);
+
+/*
+ * Remap part of the CoW fork into the data fork.
+ *
+ * We aim to remap the range starting at @offset_fsb and ending at @end_fsb
+ * into the data fork; this function will remap what it can (at the end of the
+ * range) and update @end_fsb appropriately.  Each remap gets its own
+ * transaction because we can end up merging and splitting bmbt blocks for
+ * every remap operation and we'd like to keep the block reservation
+ * requirements as low as possible.
+ */
+STATIC int
+xfs_reflink_end_cow_extent(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		*offset_fsb,
+	xfs_fileoff_t		end_fsb)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	unsigned int		resblks;
+	int			error;
+
+	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
+			XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	error = xfs_reflink_end_cow_extent_locked(tp, ip, offset_fsb, end_fsb);
+	if (error)
+		xfs_trans_cancel(tp);
+	else
+		error = xfs_trans_commit(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }