diff mbox series

[1/1] xfs: fix severe performance problems when fstrimming a subset of an AG

Message ID 171150385535.3220448.4852463781154330350.stgit@frogsfrogsfrogs (mailing list archive)
State Superseded
Headers show
Series [1/1] xfs: fix severe performance problems when fstrimming a subset of an AG | expand

Commit Message

Darrick J. Wong March 27, 2024, 2:07 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

XFS issues discard IOs while holding the free space btree and the AGF
buffers locked.  If the discard IOs are slow, this can lead to long
stalls for every other thread trying to access that AG.  On a 10TB high
performance flash storage device with a severely fragmented free space
btree in every AG, this results in many threads tripping the hangcheck
warnings while waiting for the AGF.  This happens even after we've run
fstrim a few times and waited for the nvme namespace utilization
counters to stabilize.

Strace for the entire 100TB looks like:
ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839>

Reducing the size of the FITRIM requests to a single AG at a time
produces lower times for each individual call, but even this isn't quite
acceptable, because the lock hold times are still high enough to cause
stall warnings:

Strace for the first 4x 1TB AGs looks like (2):
ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033>
ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323>
ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226>
ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744>

The fstrim code has to synchronize discards with block allocations, so
we must hold the AGF lock while issuing discard IOs.  Breaking up the
calls into smaller start/len segments ought to reduce the lock hold time
and allow other threads a chance to make progress.  Unfortunately, the
current fstrim implementation handles this poorly because it walks the
entire free space by length index (cntbt) and it's not clear if we can
cycle the AGF periodically to reduce latency because there's no
less-than btree lookup.

The first solution I thought of was to limit latency by scanning parts
of an AG at a time, but this doesn't solve the stalling problem when the
free space is heavily fragmented because each sub-AG scan has to walk
the entire cntbt to find free space that fits within the given range.
In fact, this dramatically increases the runtime!  This itself is a
problem, because sub-AG fstrim runtime is unnecessarily high.

For sub-AG scans, create a second implementation that will walk the
bnobt and perform the trims in block number order.  Since the cursor has
an obviously monotonically increasing value, it is easy to cycle the AGF
periodically to allow other threads to do work.  This implementation
avoids the worst problems of the original code, though it lacks the
desirable attribute of freeing the biggest chunks first.

On the other hand, this second implementation will be much easier to
constrain the locking latency, and makes it much easier to report fstrim
progress to anyone who's running xfs_scrub.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_discard.c |  172 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 169 insertions(+), 3 deletions(-)

Comments

Christoph Hellwig March 27, 2024, 11:35 a.m. UTC | #1
On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote:
> periodically to allow other threads to do work.  This implementation
> avoids the worst problems of the original code, though it lacks the
> desirable attribute of freeing the biggest chunks first.

Do we really care much about freeing larger area first?  I don't think
it really matters for FITRIM at all.

In other words, I suspect we're better off with only the by-bno
implementation.
Dave Chinner March 27, 2024, 10:15 p.m. UTC | #2
On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> XFS issues discard IOs while holding the free space btree and the AGF
> buffers locked.  If the discard IOs are slow, this can lead to long
> stalls for every other thread trying to access that AG.  On a 10TB high
> performance flash storage device with a severely fragmented free space
> btree in every AG, this results in many threads tripping the hangcheck
> warnings while waiting for the AGF.  This happens even after we've run
> fstrim a few times and waited for the nvme namespace utilization
> counters to stabilize.
> 
> Strace for the entire 100TB looks like:
> ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839>
> 
> Reducing the size of the FITRIM requests to a single AG at a time
> produces lower times for each individual call, but even this isn't quite
> acceptable, because the lock hold times are still high enough to cause
> stall warnings:
> 
> Strace for the first 4x 1TB AGs looks like (2):
> ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033>
> ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323>
> ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226>
> ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744>

Large FITRIM runtime is not actually an AGF locking latency problem
anymore....

> The fstrim code has to synchronize discards with block allocations, so
> we must hold the AGF lock while issuing discard IOs.

... because I fixed this problem in commit 89cfa899608f ("xfs:
reduce AGF hold times during fstrim operations").

> Breaking up the
> calls into smaller start/len segments ought to reduce the lock hold time
> and allow other threads a chance to make progress.  Unfortunately, the
> current fstrim implementation handles this poorly because it walks the
> entire free space by length index (cntbt) and it's not clear if we can
> cycle the AGF periodically to reduce latency because there's no
> less-than btree lookup.

That commit also fixed this problem.

> The first solution I thought of was to limit latency by scanning parts
> of an AG at a time,

It already does this.

> but this doesn't solve the stalling problem when the
> free space is heavily fragmented because each sub-AG scan has to walk
> the entire cntbt to find free space that fits within the given range.
> In fact, this dramatically increases the runtime!  This itself is a
> problem, because sub-AG fstrim runtime is unnecessarily high.

Ah, so this is a completely different problem to what you describe
above.  i.e. The problem with "sub-ag" fstrim is simply that finding
block range limited free extents is costly in terms of overall time
and CPU usage when you do it via the by-count btree instead of the
by-bno btree...


> For sub-AG scans, create a second implementation that will walk the
> bnobt and perform the trims in block number order.  Since the cursor has
> an obviously monotonically increasing value, it is easy to cycle the AGF
> periodically to allow other threads to do work.  This implementation
> avoids the worst problems of the original code, though it lacks the
> desirable attribute of freeing the biggest chunks first.

Ok, it's fine to do a by-bno search in this case...

> On the other hand, this second implementation will be much easier to
> constrain the locking latency, and makes it much easier to report fstrim
> progress to anyone who's running xfs_scrub.

... but It doesn't change the locking latency of fstrim at all.
Locks are still only held for batches of 100 free extent lookups...

I think a commit message update is necessary. :)

> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_discard.c |  172 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 169 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
> index 268bb734dc0a8..ee7a8759091eb 100644
> --- a/fs/xfs/xfs_discard.c
> +++ b/fs/xfs/xfs_discard.c
> @@ -157,9 +157,9 @@ xfs_trim_gather_extents(
>  	uint64_t		*blocks_trimmed)
>  {
>  	struct xfs_mount	*mp = pag->pag_mount;
> -	struct xfs_trans	*tp;
>  	struct xfs_btree_cur	*cur;
>  	struct xfs_buf		*agbp;
> +	struct xfs_trans	*tp;
>  	int			error;
>  	int			i;
>  	int			batch = 100;
> @@ -292,6 +292,160 @@ xfs_trim_gather_extents(
>  	return error;
>  }
>  
> +/* Trim the free space in this AG by block number. */
> +static inline int
> +xfs_trim_gather_bybno(
> +	struct xfs_perag	*pag,
> +	xfs_daddr_t		start,
> +	xfs_daddr_t		end,
> +	xfs_daddr_t		minlen,
> +	struct xfs_alloc_rec_incore *tcur,
> +	struct xfs_busy_extents	*extents,
> +	uint64_t		*blocks_trimmed)
> +{

I'd prefer that we don't copy-n-paste-n-subtly-modify code like
this.  There's very little different between the two gather cases -
the initial cursor setup and the loop exit criteria - so they should
be easy to use a common core loop.

> +	struct xfs_mount	*mp = pag->pag_mount;
> +	struct xfs_trans	*tp;
> +	struct xfs_btree_cur	*cur;
> +	struct xfs_buf		*agbp;
> +	xfs_daddr_t		end_daddr;
> +	xfs_agnumber_t		agno = pag->pag_agno;
> +	xfs_agblock_t		start_agbno;
> +	xfs_agblock_t		end_agbno;
> +	xfs_extlen_t		minlen_fsb = XFS_BB_TO_FSB(mp, minlen);
> +	int			i;
> +	int			batch = 100;
> +	int			error;
> +
> +	start = max(start, XFS_AGB_TO_DADDR(mp, agno, 0));
> +	start_agbno = xfs_daddr_to_agbno(mp, start);
> +
> +	end_daddr = XFS_AGB_TO_DADDR(mp, agno, pag->block_count);
> +	end = min(end, end_daddr - 1);
> +	end_agbno = xfs_daddr_to_agbno(mp, end);

I think this is the wrong place to do this.  I agree that it is
sensible to use ag-constrained agbnos here, but we should do it
properly and not make the code unnecessarily difficult to maintain
by using unconstrained daddrs in one gather function and constrained
agbnos in another. 

Here's a completely untested, uncompiled version of this by-bno
search I just wrote up to demonstrate that if we pass ag-confined
agbnos from the top level, we don't need to duplicate this gather
code at all or do trim range constraining inside the gather
functions...

-Dave.

diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index 268bb734dc0a..0c949dfc097a 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -145,14 +145,18 @@ xfs_discard_extents(
 	return error;
 }
 
+struct xfs_trim_cur {
+	xfs_agblock_t	start;
+	xfs_extlen_t	count;
+	xfs_agblock_t	end;
+	xfs_extlen_t	minlen;
+	bool		by_len;
+};
 
 static int
 xfs_trim_gather_extents(
 	struct xfs_perag	*pag,
-	xfs_daddr_t		start,
-	xfs_daddr_t		end,
-	xfs_daddr_t		minlen,
-	struct xfs_alloc_rec_incore *tcur,
+	struct xfs_trim_cur	*tcur,
 	struct xfs_busy_extents	*extents,
 	uint64_t		*blocks_trimmed)
 {
@@ -179,21 +183,21 @@ xfs_trim_gather_extents(
 	if (error)
 		goto out_trans_cancel;
 
-	cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
 
-	/*
-	 * Look up the extent length requested in the AGF and start with it.
-	 */
-	if (tcur->ar_startblock == NULLAGBLOCK)
-		error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i);
-	else
+	if (tcur->by_len) {
+		cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag);
+		error = xfs_alloc_lookup_ge(cur, tcur->ar_startblock,
+				0, &i);
+	} else {
+		cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
 		error = xfs_alloc_lookup_le(cur, tcur->ar_startblock,
 				tcur->ar_blockcount, &i);
+	}
 	if (error)
 		goto out_del_cursor;
 	if (i == 0) {
 		/* nothing of that length left in the AG, we are done */
-		tcur->ar_blockcount = 0;
+		tcur->count = 0;
 		goto out_del_cursor;
 	}
 
@@ -221,25 +225,8 @@ xfs_trim_gather_extents(
 			 * Update the cursor to point at this extent so we
 			 * restart the next batch from this extent.
 			 */
-			tcur->ar_startblock = fbno;
-			tcur->ar_blockcount = flen;
-			break;
-		}
-
-		/*
-		 * use daddr format for all range/len calculations as that is
-		 * the format the range/len variables are supplied in by
-		 * userspace.
-		 */
-		dbno = XFS_AGB_TO_DADDR(mp, pag->pag_agno, fbno);
-		dlen = XFS_FSB_TO_BB(mp, flen);
-
-		/*
-		 * Too small?  Give up.
-		 */
-		if (dlen < minlen) {
-			trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen);
-			tcur->ar_blockcount = 0;
+			tcur->start = fbno;
+			tcur->count = flen;
 			break;
 		}
 
@@ -248,10 +235,35 @@ xfs_trim_gather_extents(
 		 * supposed to discard skip it.  Do not bother to trim
 		 * down partially overlapping ranges for now.
 		 */
-		if (dbno + dlen < start || dbno > end) {
+		if (fbno + flen < tcur->start) {
 			trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen);
 			goto next_extent;
 		}
+		if (fbno > tcur->end) {
+			trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen);
+			if (tcur->by_len) {
+				tcur->count = 0;
+				break;
+			}
+			goto next_extent;
+		}
+
+		/* Trim the extent returned to the range we want. */
+		if (fbno < tcur->start) {
+			flen -= tcur->start - fbno;
+			fbno = tcur->start;
+		}
+		if (fbno + flen > tcur->end + 1)
+			flen = tcur->end - fbno + 1;
+
+		/*
+		 * Too small?  Give up.
+		 */
+		if (flen < minlen) {
+			trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen);
+			tcur->count = 0;
+			break;
+		}
 
 		/*
 		 * If any blocks in the range are still busy, skip the
@@ -266,7 +278,10 @@ xfs_trim_gather_extents(
 				&extents->extent_list);
 		*blocks_trimmed += flen;
 next_extent:
-		error = xfs_btree_decrement(cur, 0, &i);
+		if (tcur->by_len)
+			error = xfs_btree_increment(cur, 0, &i);
+		else
+			error = xfs_btree_decrement(cur, 0, &i);
 		if (error)
 			break;
 
@@ -276,7 +291,7 @@ xfs_trim_gather_extents(
 		 * is no more extents to search.
 		 */
 		if (i == 0)
-			tcur->ar_blockcount = 0;
+			tcur->count = 0;
 	}
 
 	/*
@@ -306,17 +321,22 @@ xfs_trim_should_stop(void)
 static int
 xfs_trim_extents(
 	struct xfs_perag	*pag,
-	xfs_daddr_t		start,
-	xfs_daddr_t		end,
-	xfs_daddr_t		minlen,
+	xfs_agblock_t		start,
+	xfs_agblock_t		end,
+	xfs_extlen_t		minlen,
 	uint64_t		*blocks_trimmed)
 {
-	struct xfs_alloc_rec_incore tcur = {
-		.ar_blockcount = pag->pagf_longest,
-		.ar_startblock = NULLAGBLOCK,
+	struct xfs_trim_cur	tcur = {
+		.start		= start,
+		.count		= pag->pagf_longest,
+		.end		= end,
+		.minlen		= minlen,
 	};
 	int			error = 0;
 
+	if (start != 0 || end != pag->block_count)
+		tcur.by_len = true;
+
 	do {
 		struct xfs_busy_extents	*extents;
 
@@ -330,8 +350,8 @@ xfs_trim_extents(
 		extents->owner = extents;
 		INIT_LIST_HEAD(&extents->extent_list);
 
-		error = xfs_trim_gather_extents(pag, start, end, minlen,
-				&tcur, extents, blocks_trimmed);
+		error = xfs_trim_gather_extents(pag, &tcur, extents,
+					blocks_trimmed);
 		if (error) {
 			kfree(extents);
 			break;
@@ -354,7 +374,7 @@ xfs_trim_extents(
 		if (xfs_trim_should_stop())
 			break;
 
-	} while (tcur.ar_blockcount != 0);
+	} while (tcur.count != 0);
 
 	return error;
 
@@ -378,8 +398,10 @@ xfs_ioc_trim(
 	unsigned int		granularity =
 		bdev_discard_granularity(mp->m_ddev_targp->bt_bdev);
 	struct fstrim_range	range;
-	xfs_daddr_t		start, end, minlen;
+	xfs_daddr_t		start, end;
+	xfs_extlen_t		minlen;
 	xfs_agnumber_t		agno;
+	xfs_agblock_t		start_agbno, end_agbno;
 	uint64_t		blocks_trimmed = 0;
 	int			error, last_error = 0;
 
@@ -399,7 +421,8 @@ xfs_ioc_trim(
 		return -EFAULT;
 
 	range.minlen = max_t(u64, granularity, range.minlen);
-	minlen = BTOBB(range.minlen);
+	minlen = XFS_B_TO_FSB(mp, range.minlen);
+
 	/*
 	 * Truncating down the len isn't actually quite correct, but using
 	 * BBTOB would mean we trivially get overflows for values
@@ -415,12 +438,23 @@ xfs_ioc_trim(
 	start = BTOBB(range.start);
 	end = start + BTOBBT(range.len) - 1;
 
-	if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1)
-		end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1;
+	start_agno = xfs_daddr_to_agno(mp, start);
+	start_agbno = xfs_daddr_to_agbno(mp, start);
+	end_agno = xfs_daddr_to_agno(mp, end);
+	end_agbno = xfs_daddr_to_agbno(mp, end);
 
-	agno = xfs_daddr_to_agno(mp, start);
-	for_each_perag_range(mp, agno, xfs_daddr_to_agno(mp, end), pag) {
-		error = xfs_trim_extents(pag, start, end, minlen,
+	if (end_agno >= mp->m_sb.sb_agcount ||
+	    !xfs_verify_agno_agbno(mp, end_agno, end_agbno)) {
+		end_agno = mp->m_sb.sb_agcount - 1;
+		end_agbno = xfs_ag_block_count(mp, end_agno);
+	}
+
+	for_each_perag_range(mp, start_agno, end_agno, pag) {
+		xfs_agblock_t end = mp->m_sb.sb_agblocks;
+
+		if (start_agno == end_agno)
+			end = end_agbno;
+		error = xfs_trim_extents(pag, start_agbno, end, minlen,
 					  &blocks_trimmed);
 		if (error)
 			last_error = error;
@@ -429,6 +463,7 @@ xfs_ioc_trim(
 			xfs_perag_rele(pag);
 			break;
 		}
+		start_agbno = 0;
 	}
 
 	if (last_error)
Darrick J. Wong March 29, 2024, 9:35 p.m. UTC | #3
On Wed, Mar 27, 2024 at 04:35:12AM -0700, Christoph Hellwig wrote:
> On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote:
> > periodically to allow other threads to do work.  This implementation
> > avoids the worst problems of the original code, though it lacks the
> > desirable attribute of freeing the biggest chunks first.
> 
> Do we really care much about freeing larger area first?  I don't think
> it really matters for FITRIM at all.
> 
> In other words, I suspect we're better off with only the by-bno
> implementation.

Welll... you could argue that if the underlying "thin provisioning" is
actually just an xfs file, that punching tons of tiny holes in that file
could increase the height of the bmbt.  In that case, you'd want to
reduce the chances of the punch failing with ENOSPC by punching out
larger ranges to free up more blocks.

--D
Darrick J. Wong March 29, 2024, 10:51 p.m. UTC | #4
On Thu, Mar 28, 2024 at 09:15:25AM +1100, Dave Chinner wrote:
> On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > XFS issues discard IOs while holding the free space btree and the AGF
> > buffers locked.  If the discard IOs are slow, this can lead to long
> > stalls for every other thread trying to access that AG.  On a 10TB high
> > performance flash storage device with a severely fragmented free space
> > btree in every AG, this results in many threads tripping the hangcheck
> > warnings while waiting for the AGF.  This happens even after we've run
> > fstrim a few times and waited for the nvme namespace utilization
> > counters to stabilize.
> > 
> > Strace for the entire 100TB looks like:
> > ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839>
> > 
> > Reducing the size of the FITRIM requests to a single AG at a time
> > produces lower times for each individual call, but even this isn't quite
> > acceptable, because the lock hold times are still high enough to cause
> > stall warnings:
> > 
> > Strace for the first 4x 1TB AGs looks like (2):
> > ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033>
> > ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323>
> > ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226>
> > ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744>
> 
> Large FITRIM runtime is not actually an AGF locking latency problem
> anymore....
> 
> > The fstrim code has to synchronize discards with block allocations, so
> > we must hold the AGF lock while issuing discard IOs.
> 
> ... because I fixed this problem in commit 89cfa899608f ("xfs:
> reduce AGF hold times during fstrim operations").
> 
> > Breaking up the
> > calls into smaller start/len segments ought to reduce the lock hold time
> > and allow other threads a chance to make progress.  Unfortunately, the
> > current fstrim implementation handles this poorly because it walks the
> > entire free space by length index (cntbt) and it's not clear if we can
> > cycle the AGF periodically to reduce latency because there's no
> > less-than btree lookup.
> 
> That commit also fixed this problem.
> > The first solution I thought of was to limit latency by scanning parts
> > of an AG at a time,
> 
> It already does this.

Yeah.  I sent this patch to the list 10 months ago, but nobody ever
responded.  I sent it again at the end of last year, and still nothing.

I guess I forgot to update the commit message on this patch after you
fixed the AGF hold times, but there's still the problem that asking to
trim a subset of an AG causes XFS to walk the entire by-length btree
just to select out the records that actually fit within the criteria.

> > but this doesn't solve the stalling problem when the
> > free space is heavily fragmented because each sub-AG scan has to walk
> > the entire cntbt to find free space that fits within the given range.
> > In fact, this dramatically increases the runtime!  This itself is a
> > problem, because sub-AG fstrim runtime is unnecessarily high.
> 
> Ah, so this is a completely different problem to what you describe
> above.  i.e. The problem with "sub-ag" fstrim is simply that finding
> block range limited free extents is costly in terms of overall time
> and CPU usage when you do it via the by-count btree instead of the
> by-bno btree...

Yes.

> > For sub-AG scans, create a second implementation that will walk the
> > bnobt and perform the trims in block number order.  Since the cursor has
> > an obviously monotonically increasing value, it is easy to cycle the AGF
> > periodically to allow other threads to do work.  This implementation
> > avoids the worst problems of the original code, though it lacks the
> > desirable attribute of freeing the biggest chunks first.
> 
> Ok, it's fine to do a by-bno search in this case...
> 
> > On the other hand, this second implementation will be much easier to
> > constrain the locking latency, and makes it much easier to report fstrim
> > progress to anyone who's running xfs_scrub.
> 
> ... but It doesn't change the locking latency of fstrim at all.
> Locks are still only held for batches of 100 free extent lookups...
> 
> I think a commit message update is necessary. :)

xfs: fix performance problems when fstrimming a subset of a fragmented AG

On a 10TB filesystem where the free space in each AG is heavily
fragmented, I noticed some very high runtimes on a FITRIM call for the
entire filesystem.  xfs_scrub likes to report progress information on
each phase of the scrub, which means that a strace for the entire
filesystem:

ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839>

shows that scrub is uncommunicative for the entire duration.  Reducing
the size of the FITRIM requests to a single AG at a time produces lower
times for each individual call, but even this isn't quite acceptable,
because the time between progress reports are still very high:

Strace for the first 4x 1TB AGs looks like (2):
ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033>
ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323>
ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226>
ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744>

I then had the idea to limit the length parameter of each call to a
smallish amount (~11GB) so that we could report progress relatively
quickly, but much to my surprise, each FITRIM call still took ~68
seconds!

Unfortunately, the by-length fstrim implementation handles this poorly
because it walks the entire free space by length index (cntbt), which is
a very inefficient way to walk a subset of the blocks of an AG.

Therefore, create a second implementation that will walk the bnobt and
perform the trims in block number order.  This implementation avoids the
worst problems of the original code, though it lacks the desirable
attribute of freeing the biggest chunks first.

On the other hand, this second implementation will be much easier to
constrain the system call latency, and makes it much easier to report
fstrim progress to anyone who's running xfs_scrub.

> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_discard.c |  172 +++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 169 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
> > index 268bb734dc0a8..ee7a8759091eb 100644
> > --- a/fs/xfs/xfs_discard.c
> > +++ b/fs/xfs/xfs_discard.c
> > @@ -157,9 +157,9 @@ xfs_trim_gather_extents(
> >  	uint64_t		*blocks_trimmed)
> >  {
> >  	struct xfs_mount	*mp = pag->pag_mount;
> > -	struct xfs_trans	*tp;
> >  	struct xfs_btree_cur	*cur;
> >  	struct xfs_buf		*agbp;
> > +	struct xfs_trans	*tp;
> >  	int			error;
> >  	int			i;
> >  	int			batch = 100;
> > @@ -292,6 +292,160 @@ xfs_trim_gather_extents(
> >  	return error;
> >  }
> >  
> > +/* Trim the free space in this AG by block number. */
> > +static inline int
> > +xfs_trim_gather_bybno(
> > +	struct xfs_perag	*pag,
> > +	xfs_daddr_t		start,
> > +	xfs_daddr_t		end,
> > +	xfs_daddr_t		minlen,
> > +	struct xfs_alloc_rec_incore *tcur,
> > +	struct xfs_busy_extents	*extents,
> > +	uint64_t		*blocks_trimmed)
> > +{
> 
> I'd prefer that we don't copy-n-paste-n-subtly-modify code like
> this.  There's very little different between the two gather cases -
> the initial cursor setup and the loop exit criteria - so they should
> be easy to use a common core loop.

<nod>

> > +	struct xfs_mount	*mp = pag->pag_mount;
> > +	struct xfs_trans	*tp;
> > +	struct xfs_btree_cur	*cur;
> > +	struct xfs_buf		*agbp;
> > +	xfs_daddr_t		end_daddr;
> > +	xfs_agnumber_t		agno = pag->pag_agno;
> > +	xfs_agblock_t		start_agbno;
> > +	xfs_agblock_t		end_agbno;
> > +	xfs_extlen_t		minlen_fsb = XFS_BB_TO_FSB(mp, minlen);
> > +	int			i;
> > +	int			batch = 100;
> > +	int			error;
> > +
> > +	start = max(start, XFS_AGB_TO_DADDR(mp, agno, 0));
> > +	start_agbno = xfs_daddr_to_agbno(mp, start);
> > +
> > +	end_daddr = XFS_AGB_TO_DADDR(mp, agno, pag->block_count);
> > +	end = min(end, end_daddr - 1);
> > +	end_agbno = xfs_daddr_to_agbno(mp, end);
> 
> I think this is the wrong place to do this.  I agree that it is
> sensible to use ag-constrained agbnos here, but we should do it
> properly and not make the code unnecessarily difficult to maintain
> by using unconstrained daddrs in one gather function and constrained
> agbnos in another. 
> 
> Here's a completely untested, uncompiled version of this by-bno
> search I just wrote up to demonstrate that if we pass ag-confined
> agbnos from the top level, we don't need to duplicate this gather
> code at all or do trim range constraining inside the gather
> functions...

Mostly looks ok, but let's dig in--

> -Dave.
> 
> diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
> index 268bb734dc0a..0c949dfc097a 100644
> --- a/fs/xfs/xfs_discard.c
> +++ b/fs/xfs/xfs_discard.c
> @@ -145,14 +145,18 @@ xfs_discard_extents(
>  	return error;
>  }
>  
> +struct xfs_trim_cur {
> +	xfs_agblock_t	start;
> +	xfs_extlen_t	count;
> +	xfs_agblock_t	end;
> +	xfs_extlen_t	minlen;
> +	bool		by_len;
> +};
>  
>  static int
>  xfs_trim_gather_extents(
>  	struct xfs_perag	*pag,
> -	xfs_daddr_t		start,
> -	xfs_daddr_t		end,
> -	xfs_daddr_t		minlen,
> -	struct xfs_alloc_rec_incore *tcur,
> +	struct xfs_trim_cur	*tcur,
>  	struct xfs_busy_extents	*extents,
>  	uint64_t		*blocks_trimmed)
>  {
> @@ -179,21 +183,21 @@ xfs_trim_gather_extents(
>  	if (error)
>  		goto out_trans_cancel;
>  
> -	cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
>  
> -	/*
> -	 * Look up the extent length requested in the AGF and start with it.
> -	 */
> -	if (tcur->ar_startblock == NULLAGBLOCK)
> -		error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i);
> -	else
> +	if (tcur->by_len) {
> +		cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag);

I'm confused, why are we searching the by-bno btree if "by_len" is set?
Granted the logic for setting by_len triggers only for FITRIMs of
subsets of an AG so this functions conrrectly...

> +		error = xfs_alloc_lookup_ge(cur, tcur->ar_startblock,
> +				0, &i);

...insofar as this skips a free extent that starts before and ends after
tcur->start.

> +	} else {
> +		cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
>  		error = xfs_alloc_lookup_le(cur, tcur->ar_startblock,
>  				tcur->ar_blockcount, &i);
> +	}

How about this initialization logic:

	if (tcur->by_bno) {
		/* sub-AG discard request always starts at tcur->start */
		cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag);
		error = xfs_alloc_lookup_le(cur, tcur->start, 0, &i);
	} else if (tcur->start == NULLAGBLOCK) {
		/* first time through a by-len starts with max length */
		cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
		error = xfs_alloc_lookup_ge(cur, 0, tcur->count, &i);
	} else {
		/* nth time through a by-len starts where we left off */
		cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
		error = xfs_alloc_lookup_le(cur, tcur->start, tcur->count, &i);
	}

>  	if (error)
>  		goto out_del_cursor;
>  	if (i == 0) {
>  		/* nothing of that length left in the AG, we are done */
> -		tcur->ar_blockcount = 0;
> +		tcur->count = 0;
>  		goto out_del_cursor;
>  	}
>  
> @@ -221,25 +225,8 @@ xfs_trim_gather_extents(
>  			 * Update the cursor to point at this extent so we
>  			 * restart the next batch from this extent.
>  			 */
> -			tcur->ar_startblock = fbno;
> -			tcur->ar_blockcount = flen;
> -			break;
> -		}
> -
> -		/*
> -		 * use daddr format for all range/len calculations as that is
> -		 * the format the range/len variables are supplied in by
> -		 * userspace.
> -		 */
> -		dbno = XFS_AGB_TO_DADDR(mp, pag->pag_agno, fbno);
> -		dlen = XFS_FSB_TO_BB(mp, flen);
> -
> -		/*
> -		 * Too small?  Give up.
> -		 */
> -		if (dlen < minlen) {
> -			trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen);
> -			tcur->ar_blockcount = 0;
> +			tcur->start = fbno;
> +			tcur->count = flen;
>  			break;
>  		}
>  
> @@ -248,10 +235,35 @@ xfs_trim_gather_extents(
>  		 * supposed to discard skip it.  Do not bother to trim
>  		 * down partially overlapping ranges for now.
>  		 */
> -		if (dbno + dlen < start || dbno > end) {
> +		if (fbno + flen < tcur->start) {
>  			trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen);
>  			goto next_extent;
>  		}
> +		if (fbno > tcur->end) {
> +			trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen);
> +			if (tcur->by_len) {
> +				tcur->count = 0;
> +				break;
> +			}
> +			goto next_extent;
> +		}
> +
> +		/* Trim the extent returned to the range we want. */
> +		if (fbno < tcur->start) {
> +			flen -= tcur->start - fbno;
> +			fbno = tcur->start;
> +		}
> +		if (fbno + flen > tcur->end + 1)
> +			flen = tcur->end - fbno + 1;
> +
> +		/*
> +		 * Too small?  Give up.
> +		 */
> +		if (flen < minlen) {
> +			trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen);
> +			tcur->count = 0;
> +			break;
> +		}

For a by-bno search, this logic skips the entire rest of the AG after
the first free extent that's smaller than tcur->minlen.  Instead, it
should goto next_extent, yes?

>  
>  		/*
>  		 * If any blocks in the range are still busy, skip the
> @@ -266,7 +278,10 @@ xfs_trim_gather_extents(
>  				&extents->extent_list);
>  		*blocks_trimmed += flen;
>  next_extent:
> -		error = xfs_btree_decrement(cur, 0, &i);
> +		if (tcur->by_len)
> +			error = xfs_btree_increment(cur, 0, &i);
> +		else
> +			error = xfs_btree_decrement(cur, 0, &i);
>  		if (error)
>  			break;
>  
> @@ -276,7 +291,7 @@ xfs_trim_gather_extents(
>  		 * is no more extents to search.
>  		 */
>  		if (i == 0)
> -			tcur->ar_blockcount = 0;
> +			tcur->count = 0;
>  	}
>  
>  	/*
> @@ -306,17 +321,22 @@ xfs_trim_should_stop(void)
>  static int
>  xfs_trim_extents(
>  	struct xfs_perag	*pag,
> -	xfs_daddr_t		start,
> -	xfs_daddr_t		end,
> -	xfs_daddr_t		minlen,
> +	xfs_agblock_t		start,
> +	xfs_agblock_t		end,
> +	xfs_extlen_t		minlen,
>  	uint64_t		*blocks_trimmed)
>  {
> -	struct xfs_alloc_rec_incore tcur = {
> -		.ar_blockcount = pag->pagf_longest,
> -		.ar_startblock = NULLAGBLOCK,
> +	struct xfs_trim_cur	tcur = {
> +		.start		= start,
> +		.count		= pag->pagf_longest,
> +		.end		= end,
> +		.minlen		= minlen,
>  	};
>  	int			error = 0;
>  
> +	if (start != 0 || end != pag->block_count)
> +		tcur.by_len = true;
> +
>  	do {
>  		struct xfs_busy_extents	*extents;
>  
> @@ -330,8 +350,8 @@ xfs_trim_extents(
>  		extents->owner = extents;
>  		INIT_LIST_HEAD(&extents->extent_list);
>  
> -		error = xfs_trim_gather_extents(pag, start, end, minlen,
> -				&tcur, extents, blocks_trimmed);
> +		error = xfs_trim_gather_extents(pag, &tcur, extents,
> +					blocks_trimmed);
>  		if (error) {
>  			kfree(extents);
>  			break;
> @@ -354,7 +374,7 @@ xfs_trim_extents(
>  		if (xfs_trim_should_stop())
>  			break;
>  
> -	} while (tcur.ar_blockcount != 0);
> +	} while (tcur.count != 0);
>  
>  	return error;
>  
> @@ -378,8 +398,10 @@ xfs_ioc_trim(
>  	unsigned int		granularity =
>  		bdev_discard_granularity(mp->m_ddev_targp->bt_bdev);
>  	struct fstrim_range	range;
> -	xfs_daddr_t		start, end, minlen;
> +	xfs_daddr_t		start, end;
> +	xfs_extlen_t		minlen;
>  	xfs_agnumber_t		agno;
> +	xfs_agblock_t		start_agbno, end_agbno;
>  	uint64_t		blocks_trimmed = 0;
>  	int			error, last_error = 0;
>  
> @@ -399,7 +421,8 @@ xfs_ioc_trim(
>  		return -EFAULT;
>  
>  	range.minlen = max_t(u64, granularity, range.minlen);
> -	minlen = BTOBB(range.minlen);
> +	minlen = XFS_B_TO_FSB(mp, range.minlen);
> +
>  	/*
>  	 * Truncating down the len isn't actually quite correct, but using
>  	 * BBTOB would mean we trivially get overflows for values
> @@ -415,12 +438,23 @@ xfs_ioc_trim(
>  	start = BTOBB(range.start);
>  	end = start + BTOBBT(range.len) - 1;
>  
> -	if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1)
> -		end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1;

Couldn't this simply be:

	end = min_t(xfs_daddr_t, start + BTOBBT(range.len) - 1,
		    XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1);


--D

> +	start_agno = xfs_daddr_to_agno(mp, start);
> +	start_agbno = xfs_daddr_to_agbno(mp, start);
> +	end_agno = xfs_daddr_to_agno(mp, end);
> +	end_agbno = xfs_daddr_to_agbno(mp, end);
>  
> -	agno = xfs_daddr_to_agno(mp, start);
> -	for_each_perag_range(mp, agno, xfs_daddr_to_agno(mp, end), pag) {
> -		error = xfs_trim_extents(pag, start, end, minlen,
> +	if (end_agno >= mp->m_sb.sb_agcount ||
> +	    !xfs_verify_agno_agbno(mp, end_agno, end_agbno)) {
> +		end_agno = mp->m_sb.sb_agcount - 1;
> +		end_agbno = xfs_ag_block_count(mp, end_agno);
> +	}
> +
> +	for_each_perag_range(mp, start_agno, end_agno, pag) {
> +		xfs_agblock_t end = mp->m_sb.sb_agblocks;
> +
> +		if (start_agno == end_agno)
> +			end = end_agbno;
> +		error = xfs_trim_extents(pag, start_agbno, end, minlen,
>  					  &blocks_trimmed);
>  		if (error)
>  			last_error = error;
> @@ -429,6 +463,7 @@ xfs_ioc_trim(
>  			xfs_perag_rele(pag);
>  			break;
>  		}
> +		start_agbno = 0;
>  	}
>  
>  	if (last_error)
> -- 
> Dave Chinner
> david@fromorbit.com
>
Christoph Hellwig March 30, 2024, 5:38 a.m. UTC | #5
On Fri, Mar 29, 2024 at 02:35:20PM -0700, Darrick J. Wong wrote:
> Welll... you could argue that if the underlying "thin provisioning" is
> actually just an xfs file, that punching tons of tiny holes in that file
> could increase the height of the bmbt.  In that case, you'd want to
> reduce the chances of the punch failing with ENOSPC by punching out
> larger ranges to free up more blocks.

That is a somewhat reaonable line of thougt.  But is an underprovisioned
device near ENOSPC really the target here vs an SSD?  Either way,
if we have good reasons for by-cnt except for it being a little simpler
I can live with keeping it.  In doubt I just prefer to have simple code
and one implementation instead of two.
Dave Chinner March 30, 2024, 9:15 p.m. UTC | #6
On Fri, Mar 29, 2024 at 02:35:20PM -0700, Darrick J. Wong wrote:
> On Wed, Mar 27, 2024 at 04:35:12AM -0700, Christoph Hellwig wrote:
> > On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote:
> > > periodically to allow other threads to do work.  This implementation
> > > avoids the worst problems of the original code, though it lacks the
> > > desirable attribute of freeing the biggest chunks first.
> > 
> > Do we really care much about freeing larger area first?  I don't think
> > it really matters for FITRIM at all.
> > 
> > In other words, I suspect we're better off with only the by-bno
> > implementation.
> 
> Welll... you could argue that if the underlying "thin provisioning" is
> actually just an xfs file, that punching tons of tiny holes in that file
> could increase the height of the bmbt.  In that case, you'd want to
> reduce the chances of the punch failing with ENOSPC by punching out
> larger ranges to free up more blocks.

That's true, but it's a consideration for dm-thinp, too. I've always
considered FITRIM as an optimisation for dm-thinp systems (rather
than a hardware device optimisation), and so "large extents first"
make a whole lot more sense from that perspective.

That said, there's another reason for by-size searches being the
default behaviour: FITRIM has a minimum length control parameter.
Hence for fragmented filesystems where there are few extents larger
than the minimum length, we needed to avoid doing an exhaustive
search of the by-bno tree to find extents longer than the minimum
length....

-Dave.
Dave Chinner March 30, 2024, 9:51 p.m. UTC | #7
On Fri, Mar 29, 2024 at 03:51:49PM -0700, Darrick J. Wong wrote:
> On Thu, Mar 28, 2024 at 09:15:25AM +1100, Dave Chinner wrote:
> > On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
....
> > I think a commit message update is necessary. :)
> 
> xfs: fix performance problems when fstrimming a subset of a fragmented AG
> 
> On a 10TB filesystem where the free space in each AG is heavily
> fragmented, I noticed some very high runtimes on a FITRIM call for the
> entire filesystem.  xfs_scrub likes to report progress information on
> each phase of the scrub, which means that a strace for the entire
> filesystem:
> 
> ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839>
> 
> shows that scrub is uncommunicative for the entire duration.  Reducing
> the size of the FITRIM requests to a single AG at a time produces lower
> times for each individual call, but even this isn't quite acceptable,
> because the time between progress reports are still very high:
> 
> Strace for the first 4x 1TB AGs looks like (2):
> ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033>
> ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323>
> ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226>
> ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744>
> 
> I then had the idea to limit the length parameter of each call to a
> smallish amount (~11GB) so that we could report progress relatively
> quickly, but much to my surprise, each FITRIM call still took ~68
> seconds!
> 
> Unfortunately, the by-length fstrim implementation handles this poorly
> because it walks the entire free space by length index (cntbt), which is
> a very inefficient way to walk a subset of the blocks of an AG.
> 
> Therefore, create a second implementation that will walk the bnobt and
> perform the trims in block number order.  This implementation avoids the
> worst problems of the original code, though it lacks the desirable
> attribute of freeing the biggest chunks first.
> 
> On the other hand, this second implementation will be much easier to
> constrain the system call latency, and makes it much easier to report
> fstrim progress to anyone who's running xfs_scrub.

Much better :)

> > Here's a completely untested, uncompiled version of this by-bno
> > search I just wrote up to demonstrate that if we pass ag-confined
> > agbnos from the top level, we don't need to duplicate this gather
> > code at all or do trim range constraining inside the gather
> > functions...
> 
> Mostly looks ok, but let's dig in--
> 
> > -Dave.
> > 
> > diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
> > index 268bb734dc0a..0c949dfc097a 100644
> > --- a/fs/xfs/xfs_discard.c
> > +++ b/fs/xfs/xfs_discard.c
> > @@ -145,14 +145,18 @@ xfs_discard_extents(
> >  	return error;
> >  }
> >  
> > +struct xfs_trim_cur {
> > +	xfs_agblock_t	start;
> > +	xfs_extlen_t	count;
> > +	xfs_agblock_t	end;
> > +	xfs_extlen_t	minlen;
> > +	bool		by_len;
> > +};
> >  
> >  static int
> >  xfs_trim_gather_extents(
> >  	struct xfs_perag	*pag,
> > -	xfs_daddr_t		start,
> > -	xfs_daddr_t		end,
> > -	xfs_daddr_t		minlen,
> > -	struct xfs_alloc_rec_incore *tcur,
> > +	struct xfs_trim_cur	*tcur,
> >  	struct xfs_busy_extents	*extents,
> >  	uint64_t		*blocks_trimmed)
> >  {
> > @@ -179,21 +183,21 @@ xfs_trim_gather_extents(
> >  	if (error)
> >  		goto out_trans_cancel;
> >  
> > -	cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> >  
> > -	/*
> > -	 * Look up the extent length requested in the AGF and start with it.
> > -	 */
> > -	if (tcur->ar_startblock == NULLAGBLOCK)
> > -		error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i);
> > -	else
> > +	if (tcur->by_len) {
> > +		cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag);
> 
> I'm confused, why are we searching the by-bno btree if "by_len" is set?

Because I changed the logic half way through writing it and forgot
to change the variable name....

> Granted the logic for setting by_len triggers only for FITRIMs of
> subsets of an AG so this functions conrrectly...
> 
> > +		error = xfs_alloc_lookup_ge(cur, tcur->ar_startblock,
> > +				0, &i);
> 
> ...insofar as this skips a free extent that starts before and ends after
> tcur->start.

Right - did I mention this wasn't even compiled? :)

> 
> > +	} else {
> > +		cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> >  		error = xfs_alloc_lookup_le(cur, tcur->ar_startblock,
> >  				tcur->ar_blockcount, &i);
> > +	}
> 
> How about this initialization logic:
> 
> 	if (tcur->by_bno) {
> 		/* sub-AG discard request always starts at tcur->start */
> 		cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag);
> 		error = xfs_alloc_lookup_le(cur, tcur->start, 0, &i);

OK.

> 	} else if (tcur->start == NULLAGBLOCK) {
> 		/* first time through a by-len starts with max length */
> 		cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> 		error = xfs_alloc_lookup_ge(cur, 0, tcur->count, &i);
> 	} else {
> 		/* nth time through a by-len starts where we left off */
> 		cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> 		error = xfs_alloc_lookup_le(cur, tcur->start, tcur->count, &i);
> 	}

... but that is what I was explicilty trying to avoid because the
initial search on a by-count config is for tcur->count =
pag->pagf_longest.

IOWs, there is no "greater" sized free extent than tcur->count,
every free extent *must* be less than or equal to pag->pagf_longest.
The _ge() search is done because the start block of zero is less
than every agbno in the tree. Hence we have to do a _ge() search to
"find" the pag->pagf_longest extent based on start block criteria.

If we pass in a tcur->start = NULLAGBLOCK (0xffffffff) or
pag->pag_blocks then every agbno is "less than" the start block, and
so it should return the extent at the right most edge of the tree
with a _le() search regardless of where in the AG it is located.

Hence the initial search can also be a _le() search and we don't
have to special case it - the by-count tree has two components in
it's search key and if we set the initial values of both components
correctly one search works for all cases...

> > @@ -248,10 +235,35 @@ xfs_trim_gather_extents(
> >  		 * supposed to discard skip it.  Do not bother to trim
> >  		 * down partially overlapping ranges for now.
> >  		 */
> > -		if (dbno + dlen < start || dbno > end) {
> > +		if (fbno + flen < tcur->start) {
> >  			trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen);
> >  			goto next_extent;
> >  		}
> > +		if (fbno > tcur->end) {
> > +			trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen);
> > +			if (tcur->by_len) {
> > +				tcur->count = 0;
> > +				break;
> > +			}
> > +			goto next_extent;
> > +		}
> > +
> > +		/* Trim the extent returned to the range we want. */
> > +		if (fbno < tcur->start) {
> > +			flen -= tcur->start - fbno;
> > +			fbno = tcur->start;
> > +		}
> > +		if (fbno + flen > tcur->end + 1)
> > +			flen = tcur->end - fbno + 1;
> > +
> > +		/*
> > +		 * Too small?  Give up.
> > +		 */
> > +		if (flen < minlen) {
> > +			trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen);
> > +			tcur->count = 0;
> > +			break;
> > +		}
> 
> For a by-bno search, this logic skips the entire rest of the AG after
> the first free extent that's smaller than tcur->minlen.  Instead, it
> should goto next_extent, yes?

Yes.

> > @@ -415,12 +438,23 @@ xfs_ioc_trim(
> >  	start = BTOBB(range.start);
> >  	end = start + BTOBBT(range.len) - 1;
> >  
> > -	if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1)
> > -		end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1;
> 
> Couldn't this simply be:
> 
> 	end = min_t(xfs_daddr_t, start + BTOBBT(range.len) - 1,
> 		    XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1);

Yes.

-Dave.
Darrick J. Wong March 31, 2024, 10:44 p.m. UTC | #8
On Sun, Mar 31, 2024 at 08:15:49AM +1100, Dave Chinner wrote:
> On Fri, Mar 29, 2024 at 02:35:20PM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 27, 2024 at 04:35:12AM -0700, Christoph Hellwig wrote:
> > > On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote:
> > > > periodically to allow other threads to do work.  This implementation
> > > > avoids the worst problems of the original code, though it lacks the
> > > > desirable attribute of freeing the biggest chunks first.
> > > 
> > > Do we really care much about freeing larger area first?  I don't think
> > > it really matters for FITRIM at all.
> > > 
> > > In other words, I suspect we're better off with only the by-bno
> > > implementation.
> > 
> > Welll... you could argue that if the underlying "thin provisioning" is
> > actually just an xfs file, that punching tons of tiny holes in that file
> > could increase the height of the bmbt.  In that case, you'd want to
> > reduce the chances of the punch failing with ENOSPC by punching out
> > larger ranges to free up more blocks.
> 
> That's true, but it's a consideration for dm-thinp, too. I've always
> considered FITRIM as an optimisation for dm-thinp systems (rather
> than a hardware device optimisation), and so "large extents first"
> make a whole lot more sense from that perspective.
> 
> That said, there's another reason for by-size searches being the
> default behaviour: FITRIM has a minimum length control parameter.
> Hence for fragmented filesystems where there are few extents larger
> than the minimum length, we needed to avoid doing an exhaustive
> search of the by-bno tree to find extents longer than the minimum
> length....

Ooooh, that's a /very/ good point.  xfs_scrub also now constructs a
histogram of free space extent lengths to see if it can reduce the
runtime of phase 8 (FITRIM) by setting minlen to something large enough
to reduce runtime by 4-5x while not missing more than a few percent of
the free space.

--D
Darrick J. Wong March 31, 2024, 10:44 p.m. UTC | #9
On Sun, Mar 31, 2024 at 08:51:23AM +1100, Dave Chinner wrote:
> On Fri, Mar 29, 2024 at 03:51:49PM -0700, Darrick J. Wong wrote:
> > On Thu, Mar 28, 2024 at 09:15:25AM +1100, Dave Chinner wrote:
> > > On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> ....
> > > I think a commit message update is necessary. :)
> > 
> > xfs: fix performance problems when fstrimming a subset of a fragmented AG
> > 
> > On a 10TB filesystem where the free space in each AG is heavily
> > fragmented, I noticed some very high runtimes on a FITRIM call for the
> > entire filesystem.  xfs_scrub likes to report progress information on
> > each phase of the scrub, which means that a strace for the entire
> > filesystem:
> > 
> > ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839>
> > 
> > shows that scrub is uncommunicative for the entire duration.  Reducing
> > the size of the FITRIM requests to a single AG at a time produces lower
> > times for each individual call, but even this isn't quite acceptable,
> > because the time between progress reports are still very high:
> > 
> > Strace for the first 4x 1TB AGs looks like (2):
> > ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033>
> > ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323>
> > ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226>
> > ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744>
> > 
> > I then had the idea to limit the length parameter of each call to a
> > smallish amount (~11GB) so that we could report progress relatively
> > quickly, but much to my surprise, each FITRIM call still took ~68
> > seconds!
> > 
> > Unfortunately, the by-length fstrim implementation handles this poorly
> > because it walks the entire free space by length index (cntbt), which is
> > a very inefficient way to walk a subset of the blocks of an AG.
> > 
> > Therefore, create a second implementation that will walk the bnobt and
> > perform the trims in block number order.  This implementation avoids the
> > worst problems of the original code, though it lacks the desirable
> > attribute of freeing the biggest chunks first.
> > 
> > On the other hand, this second implementation will be much easier to
> > constrain the system call latency, and makes it much easier to report
> > fstrim progress to anyone who's running xfs_scrub.
> 
> Much better :)
> 
> > > Here's a completely untested, uncompiled version of this by-bno
> > > search I just wrote up to demonstrate that if we pass ag-confined
> > > agbnos from the top level, we don't need to duplicate this gather
> > > code at all or do trim range constraining inside the gather
> > > functions...
> > 
> > Mostly looks ok, but let's dig in--
> > 
> > > -Dave.
> > > 
> > > diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
> > > index 268bb734dc0a..0c949dfc097a 100644
> > > --- a/fs/xfs/xfs_discard.c
> > > +++ b/fs/xfs/xfs_discard.c
> > > @@ -145,14 +145,18 @@ xfs_discard_extents(
> > >  	return error;
> > >  }
> > >  
> > > +struct xfs_trim_cur {
> > > +	xfs_agblock_t	start;
> > > +	xfs_extlen_t	count;
> > > +	xfs_agblock_t	end;
> > > +	xfs_extlen_t	minlen;
> > > +	bool		by_len;
> > > +};
> > >  
> > >  static int
> > >  xfs_trim_gather_extents(
> > >  	struct xfs_perag	*pag,
> > > -	xfs_daddr_t		start,
> > > -	xfs_daddr_t		end,
> > > -	xfs_daddr_t		minlen,
> > > -	struct xfs_alloc_rec_incore *tcur,
> > > +	struct xfs_trim_cur	*tcur,
> > >  	struct xfs_busy_extents	*extents,
> > >  	uint64_t		*blocks_trimmed)
> > >  {
> > > @@ -179,21 +183,21 @@ xfs_trim_gather_extents(
> > >  	if (error)
> > >  		goto out_trans_cancel;
> > >  
> > > -	cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> > >  
> > > -	/*
> > > -	 * Look up the extent length requested in the AGF and start with it.
> > > -	 */
> > > -	if (tcur->ar_startblock == NULLAGBLOCK)
> > > -		error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i);
> > > -	else
> > > +	if (tcur->by_len) {
> > > +		cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag);
> > 
> > I'm confused, why are we searching the by-bno btree if "by_len" is set?
> 
> Because I changed the logic half way through writing it and forgot
> to change the variable name....
> 
> > Granted the logic for setting by_len triggers only for FITRIMs of
> > subsets of an AG so this functions conrrectly...
> > 
> > > +		error = xfs_alloc_lookup_ge(cur, tcur->ar_startblock,
> > > +				0, &i);
> > 
> > ...insofar as this skips a free extent that starts before and ends after
> > tcur->start.
> 
> Right - did I mention this wasn't even compiled? :)
> 
> > 
> > > +	} else {
> > > +		cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> > >  		error = xfs_alloc_lookup_le(cur, tcur->ar_startblock,
> > >  				tcur->ar_blockcount, &i);
> > > +	}
> > 
> > How about this initialization logic:
> > 
> > 	if (tcur->by_bno) {
> > 		/* sub-AG discard request always starts at tcur->start */
> > 		cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag);
> > 		error = xfs_alloc_lookup_le(cur, tcur->start, 0, &i);
> 
> OK.
> 
> > 	} else if (tcur->start == NULLAGBLOCK) {
> > 		/* first time through a by-len starts with max length */
> > 		cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> > 		error = xfs_alloc_lookup_ge(cur, 0, tcur->count, &i);
> > 	} else {
> > 		/* nth time through a by-len starts where we left off */
> > 		cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> > 		error = xfs_alloc_lookup_le(cur, tcur->start, tcur->count, &i);
> > 	}
> 
> ... but that is what I was explicilty trying to avoid because the
> initial search on a by-count config is for tcur->count =
> pag->pagf_longest.
> 
> IOWs, there is no "greater" sized free extent than tcur->count,
> every free extent *must* be less than or equal to pag->pagf_longest.
> The _ge() search is done because the start block of zero is less
> than every agbno in the tree. Hence we have to do a _ge() search to
> "find" the pag->pagf_longest extent based on start block criteria.
> 
> If we pass in a tcur->start = NULLAGBLOCK (0xffffffff) or
> pag->pag_blocks then every agbno is "less than" the start block, and
> so it should return the extent at the right most edge of the tree
> with a _le() search regardless of where in the AG it is located.
> 
> Hence the initial search can also be a _le() search and we don't
> have to special case it - the by-count tree has two components in
> it's search key and if we set the initial values of both components
> correctly one search works for all cases...
> 
> > > @@ -248,10 +235,35 @@ xfs_trim_gather_extents(
> > >  		 * supposed to discard skip it.  Do not bother to trim
> > >  		 * down partially overlapping ranges for now.
> > >  		 */
> > > -		if (dbno + dlen < start || dbno > end) {
> > > +		if (fbno + flen < tcur->start) {
> > >  			trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen);
> > >  			goto next_extent;
> > >  		}
> > > +		if (fbno > tcur->end) {
> > > +			trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen);
> > > +			if (tcur->by_len) {
> > > +				tcur->count = 0;
> > > +				break;
> > > +			}
> > > +			goto next_extent;
> > > +		}
> > > +
> > > +		/* Trim the extent returned to the range we want. */
> > > +		if (fbno < tcur->start) {
> > > +			flen -= tcur->start - fbno;
> > > +			fbno = tcur->start;
> > > +		}
> > > +		if (fbno + flen > tcur->end + 1)
> > > +			flen = tcur->end - fbno + 1;
> > > +
> > > +		/*
> > > +		 * Too small?  Give up.
> > > +		 */
> > > +		if (flen < minlen) {
> > > +			trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen);
> > > +			tcur->count = 0;
> > > +			break;
> > > +		}
> > 
> > For a by-bno search, this logic skips the entire rest of the AG after
> > the first free extent that's smaller than tcur->minlen.  Instead, it
> > should goto next_extent, yes?
> 
> Yes.
> 
> > > @@ -415,12 +438,23 @@ xfs_ioc_trim(
> > >  	start = BTOBB(range.start);
> > >  	end = start + BTOBBT(range.len) - 1;
> > >  
> > > -	if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1)
> > > -		end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1;
> > 
> > Couldn't this simply be:
> > 
> > 	end = min_t(xfs_daddr_t, start + BTOBBT(range.len) - 1,
> > 		    XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1);
> 
> Yes.

How's this for a replacement patch, then?

--D

Subject: xfs: fix performance problems when fstrimming a subset of a fragmented AG

On a 10TB filesystem where the free space in each AG is heavily
fragmented, I noticed some very high runtimes on a FITRIM call for the
entire filesystem.  xfs_scrub likes to report progress information on
each phase of the scrub, which means that a strace for the entire
filesystem:

ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839>

shows that scrub is uncommunicative for the entire duration.  Reducing
the size of the FITRIM requests to a single AG at a time produces lower
times for each individual call, but even this isn't quite acceptable,
because the time between progress reports are still very high:

Strace for the first 4x 1TB AGs looks like (2):
ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033>
ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323>
ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226>
ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744>

I then had the idea to limit the length parameter of each call to a
smallish amount (~11GB) so that we could report progress relatively
quickly, but much to my surprise, each FITRIM call still took ~68
seconds!

Unfortunately, the by-length fstrim implementation handles this poorly
because it walks the entire free space by length index (cntbt), which is
a very inefficient way to walk a subset of the blocks of an AG.

Therefore, create a second implementation that will walk the bnobt and
perform the trims in block number order.  This implementation avoids the
worst problems of the original code, though it lacks the desirable
attribute of freeing the biggest chunks first.

On the other hand, this second implementation will be much easier to
constrain the system call latency, and makes it much easier to report
fstrim progress to anyone who's running xfs_scrub.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_discard.c |  153 ++++++++++++++++++++++++++++++--------------------
 1 file changed, 93 insertions(+), 60 deletions(-)

diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index 268bb734dc0a8..25fe3b932b5a6 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -145,14 +145,18 @@ xfs_discard_extents(
 	return error;
 }
 
+struct xfs_trim_cur {
+	xfs_agblock_t	start;
+	xfs_extlen_t	count;
+	xfs_agblock_t	end;
+	xfs_extlen_t	minlen;
+	bool		by_bno;
+};
 
 static int
 xfs_trim_gather_extents(
 	struct xfs_perag	*pag,
-	xfs_daddr_t		start,
-	xfs_daddr_t		end,
-	xfs_daddr_t		minlen,
-	struct xfs_alloc_rec_incore *tcur,
+	struct xfs_trim_cur	*tcur,
 	struct xfs_busy_extents	*extents,
 	uint64_t		*blocks_trimmed)
 {
@@ -179,21 +183,26 @@ xfs_trim_gather_extents(
 	if (error)
 		goto out_trans_cancel;
 
-	cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
-
-	/*
-	 * Look up the extent length requested in the AGF and start with it.
-	 */
-	if (tcur->ar_startblock == NULLAGBLOCK)
-		error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i);
-	else
-		error = xfs_alloc_lookup_le(cur, tcur->ar_startblock,
-				tcur->ar_blockcount, &i);
+	if (tcur->by_bno) {
+		/* sub-AG discard request always starts at tcur->start */
+		cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag);
+		error = xfs_alloc_lookup_le(cur, tcur->start, 0, &i);
+		if (!error && !i)
+			error = xfs_alloc_lookup_ge(cur, tcur->start, 0, &i);
+	} else if (tcur->start == 0) {
+		/* first time through a by-len starts with max length */
+		cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
+		error = xfs_alloc_lookup_ge(cur, 0, tcur->count, &i);
+	} else {
+		/* nth time through a by-len starts where we left off */
+		cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
+		error = xfs_alloc_lookup_le(cur, tcur->start, tcur->count, &i);
+	}
 	if (error)
 		goto out_del_cursor;
 	if (i == 0) {
 		/* nothing of that length left in the AG, we are done */
-		tcur->ar_blockcount = 0;
+		tcur->count = 0;
 		goto out_del_cursor;
 	}
 
@@ -204,8 +213,6 @@ xfs_trim_gather_extents(
 	while (i) {
 		xfs_agblock_t	fbno;
 		xfs_extlen_t	flen;
-		xfs_daddr_t	dbno;
-		xfs_extlen_t	dlen;
 
 		error = xfs_alloc_get_rec(cur, &fbno, &flen, &i);
 		if (error)
@@ -221,37 +228,45 @@ xfs_trim_gather_extents(
 			 * Update the cursor to point at this extent so we
 			 * restart the next batch from this extent.
 			 */
-			tcur->ar_startblock = fbno;
-			tcur->ar_blockcount = flen;
-			break;
-		}
-
-		/*
-		 * use daddr format for all range/len calculations as that is
-		 * the format the range/len variables are supplied in by
-		 * userspace.
-		 */
-		dbno = XFS_AGB_TO_DADDR(mp, pag->pag_agno, fbno);
-		dlen = XFS_FSB_TO_BB(mp, flen);
-
-		/*
-		 * Too small?  Give up.
-		 */
-		if (dlen < minlen) {
-			trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen);
-			tcur->ar_blockcount = 0;
+			tcur->start = fbno;
+			tcur->count = flen;
 			break;
 		}
 
 		/*
 		 * If the extent is entirely outside of the range we are
-		 * supposed to discard skip it.  Do not bother to trim
-		 * down partially overlapping ranges for now.
+		 * supposed to skip it.  Do not bother to trim down partially
+		 * overlapping ranges for now.
 		 */
-		if (dbno + dlen < start || dbno > end) {
+		if (fbno + flen < tcur->start) {
 			trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen);
 			goto next_extent;
 		}
+		if (fbno > tcur->end) {
+			trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen);
+			if (tcur->by_bno) {
+				tcur->count = 0;
+				break;
+			}
+			goto next_extent;
+		}
+
+		/* Trim the extent returned to the range we want. */
+		if (fbno < tcur->start) {
+			flen -= tcur->start - fbno;
+			fbno = tcur->start;
+		}
+		if (fbno + flen > tcur->end + 1)
+			flen = tcur->end - fbno + 1;
+
+		/* Too small?  Give up. */
+		if (flen < tcur->minlen) {
+			trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen);
+			if (tcur->by_bno)
+				goto next_extent;
+			tcur->count = 0;
+			break;
+		}
 
 		/*
 		 * If any blocks in the range are still busy, skip the
@@ -266,7 +281,10 @@ xfs_trim_gather_extents(
 				&extents->extent_list);
 		*blocks_trimmed += flen;
 next_extent:
-		error = xfs_btree_decrement(cur, 0, &i);
+		if (tcur->by_bno)
+			error = xfs_btree_increment(cur, 0, &i);
+		else
+			error = xfs_btree_decrement(cur, 0, &i);
 		if (error)
 			break;
 
@@ -276,7 +294,7 @@ xfs_trim_gather_extents(
 		 * is no more extents to search.
 		 */
 		if (i == 0)
-			tcur->ar_blockcount = 0;
+			tcur->count = 0;
 	}
 
 	/*
@@ -306,17 +324,22 @@ xfs_trim_should_stop(void)
 static int
 xfs_trim_extents(
 	struct xfs_perag	*pag,
-	xfs_daddr_t		start,
-	xfs_daddr_t		end,
-	xfs_daddr_t		minlen,
+	xfs_agblock_t		start,
+	xfs_agblock_t		end,
+	xfs_extlen_t		minlen,
 	uint64_t		*blocks_trimmed)
 {
-	struct xfs_alloc_rec_incore tcur = {
-		.ar_blockcount = pag->pagf_longest,
-		.ar_startblock = NULLAGBLOCK,
+	struct xfs_trim_cur	tcur = {
+		.start		= start,
+		.count		= pag->pagf_longest,
+		.end		= end,
+		.minlen		= minlen,
 	};
 	int			error = 0;
 
+	if (start != 0 || end != pag->block_count)
+		tcur.by_bno = true;
+
 	do {
 		struct xfs_busy_extents	*extents;
 
@@ -330,8 +353,8 @@ xfs_trim_extents(
 		extents->owner = extents;
 		INIT_LIST_HEAD(&extents->extent_list);
 
-		error = xfs_trim_gather_extents(pag, start, end, minlen,
-				&tcur, extents, blocks_trimmed);
+		error = xfs_trim_gather_extents(pag, &tcur, extents,
+				blocks_trimmed);
 		if (error) {
 			kfree(extents);
 			break;
@@ -354,7 +377,7 @@ xfs_trim_extents(
 		if (xfs_trim_should_stop())
 			break;
 
-	} while (tcur.ar_blockcount != 0);
+	} while (tcur.count != 0);
 
 	return error;
 
@@ -378,8 +401,10 @@ xfs_ioc_trim(
 	unsigned int		granularity =
 		bdev_discard_granularity(mp->m_ddev_targp->bt_bdev);
 	struct fstrim_range	range;
-	xfs_daddr_t		start, end, minlen;
-	xfs_agnumber_t		agno;
+	xfs_daddr_t		start, end;
+	xfs_extlen_t		minlen;
+	xfs_agnumber_t		start_agno, end_agno;
+	xfs_agblock_t		start_agbno, end_agbno;
 	uint64_t		blocks_trimmed = 0;
 	int			error, last_error = 0;
 
@@ -399,7 +424,8 @@ xfs_ioc_trim(
 		return -EFAULT;
 
 	range.minlen = max_t(u64, granularity, range.minlen);
-	minlen = BTOBB(range.minlen);
+	minlen = XFS_B_TO_FSB(mp, range.minlen);
+
 	/*
 	 * Truncating down the len isn't actually quite correct, but using
 	 * BBTOB would mean we trivially get overflows for values
@@ -413,15 +439,21 @@ xfs_ioc_trim(
 		return -EINVAL;
 
 	start = BTOBB(range.start);
-	end = start + BTOBBT(range.len) - 1;
+	end = min_t(xfs_daddr_t, start + BTOBBT(range.len),
+		    XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) - 1;
 
-	if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1)
-		end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1;
+	start_agno = xfs_daddr_to_agno(mp, start);
+	start_agbno = xfs_daddr_to_agbno(mp, start);
+	end_agno = xfs_daddr_to_agno(mp, end);
+	end_agbno = xfs_daddr_to_agbno(mp, end);
 
-	agno = xfs_daddr_to_agno(mp, start);
-	for_each_perag_range(mp, agno, xfs_daddr_to_agno(mp, end), pag) {
-		error = xfs_trim_extents(pag, start, end, minlen,
-					  &blocks_trimmed);
+	for_each_perag_range(mp, start_agno, end_agno, pag) {
+		xfs_agblock_t	agend = pag->block_count;
+
+		if (start_agno == end_agno)
+			agend = end_agbno;
+		error = xfs_trim_extents(pag, start_agbno, agend, minlen,
+				&blocks_trimmed);
 		if (error)
 			last_error = error;
 
@@ -429,6 +461,7 @@ xfs_ioc_trim(
 			xfs_perag_rele(pag);
 			break;
 		}
+		start_agbno = 0;
 	}
 
 	if (last_error)
Dave Chinner April 1, 2024, 10:12 p.m. UTC | #10
On Sun, Mar 31, 2024 at 03:44:45PM -0700, Darrick J. Wong wrote:
> 
> How's this for a replacement patch, then?

Looks OK to me.

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index 268bb734dc0a8..ee7a8759091eb 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -157,9 +157,9 @@  xfs_trim_gather_extents(
 	uint64_t		*blocks_trimmed)
 {
 	struct xfs_mount	*mp = pag->pag_mount;
-	struct xfs_trans	*tp;
 	struct xfs_btree_cur	*cur;
 	struct xfs_buf		*agbp;
+	struct xfs_trans	*tp;
 	int			error;
 	int			i;
 	int			batch = 100;
@@ -292,6 +292,160 @@  xfs_trim_gather_extents(
 	return error;
 }
 
+/* Trim the free space in this AG by block number. */
+static inline int
+xfs_trim_gather_bybno(
+	struct xfs_perag	*pag,
+	xfs_daddr_t		start,
+	xfs_daddr_t		end,
+	xfs_daddr_t		minlen,
+	struct xfs_alloc_rec_incore *tcur,
+	struct xfs_busy_extents	*extents,
+	uint64_t		*blocks_trimmed)
+{
+	struct xfs_mount	*mp = pag->pag_mount;
+	struct xfs_trans	*tp;
+	struct xfs_btree_cur	*cur;
+	struct xfs_buf		*agbp;
+	xfs_daddr_t		end_daddr;
+	xfs_agnumber_t		agno = pag->pag_agno;
+	xfs_agblock_t		start_agbno;
+	xfs_agblock_t		end_agbno;
+	xfs_extlen_t		minlen_fsb = XFS_BB_TO_FSB(mp, minlen);
+	int			i;
+	int			batch = 100;
+	int			error;
+
+	start = max(start, XFS_AGB_TO_DADDR(mp, agno, 0));
+	start_agbno = xfs_daddr_to_agbno(mp, start);
+
+	end_daddr = XFS_AGB_TO_DADDR(mp, agno, pag->block_count);
+	end = min(end, end_daddr - 1);
+	end_agbno = xfs_daddr_to_agbno(mp, end);
+
+	error = xfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		return error;
+
+	error = xfs_alloc_read_agf(pag, tp, 0, &agbp);
+	if (error)
+		goto out_trans_cancel;
+
+	cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag);
+
+	/*
+	 * If this is our first time, look for any extent crossing start_agbno.
+	 * Otherwise, continue at the next extent after wherever we left off.
+	 */
+	if (tcur->ar_startblock == NULLAGBLOCK) {
+		error = xfs_alloc_lookup_le(cur, start_agbno, 0, &i);
+		if (error)
+			goto out_del_cursor;
+
+		/*
+		 * If we didn't find anything at or below start_agbno,
+		 * increment the cursor to see if there's another record above
+		 * it.
+		 */
+		if (!i)
+			error = xfs_btree_increment(cur, 0, &i);
+	} else {
+		error = xfs_alloc_lookup_ge(cur, tcur->ar_startblock, 0, &i);
+	}
+	if (error)
+		goto out_del_cursor;
+	if (!i) {
+		/* nothing left in the AG, we are done */
+		tcur->ar_blockcount = 0;
+		goto out_del_cursor;
+	}
+
+	/* Loop the entire range that was asked for. */
+	while (i) {
+		xfs_agblock_t	fbno;
+		xfs_extlen_t	flen;
+
+		error = xfs_alloc_get_rec(cur, &fbno, &flen, &i);
+		if (error)
+			break;
+		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
+			error = -EFSCORRUPTED;
+			break;
+		}
+
+		if (--batch <= 0) {
+			/*
+			 * Update the cursor to point at this extent so we
+			 * restart the next batch from this extent.
+			 */
+			tcur->ar_startblock = fbno;
+			tcur->ar_blockcount = flen;
+			break;
+		}
+
+		/* Exit on extents entirely outside of the range. */
+		if (fbno >= end_agbno) {
+			tcur->ar_blockcount = 0;
+			break;
+		}
+		if (fbno + flen < start_agbno)
+			goto next_extent;
+
+		/* Trim the extent returned to the range we want. */
+		if (fbno < start_agbno) {
+			flen -= start_agbno - fbno;
+			fbno = start_agbno;
+		}
+		if (fbno + flen > end_agbno + 1)
+			flen = end_agbno - fbno + 1;
+
+		/* Ignore too small. */
+		if (flen < minlen_fsb) {
+			trace_xfs_discard_toosmall(mp, agno, fbno, flen);
+			goto next_extent;
+		}
+
+		/*
+		 * If any blocks in the range are still busy, skip the
+		 * discard and try again the next time.
+		 */
+		if (xfs_extent_busy_search(mp, pag, fbno, flen)) {
+			trace_xfs_discard_busy(mp, agno, fbno, flen);
+			goto next_extent;
+		}
+
+		xfs_extent_busy_insert_discard(pag, fbno, flen,
+				&extents->extent_list);
+		*blocks_trimmed += flen;
+next_extent:
+		error = xfs_btree_increment(cur, 0, &i);
+		if (error)
+			break;
+
+		/*
+		 * If there's no more records in the tree, we are done. Set the
+		 * cursor block count to 0 to indicate to the caller that there
+		 * are no more extents to search.
+		 */
+		if (i == 0)
+			tcur->ar_blockcount = 0;
+	}
+
+	/*
+	 * If there was an error, release all the gathered busy extents because
+	 * we aren't going to issue a discard on them any more.
+	 */
+	if (error)
+		xfs_extent_busy_clear(mp, &extents->extent_list, false);
+
+out_del_cursor:
+	xfs_btree_del_cursor(cur, error);
+out_trans_cancel:
+	xfs_trans_cancel(tp);
+	return error;
+}
+
 static bool
 xfs_trim_should_stop(void)
 {
@@ -315,8 +469,15 @@  xfs_trim_extents(
 		.ar_blockcount = pag->pagf_longest,
 		.ar_startblock = NULLAGBLOCK,
 	};
+	struct xfs_mount	*mp = pag->pag_mount;
+	bool			by_len = true;
 	int			error = 0;
 
+	/* Are we only trimming part of this AG? */
+	if (start > XFS_AGB_TO_DADDR(mp, pag->pag_agno, 0) ||
+	    end < XFS_AGB_TO_DADDR(mp, pag->pag_agno, pag->block_count - 1))
+		by_len = false;
+
 	do {
 		struct xfs_busy_extents	*extents;
 
@@ -330,8 +491,13 @@  xfs_trim_extents(
 		extents->owner = extents;
 		INIT_LIST_HEAD(&extents->extent_list);
 
-		error = xfs_trim_gather_extents(pag, start, end, minlen,
-				&tcur, extents, blocks_trimmed);
+		if (by_len)
+			error = xfs_trim_gather_extents(pag, start, end,
+					minlen, &tcur, extents,
+					blocks_trimmed);
+		else
+			error = xfs_trim_gather_bybno(pag, start, end, minlen,
+					&tcur, extents, blocks_trimmed);
 		if (error) {
 			kfree(extents);
 			break;