Message ID | 171150385535.3220448.4852463781154330350.stgit@frogsfrogsfrogs (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [1/1] xfs: fix severe performance problems when fstrimming a subset of an AG | expand |
On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote: > periodically to allow other threads to do work. This implementation > avoids the worst problems of the original code, though it lacks the > desirable attribute of freeing the biggest chunks first. Do we really care much about freeing larger area first? I don't think it really matters for FITRIM at all. In other words, I suspect we're better off with only the by-bno implementation.
On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > XFS issues discard IOs while holding the free space btree and the AGF > buffers locked. If the discard IOs are slow, this can lead to long > stalls for every other thread trying to access that AG. On a 10TB high > performance flash storage device with a severely fragmented free space > btree in every AG, this results in many threads tripping the hangcheck > warnings while waiting for the AGF. This happens even after we've run > fstrim a few times and waited for the nvme namespace utilization > counters to stabilize. > > Strace for the entire 100TB looks like: > ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839> > > Reducing the size of the FITRIM requests to a single AG at a time > produces lower times for each individual call, but even this isn't quite > acceptable, because the lock hold times are still high enough to cause > stall warnings: > > Strace for the first 4x 1TB AGs looks like (2): > ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033> > ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323> > ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226> > ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744> Large FITRIM runtime is not actually an AGF locking latency problem anymore.... > The fstrim code has to synchronize discards with block allocations, so > we must hold the AGF lock while issuing discard IOs. ... because I fixed this problem in commit 89cfa899608f ("xfs: reduce AGF hold times during fstrim operations"). > Breaking up the > calls into smaller start/len segments ought to reduce the lock hold time > and allow other threads a chance to make progress. Unfortunately, the > current fstrim implementation handles this poorly because it walks the > entire free space by length index (cntbt) and it's not clear if we can > cycle the AGF periodically to reduce latency because there's no > less-than btree lookup. That commit also fixed this problem. > The first solution I thought of was to limit latency by scanning parts > of an AG at a time, It already does this. > but this doesn't solve the stalling problem when the > free space is heavily fragmented because each sub-AG scan has to walk > the entire cntbt to find free space that fits within the given range. > In fact, this dramatically increases the runtime! This itself is a > problem, because sub-AG fstrim runtime is unnecessarily high. Ah, so this is a completely different problem to what you describe above. i.e. The problem with "sub-ag" fstrim is simply that finding block range limited free extents is costly in terms of overall time and CPU usage when you do it via the by-count btree instead of the by-bno btree... > For sub-AG scans, create a second implementation that will walk the > bnobt and perform the trims in block number order. Since the cursor has > an obviously monotonically increasing value, it is easy to cycle the AGF > periodically to allow other threads to do work. This implementation > avoids the worst problems of the original code, though it lacks the > desirable attribute of freeing the biggest chunks first. Ok, it's fine to do a by-bno search in this case... > On the other hand, this second implementation will be much easier to > constrain the locking latency, and makes it much easier to report fstrim > progress to anyone who's running xfs_scrub. ... but It doesn't change the locking latency of fstrim at all. Locks are still only held for batches of 100 free extent lookups... I think a commit message update is necessary. :) > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > --- > fs/xfs/xfs_discard.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 169 insertions(+), 3 deletions(-) > > > diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c > index 268bb734dc0a8..ee7a8759091eb 100644 > --- a/fs/xfs/xfs_discard.c > +++ b/fs/xfs/xfs_discard.c > @@ -157,9 +157,9 @@ xfs_trim_gather_extents( > uint64_t *blocks_trimmed) > { > struct xfs_mount *mp = pag->pag_mount; > - struct xfs_trans *tp; > struct xfs_btree_cur *cur; > struct xfs_buf *agbp; > + struct xfs_trans *tp; > int error; > int i; > int batch = 100; > @@ -292,6 +292,160 @@ xfs_trim_gather_extents( > return error; > } > > +/* Trim the free space in this AG by block number. */ > +static inline int > +xfs_trim_gather_bybno( > + struct xfs_perag *pag, > + xfs_daddr_t start, > + xfs_daddr_t end, > + xfs_daddr_t minlen, > + struct xfs_alloc_rec_incore *tcur, > + struct xfs_busy_extents *extents, > + uint64_t *blocks_trimmed) > +{ I'd prefer that we don't copy-n-paste-n-subtly-modify code like this. There's very little different between the two gather cases - the initial cursor setup and the loop exit criteria - so they should be easy to use a common core loop. > + struct xfs_mount *mp = pag->pag_mount; > + struct xfs_trans *tp; > + struct xfs_btree_cur *cur; > + struct xfs_buf *agbp; > + xfs_daddr_t end_daddr; > + xfs_agnumber_t agno = pag->pag_agno; > + xfs_agblock_t start_agbno; > + xfs_agblock_t end_agbno; > + xfs_extlen_t minlen_fsb = XFS_BB_TO_FSB(mp, minlen); > + int i; > + int batch = 100; > + int error; > + > + start = max(start, XFS_AGB_TO_DADDR(mp, agno, 0)); > + start_agbno = xfs_daddr_to_agbno(mp, start); > + > + end_daddr = XFS_AGB_TO_DADDR(mp, agno, pag->block_count); > + end = min(end, end_daddr - 1); > + end_agbno = xfs_daddr_to_agbno(mp, end); I think this is the wrong place to do this. I agree that it is sensible to use ag-constrained agbnos here, but we should do it properly and not make the code unnecessarily difficult to maintain by using unconstrained daddrs in one gather function and constrained agbnos in another. Here's a completely untested, uncompiled version of this by-bno search I just wrote up to demonstrate that if we pass ag-confined agbnos from the top level, we don't need to duplicate this gather code at all or do trim range constraining inside the gather functions... -Dave. diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c index 268bb734dc0a..0c949dfc097a 100644 --- a/fs/xfs/xfs_discard.c +++ b/fs/xfs/xfs_discard.c @@ -145,14 +145,18 @@ xfs_discard_extents( return error; } +struct xfs_trim_cur { + xfs_agblock_t start; + xfs_extlen_t count; + xfs_agblock_t end; + xfs_extlen_t minlen; + bool by_len; +}; static int xfs_trim_gather_extents( struct xfs_perag *pag, - xfs_daddr_t start, - xfs_daddr_t end, - xfs_daddr_t minlen, - struct xfs_alloc_rec_incore *tcur, + struct xfs_trim_cur *tcur, struct xfs_busy_extents *extents, uint64_t *blocks_trimmed) { @@ -179,21 +183,21 @@ xfs_trim_gather_extents( if (error) goto out_trans_cancel; - cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); - /* - * Look up the extent length requested in the AGF and start with it. - */ - if (tcur->ar_startblock == NULLAGBLOCK) - error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i); - else + if (tcur->by_len) { + cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag); + error = xfs_alloc_lookup_ge(cur, tcur->ar_startblock, + 0, &i); + } else { + cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); error = xfs_alloc_lookup_le(cur, tcur->ar_startblock, tcur->ar_blockcount, &i); + } if (error) goto out_del_cursor; if (i == 0) { /* nothing of that length left in the AG, we are done */ - tcur->ar_blockcount = 0; + tcur->count = 0; goto out_del_cursor; } @@ -221,25 +225,8 @@ xfs_trim_gather_extents( * Update the cursor to point at this extent so we * restart the next batch from this extent. */ - tcur->ar_startblock = fbno; - tcur->ar_blockcount = flen; - break; - } - - /* - * use daddr format for all range/len calculations as that is - * the format the range/len variables are supplied in by - * userspace. - */ - dbno = XFS_AGB_TO_DADDR(mp, pag->pag_agno, fbno); - dlen = XFS_FSB_TO_BB(mp, flen); - - /* - * Too small? Give up. - */ - if (dlen < minlen) { - trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen); - tcur->ar_blockcount = 0; + tcur->start = fbno; + tcur->count = flen; break; } @@ -248,10 +235,35 @@ xfs_trim_gather_extents( * supposed to discard skip it. Do not bother to trim * down partially overlapping ranges for now. */ - if (dbno + dlen < start || dbno > end) { + if (fbno + flen < tcur->start) { trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); goto next_extent; } + if (fbno > tcur->end) { + trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); + if (tcur->by_len) { + tcur->count = 0; + break; + } + goto next_extent; + } + + /* Trim the extent returned to the range we want. */ + if (fbno < tcur->start) { + flen -= tcur->start - fbno; + fbno = tcur->start; + } + if (fbno + flen > tcur->end + 1) + flen = tcur->end - fbno + 1; + + /* + * Too small? Give up. + */ + if (flen < minlen) { + trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen); + tcur->count = 0; + break; + } /* * If any blocks in the range are still busy, skip the @@ -266,7 +278,10 @@ xfs_trim_gather_extents( &extents->extent_list); *blocks_trimmed += flen; next_extent: - error = xfs_btree_decrement(cur, 0, &i); + if (tcur->by_len) + error = xfs_btree_increment(cur, 0, &i); + else + error = xfs_btree_decrement(cur, 0, &i); if (error) break; @@ -276,7 +291,7 @@ xfs_trim_gather_extents( * is no more extents to search. */ if (i == 0) - tcur->ar_blockcount = 0; + tcur->count = 0; } /* @@ -306,17 +321,22 @@ xfs_trim_should_stop(void) static int xfs_trim_extents( struct xfs_perag *pag, - xfs_daddr_t start, - xfs_daddr_t end, - xfs_daddr_t minlen, + xfs_agblock_t start, + xfs_agblock_t end, + xfs_extlen_t minlen, uint64_t *blocks_trimmed) { - struct xfs_alloc_rec_incore tcur = { - .ar_blockcount = pag->pagf_longest, - .ar_startblock = NULLAGBLOCK, + struct xfs_trim_cur tcur = { + .start = start, + .count = pag->pagf_longest, + .end = end, + .minlen = minlen, }; int error = 0; + if (start != 0 || end != pag->block_count) + tcur.by_len = true; + do { struct xfs_busy_extents *extents; @@ -330,8 +350,8 @@ xfs_trim_extents( extents->owner = extents; INIT_LIST_HEAD(&extents->extent_list); - error = xfs_trim_gather_extents(pag, start, end, minlen, - &tcur, extents, blocks_trimmed); + error = xfs_trim_gather_extents(pag, &tcur, extents, + blocks_trimmed); if (error) { kfree(extents); break; @@ -354,7 +374,7 @@ xfs_trim_extents( if (xfs_trim_should_stop()) break; - } while (tcur.ar_blockcount != 0); + } while (tcur.count != 0); return error; @@ -378,8 +398,10 @@ xfs_ioc_trim( unsigned int granularity = bdev_discard_granularity(mp->m_ddev_targp->bt_bdev); struct fstrim_range range; - xfs_daddr_t start, end, minlen; + xfs_daddr_t start, end; + xfs_extlen_t minlen; xfs_agnumber_t agno; + xfs_agblock_t start_agbno, end_agbno; uint64_t blocks_trimmed = 0; int error, last_error = 0; @@ -399,7 +421,8 @@ xfs_ioc_trim( return -EFAULT; range.minlen = max_t(u64, granularity, range.minlen); - minlen = BTOBB(range.minlen); + minlen = XFS_B_TO_FSB(mp, range.minlen); + /* * Truncating down the len isn't actually quite correct, but using * BBTOB would mean we trivially get overflows for values @@ -415,12 +438,23 @@ xfs_ioc_trim( start = BTOBB(range.start); end = start + BTOBBT(range.len) - 1; - if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1) - end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1; + start_agno = xfs_daddr_to_agno(mp, start); + start_agbno = xfs_daddr_to_agbno(mp, start); + end_agno = xfs_daddr_to_agno(mp, end); + end_agbno = xfs_daddr_to_agbno(mp, end); - agno = xfs_daddr_to_agno(mp, start); - for_each_perag_range(mp, agno, xfs_daddr_to_agno(mp, end), pag) { - error = xfs_trim_extents(pag, start, end, minlen, + if (end_agno >= mp->m_sb.sb_agcount || + !xfs_verify_agno_agbno(mp, end_agno, end_agbno)) { + end_agno = mp->m_sb.sb_agcount - 1; + end_agbno = xfs_ag_block_count(mp, end_agno); + } + + for_each_perag_range(mp, start_agno, end_agno, pag) { + xfs_agblock_t end = mp->m_sb.sb_agblocks; + + if (start_agno == end_agno) + end = end_agbno; + error = xfs_trim_extents(pag, start_agbno, end, minlen, &blocks_trimmed); if (error) last_error = error; @@ -429,6 +463,7 @@ xfs_ioc_trim( xfs_perag_rele(pag); break; } + start_agbno = 0; } if (last_error)
On Wed, Mar 27, 2024 at 04:35:12AM -0700, Christoph Hellwig wrote: > On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote: > > periodically to allow other threads to do work. This implementation > > avoids the worst problems of the original code, though it lacks the > > desirable attribute of freeing the biggest chunks first. > > Do we really care much about freeing larger area first? I don't think > it really matters for FITRIM at all. > > In other words, I suspect we're better off with only the by-bno > implementation. Welll... you could argue that if the underlying "thin provisioning" is actually just an xfs file, that punching tons of tiny holes in that file could increase the height of the bmbt. In that case, you'd want to reduce the chances of the punch failing with ENOSPC by punching out larger ranges to free up more blocks. --D
On Thu, Mar 28, 2024 at 09:15:25AM +1100, Dave Chinner wrote: > On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > XFS issues discard IOs while holding the free space btree and the AGF > > buffers locked. If the discard IOs are slow, this can lead to long > > stalls for every other thread trying to access that AG. On a 10TB high > > performance flash storage device with a severely fragmented free space > > btree in every AG, this results in many threads tripping the hangcheck > > warnings while waiting for the AGF. This happens even after we've run > > fstrim a few times and waited for the nvme namespace utilization > > counters to stabilize. > > > > Strace for the entire 100TB looks like: > > ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839> > > > > Reducing the size of the FITRIM requests to a single AG at a time > > produces lower times for each individual call, but even this isn't quite > > acceptable, because the lock hold times are still high enough to cause > > stall warnings: > > > > Strace for the first 4x 1TB AGs looks like (2): > > ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033> > > ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323> > > ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226> > > ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744> > > Large FITRIM runtime is not actually an AGF locking latency problem > anymore.... > > > The fstrim code has to synchronize discards with block allocations, so > > we must hold the AGF lock while issuing discard IOs. > > ... because I fixed this problem in commit 89cfa899608f ("xfs: > reduce AGF hold times during fstrim operations"). > > > Breaking up the > > calls into smaller start/len segments ought to reduce the lock hold time > > and allow other threads a chance to make progress. Unfortunately, the > > current fstrim implementation handles this poorly because it walks the > > entire free space by length index (cntbt) and it's not clear if we can > > cycle the AGF periodically to reduce latency because there's no > > less-than btree lookup. > > That commit also fixed this problem. > > The first solution I thought of was to limit latency by scanning parts > > of an AG at a time, > > It already does this. Yeah. I sent this patch to the list 10 months ago, but nobody ever responded. I sent it again at the end of last year, and still nothing. I guess I forgot to update the commit message on this patch after you fixed the AGF hold times, but there's still the problem that asking to trim a subset of an AG causes XFS to walk the entire by-length btree just to select out the records that actually fit within the criteria. > > but this doesn't solve the stalling problem when the > > free space is heavily fragmented because each sub-AG scan has to walk > > the entire cntbt to find free space that fits within the given range. > > In fact, this dramatically increases the runtime! This itself is a > > problem, because sub-AG fstrim runtime is unnecessarily high. > > Ah, so this is a completely different problem to what you describe > above. i.e. The problem with "sub-ag" fstrim is simply that finding > block range limited free extents is costly in terms of overall time > and CPU usage when you do it via the by-count btree instead of the > by-bno btree... Yes. > > For sub-AG scans, create a second implementation that will walk the > > bnobt and perform the trims in block number order. Since the cursor has > > an obviously monotonically increasing value, it is easy to cycle the AGF > > periodically to allow other threads to do work. This implementation > > avoids the worst problems of the original code, though it lacks the > > desirable attribute of freeing the biggest chunks first. > > Ok, it's fine to do a by-bno search in this case... > > > On the other hand, this second implementation will be much easier to > > constrain the locking latency, and makes it much easier to report fstrim > > progress to anyone who's running xfs_scrub. > > ... but It doesn't change the locking latency of fstrim at all. > Locks are still only held for batches of 100 free extent lookups... > > I think a commit message update is necessary. :) xfs: fix performance problems when fstrimming a subset of a fragmented AG On a 10TB filesystem where the free space in each AG is heavily fragmented, I noticed some very high runtimes on a FITRIM call for the entire filesystem. xfs_scrub likes to report progress information on each phase of the scrub, which means that a strace for the entire filesystem: ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839> shows that scrub is uncommunicative for the entire duration. Reducing the size of the FITRIM requests to a single AG at a time produces lower times for each individual call, but even this isn't quite acceptable, because the time between progress reports are still very high: Strace for the first 4x 1TB AGs looks like (2): ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033> ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323> ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226> ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744> I then had the idea to limit the length parameter of each call to a smallish amount (~11GB) so that we could report progress relatively quickly, but much to my surprise, each FITRIM call still took ~68 seconds! Unfortunately, the by-length fstrim implementation handles this poorly because it walks the entire free space by length index (cntbt), which is a very inefficient way to walk a subset of the blocks of an AG. Therefore, create a second implementation that will walk the bnobt and perform the trims in block number order. This implementation avoids the worst problems of the original code, though it lacks the desirable attribute of freeing the biggest chunks first. On the other hand, this second implementation will be much easier to constrain the system call latency, and makes it much easier to report fstrim progress to anyone who's running xfs_scrub. > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > --- > > fs/xfs/xfs_discard.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 169 insertions(+), 3 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c > > index 268bb734dc0a8..ee7a8759091eb 100644 > > --- a/fs/xfs/xfs_discard.c > > +++ b/fs/xfs/xfs_discard.c > > @@ -157,9 +157,9 @@ xfs_trim_gather_extents( > > uint64_t *blocks_trimmed) > > { > > struct xfs_mount *mp = pag->pag_mount; > > - struct xfs_trans *tp; > > struct xfs_btree_cur *cur; > > struct xfs_buf *agbp; > > + struct xfs_trans *tp; > > int error; > > int i; > > int batch = 100; > > @@ -292,6 +292,160 @@ xfs_trim_gather_extents( > > return error; > > } > > > > +/* Trim the free space in this AG by block number. */ > > +static inline int > > +xfs_trim_gather_bybno( > > + struct xfs_perag *pag, > > + xfs_daddr_t start, > > + xfs_daddr_t end, > > + xfs_daddr_t minlen, > > + struct xfs_alloc_rec_incore *tcur, > > + struct xfs_busy_extents *extents, > > + uint64_t *blocks_trimmed) > > +{ > > I'd prefer that we don't copy-n-paste-n-subtly-modify code like > this. There's very little different between the two gather cases - > the initial cursor setup and the loop exit criteria - so they should > be easy to use a common core loop. <nod> > > + struct xfs_mount *mp = pag->pag_mount; > > + struct xfs_trans *tp; > > + struct xfs_btree_cur *cur; > > + struct xfs_buf *agbp; > > + xfs_daddr_t end_daddr; > > + xfs_agnumber_t agno = pag->pag_agno; > > + xfs_agblock_t start_agbno; > > + xfs_agblock_t end_agbno; > > + xfs_extlen_t minlen_fsb = XFS_BB_TO_FSB(mp, minlen); > > + int i; > > + int batch = 100; > > + int error; > > + > > + start = max(start, XFS_AGB_TO_DADDR(mp, agno, 0)); > > + start_agbno = xfs_daddr_to_agbno(mp, start); > > + > > + end_daddr = XFS_AGB_TO_DADDR(mp, agno, pag->block_count); > > + end = min(end, end_daddr - 1); > > + end_agbno = xfs_daddr_to_agbno(mp, end); > > I think this is the wrong place to do this. I agree that it is > sensible to use ag-constrained agbnos here, but we should do it > properly and not make the code unnecessarily difficult to maintain > by using unconstrained daddrs in one gather function and constrained > agbnos in another. > > Here's a completely untested, uncompiled version of this by-bno > search I just wrote up to demonstrate that if we pass ag-confined > agbnos from the top level, we don't need to duplicate this gather > code at all or do trim range constraining inside the gather > functions... Mostly looks ok, but let's dig in-- > -Dave. > > diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c > index 268bb734dc0a..0c949dfc097a 100644 > --- a/fs/xfs/xfs_discard.c > +++ b/fs/xfs/xfs_discard.c > @@ -145,14 +145,18 @@ xfs_discard_extents( > return error; > } > > +struct xfs_trim_cur { > + xfs_agblock_t start; > + xfs_extlen_t count; > + xfs_agblock_t end; > + xfs_extlen_t minlen; > + bool by_len; > +}; > > static int > xfs_trim_gather_extents( > struct xfs_perag *pag, > - xfs_daddr_t start, > - xfs_daddr_t end, > - xfs_daddr_t minlen, > - struct xfs_alloc_rec_incore *tcur, > + struct xfs_trim_cur *tcur, > struct xfs_busy_extents *extents, > uint64_t *blocks_trimmed) > { > @@ -179,21 +183,21 @@ xfs_trim_gather_extents( > if (error) > goto out_trans_cancel; > > - cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > > - /* > - * Look up the extent length requested in the AGF and start with it. > - */ > - if (tcur->ar_startblock == NULLAGBLOCK) > - error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i); > - else > + if (tcur->by_len) { > + cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag); I'm confused, why are we searching the by-bno btree if "by_len" is set? Granted the logic for setting by_len triggers only for FITRIMs of subsets of an AG so this functions conrrectly... > + error = xfs_alloc_lookup_ge(cur, tcur->ar_startblock, > + 0, &i); ...insofar as this skips a free extent that starts before and ends after tcur->start. > + } else { > + cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > error = xfs_alloc_lookup_le(cur, tcur->ar_startblock, > tcur->ar_blockcount, &i); > + } How about this initialization logic: if (tcur->by_bno) { /* sub-AG discard request always starts at tcur->start */ cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag); error = xfs_alloc_lookup_le(cur, tcur->start, 0, &i); } else if (tcur->start == NULLAGBLOCK) { /* first time through a by-len starts with max length */ cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); error = xfs_alloc_lookup_ge(cur, 0, tcur->count, &i); } else { /* nth time through a by-len starts where we left off */ cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); error = xfs_alloc_lookup_le(cur, tcur->start, tcur->count, &i); } > if (error) > goto out_del_cursor; > if (i == 0) { > /* nothing of that length left in the AG, we are done */ > - tcur->ar_blockcount = 0; > + tcur->count = 0; > goto out_del_cursor; > } > > @@ -221,25 +225,8 @@ xfs_trim_gather_extents( > * Update the cursor to point at this extent so we > * restart the next batch from this extent. > */ > - tcur->ar_startblock = fbno; > - tcur->ar_blockcount = flen; > - break; > - } > - > - /* > - * use daddr format for all range/len calculations as that is > - * the format the range/len variables are supplied in by > - * userspace. > - */ > - dbno = XFS_AGB_TO_DADDR(mp, pag->pag_agno, fbno); > - dlen = XFS_FSB_TO_BB(mp, flen); > - > - /* > - * Too small? Give up. > - */ > - if (dlen < minlen) { > - trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen); > - tcur->ar_blockcount = 0; > + tcur->start = fbno; > + tcur->count = flen; > break; > } > > @@ -248,10 +235,35 @@ xfs_trim_gather_extents( > * supposed to discard skip it. Do not bother to trim > * down partially overlapping ranges for now. > */ > - if (dbno + dlen < start || dbno > end) { > + if (fbno + flen < tcur->start) { > trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); > goto next_extent; > } > + if (fbno > tcur->end) { > + trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); > + if (tcur->by_len) { > + tcur->count = 0; > + break; > + } > + goto next_extent; > + } > + > + /* Trim the extent returned to the range we want. */ > + if (fbno < tcur->start) { > + flen -= tcur->start - fbno; > + fbno = tcur->start; > + } > + if (fbno + flen > tcur->end + 1) > + flen = tcur->end - fbno + 1; > + > + /* > + * Too small? Give up. > + */ > + if (flen < minlen) { > + trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen); > + tcur->count = 0; > + break; > + } For a by-bno search, this logic skips the entire rest of the AG after the first free extent that's smaller than tcur->minlen. Instead, it should goto next_extent, yes? > > /* > * If any blocks in the range are still busy, skip the > @@ -266,7 +278,10 @@ xfs_trim_gather_extents( > &extents->extent_list); > *blocks_trimmed += flen; > next_extent: > - error = xfs_btree_decrement(cur, 0, &i); > + if (tcur->by_len) > + error = xfs_btree_increment(cur, 0, &i); > + else > + error = xfs_btree_decrement(cur, 0, &i); > if (error) > break; > > @@ -276,7 +291,7 @@ xfs_trim_gather_extents( > * is no more extents to search. > */ > if (i == 0) > - tcur->ar_blockcount = 0; > + tcur->count = 0; > } > > /* > @@ -306,17 +321,22 @@ xfs_trim_should_stop(void) > static int > xfs_trim_extents( > struct xfs_perag *pag, > - xfs_daddr_t start, > - xfs_daddr_t end, > - xfs_daddr_t minlen, > + xfs_agblock_t start, > + xfs_agblock_t end, > + xfs_extlen_t minlen, > uint64_t *blocks_trimmed) > { > - struct xfs_alloc_rec_incore tcur = { > - .ar_blockcount = pag->pagf_longest, > - .ar_startblock = NULLAGBLOCK, > + struct xfs_trim_cur tcur = { > + .start = start, > + .count = pag->pagf_longest, > + .end = end, > + .minlen = minlen, > }; > int error = 0; > > + if (start != 0 || end != pag->block_count) > + tcur.by_len = true; > + > do { > struct xfs_busy_extents *extents; > > @@ -330,8 +350,8 @@ xfs_trim_extents( > extents->owner = extents; > INIT_LIST_HEAD(&extents->extent_list); > > - error = xfs_trim_gather_extents(pag, start, end, minlen, > - &tcur, extents, blocks_trimmed); > + error = xfs_trim_gather_extents(pag, &tcur, extents, > + blocks_trimmed); > if (error) { > kfree(extents); > break; > @@ -354,7 +374,7 @@ xfs_trim_extents( > if (xfs_trim_should_stop()) > break; > > - } while (tcur.ar_blockcount != 0); > + } while (tcur.count != 0); > > return error; > > @@ -378,8 +398,10 @@ xfs_ioc_trim( > unsigned int granularity = > bdev_discard_granularity(mp->m_ddev_targp->bt_bdev); > struct fstrim_range range; > - xfs_daddr_t start, end, minlen; > + xfs_daddr_t start, end; > + xfs_extlen_t minlen; > xfs_agnumber_t agno; > + xfs_agblock_t start_agbno, end_agbno; > uint64_t blocks_trimmed = 0; > int error, last_error = 0; > > @@ -399,7 +421,8 @@ xfs_ioc_trim( > return -EFAULT; > > range.minlen = max_t(u64, granularity, range.minlen); > - minlen = BTOBB(range.minlen); > + minlen = XFS_B_TO_FSB(mp, range.minlen); > + > /* > * Truncating down the len isn't actually quite correct, but using > * BBTOB would mean we trivially get overflows for values > @@ -415,12 +438,23 @@ xfs_ioc_trim( > start = BTOBB(range.start); > end = start + BTOBBT(range.len) - 1; > > - if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1) > - end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1; Couldn't this simply be: end = min_t(xfs_daddr_t, start + BTOBBT(range.len) - 1, XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1); --D > + start_agno = xfs_daddr_to_agno(mp, start); > + start_agbno = xfs_daddr_to_agbno(mp, start); > + end_agno = xfs_daddr_to_agno(mp, end); > + end_agbno = xfs_daddr_to_agbno(mp, end); > > - agno = xfs_daddr_to_agno(mp, start); > - for_each_perag_range(mp, agno, xfs_daddr_to_agno(mp, end), pag) { > - error = xfs_trim_extents(pag, start, end, minlen, > + if (end_agno >= mp->m_sb.sb_agcount || > + !xfs_verify_agno_agbno(mp, end_agno, end_agbno)) { > + end_agno = mp->m_sb.sb_agcount - 1; > + end_agbno = xfs_ag_block_count(mp, end_agno); > + } > + > + for_each_perag_range(mp, start_agno, end_agno, pag) { > + xfs_agblock_t end = mp->m_sb.sb_agblocks; > + > + if (start_agno == end_agno) > + end = end_agbno; > + error = xfs_trim_extents(pag, start_agbno, end, minlen, > &blocks_trimmed); > if (error) > last_error = error; > @@ -429,6 +463,7 @@ xfs_ioc_trim( > xfs_perag_rele(pag); > break; > } > + start_agbno = 0; > } > > if (last_error) > -- > Dave Chinner > david@fromorbit.com >
On Fri, Mar 29, 2024 at 02:35:20PM -0700, Darrick J. Wong wrote: > Welll... you could argue that if the underlying "thin provisioning" is > actually just an xfs file, that punching tons of tiny holes in that file > could increase the height of the bmbt. In that case, you'd want to > reduce the chances of the punch failing with ENOSPC by punching out > larger ranges to free up more blocks. That is a somewhat reaonable line of thougt. But is an underprovisioned device near ENOSPC really the target here vs an SSD? Either way, if we have good reasons for by-cnt except for it being a little simpler I can live with keeping it. In doubt I just prefer to have simple code and one implementation instead of two.
On Fri, Mar 29, 2024 at 02:35:20PM -0700, Darrick J. Wong wrote: > On Wed, Mar 27, 2024 at 04:35:12AM -0700, Christoph Hellwig wrote: > > On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote: > > > periodically to allow other threads to do work. This implementation > > > avoids the worst problems of the original code, though it lacks the > > > desirable attribute of freeing the biggest chunks first. > > > > Do we really care much about freeing larger area first? I don't think > > it really matters for FITRIM at all. > > > > In other words, I suspect we're better off with only the by-bno > > implementation. > > Welll... you could argue that if the underlying "thin provisioning" is > actually just an xfs file, that punching tons of tiny holes in that file > could increase the height of the bmbt. In that case, you'd want to > reduce the chances of the punch failing with ENOSPC by punching out > larger ranges to free up more blocks. That's true, but it's a consideration for dm-thinp, too. I've always considered FITRIM as an optimisation for dm-thinp systems (rather than a hardware device optimisation), and so "large extents first" make a whole lot more sense from that perspective. That said, there's another reason for by-size searches being the default behaviour: FITRIM has a minimum length control parameter. Hence for fragmented filesystems where there are few extents larger than the minimum length, we needed to avoid doing an exhaustive search of the by-bno tree to find extents longer than the minimum length.... -Dave.
On Fri, Mar 29, 2024 at 03:51:49PM -0700, Darrick J. Wong wrote: > On Thu, Mar 28, 2024 at 09:15:25AM +1100, Dave Chinner wrote: > > On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <djwong@kernel.org> .... > > I think a commit message update is necessary. :) > > xfs: fix performance problems when fstrimming a subset of a fragmented AG > > On a 10TB filesystem where the free space in each AG is heavily > fragmented, I noticed some very high runtimes on a FITRIM call for the > entire filesystem. xfs_scrub likes to report progress information on > each phase of the scrub, which means that a strace for the entire > filesystem: > > ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839> > > shows that scrub is uncommunicative for the entire duration. Reducing > the size of the FITRIM requests to a single AG at a time produces lower > times for each individual call, but even this isn't quite acceptable, > because the time between progress reports are still very high: > > Strace for the first 4x 1TB AGs looks like (2): > ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033> > ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323> > ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226> > ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744> > > I then had the idea to limit the length parameter of each call to a > smallish amount (~11GB) so that we could report progress relatively > quickly, but much to my surprise, each FITRIM call still took ~68 > seconds! > > Unfortunately, the by-length fstrim implementation handles this poorly > because it walks the entire free space by length index (cntbt), which is > a very inefficient way to walk a subset of the blocks of an AG. > > Therefore, create a second implementation that will walk the bnobt and > perform the trims in block number order. This implementation avoids the > worst problems of the original code, though it lacks the desirable > attribute of freeing the biggest chunks first. > > On the other hand, this second implementation will be much easier to > constrain the system call latency, and makes it much easier to report > fstrim progress to anyone who's running xfs_scrub. Much better :) > > Here's a completely untested, uncompiled version of this by-bno > > search I just wrote up to demonstrate that if we pass ag-confined > > agbnos from the top level, we don't need to duplicate this gather > > code at all or do trim range constraining inside the gather > > functions... > > Mostly looks ok, but let's dig in-- > > > -Dave. > > > > diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c > > index 268bb734dc0a..0c949dfc097a 100644 > > --- a/fs/xfs/xfs_discard.c > > +++ b/fs/xfs/xfs_discard.c > > @@ -145,14 +145,18 @@ xfs_discard_extents( > > return error; > > } > > > > +struct xfs_trim_cur { > > + xfs_agblock_t start; > > + xfs_extlen_t count; > > + xfs_agblock_t end; > > + xfs_extlen_t minlen; > > + bool by_len; > > +}; > > > > static int > > xfs_trim_gather_extents( > > struct xfs_perag *pag, > > - xfs_daddr_t start, > > - xfs_daddr_t end, > > - xfs_daddr_t minlen, > > - struct xfs_alloc_rec_incore *tcur, > > + struct xfs_trim_cur *tcur, > > struct xfs_busy_extents *extents, > > uint64_t *blocks_trimmed) > > { > > @@ -179,21 +183,21 @@ xfs_trim_gather_extents( > > if (error) > > goto out_trans_cancel; > > > > - cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > > > > - /* > > - * Look up the extent length requested in the AGF and start with it. > > - */ > > - if (tcur->ar_startblock == NULLAGBLOCK) > > - error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i); > > - else > > + if (tcur->by_len) { > > + cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag); > > I'm confused, why are we searching the by-bno btree if "by_len" is set? Because I changed the logic half way through writing it and forgot to change the variable name.... > Granted the logic for setting by_len triggers only for FITRIMs of > subsets of an AG so this functions conrrectly... > > > + error = xfs_alloc_lookup_ge(cur, tcur->ar_startblock, > > + 0, &i); > > ...insofar as this skips a free extent that starts before and ends after > tcur->start. Right - did I mention this wasn't even compiled? :) > > > + } else { > > + cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > > error = xfs_alloc_lookup_le(cur, tcur->ar_startblock, > > tcur->ar_blockcount, &i); > > + } > > How about this initialization logic: > > if (tcur->by_bno) { > /* sub-AG discard request always starts at tcur->start */ > cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag); > error = xfs_alloc_lookup_le(cur, tcur->start, 0, &i); OK. > } else if (tcur->start == NULLAGBLOCK) { > /* first time through a by-len starts with max length */ > cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > error = xfs_alloc_lookup_ge(cur, 0, tcur->count, &i); > } else { > /* nth time through a by-len starts where we left off */ > cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > error = xfs_alloc_lookup_le(cur, tcur->start, tcur->count, &i); > } ... but that is what I was explicilty trying to avoid because the initial search on a by-count config is for tcur->count = pag->pagf_longest. IOWs, there is no "greater" sized free extent than tcur->count, every free extent *must* be less than or equal to pag->pagf_longest. The _ge() search is done because the start block of zero is less than every agbno in the tree. Hence we have to do a _ge() search to "find" the pag->pagf_longest extent based on start block criteria. If we pass in a tcur->start = NULLAGBLOCK (0xffffffff) or pag->pag_blocks then every agbno is "less than" the start block, and so it should return the extent at the right most edge of the tree with a _le() search regardless of where in the AG it is located. Hence the initial search can also be a _le() search and we don't have to special case it - the by-count tree has two components in it's search key and if we set the initial values of both components correctly one search works for all cases... > > @@ -248,10 +235,35 @@ xfs_trim_gather_extents( > > * supposed to discard skip it. Do not bother to trim > > * down partially overlapping ranges for now. > > */ > > - if (dbno + dlen < start || dbno > end) { > > + if (fbno + flen < tcur->start) { > > trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); > > goto next_extent; > > } > > + if (fbno > tcur->end) { > > + trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); > > + if (tcur->by_len) { > > + tcur->count = 0; > > + break; > > + } > > + goto next_extent; > > + } > > + > > + /* Trim the extent returned to the range we want. */ > > + if (fbno < tcur->start) { > > + flen -= tcur->start - fbno; > > + fbno = tcur->start; > > + } > > + if (fbno + flen > tcur->end + 1) > > + flen = tcur->end - fbno + 1; > > + > > + /* > > + * Too small? Give up. > > + */ > > + if (flen < minlen) { > > + trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen); > > + tcur->count = 0; > > + break; > > + } > > For a by-bno search, this logic skips the entire rest of the AG after > the first free extent that's smaller than tcur->minlen. Instead, it > should goto next_extent, yes? Yes. > > @@ -415,12 +438,23 @@ xfs_ioc_trim( > > start = BTOBB(range.start); > > end = start + BTOBBT(range.len) - 1; > > > > - if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1) > > - end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1; > > Couldn't this simply be: > > end = min_t(xfs_daddr_t, start + BTOBBT(range.len) - 1, > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1); Yes. -Dave.
On Sun, Mar 31, 2024 at 08:15:49AM +1100, Dave Chinner wrote: > On Fri, Mar 29, 2024 at 02:35:20PM -0700, Darrick J. Wong wrote: > > On Wed, Mar 27, 2024 at 04:35:12AM -0700, Christoph Hellwig wrote: > > > On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote: > > > > periodically to allow other threads to do work. This implementation > > > > avoids the worst problems of the original code, though it lacks the > > > > desirable attribute of freeing the biggest chunks first. > > > > > > Do we really care much about freeing larger area first? I don't think > > > it really matters for FITRIM at all. > > > > > > In other words, I suspect we're better off with only the by-bno > > > implementation. > > > > Welll... you could argue that if the underlying "thin provisioning" is > > actually just an xfs file, that punching tons of tiny holes in that file > > could increase the height of the bmbt. In that case, you'd want to > > reduce the chances of the punch failing with ENOSPC by punching out > > larger ranges to free up more blocks. > > That's true, but it's a consideration for dm-thinp, too. I've always > considered FITRIM as an optimisation for dm-thinp systems (rather > than a hardware device optimisation), and so "large extents first" > make a whole lot more sense from that perspective. > > That said, there's another reason for by-size searches being the > default behaviour: FITRIM has a minimum length control parameter. > Hence for fragmented filesystems where there are few extents larger > than the minimum length, we needed to avoid doing an exhaustive > search of the by-bno tree to find extents longer than the minimum > length.... Ooooh, that's a /very/ good point. xfs_scrub also now constructs a histogram of free space extent lengths to see if it can reduce the runtime of phase 8 (FITRIM) by setting minlen to something large enough to reduce runtime by 4-5x while not missing more than a few percent of the free space. --D
On Sun, Mar 31, 2024 at 08:51:23AM +1100, Dave Chinner wrote: > On Fri, Mar 29, 2024 at 03:51:49PM -0700, Darrick J. Wong wrote: > > On Thu, Mar 28, 2024 at 09:15:25AM +1100, Dave Chinner wrote: > > > On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <djwong@kernel.org> > .... > > > I think a commit message update is necessary. :) > > > > xfs: fix performance problems when fstrimming a subset of a fragmented AG > > > > On a 10TB filesystem where the free space in each AG is heavily > > fragmented, I noticed some very high runtimes on a FITRIM call for the > > entire filesystem. xfs_scrub likes to report progress information on > > each phase of the scrub, which means that a strace for the entire > > filesystem: > > > > ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839> > > > > shows that scrub is uncommunicative for the entire duration. Reducing > > the size of the FITRIM requests to a single AG at a time produces lower > > times for each individual call, but even this isn't quite acceptable, > > because the time between progress reports are still very high: > > > > Strace for the first 4x 1TB AGs looks like (2): > > ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033> > > ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323> > > ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226> > > ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744> > > > > I then had the idea to limit the length parameter of each call to a > > smallish amount (~11GB) so that we could report progress relatively > > quickly, but much to my surprise, each FITRIM call still took ~68 > > seconds! > > > > Unfortunately, the by-length fstrim implementation handles this poorly > > because it walks the entire free space by length index (cntbt), which is > > a very inefficient way to walk a subset of the blocks of an AG. > > > > Therefore, create a second implementation that will walk the bnobt and > > perform the trims in block number order. This implementation avoids the > > worst problems of the original code, though it lacks the desirable > > attribute of freeing the biggest chunks first. > > > > On the other hand, this second implementation will be much easier to > > constrain the system call latency, and makes it much easier to report > > fstrim progress to anyone who's running xfs_scrub. > > Much better :) > > > > Here's a completely untested, uncompiled version of this by-bno > > > search I just wrote up to demonstrate that if we pass ag-confined > > > agbnos from the top level, we don't need to duplicate this gather > > > code at all or do trim range constraining inside the gather > > > functions... > > > > Mostly looks ok, but let's dig in-- > > > > > -Dave. > > > > > > diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c > > > index 268bb734dc0a..0c949dfc097a 100644 > > > --- a/fs/xfs/xfs_discard.c > > > +++ b/fs/xfs/xfs_discard.c > > > @@ -145,14 +145,18 @@ xfs_discard_extents( > > > return error; > > > } > > > > > > +struct xfs_trim_cur { > > > + xfs_agblock_t start; > > > + xfs_extlen_t count; > > > + xfs_agblock_t end; > > > + xfs_extlen_t minlen; > > > + bool by_len; > > > +}; > > > > > > static int > > > xfs_trim_gather_extents( > > > struct xfs_perag *pag, > > > - xfs_daddr_t start, > > > - xfs_daddr_t end, > > > - xfs_daddr_t minlen, > > > - struct xfs_alloc_rec_incore *tcur, > > > + struct xfs_trim_cur *tcur, > > > struct xfs_busy_extents *extents, > > > uint64_t *blocks_trimmed) > > > { > > > @@ -179,21 +183,21 @@ xfs_trim_gather_extents( > > > if (error) > > > goto out_trans_cancel; > > > > > > - cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > > > > > > - /* > > > - * Look up the extent length requested in the AGF and start with it. > > > - */ > > > - if (tcur->ar_startblock == NULLAGBLOCK) > > > - error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i); > > > - else > > > + if (tcur->by_len) { > > > + cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag); > > > > I'm confused, why are we searching the by-bno btree if "by_len" is set? > > Because I changed the logic half way through writing it and forgot > to change the variable name.... > > > Granted the logic for setting by_len triggers only for FITRIMs of > > subsets of an AG so this functions conrrectly... > > > > > + error = xfs_alloc_lookup_ge(cur, tcur->ar_startblock, > > > + 0, &i); > > > > ...insofar as this skips a free extent that starts before and ends after > > tcur->start. > > Right - did I mention this wasn't even compiled? :) > > > > > > + } else { > > > + cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > > > error = xfs_alloc_lookup_le(cur, tcur->ar_startblock, > > > tcur->ar_blockcount, &i); > > > + } > > > > How about this initialization logic: > > > > if (tcur->by_bno) { > > /* sub-AG discard request always starts at tcur->start */ > > cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag); > > error = xfs_alloc_lookup_le(cur, tcur->start, 0, &i); > > OK. > > > } else if (tcur->start == NULLAGBLOCK) { > > /* first time through a by-len starts with max length */ > > cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > > error = xfs_alloc_lookup_ge(cur, 0, tcur->count, &i); > > } else { > > /* nth time through a by-len starts where we left off */ > > cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > > error = xfs_alloc_lookup_le(cur, tcur->start, tcur->count, &i); > > } > > ... but that is what I was explicilty trying to avoid because the > initial search on a by-count config is for tcur->count = > pag->pagf_longest. > > IOWs, there is no "greater" sized free extent than tcur->count, > every free extent *must* be less than or equal to pag->pagf_longest. > The _ge() search is done because the start block of zero is less > than every agbno in the tree. Hence we have to do a _ge() search to > "find" the pag->pagf_longest extent based on start block criteria. > > If we pass in a tcur->start = NULLAGBLOCK (0xffffffff) or > pag->pag_blocks then every agbno is "less than" the start block, and > so it should return the extent at the right most edge of the tree > with a _le() search regardless of where in the AG it is located. > > Hence the initial search can also be a _le() search and we don't > have to special case it - the by-count tree has two components in > it's search key and if we set the initial values of both components > correctly one search works for all cases... > > > > @@ -248,10 +235,35 @@ xfs_trim_gather_extents( > > > * supposed to discard skip it. Do not bother to trim > > > * down partially overlapping ranges for now. > > > */ > > > - if (dbno + dlen < start || dbno > end) { > > > + if (fbno + flen < tcur->start) { > > > trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); > > > goto next_extent; > > > } > > > + if (fbno > tcur->end) { > > > + trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); > > > + if (tcur->by_len) { > > > + tcur->count = 0; > > > + break; > > > + } > > > + goto next_extent; > > > + } > > > + > > > + /* Trim the extent returned to the range we want. */ > > > + if (fbno < tcur->start) { > > > + flen -= tcur->start - fbno; > > > + fbno = tcur->start; > > > + } > > > + if (fbno + flen > tcur->end + 1) > > > + flen = tcur->end - fbno + 1; > > > + > > > + /* > > > + * Too small? Give up. > > > + */ > > > + if (flen < minlen) { > > > + trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen); > > > + tcur->count = 0; > > > + break; > > > + } > > > > For a by-bno search, this logic skips the entire rest of the AG after > > the first free extent that's smaller than tcur->minlen. Instead, it > > should goto next_extent, yes? > > Yes. > > > > @@ -415,12 +438,23 @@ xfs_ioc_trim( > > > start = BTOBB(range.start); > > > end = start + BTOBBT(range.len) - 1; > > > > > > - if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1) > > > - end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1; > > > > Couldn't this simply be: > > > > end = min_t(xfs_daddr_t, start + BTOBBT(range.len) - 1, > > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1); > > Yes. How's this for a replacement patch, then? --D Subject: xfs: fix performance problems when fstrimming a subset of a fragmented AG On a 10TB filesystem where the free space in each AG is heavily fragmented, I noticed some very high runtimes on a FITRIM call for the entire filesystem. xfs_scrub likes to report progress information on each phase of the scrub, which means that a strace for the entire filesystem: ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839> shows that scrub is uncommunicative for the entire duration. Reducing the size of the FITRIM requests to a single AG at a time produces lower times for each individual call, but even this isn't quite acceptable, because the time between progress reports are still very high: Strace for the first 4x 1TB AGs looks like (2): ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033> ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323> ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226> ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744> I then had the idea to limit the length parameter of each call to a smallish amount (~11GB) so that we could report progress relatively quickly, but much to my surprise, each FITRIM call still took ~68 seconds! Unfortunately, the by-length fstrim implementation handles this poorly because it walks the entire free space by length index (cntbt), which is a very inefficient way to walk a subset of the blocks of an AG. Therefore, create a second implementation that will walk the bnobt and perform the trims in block number order. This implementation avoids the worst problems of the original code, though it lacks the desirable attribute of freeing the biggest chunks first. On the other hand, this second implementation will be much easier to constrain the system call latency, and makes it much easier to report fstrim progress to anyone who's running xfs_scrub. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/xfs_discard.c | 153 ++++++++++++++++++++++++++++++-------------------- 1 file changed, 93 insertions(+), 60 deletions(-) diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c index 268bb734dc0a8..25fe3b932b5a6 100644 --- a/fs/xfs/xfs_discard.c +++ b/fs/xfs/xfs_discard.c @@ -145,14 +145,18 @@ xfs_discard_extents( return error; } +struct xfs_trim_cur { + xfs_agblock_t start; + xfs_extlen_t count; + xfs_agblock_t end; + xfs_extlen_t minlen; + bool by_bno; +}; static int xfs_trim_gather_extents( struct xfs_perag *pag, - xfs_daddr_t start, - xfs_daddr_t end, - xfs_daddr_t minlen, - struct xfs_alloc_rec_incore *tcur, + struct xfs_trim_cur *tcur, struct xfs_busy_extents *extents, uint64_t *blocks_trimmed) { @@ -179,21 +183,26 @@ xfs_trim_gather_extents( if (error) goto out_trans_cancel; - cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); - - /* - * Look up the extent length requested in the AGF and start with it. - */ - if (tcur->ar_startblock == NULLAGBLOCK) - error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i); - else - error = xfs_alloc_lookup_le(cur, tcur->ar_startblock, - tcur->ar_blockcount, &i); + if (tcur->by_bno) { + /* sub-AG discard request always starts at tcur->start */ + cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag); + error = xfs_alloc_lookup_le(cur, tcur->start, 0, &i); + if (!error && !i) + error = xfs_alloc_lookup_ge(cur, tcur->start, 0, &i); + } else if (tcur->start == 0) { + /* first time through a by-len starts with max length */ + cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); + error = xfs_alloc_lookup_ge(cur, 0, tcur->count, &i); + } else { + /* nth time through a by-len starts where we left off */ + cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); + error = xfs_alloc_lookup_le(cur, tcur->start, tcur->count, &i); + } if (error) goto out_del_cursor; if (i == 0) { /* nothing of that length left in the AG, we are done */ - tcur->ar_blockcount = 0; + tcur->count = 0; goto out_del_cursor; } @@ -204,8 +213,6 @@ xfs_trim_gather_extents( while (i) { xfs_agblock_t fbno; xfs_extlen_t flen; - xfs_daddr_t dbno; - xfs_extlen_t dlen; error = xfs_alloc_get_rec(cur, &fbno, &flen, &i); if (error) @@ -221,37 +228,45 @@ xfs_trim_gather_extents( * Update the cursor to point at this extent so we * restart the next batch from this extent. */ - tcur->ar_startblock = fbno; - tcur->ar_blockcount = flen; - break; - } - - /* - * use daddr format for all range/len calculations as that is - * the format the range/len variables are supplied in by - * userspace. - */ - dbno = XFS_AGB_TO_DADDR(mp, pag->pag_agno, fbno); - dlen = XFS_FSB_TO_BB(mp, flen); - - /* - * Too small? Give up. - */ - if (dlen < minlen) { - trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen); - tcur->ar_blockcount = 0; + tcur->start = fbno; + tcur->count = flen; break; } /* * If the extent is entirely outside of the range we are - * supposed to discard skip it. Do not bother to trim - * down partially overlapping ranges for now. + * supposed to skip it. Do not bother to trim down partially + * overlapping ranges for now. */ - if (dbno + dlen < start || dbno > end) { + if (fbno + flen < tcur->start) { trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); goto next_extent; } + if (fbno > tcur->end) { + trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); + if (tcur->by_bno) { + tcur->count = 0; + break; + } + goto next_extent; + } + + /* Trim the extent returned to the range we want. */ + if (fbno < tcur->start) { + flen -= tcur->start - fbno; + fbno = tcur->start; + } + if (fbno + flen > tcur->end + 1) + flen = tcur->end - fbno + 1; + + /* Too small? Give up. */ + if (flen < tcur->minlen) { + trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen); + if (tcur->by_bno) + goto next_extent; + tcur->count = 0; + break; + } /* * If any blocks in the range are still busy, skip the @@ -266,7 +281,10 @@ xfs_trim_gather_extents( &extents->extent_list); *blocks_trimmed += flen; next_extent: - error = xfs_btree_decrement(cur, 0, &i); + if (tcur->by_bno) + error = xfs_btree_increment(cur, 0, &i); + else + error = xfs_btree_decrement(cur, 0, &i); if (error) break; @@ -276,7 +294,7 @@ xfs_trim_gather_extents( * is no more extents to search. */ if (i == 0) - tcur->ar_blockcount = 0; + tcur->count = 0; } /* @@ -306,17 +324,22 @@ xfs_trim_should_stop(void) static int xfs_trim_extents( struct xfs_perag *pag, - xfs_daddr_t start, - xfs_daddr_t end, - xfs_daddr_t minlen, + xfs_agblock_t start, + xfs_agblock_t end, + xfs_extlen_t minlen, uint64_t *blocks_trimmed) { - struct xfs_alloc_rec_incore tcur = { - .ar_blockcount = pag->pagf_longest, - .ar_startblock = NULLAGBLOCK, + struct xfs_trim_cur tcur = { + .start = start, + .count = pag->pagf_longest, + .end = end, + .minlen = minlen, }; int error = 0; + if (start != 0 || end != pag->block_count) + tcur.by_bno = true; + do { struct xfs_busy_extents *extents; @@ -330,8 +353,8 @@ xfs_trim_extents( extents->owner = extents; INIT_LIST_HEAD(&extents->extent_list); - error = xfs_trim_gather_extents(pag, start, end, minlen, - &tcur, extents, blocks_trimmed); + error = xfs_trim_gather_extents(pag, &tcur, extents, + blocks_trimmed); if (error) { kfree(extents); break; @@ -354,7 +377,7 @@ xfs_trim_extents( if (xfs_trim_should_stop()) break; - } while (tcur.ar_blockcount != 0); + } while (tcur.count != 0); return error; @@ -378,8 +401,10 @@ xfs_ioc_trim( unsigned int granularity = bdev_discard_granularity(mp->m_ddev_targp->bt_bdev); struct fstrim_range range; - xfs_daddr_t start, end, minlen; - xfs_agnumber_t agno; + xfs_daddr_t start, end; + xfs_extlen_t minlen; + xfs_agnumber_t start_agno, end_agno; + xfs_agblock_t start_agbno, end_agbno; uint64_t blocks_trimmed = 0; int error, last_error = 0; @@ -399,7 +424,8 @@ xfs_ioc_trim( return -EFAULT; range.minlen = max_t(u64, granularity, range.minlen); - minlen = BTOBB(range.minlen); + minlen = XFS_B_TO_FSB(mp, range.minlen); + /* * Truncating down the len isn't actually quite correct, but using * BBTOB would mean we trivially get overflows for values @@ -413,15 +439,21 @@ xfs_ioc_trim( return -EINVAL; start = BTOBB(range.start); - end = start + BTOBBT(range.len) - 1; + end = min_t(xfs_daddr_t, start + BTOBBT(range.len), + XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) - 1; - if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1) - end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1; + start_agno = xfs_daddr_to_agno(mp, start); + start_agbno = xfs_daddr_to_agbno(mp, start); + end_agno = xfs_daddr_to_agno(mp, end); + end_agbno = xfs_daddr_to_agbno(mp, end); - agno = xfs_daddr_to_agno(mp, start); - for_each_perag_range(mp, agno, xfs_daddr_to_agno(mp, end), pag) { - error = xfs_trim_extents(pag, start, end, minlen, - &blocks_trimmed); + for_each_perag_range(mp, start_agno, end_agno, pag) { + xfs_agblock_t agend = pag->block_count; + + if (start_agno == end_agno) + agend = end_agbno; + error = xfs_trim_extents(pag, start_agbno, agend, minlen, + &blocks_trimmed); if (error) last_error = error; @@ -429,6 +461,7 @@ xfs_ioc_trim( xfs_perag_rele(pag); break; } + start_agbno = 0; } if (last_error)
On Sun, Mar 31, 2024 at 03:44:45PM -0700, Darrick J. Wong wrote: > > How's this for a replacement patch, then? Looks OK to me. -Dave.
diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c index 268bb734dc0a8..ee7a8759091eb 100644 --- a/fs/xfs/xfs_discard.c +++ b/fs/xfs/xfs_discard.c @@ -157,9 +157,9 @@ xfs_trim_gather_extents( uint64_t *blocks_trimmed) { struct xfs_mount *mp = pag->pag_mount; - struct xfs_trans *tp; struct xfs_btree_cur *cur; struct xfs_buf *agbp; + struct xfs_trans *tp; int error; int i; int batch = 100; @@ -292,6 +292,160 @@ xfs_trim_gather_extents( return error; } +/* Trim the free space in this AG by block number. */ +static inline int +xfs_trim_gather_bybno( + struct xfs_perag *pag, + xfs_daddr_t start, + xfs_daddr_t end, + xfs_daddr_t minlen, + struct xfs_alloc_rec_incore *tcur, + struct xfs_busy_extents *extents, + uint64_t *blocks_trimmed) +{ + struct xfs_mount *mp = pag->pag_mount; + struct xfs_trans *tp; + struct xfs_btree_cur *cur; + struct xfs_buf *agbp; + xfs_daddr_t end_daddr; + xfs_agnumber_t agno = pag->pag_agno; + xfs_agblock_t start_agbno; + xfs_agblock_t end_agbno; + xfs_extlen_t minlen_fsb = XFS_BB_TO_FSB(mp, minlen); + int i; + int batch = 100; + int error; + + start = max(start, XFS_AGB_TO_DADDR(mp, agno, 0)); + start_agbno = xfs_daddr_to_agbno(mp, start); + + end_daddr = XFS_AGB_TO_DADDR(mp, agno, pag->block_count); + end = min(end, end_daddr - 1); + end_agbno = xfs_daddr_to_agbno(mp, end); + + error = xfs_trans_alloc_empty(mp, &tp); + if (error) + return error; + + error = xfs_alloc_read_agf(pag, tp, 0, &agbp); + if (error) + goto out_trans_cancel; + + cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag); + + /* + * If this is our first time, look for any extent crossing start_agbno. + * Otherwise, continue at the next extent after wherever we left off. + */ + if (tcur->ar_startblock == NULLAGBLOCK) { + error = xfs_alloc_lookup_le(cur, start_agbno, 0, &i); + if (error) + goto out_del_cursor; + + /* + * If we didn't find anything at or below start_agbno, + * increment the cursor to see if there's another record above + * it. + */ + if (!i) + error = xfs_btree_increment(cur, 0, &i); + } else { + error = xfs_alloc_lookup_ge(cur, tcur->ar_startblock, 0, &i); + } + if (error) + goto out_del_cursor; + if (!i) { + /* nothing left in the AG, we are done */ + tcur->ar_blockcount = 0; + goto out_del_cursor; + } + + /* Loop the entire range that was asked for. */ + while (i) { + xfs_agblock_t fbno; + xfs_extlen_t flen; + + error = xfs_alloc_get_rec(cur, &fbno, &flen, &i); + if (error) + break; + if (XFS_IS_CORRUPT(mp, i != 1)) { + xfs_btree_mark_sick(cur); + error = -EFSCORRUPTED; + break; + } + + if (--batch <= 0) { + /* + * Update the cursor to point at this extent so we + * restart the next batch from this extent. + */ + tcur->ar_startblock = fbno; + tcur->ar_blockcount = flen; + break; + } + + /* Exit on extents entirely outside of the range. */ + if (fbno >= end_agbno) { + tcur->ar_blockcount = 0; + break; + } + if (fbno + flen < start_agbno) + goto next_extent; + + /* Trim the extent returned to the range we want. */ + if (fbno < start_agbno) { + flen -= start_agbno - fbno; + fbno = start_agbno; + } + if (fbno + flen > end_agbno + 1) + flen = end_agbno - fbno + 1; + + /* Ignore too small. */ + if (flen < minlen_fsb) { + trace_xfs_discard_toosmall(mp, agno, fbno, flen); + goto next_extent; + } + + /* + * If any blocks in the range are still busy, skip the + * discard and try again the next time. + */ + if (xfs_extent_busy_search(mp, pag, fbno, flen)) { + trace_xfs_discard_busy(mp, agno, fbno, flen); + goto next_extent; + } + + xfs_extent_busy_insert_discard(pag, fbno, flen, + &extents->extent_list); + *blocks_trimmed += flen; +next_extent: + error = xfs_btree_increment(cur, 0, &i); + if (error) + break; + + /* + * If there's no more records in the tree, we are done. Set the + * cursor block count to 0 to indicate to the caller that there + * are no more extents to search. + */ + if (i == 0) + tcur->ar_blockcount = 0; + } + + /* + * If there was an error, release all the gathered busy extents because + * we aren't going to issue a discard on them any more. + */ + if (error) + xfs_extent_busy_clear(mp, &extents->extent_list, false); + +out_del_cursor: + xfs_btree_del_cursor(cur, error); +out_trans_cancel: + xfs_trans_cancel(tp); + return error; +} + static bool xfs_trim_should_stop(void) { @@ -315,8 +469,15 @@ xfs_trim_extents( .ar_blockcount = pag->pagf_longest, .ar_startblock = NULLAGBLOCK, }; + struct xfs_mount *mp = pag->pag_mount; + bool by_len = true; int error = 0; + /* Are we only trimming part of this AG? */ + if (start > XFS_AGB_TO_DADDR(mp, pag->pag_agno, 0) || + end < XFS_AGB_TO_DADDR(mp, pag->pag_agno, pag->block_count - 1)) + by_len = false; + do { struct xfs_busy_extents *extents; @@ -330,8 +491,13 @@ xfs_trim_extents( extents->owner = extents; INIT_LIST_HEAD(&extents->extent_list); - error = xfs_trim_gather_extents(pag, start, end, minlen, - &tcur, extents, blocks_trimmed); + if (by_len) + error = xfs_trim_gather_extents(pag, start, end, + minlen, &tcur, extents, + blocks_trimmed); + else + error = xfs_trim_gather_bybno(pag, start, end, minlen, + &tcur, extents, blocks_trimmed); if (error) { kfree(extents); break;