diff mbox series

[1/6] xfs: stop artificially limiting the length of bunmap calls

Message ID 164997687142.383881.7160925177236303538.stgit@magnolia (mailing list archive)
State Accepted
Headers show
Series xfs: fix reflink inefficiencies | expand

Commit Message

Darrick J. Wong April 14, 2022, 10:54 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the
data forks of shared files to avoid two failure scenarios: one where the
extent being unmapped is so sparsely shared that we exceed the
transaction reservation with the sheer number of refcount btree updates
and EFI intent items; and the other where we attach so many deferred
updates to the transaction that we pin the log tail and later the log
head meets the tail, causing the log to livelock.

We avoid triggering the first problem by tracking the number of ops in
the refcount btree cursor and forcing a requeue of the refcount intent
item any time we think that we might be close to overflowing.  This has
been baked into XFS since before the original e1a4 patch.

A recent patchset fixed the second problem by changing the deferred ops
code to finish all the work items created by each round of trying to
complete a refcount intent item, which eliminates the long chains of
deferred items (27dad); and causing long-running transactions to relog
their intent log items when space in the log gets low (74f4d).

Because this clamp affects /any/ unmapping request regardless of the
sharing factors of the component blocks, it degrades the performance of
all large unmapping requests -- whereas with an unshared file we can
unmap millions of blocks in one go, shared files are limited to
unmapping a few thousand blocks at a time, which causes the upper level
code to spin in a bunmapi loop even if it wasn't needed.

This also eliminates one more place where log recovery behavior can
differ from online behavior, because bunmapi operations no longer need
to requeue.

Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c     |   22 +---------------------
 fs/xfs/libxfs/xfs_refcount.c |    5 ++---
 fs/xfs/libxfs/xfs_refcount.h |    8 ++------
 3 files changed, 5 insertions(+), 30 deletions(-)

Comments

Dave Chinner April 22, 2022, 10:01 p.m. UTC | #1
On Thu, Apr 14, 2022 at 03:54:31PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the
> data forks of shared files to avoid two failure scenarios: one where the
> extent being unmapped is so sparsely shared that we exceed the
> transaction reservation with the sheer number of refcount btree updates
> and EFI intent items; and the other where we attach so many deferred
> updates to the transaction that we pin the log tail and later the log
> head meets the tail, causing the log to livelock.
> 
> We avoid triggering the first problem by tracking the number of ops in
> the refcount btree cursor and forcing a requeue of the refcount intent
> item any time we think that we might be close to overflowing.  This has
> been baked into XFS since before the original e1a4 patch.
> 
> A recent patchset fixed the second problem by changing the deferred ops
> code to finish all the work items created by each round of trying to
> complete a refcount intent item, which eliminates the long chains of
> deferred items (27dad); and causing long-running transactions to relog
> their intent log items when space in the log gets low (74f4d).
> 
> Because this clamp affects /any/ unmapping request regardless of the
> sharing factors of the component blocks, it degrades the performance of
> all large unmapping requests -- whereas with an unshared file we can
> unmap millions of blocks in one go, shared files are limited to
> unmapping a few thousand blocks at a time, which causes the upper level
> code to spin in a bunmapi loop even if it wasn't needed.
> 
> This also eliminates one more place where log recovery behavior can
> differ from online behavior, because bunmapi operations no longer need
> to requeue.
> 
> Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
> Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
> Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_bmap.c     |   22 +---------------------
>  fs/xfs/libxfs/xfs_refcount.c |    5 ++---
>  fs/xfs/libxfs/xfs_refcount.h |    8 ++------
>  3 files changed, 5 insertions(+), 30 deletions(-)

This looks reasonable, but I'm wondering how the original problem
was discovered and whether this has been tested against that
original problem situation to ensure we aren't introducing a
regression here....

> diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
> index 9eb01edbd89d..6b265f6075b8 100644
> --- a/fs/xfs/libxfs/xfs_refcount.h
> +++ b/fs/xfs/libxfs/xfs_refcount.h
> @@ -66,15 +66,11 @@ extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
>   * reservation and crash the fs.  Each record adds 12 bytes to the
>   * log (plus any key updates) so we'll conservatively assume 32 bytes
>   * per record.  We must also leave space for btree splits on both ends
> - * of the range and space for the CUD and a new CUI.
> + * of the range and space for the CUD and a new CUI.  Each EFI that we
> + * attach to the transaction also consumes ~32 bytes.
>   */
>  #define XFS_REFCOUNT_ITEM_OVERHEAD	32

FWIW, I think this is a low-ball number - each EFI also consumes an
ophdr (12 bytes) for the region identifier in the log, so it's
actually 44 bytes, not 32 bytes that will be consumed.  It is not
necessary to address this in this patchset, though.

Cheers,

Dave.
Darrick J. Wong April 22, 2022, 10:18 p.m. UTC | #2
On Sat, Apr 23, 2022 at 08:01:20AM +1000, Dave Chinner wrote:
> On Thu, Apr 14, 2022 at 03:54:31PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the
> > data forks of shared files to avoid two failure scenarios: one where the
> > extent being unmapped is so sparsely shared that we exceed the
> > transaction reservation with the sheer number of refcount btree updates
> > and EFI intent items; and the other where we attach so many deferred
> > updates to the transaction that we pin the log tail and later the log
> > head meets the tail, causing the log to livelock.
> > 
> > We avoid triggering the first problem by tracking the number of ops in
> > the refcount btree cursor and forcing a requeue of the refcount intent
> > item any time we think that we might be close to overflowing.  This has
> > been baked into XFS since before the original e1a4 patch.
> > 
> > A recent patchset fixed the second problem by changing the deferred ops
> > code to finish all the work items created by each round of trying to
> > complete a refcount intent item, which eliminates the long chains of
> > deferred items (27dad); and causing long-running transactions to relog
> > their intent log items when space in the log gets low (74f4d).
> > 
> > Because this clamp affects /any/ unmapping request regardless of the
> > sharing factors of the component blocks, it degrades the performance of
> > all large unmapping requests -- whereas with an unshared file we can
> > unmap millions of blocks in one go, shared files are limited to
> > unmapping a few thousand blocks at a time, which causes the upper level
> > code to spin in a bunmapi loop even if it wasn't needed.
> > 
> > This also eliminates one more place where log recovery behavior can
> > differ from online behavior, because bunmapi operations no longer need
> > to requeue.
> > 
> > Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
> > Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
> > Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c     |   22 +---------------------
> >  fs/xfs/libxfs/xfs_refcount.c |    5 ++---
> >  fs/xfs/libxfs/xfs_refcount.h |    8 ++------
> >  3 files changed, 5 insertions(+), 30 deletions(-)
> 
> This looks reasonable, but I'm wondering how the original problem
> was discovered and whether this has been tested against that
> original problem situation to ensure we aren't introducing a
> regression here....

generic/447, and yes, I have forced it to run a deletion of 1 million
extents without incident. :)

I should probably amend that test to note that it's an exerciser for
e1a4e37cc7b6.

> > diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
> > index 9eb01edbd89d..6b265f6075b8 100644
> > --- a/fs/xfs/libxfs/xfs_refcount.h
> > +++ b/fs/xfs/libxfs/xfs_refcount.h
> > @@ -66,15 +66,11 @@ extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
> >   * reservation and crash the fs.  Each record adds 12 bytes to the
> >   * log (plus any key updates) so we'll conservatively assume 32 bytes
> >   * per record.  We must also leave space for btree splits on both ends
> > - * of the range and space for the CUD and a new CUI.
> > + * of the range and space for the CUD and a new CUI.  Each EFI that we
> > + * attach to the transaction also consumes ~32 bytes.
> >   */
> >  #define XFS_REFCOUNT_ITEM_OVERHEAD	32
> 
> FWIW, I think this is a low-ball number - each EFI also consumes an
> ophdr (12 bytes) for the region identifier in the log, so it's
> actually 44 bytes, not 32 bytes that will be consumed.  It is not
> necessary to address this in this patchset, though.

<Nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner April 22, 2022, 11:51 p.m. UTC | #3
On Fri, Apr 22, 2022 at 03:18:20PM -0700, Darrick J. Wong wrote:
> On Sat, Apr 23, 2022 at 08:01:20AM +1000, Dave Chinner wrote:
> > On Thu, Apr 14, 2022 at 03:54:31PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the
> > > data forks of shared files to avoid two failure scenarios: one where the
> > > extent being unmapped is so sparsely shared that we exceed the
> > > transaction reservation with the sheer number of refcount btree updates
> > > and EFI intent items; and the other where we attach so many deferred
> > > updates to the transaction that we pin the log tail and later the log
> > > head meets the tail, causing the log to livelock.
> > > 
> > > We avoid triggering the first problem by tracking the number of ops in
> > > the refcount btree cursor and forcing a requeue of the refcount intent
> > > item any time we think that we might be close to overflowing.  This has
> > > been baked into XFS since before the original e1a4 patch.
> > > 
> > > A recent patchset fixed the second problem by changing the deferred ops
> > > code to finish all the work items created by each round of trying to
> > > complete a refcount intent item, which eliminates the long chains of
> > > deferred items (27dad); and causing long-running transactions to relog
> > > their intent log items when space in the log gets low (74f4d).
> > > 
> > > Because this clamp affects /any/ unmapping request regardless of the
> > > sharing factors of the component blocks, it degrades the performance of
> > > all large unmapping requests -- whereas with an unshared file we can
> > > unmap millions of blocks in one go, shared files are limited to
> > > unmapping a few thousand blocks at a time, which causes the upper level
> > > code to spin in a bunmapi loop even if it wasn't needed.
> > > 
> > > This also eliminates one more place where log recovery behavior can
> > > differ from online behavior, because bunmapi operations no longer need
> > > to requeue.
> > > 
> > > Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
> > > Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
> > > Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c     |   22 +---------------------
> > >  fs/xfs/libxfs/xfs_refcount.c |    5 ++---
> > >  fs/xfs/libxfs/xfs_refcount.h |    8 ++------
> > >  3 files changed, 5 insertions(+), 30 deletions(-)
> > 
> > This looks reasonable, but I'm wondering how the original problem
> > was discovered and whether this has been tested against that
> > original problem situation to ensure we aren't introducing a
> > regression here....
> 
> generic/447, and yes, I have forced it to run a deletion of 1 million
> extents without incident. :)

Ok, that's all I wanted to know :)

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Christoph Hellwig April 26, 2022, 1:46 p.m. UTC | #4
On Thu, Apr 14, 2022 at 03:54:31PM -0700, Darrick J. Wong wrote:
> diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> index 327ba25e9e17..a07ebaecba73 100644
> --- a/fs/xfs/libxfs/xfs_refcount.c
> +++ b/fs/xfs/libxfs/xfs_refcount.c
> @@ -960,6 +960,7 @@ xfs_refcount_adjust_extents(
>  			 * Either cover the hole (increment) or
>  			 * delete the range (decrement).
>  			 */
> +			cur->bc_ag.refc.nr_ops++;
>  			if (tmp.rc_refcount) {
>  				error = xfs_refcount_insert(cur, &tmp,
>  						&found_tmp);
> @@ -970,7 +971,6 @@ xfs_refcount_adjust_extents(
>  					error = -EFSCORRUPTED;
>  					goto out_error;
>  				}
> -				cur->bc_ag.refc.nr_ops++;

How do these changes fit into the rest of the patch?
Darrick J. Wong April 26, 2022, 2:52 p.m. UTC | #5
On Tue, Apr 26, 2022 at 06:46:42AM -0700, Christoph Hellwig wrote:
> On Thu, Apr 14, 2022 at 03:54:31PM -0700, Darrick J. Wong wrote:
> > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > index 327ba25e9e17..a07ebaecba73 100644
> > --- a/fs/xfs/libxfs/xfs_refcount.c
> > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > @@ -960,6 +960,7 @@ xfs_refcount_adjust_extents(
> >  			 * Either cover the hole (increment) or
> >  			 * delete the range (decrement).
> >  			 */
> > +			cur->bc_ag.refc.nr_ops++;
> >  			if (tmp.rc_refcount) {
> >  				error = xfs_refcount_insert(cur, &tmp,
> >  						&found_tmp);
> > @@ -970,7 +971,6 @@ xfs_refcount_adjust_extents(
> >  					error = -EFSCORRUPTED;
> >  					goto out_error;
> >  				}
> > -				cur->bc_ag.refc.nr_ops++;
> 
> How do these changes fit into the rest of the patch?

Long ago before we used deferred ops to update the refcount btree, we
would restrict the number of blocks that could be bunmapped in order to
avoid overflowing the transaction reservation while doing refcount
updates for an unmapped extent.

Later on, I added deferred refcount updates so at least the refcountbt
updates went in their own transaction, which meant that if we started
running out of space because the unmapped extent had many many different
refcount records, we could at least return EAGAIN to get a clean
transaction to continue.  I needed a way to guess when we were getting
close to the reservation limit, so I added this nr_ops counter to track
the number of record updates.  Granted, 32 bytes per refcountbt record
update is quite an overestimate, since the records are contiguous in a
btree leaf block.

In the past, the deferred ops code had a defect where it didn't maintain
correct ordering of deferred ops when one dfops' ->finish_item function
queued even more dfops to the transaction, which is what happens when a
refcount update queues EFIs for blocks that are no longer referenced.
This resulted in enormous chains of defer ops, so I left the bunmap
piece in to throttle the chain lengths.

Now we've fixed the deferred ops code to maintain correct dfops order
between existing ops and newly queued ones, so the chaining problem went
away, and we can get rid of the bunmap throttle.  However, as part of
doing that, it's necessary to track the number of EFIs logged to the
transaction as well, which is what the above code change fixes.

Granted, Dave has also commented that EFIs use a bit more than 32 bytes,
so I think it would be warranted to capture the above comment and the
xfs_refcount.c changes in a separate patch now.

--D
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 74198dd82b03..432597e425c1 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5299,7 +5299,6 @@  __xfs_bunmapi(
 	int			whichfork;	/* data or attribute fork */
 	xfs_fsblock_t		sum;
 	xfs_filblks_t		len = *rlen;	/* length to unmap in file */
-	xfs_fileoff_t		max_len;
 	xfs_fileoff_t		end;
 	struct xfs_iext_cursor	icur;
 	bool			done = false;
@@ -5318,16 +5317,6 @@  __xfs_bunmapi(
 	ASSERT(len > 0);
 	ASSERT(nexts >= 0);
 
-	/*
-	 * Guesstimate how many blocks we can unmap without running the risk of
-	 * blowing out the transaction with a mix of EFIs and reflink
-	 * adjustments.
-	 */
-	if (tp && xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK)
-		max_len = min(len, xfs_refcount_max_unmap(tp->t_log_res));
-	else
-		max_len = len;
-
 	error = xfs_iread_extents(tp, ip, whichfork);
 	if (error)
 		return error;
@@ -5366,7 +5355,7 @@  __xfs_bunmapi(
 
 	extno = 0;
 	while (end != (xfs_fileoff_t)-1 && end >= start &&
-	       (nexts == 0 || extno < nexts) && max_len > 0) {
+	       (nexts == 0 || extno < nexts)) {
 		/*
 		 * Is the found extent after a hole in which end lives?
 		 * Just back up to the previous extent, if so.
@@ -5400,14 +5389,6 @@  __xfs_bunmapi(
 		if (del.br_startoff + del.br_blockcount > end + 1)
 			del.br_blockcount = end + 1 - del.br_startoff;
 
-		/* How much can we safely unmap? */
-		if (max_len < del.br_blockcount) {
-			del.br_startoff += del.br_blockcount - max_len;
-			if (!wasdel)
-				del.br_startblock += del.br_blockcount - max_len;
-			del.br_blockcount = max_len;
-		}
-
 		if (!isrt)
 			goto delete;
 
@@ -5543,7 +5524,6 @@  __xfs_bunmapi(
 		if (error)
 			goto error0;
 
-		max_len -= del.br_blockcount;
 		end = del.br_startoff - 1;
 nodelete:
 		/*
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 327ba25e9e17..a07ebaecba73 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -960,6 +960,7 @@  xfs_refcount_adjust_extents(
 			 * Either cover the hole (increment) or
 			 * delete the range (decrement).
 			 */
+			cur->bc_ag.refc.nr_ops++;
 			if (tmp.rc_refcount) {
 				error = xfs_refcount_insert(cur, &tmp,
 						&found_tmp);
@@ -970,7 +971,6 @@  xfs_refcount_adjust_extents(
 					error = -EFSCORRUPTED;
 					goto out_error;
 				}
-				cur->bc_ag.refc.nr_ops++;
 			} else {
 				fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
 						cur->bc_ag.pag->pag_agno,
@@ -1001,11 +1001,11 @@  xfs_refcount_adjust_extents(
 		ext.rc_refcount += adj;
 		trace_xfs_refcount_modify_extent(cur->bc_mp,
 				cur->bc_ag.pag->pag_agno, &ext);
+		cur->bc_ag.refc.nr_ops++;
 		if (ext.rc_refcount > 1) {
 			error = xfs_refcount_update(cur, &ext);
 			if (error)
 				goto out_error;
-			cur->bc_ag.refc.nr_ops++;
 		} else if (ext.rc_refcount == 1) {
 			error = xfs_refcount_delete(cur, &found_rec);
 			if (error)
@@ -1014,7 +1014,6 @@  xfs_refcount_adjust_extents(
 				error = -EFSCORRUPTED;
 				goto out_error;
 			}
-			cur->bc_ag.refc.nr_ops++;
 			goto advloop;
 		} else {
 			fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 9eb01edbd89d..6b265f6075b8 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -66,15 +66,11 @@  extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
  * reservation and crash the fs.  Each record adds 12 bytes to the
  * log (plus any key updates) so we'll conservatively assume 32 bytes
  * per record.  We must also leave space for btree splits on both ends
- * of the range and space for the CUD and a new CUI.
+ * of the range and space for the CUD and a new CUI.  Each EFI that we
+ * attach to the transaction also consumes ~32 bytes.
  */
 #define XFS_REFCOUNT_ITEM_OVERHEAD	32
 
-static inline xfs_fileoff_t xfs_refcount_max_unmap(int log_res)
-{
-	return (log_res * 3 / 4) / XFS_REFCOUNT_ITEM_OVERHEAD;
-}
-
 extern int xfs_refcount_has_record(struct xfs_btree_cur *cur,
 		xfs_agblock_t bno, xfs_extlen_t len, bool *exists);
 union xfs_btree_rec;