diff mbox series

xfs: don't use BMBT btree split workers for IO completion

Message ID 20230119010334.1982938-1-david@fromorbit.com (mailing list archive)
State Accepted
Headers show
Series xfs: don't use BMBT btree split workers for IO completion | expand

Commit Message

Dave Chinner Jan. 19, 2023, 1:03 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

When we split a BMBT due to record insertion, we offload it to a
worker thread because we can be deep in the stack when we try to
allocate a new block for the BMBT. Allocation can use several
kilobytes of stack (full memory reclaim, swap and/or IO path can
end up on the stack during allocation) and we can already be several
kilobytes deep in the stack when we need to split the BMBT.

A recent workload demonstrated a deadlock in this BMBT split
offload. It requires several things to happen at once:

1. two inodes need a BMBT split at the same time, one must be
unwritten extent conversion from IO completion, the other must be
from extent allocation.

2. there must be a no available xfs_alloc_wq worker threads
available in the worker pool.

3. There must be sustained severe memory shortages such that new
kworker threads cannot be allocated to the xfs_alloc_wq pool for
both threads that need split work to be run

4. The split work from the unwritten extent conversion must run
first.

5. when the BMBT block allocation runs from the split work, it must
loop over all AGs and not be able to either trylock an AGF
successfully, or each AGF is is able to lock has no space available
for a single block allocation.

6. The BMBT allocation must then attempt to lock the AGF that the
second task queued to the rescuer thread already has locked before
it finds an AGF it can allocate from.

At this point, we have an ABBA deadlock between tasks queued on the
xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
holding the AGF lock can't be run by the rescuer thread until the
task the rescuer thread is runing gets the AGF lock....

This is a highly improbably series of events, but there it is.

There's a couple of ways to fix this, but the easiest way to ensure
that we only punt tasks with a locked AGF that holds enough space
for the BMBT block allocations to the worker thread.

This works for unwritten extent conversion in IO completion (which
doesn't have a locked AGF and space reservations) because we have
tight control over the IO completion stack. It is typically only 6
functions deep when xfs_btree_split() is called because we've
already offloaded the IO completion work to a worker thread and
hence we don't need to worry about stack overruns here.

The other place we can be called for a BMBT split without a
preceeding allocation is __xfs_bunmapi() when punching out the
center of an existing extent. We don't remove extents in the IO
path, so these operations don't tend to be called with a lot of
stack consumed. Hence we don't really need to ship the split off to
a worker thread in these cases, either.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_btree.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig Jan. 19, 2023, 6:33 p.m. UTC | #1
On Thu, Jan 19, 2023 at 12:03:34PM +1100, Dave Chinner wrote:
> The other place we can be called for a BMBT split without a
> preceeding allocation is __xfs_bunmapi() when punching out the
> center of an existing extent. We don't remove extents in the IO
> path, so these operations don't tend to be called with a lot of
> stack consumed. Hence we don't really need to ship the split off to
> a worker thread in these cases, either.

So I agree with the fix.  But the t_firstblock seems a bit opaque.
We do have a lot of comments, which is good but it still feels
a little fragile to me.  Is there a good reason we can't do an
explicit flag to document the intent better?
Dave Chinner Jan. 19, 2023, 8:20 p.m. UTC | #2
On Thu, Jan 19, 2023 at 10:33:26AM -0800, Christoph Hellwig wrote:
> On Thu, Jan 19, 2023 at 12:03:34PM +1100, Dave Chinner wrote:
> > The other place we can be called for a BMBT split without a
> > preceeding allocation is __xfs_bunmapi() when punching out the
> > center of an existing extent. We don't remove extents in the IO
> > path, so these operations don't tend to be called with a lot of
> > stack consumed. Hence we don't really need to ship the split off to
> > a worker thread in these cases, either.
> 
> So I agree with the fix.  But the t_firstblock seems a bit opaque.
> We do have a lot of comments, which is good but it still feels
> a little fragile to me.  Is there a good reason we can't do an
> explicit flag to document the intent better?

I don't see any point in adding flags to high level allocation APIs
that expose deeply internal btree modification implementation
details. Caller have no business knowing about this, have no idea
how much stack they might need or consume, what context a btree split
might occur in, etc. So I don't think there's any reason at all for
exposing this btree split offload implementation detail to any other
code.

As for the "documentation" aspect of the change, see the patchset I
just posted for "per-ag centric allocation". That fixes all the
existing problems with tp->t_firstblock and also converts it to
store an agno (i.e. tp->t_highest_agno) to make it clear that an AG
access restriction has been placed on this transaction.

That's exactly the situation that this deadlock avoidance needs to
be aware of, so I don't see any real need to duplicate this
information given the rework of the t-firstblock infrastructure that
is out for review.

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 35f574421670..da8c769887fd 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2913,9 +2913,22 @@  xfs_btree_split_worker(
 }
 
 /*
- * BMBT split requests often come in with little stack to work on. Push
+ * BMBT split requests often come in with little stack to work on so we push
  * them off to a worker thread so there is lots of stack to use. For the other
  * btree types, just call directly to avoid the context switch overhead here.
+ *
+ * Care must be taken here - the work queue rescuer thread introduces potential
+ * AGF <> worker queue deadlocks if the BMBT block allocation has to lock new
+ * AGFs to allocate blocks. A task being run by the rescuer could attempt to
+ * lock an AGF that is already locked by a task queued to run by the rescuer,
+ * resulting in an ABBA deadlock as the rescuer cannot run the lock holder to
+ * release it until the current thread it is running gains the lock.
+ *
+ * To avoid this issue, we only ever queue BMBT splits that don't have an AGF
+ * already locked to allocate from. The only place that doesn't hold an AGF
+ * locked is unwritten extent conversion at IO completion, but that has already
+ * been offloaded to a worker thread and hence has no stack consumption issues
+ * we have to worry about.
  */
 STATIC int					/* error */
 xfs_btree_split(
@@ -2929,7 +2942,8 @@  xfs_btree_split(
 	struct xfs_btree_split_args	args;
 	DECLARE_COMPLETION_ONSTACK(done);
 
-	if (cur->bc_btnum != XFS_BTNUM_BMAP)
+	if (cur->bc_btnum != XFS_BTNUM_BMAP ||
+	    cur->bc_tp->t_firstblock == NULLFSBLOCK)
 		return __xfs_btree_split(cur, level, ptrp, key, curp, stat);
 
 	args.cur = cur;