Message ID | 20230119010334.1982938-1-david@fromorbit.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
Series | xfs: don't use BMBT btree split workers for IO completion | expand |
On Thu, Jan 19, 2023 at 12:03:34PM +1100, Dave Chinner wrote: > The other place we can be called for a BMBT split without a > preceeding allocation is __xfs_bunmapi() when punching out the > center of an existing extent. We don't remove extents in the IO > path, so these operations don't tend to be called with a lot of > stack consumed. Hence we don't really need to ship the split off to > a worker thread in these cases, either. So I agree with the fix. But the t_firstblock seems a bit opaque. We do have a lot of comments, which is good but it still feels a little fragile to me. Is there a good reason we can't do an explicit flag to document the intent better?
On Thu, Jan 19, 2023 at 10:33:26AM -0800, Christoph Hellwig wrote: > On Thu, Jan 19, 2023 at 12:03:34PM +1100, Dave Chinner wrote: > > The other place we can be called for a BMBT split without a > > preceeding allocation is __xfs_bunmapi() when punching out the > > center of an existing extent. We don't remove extents in the IO > > path, so these operations don't tend to be called with a lot of > > stack consumed. Hence we don't really need to ship the split off to > > a worker thread in these cases, either. > > So I agree with the fix. But the t_firstblock seems a bit opaque. > We do have a lot of comments, which is good but it still feels > a little fragile to me. Is there a good reason we can't do an > explicit flag to document the intent better? I don't see any point in adding flags to high level allocation APIs that expose deeply internal btree modification implementation details. Caller have no business knowing about this, have no idea how much stack they might need or consume, what context a btree split might occur in, etc. So I don't think there's any reason at all for exposing this btree split offload implementation detail to any other code. As for the "documentation" aspect of the change, see the patchset I just posted for "per-ag centric allocation". That fixes all the existing problems with tp->t_firstblock and also converts it to store an agno (i.e. tp->t_highest_agno) to make it clear that an AG access restriction has been placed on this transaction. That's exactly the situation that this deadlock avoidance needs to be aware of, so I don't see any real need to duplicate this information given the rework of the t-firstblock infrastructure that is out for review. Cheers, Dave.
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index 35f574421670..da8c769887fd 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -2913,9 +2913,22 @@ xfs_btree_split_worker( } /* - * BMBT split requests often come in with little stack to work on. Push + * BMBT split requests often come in with little stack to work on so we push * them off to a worker thread so there is lots of stack to use. For the other * btree types, just call directly to avoid the context switch overhead here. + * + * Care must be taken here - the work queue rescuer thread introduces potential + * AGF <> worker queue deadlocks if the BMBT block allocation has to lock new + * AGFs to allocate blocks. A task being run by the rescuer could attempt to + * lock an AGF that is already locked by a task queued to run by the rescuer, + * resulting in an ABBA deadlock as the rescuer cannot run the lock holder to + * release it until the current thread it is running gains the lock. + * + * To avoid this issue, we only ever queue BMBT splits that don't have an AGF + * already locked to allocate from. The only place that doesn't hold an AGF + * locked is unwritten extent conversion at IO completion, but that has already + * been offloaded to a worker thread and hence has no stack consumption issues + * we have to worry about. */ STATIC int /* error */ xfs_btree_split( @@ -2929,7 +2942,8 @@ xfs_btree_split( struct xfs_btree_split_args args; DECLARE_COMPLETION_ONSTACK(done); - if (cur->bc_btnum != XFS_BTNUM_BMAP) + if (cur->bc_btnum != XFS_BTNUM_BMAP || + cur->bc_tp->t_firstblock == NULLFSBLOCK) return __xfs_btree_split(cur, level, ptrp, key, curp, stat); args.cur = cur;