Message ID | 20191205175037.52529-2-bfoster@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xfs: automatic relogging experiment | expand |
On Thu, Dec 05, 2019 at 12:50:36PM -0500, Brian Foster wrote: > This is an AIL based mechanism to enable automatic relogging of > selected log items. The use case is for particular operations that > commit an item known to pin the tail of the log for a potentially > long period of time and otherwise cannot use a rolling transaction. > While this does not provide the deadlock avoidance guarantees of a > rolling transaction, it ties the relog transaction into AIL pushing > pressure such that we should expect the transaction to reserve the > necessary log space long before deadlock becomes a problem. > > To enable relogging, a bit is set on the log item before it is first > committed to the log subsystem. Once the item commits to the on-disk > log and inserts to the AIL, AIL pushing dictates when the item is > ready for a relog. When that occurs, the item relogs in an > independent transaction to ensure the log tail keeps moving without > intervention from the original committer. To disable relogging, the > original committer clears the log item bit and optionally waits on > relogging activity to cease if it needs to reuse the item before the > operation completes. > > While the current use case for automatic relogging is limited, the > mechanism is AIL based because it 1.) provides existing callbacks > into all possible log item types for future support and 2.) has > applicable context to determine when to relog particular items (such > as when an item pins the log tail). This provides enough flexibility > to support various log item types and future workloads without > introducing complexity up front for currently unknown use cases. > Further complexity, such as preallocated or regranted relog > transaction reservation or custom relog handlers can be considered > as the need arises. .... > +/* > + * Relog log items on the AIL relog queue. > + */ > +static void > +xfs_ail_relog( > + struct work_struct *work) > +{ > + struct xfs_ail *ailp = container_of(work, struct xfs_ail, > + ail_relog_work); > + struct xfs_mount *mp = ailp->ail_mount; > + struct xfs_trans *tp; > + struct xfs_log_item *lip, *lipp; > + int error; Probably need PF_MEMALLOC here - this will need to make progress in low memory situations, just like the xfsaild. > + /* XXX: define a ->tr_relog reservation */ > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp); > + if (error) > + return; Also needs memalloc_nofs_save() around this whole function, because we most definitely don't want this recursing into the filesystem and running transactions when we are trying to ensure we don't pin the tail of the log... > + > + spin_lock(&ailp->ail_lock); > + list_for_each_entry_safe(lip, lipp, &ailp->ail_relog_list, li_trans) { > + list_del_init(&lip->li_trans); > + xfs_trans_add_item(tp, lip); > + set_bit(XFS_LI_DIRTY, &lip->li_flags); > + tp->t_flags |= XFS_TRANS_DIRTY; > + } > + spin_unlock(&ailp->ail_lock); > + > + error = xfs_trans_commit(tp); > + ASSERT(!error); > +} Simple, but I see several issues here regarding how generic this approach is. 1. Memory reclaim deadlock. Needs to be run under memalloc_nofs_save() context (and possibly PF_MEMALLOC) because transaction allocation can trigger reclaim and that can run filesystem transactions and can block waiting for log space. If we already pin the tail of the log, then we deadlock in memory reclaim... 2. Transaction reservation deadlock. If the item being relogged is already at the tail of the log, or if the item that pins the tail preventing further log reservations from finding space is in the relog list, we will effectively deadlock the filesystem here. 3. it's going to require a log reservation for every item on the relog list, in total. That cannot be known ahead of time as the reservation is dependent on the dirty size of the items being added to the transaction. Hence, if this is a generic mechanism, the required reservation could be quite large... If it's too large, see #1. 4. xfs_trans_commit() requires objects being logged to be locked against concurrent modification. Locking in the AIL context is non-blocking, so the item needs to be locked against modification before it is added to the ail_relog_list, and won't get unlocked until it it committed again. 5. Locking random items in random order into the ail_relog_list opens us up to ABBA deadlocks with locking in ongoing modifications. 6. If the item is already dirty in a transaction (i.e. currently locked and joined to another transaction - being relogged) then then xfs_trans_add_item() is either going to fire asserts because XFS_LI_DIRTY is already set or it's going to corrupt the item list of that already running transaction. Given this, I'm skeptical this can be made into a useful, reliable generic async relogging mechanism. > /* > * The cursor keeps track of where our current traversal is up to by tracking > * the next item in the list for us. However, for this to be safe, removing an > @@ -363,7 +395,7 @@ static long > xfsaild_push( > struct xfs_ail *ailp) > { > - xfs_mount_t *mp = ailp->ail_mount; > + struct xfs_mount *mp = ailp->ail_mount; > struct xfs_ail_cursor cur; > struct xfs_log_item *lip; > xfs_lsn_t lsn; > @@ -425,6 +457,13 @@ xfsaild_push( > ailp->ail_last_pushed_lsn = lsn; > break; > > + case XFS_ITEM_RELOG: > + trace_xfs_ail_relog(lip); > + ASSERT(list_empty(&lip->li_trans)); > + list_add_tail(&lip->li_trans, &ailp->ail_relog_list); > + set_bit(XFS_LI_RELOGGED, &lip->li_flags); > + break; > + > case XFS_ITEM_FLUSHING: > /* > * The item or its backing buffer is already being > @@ -491,6 +530,9 @@ xfsaild_push( > if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list)) > ailp->ail_log_flush++; > > + if (!list_empty(&ailp->ail_relog_list)) > + queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work); > + Hmmm. Nothing here appears to guarantee forwards progress of the relogged items? We just queue the items and move on? What if we scan the ail and trip over the item again before it's been processed by the relog worker function? > if (!count || XFS_LSN_CMP(lsn, target) >= 0) { > out_done: > /* > @@ -834,15 +876,24 @@ xfs_trans_ail_init( > spin_lock_init(&ailp->ail_lock); > INIT_LIST_HEAD(&ailp->ail_buf_list); > init_waitqueue_head(&ailp->ail_empty); > + INIT_LIST_HEAD(&ailp->ail_relog_list); > + INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog); > + > + ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0, > + mp->m_super->s_id); It'll need WQ_MEMRECLAIM so it has a rescuer thread (relogging has to work when we are low on memory), and possibly WQ_HIPRI so that it doesn't get stuck behind other workqueues that run transactions and run the log out of space before we try to reserve log space for the relogging.... Cheers, Dave.
On Fri, Dec 06, 2019 at 08:02:11AM +1100, Dave Chinner wrote: > On Thu, Dec 05, 2019 at 12:50:36PM -0500, Brian Foster wrote: > > This is an AIL based mechanism to enable automatic relogging of > > selected log items. The use case is for particular operations that > > commit an item known to pin the tail of the log for a potentially > > long period of time and otherwise cannot use a rolling transaction. > > While this does not provide the deadlock avoidance guarantees of a > > rolling transaction, it ties the relog transaction into AIL pushing > > pressure such that we should expect the transaction to reserve the > > necessary log space long before deadlock becomes a problem. > > > > To enable relogging, a bit is set on the log item before it is first > > committed to the log subsystem. Once the item commits to the on-disk > > log and inserts to the AIL, AIL pushing dictates when the item is > > ready for a relog. When that occurs, the item relogs in an > > independent transaction to ensure the log tail keeps moving without > > intervention from the original committer. To disable relogging, the > > original committer clears the log item bit and optionally waits on > > relogging activity to cease if it needs to reuse the item before the > > operation completes. > > > > While the current use case for automatic relogging is limited, the > > mechanism is AIL based because it 1.) provides existing callbacks > > into all possible log item types for future support and 2.) has > > applicable context to determine when to relog particular items (such > > as when an item pins the log tail). This provides enough flexibility > > to support various log item types and future workloads without > > introducing complexity up front for currently unknown use cases. > > Further complexity, such as preallocated or regranted relog > > transaction reservation or custom relog handlers can be considered > > as the need arises. > > .... > > > +/* > > + * Relog log items on the AIL relog queue. > > + */ > > +static void > > +xfs_ail_relog( > > + struct work_struct *work) > > +{ > > + struct xfs_ail *ailp = container_of(work, struct xfs_ail, > > + ail_relog_work); > > + struct xfs_mount *mp = ailp->ail_mount; > > + struct xfs_trans *tp; > > + struct xfs_log_item *lip, *lipp; > > + int error; > > Probably need PF_MEMALLOC here - this will need to make progress in > low memory situations, just like the xfsaild. > > > + /* XXX: define a ->tr_relog reservation */ > > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp); > > + if (error) > > + return; > > Also needs memalloc_nofs_save() around this whole function, because > we most definitely don't want this recursing into the filesystem and > running transactions when we are trying to ensure we don't pin the > tail of the log... > Ok. Note that this will likely change to not allocate the transaction here, but I'll revisit this when appropriate. > > + > > + spin_lock(&ailp->ail_lock); > > + list_for_each_entry_safe(lip, lipp, &ailp->ail_relog_list, li_trans) { > > + list_del_init(&lip->li_trans); > > + xfs_trans_add_item(tp, lip); > > + set_bit(XFS_LI_DIRTY, &lip->li_flags); > > + tp->t_flags |= XFS_TRANS_DIRTY; > > + } > > + spin_unlock(&ailp->ail_lock); > > + > > + error = xfs_trans_commit(tp); > > + ASSERT(!error); > > +} > > Simple, but I see several issues here regarding how generic this > approach is. > > 1. Memory reclaim deadlock. Needs to be run under > memalloc_nofs_save() context (and possibly PF_MEMALLOC) because > transaction allocation can trigger reclaim and that can run > filesystem transactions and can block waiting for log space. If we > already pin the tail of the log, then we deadlock in memory > reclaim... > > 2. Transaction reservation deadlock. If the item being relogged is > already at the tail of the log, or if the item that pins the tail > preventing further log reservations from finding space is in the > relog list, we will effectively deadlock the filesystem here. > Yes, this was by design. There are a couple angles to this.. First, my testing to this point suggests that the current use case doesn't require this level of guarantee. This deadlock can manifest in a few seconds when combined with a parallel workload and small log. I've been stalling the quotaoff completion for minutes under those same conditions to this point without reproducing any sort of issues. That said, I think there are other issues with this approach. This hasn't been tested thoroughly yet, for one. :) I'm not sure the practical behavior would apply to the scrub use case (more relogged items) or non-intent items (with lock contention). Finally, I'm not sure it's totally appropriate to lock items for relog before we acquire reservation for the relog transaction. All in all, I'm still considering changes in this area. It's just a matter of finding an approach that is simple enough and isn't overkill in terms of overhead or complexity... > 3. it's going to require a log reservation for every item on the > relog list, in total. That cannot be known ahead of time as the > reservation is dependent on the dirty size of the items being added > to the transaction. Hence, if this is a generic mechanism, the > required reservation could be quite large... If it's too large, see > #1. > Yeah, I punted on the transaction management for this RFC because 1.) it's independent from the mechanics of how to relog items and so not necessary to demonstrate that concept and 2.) I can think of at least three or four different ways to manage transaction reservation to address this problem, with varying levels of complexity/flexibility. See the previous RFCs for examples of what I mean. I've been thinking about doing something similar for this variant, but a bit more simple: - Original (i.e. quotaoff) transaction becomes a permanent transaction. - Immediately after transaction allocation, explicitly charge the relogging subsystem with a certain amount of (fixed size) reservation and roll the transaction. - The rolled transaction proceeds as the original would have, sets the relog bit on associated item(s) and commits. The relogged items are thus added to the log subsystem with an already charged relog system. Note that the aforementioned relog reservation is undone in the event of error/cancel before the final transaction commits. On the backend/relog side of things: - Establish a max item count for the relog transaction (probably 1 for now). When the max is hit, the relog worker instance rolls the transaction and repeats until it drains the current relog item list (similar to a truncate operation, for example). - Keep a global count of the number of active relog items somewhere. Once this drops to zero, the relog transaction is free to be cancelled and the reservation returned to the grant heads. The latter bit essentially means the first relog transaction is responsible for charging the relog system and doing the little transaction roll dance, while subsequent parallel users effectively acquire a reference count on the existence of the permanent relog transaction until there are zero active reloggable items in the subsystem. The transaction itself probably starts out as a fixed ->tr_relog reservation defined to the maximum supported reloggable transaction (i.e. quotaoff, for now). ISTM this keeps things reasonably simple, doesn't introduce runtime overhead unless relogging is active and guarantees deadlock avoidance. Thoughts on that? > 4. xfs_trans_commit() requires objects being logged to be locked > against concurrent modification. Locking in the AIL context is > non-blocking, so the item needs to be locked against modification > before it is added to the ail_relog_list, and won't get unlocked > until it it committed again. > Yep.. > 5. Locking random items in random order into the ail_relog_list > opens us up to ABBA deadlocks with locking in ongoing modifications. > This relies on locking behavior as already implemented by xfsaild. The whole point is to tie into existing infrastructure rather than duplicate it or introduce new access patterns. All this does is change how we handle certain log items once they are locked. It doesn't introduce any new locking patterns that don't already occur. > 6. If the item is already dirty in a transaction (i.e. currently > locked and joined to another transaction - being relogged) then then > xfs_trans_add_item() is either going to fire asserts because > XFS_LI_DIRTY is already set or it's going to corrupt the item list > of that already running transaction. > As above, the item isn't processed unless it is locked or otherwise exclusive. In a general sense, there are no new log item access patterns here. Relogging follows the same log item access rules as the rest of the code. > Given this, I'm skeptical this can be made into a useful, reliable > generic async relogging mechanism. > Fair, but that's also not the current goal. The goal is to address the current use cases of quotaoff and scrub with consideration for similar, basic handling of arbitrary log items in the future. It's fairly trivial to implement relogging of a buffer or inode on top of this, for example. If more complex use cases arise, we can build in more complexity as needed or further genericize the mechanism, but I see no need to build in complexity or introduce major abstractions given the current couple of fairly quirky use cases and otherwise lack of further requirements. I view the lack of infrastructure as an advantage so I'm intentionally trying to keep things simple. IOW, worst case, the mechanism in this approach is easy to repurpose because after the reservation management bits, it's just a couple log item states, a list and a workqueue job. The locking issues noted above strike me as a misunderstanding one way or another (which is easy to misinterpret given the undefined use case). Given that and with the transaction issues resolved I don't see the remaining concerns as blockers. If you have thoughts on some future use cases that obviously conflict or might warrant further complexity/requirements right now, I'd be interested to hear more about it. Otherwise I'm just kind of guessing at things in that regard; it's hard to design a system around unknown requirements... (I could also look into wiring up an arbitrary non-intent log item as a canary, I suppose. That would be a final RFC patch that serves no real purpose and wouldn't be merged, but would hook up the mechanism for testing purposes, help prove sanity and provide some actual code to reason about.) > > /* > > * The cursor keeps track of where our current traversal is up to by tracking > > * the next item in the list for us. However, for this to be safe, removing an > > @@ -363,7 +395,7 @@ static long > > xfsaild_push( > > struct xfs_ail *ailp) > > { > > - xfs_mount_t *mp = ailp->ail_mount; > > + struct xfs_mount *mp = ailp->ail_mount; > > struct xfs_ail_cursor cur; > > struct xfs_log_item *lip; > > xfs_lsn_t lsn; > > @@ -425,6 +457,13 @@ xfsaild_push( > > ailp->ail_last_pushed_lsn = lsn; > > break; > > > > + case XFS_ITEM_RELOG: > > + trace_xfs_ail_relog(lip); > > + ASSERT(list_empty(&lip->li_trans)); > > + list_add_tail(&lip->li_trans, &ailp->ail_relog_list); > > + set_bit(XFS_LI_RELOGGED, &lip->li_flags); > > + break; > > + > > case XFS_ITEM_FLUSHING: > > /* > > * The item or its backing buffer is already being > > @@ -491,6 +530,9 @@ xfsaild_push( > > if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list)) > > ailp->ail_log_flush++; > > > > + if (!list_empty(&ailp->ail_relog_list)) > > + queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work); > > + > > Hmmm. Nothing here appears to guarantee forwards progress of the > relogged items? We just queue the items and move on? What if we scan > the ail and trip over the item again before it's been processed by > the relog worker function? > See the XFS_LI_RELOGGED bit (and specifically how it is used in the next patch). This could be lifted up a layer if that is more clear. > > if (!count || XFS_LSN_CMP(lsn, target) >= 0) { > > out_done: > > /* > > @@ -834,15 +876,24 @@ xfs_trans_ail_init( > > spin_lock_init(&ailp->ail_lock); > > INIT_LIST_HEAD(&ailp->ail_buf_list); > > init_waitqueue_head(&ailp->ail_empty); > > + INIT_LIST_HEAD(&ailp->ail_relog_list); > > + INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog); > > + > > + ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0, > > + mp->m_super->s_id); > > It'll need WQ_MEMRECLAIM so it has a rescuer thread (relogging has > to work when we are low on memory), and possibly WQ_HIPRI so that it > doesn't get stuck behind other workqueues that run transactions and > run the log out of space before we try to reserve log space for the > relogging.... > Ok, I had dropped those for now because it wasn't immediately clear if they were needed. I'll take another look at this. Thanks for the feedback... Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index c13bb3655e48..6c2a9cdadd03 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -1063,6 +1063,7 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push); DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned); DEFINE_LOG_ITEM_EVENT(xfs_ail_locked); DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing); +DEFINE_LOG_ITEM_EVENT(xfs_ail_relog); DECLARE_EVENT_CLASS(xfs_ail_class, TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn), diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 3b208f9a865c..f2c06cdd1074 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -763,6 +763,35 @@ xfs_trans_del_item( list_del_init(&lip->li_trans); } +void +xfs_trans_enable_relog( + struct xfs_log_item *lip) +{ + set_bit(XFS_LI_RELOG, &lip->li_flags); +} + +void +xfs_trans_disable_relog( + struct xfs_log_item *lip, + bool drain) /* wait for relogging to cease */ +{ + struct xfs_mount *mp = lip->li_mountp; + + clear_bit(XFS_LI_RELOG, &lip->li_flags); + + if (!drain) + return; + + /* + * Some operations might require relog activity to cease before they can + * proceed. For example, an operation must wait before including a + * non-lockable log item (i.e. intent) in another transaction. + */ + while (wait_on_bit_timeout(&lip->li_flags, XFS_LI_RELOGGED, + TASK_UNINTERRUPTIBLE, HZ)) + xfs_log_force(mp, XFS_LOG_SYNC); +} + /* Detach and unlock all of the items in a transaction */ static void xfs_trans_free_items( @@ -848,6 +877,7 @@ xfs_trans_committed_bulk( if (aborted) set_bit(XFS_LI_ABORTED, &lip->li_flags); + clear_and_wake_up_bit(XFS_LI_RELOGGED, &lip->li_flags); if (lip->li_ops->flags & XFS_ITEM_RELEASE_WHEN_COMMITTED) { lip->li_ops->iop_release(lip); diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index 64d7f171ebd3..6d4311d82c4c 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -59,12 +59,16 @@ struct xfs_log_item { #define XFS_LI_ABORTED 1 #define XFS_LI_FAILED 2 #define XFS_LI_DIRTY 3 /* log item dirty in transaction */ +#define XFS_LI_RELOG 4 /* automatic relogging */ +#define XFS_LI_RELOGGED 5 /* relogged by xfsaild */ #define XFS_LI_FLAGS \ { (1 << XFS_LI_IN_AIL), "IN_AIL" }, \ { (1 << XFS_LI_ABORTED), "ABORTED" }, \ { (1 << XFS_LI_FAILED), "FAILED" }, \ - { (1 << XFS_LI_DIRTY), "DIRTY" } + { (1 << XFS_LI_DIRTY), "DIRTY" }, \ + { (1 << XFS_LI_RELOG), "RELOG" }, \ + { (1 << XFS_LI_RELOGGED), "RELOGGED" } struct xfs_item_ops { unsigned flags; @@ -95,6 +99,7 @@ void xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item, #define XFS_ITEM_PINNED 1 #define XFS_ITEM_LOCKED 2 #define XFS_ITEM_FLUSHING 3 +#define XFS_ITEM_RELOG 4 /* * Deferred operation item relogging limits. diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 00cc5b8734be..bb54d00ae095 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -143,6 +143,38 @@ xfs_ail_max_lsn( return lsn; } +/* + * Relog log items on the AIL relog queue. + */ +static void +xfs_ail_relog( + struct work_struct *work) +{ + struct xfs_ail *ailp = container_of(work, struct xfs_ail, + ail_relog_work); + struct xfs_mount *mp = ailp->ail_mount; + struct xfs_trans *tp; + struct xfs_log_item *lip, *lipp; + int error; + + /* XXX: define a ->tr_relog reservation */ + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp); + if (error) + return; + + spin_lock(&ailp->ail_lock); + list_for_each_entry_safe(lip, lipp, &ailp->ail_relog_list, li_trans) { + list_del_init(&lip->li_trans); + xfs_trans_add_item(tp, lip); + set_bit(XFS_LI_DIRTY, &lip->li_flags); + tp->t_flags |= XFS_TRANS_DIRTY; + } + spin_unlock(&ailp->ail_lock); + + error = xfs_trans_commit(tp); + ASSERT(!error); +} + /* * The cursor keeps track of where our current traversal is up to by tracking * the next item in the list for us. However, for this to be safe, removing an @@ -363,7 +395,7 @@ static long xfsaild_push( struct xfs_ail *ailp) { - xfs_mount_t *mp = ailp->ail_mount; + struct xfs_mount *mp = ailp->ail_mount; struct xfs_ail_cursor cur; struct xfs_log_item *lip; xfs_lsn_t lsn; @@ -425,6 +457,13 @@ xfsaild_push( ailp->ail_last_pushed_lsn = lsn; break; + case XFS_ITEM_RELOG: + trace_xfs_ail_relog(lip); + ASSERT(list_empty(&lip->li_trans)); + list_add_tail(&lip->li_trans, &ailp->ail_relog_list); + set_bit(XFS_LI_RELOGGED, &lip->li_flags); + break; + case XFS_ITEM_FLUSHING: /* * The item or its backing buffer is already being @@ -491,6 +530,9 @@ xfsaild_push( if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list)) ailp->ail_log_flush++; + if (!list_empty(&ailp->ail_relog_list)) + queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work); + if (!count || XFS_LSN_CMP(lsn, target) >= 0) { out_done: /* @@ -834,15 +876,24 @@ xfs_trans_ail_init( spin_lock_init(&ailp->ail_lock); INIT_LIST_HEAD(&ailp->ail_buf_list); init_waitqueue_head(&ailp->ail_empty); + INIT_LIST_HEAD(&ailp->ail_relog_list); + INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog); + + ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0, + mp->m_super->s_id); + if (!ailp->ail_relog_wq) + goto out_free_ailp; ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s", ailp->ail_mount->m_super->s_id); if (IS_ERR(ailp->ail_task)) - goto out_free_ailp; + goto out_destroy_wq; mp->m_ail = ailp; return 0; +out_destroy_wq: + destroy_workqueue(ailp->ail_relog_wq); out_free_ailp: kmem_free(ailp); return -ENOMEM; @@ -855,5 +906,6 @@ xfs_trans_ail_destroy( struct xfs_ail *ailp = mp->m_ail; kthread_stop(ailp->ail_task); + destroy_workqueue(ailp->ail_relog_wq); kmem_free(ailp); } diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 2e073c1c4614..3cefc821350e 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -16,6 +16,8 @@ struct xfs_log_vec; void xfs_trans_init(struct xfs_mount *); void xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *); void xfs_trans_del_item(struct xfs_log_item *); +void xfs_trans_enable_relog(struct xfs_log_item *); +void xfs_trans_disable_relog(struct xfs_log_item *, bool); void xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp); void xfs_trans_committed_bulk(struct xfs_ail *ailp, struct xfs_log_vec *lv, @@ -61,6 +63,9 @@ struct xfs_ail { int ail_log_flush; struct list_head ail_buf_list; wait_queue_head_t ail_empty; + struct work_struct ail_relog_work; + struct list_head ail_relog_list; + struct workqueue_struct *ail_relog_wq; }; /*
This is an AIL based mechanism to enable automatic relogging of selected log items. The use case is for particular operations that commit an item known to pin the tail of the log for a potentially long period of time and otherwise cannot use a rolling transaction. While this does not provide the deadlock avoidance guarantees of a rolling transaction, it ties the relog transaction into AIL pushing pressure such that we should expect the transaction to reserve the necessary log space long before deadlock becomes a problem. To enable relogging, a bit is set on the log item before it is first committed to the log subsystem. Once the item commits to the on-disk log and inserts to the AIL, AIL pushing dictates when the item is ready for a relog. When that occurs, the item relogs in an independent transaction to ensure the log tail keeps moving without intervention from the original committer. To disable relogging, the original committer clears the log item bit and optionally waits on relogging activity to cease if it needs to reuse the item before the operation completes. While the current use case for automatic relogging is limited, the mechanism is AIL based because it 1.) provides existing callbacks into all possible log item types for future support and 2.) has applicable context to determine when to relog particular items (such as when an item pins the log tail). This provides enough flexibility to support various log item types and future workloads without introducing complexity up front for currently unknown use cases. Further complexity, such as preallocated or regranted relog transaction reservation or custom relog handlers can be considered as the need arises. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/xfs_trace.h | 1 + fs/xfs/xfs_trans.c | 30 ++++++++++++++++++++++ fs/xfs/xfs_trans.h | 7 +++++- fs/xfs/xfs_trans_ail.c | 56 +++++++++++++++++++++++++++++++++++++++-- fs/xfs/xfs_trans_priv.h | 5 ++++ 5 files changed, 96 insertions(+), 3 deletions(-)