diff mbox series

[RFC,v4,1/2] xfs: automatic log item relog mechanism

Message ID 20191205175037.52529-2-bfoster@redhat.com (mailing list archive)
State Superseded
Headers show
Series xfs: automatic relogging experiment | expand

Commit Message

Brian Foster Dec. 5, 2019, 5:50 p.m. UTC
This is an AIL based mechanism to enable automatic relogging of
selected log items. The use case is for particular operations that
commit an item known to pin the tail of the log for a potentially
long period of time and otherwise cannot use a rolling transaction.
While this does not provide the deadlock avoidance guarantees of a
rolling transaction, it ties the relog transaction into AIL pushing
pressure such that we should expect the transaction to reserve the
necessary log space long before deadlock becomes a problem.

To enable relogging, a bit is set on the log item before it is first
committed to the log subsystem. Once the item commits to the on-disk
log and inserts to the AIL, AIL pushing dictates when the item is
ready for a relog. When that occurs, the item relogs in an
independent transaction to ensure the log tail keeps moving without
intervention from the original committer.  To disable relogging, the
original committer clears the log item bit and optionally waits on
relogging activity to cease if it needs to reuse the item before the
operation completes.

While the current use case for automatic relogging is limited, the
mechanism is AIL based because it 1.) provides existing callbacks
into all possible log item types for future support and 2.) has
applicable context to determine when to relog particular items (such
as when an item pins the log tail). This provides enough flexibility
to support various log item types and future workloads without
introducing complexity up front for currently unknown use cases.
Further complexity, such as preallocated or regranted relog
transaction reservation or custom relog handlers can be considered
as the need arises.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trace.h      |  1 +
 fs/xfs/xfs_trans.c      | 30 ++++++++++++++++++++++
 fs/xfs/xfs_trans.h      |  7 +++++-
 fs/xfs/xfs_trans_ail.c  | 56 +++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_trans_priv.h |  5 ++++
 5 files changed, 96 insertions(+), 3 deletions(-)

Comments

Dave Chinner Dec. 5, 2019, 9:02 p.m. UTC | #1
On Thu, Dec 05, 2019 at 12:50:36PM -0500, Brian Foster wrote:
> This is an AIL based mechanism to enable automatic relogging of
> selected log items. The use case is for particular operations that
> commit an item known to pin the tail of the log for a potentially
> long period of time and otherwise cannot use a rolling transaction.
> While this does not provide the deadlock avoidance guarantees of a
> rolling transaction, it ties the relog transaction into AIL pushing
> pressure such that we should expect the transaction to reserve the
> necessary log space long before deadlock becomes a problem.
> 
> To enable relogging, a bit is set on the log item before it is first
> committed to the log subsystem. Once the item commits to the on-disk
> log and inserts to the AIL, AIL pushing dictates when the item is
> ready for a relog. When that occurs, the item relogs in an
> independent transaction to ensure the log tail keeps moving without
> intervention from the original committer.  To disable relogging, the
> original committer clears the log item bit and optionally waits on
> relogging activity to cease if it needs to reuse the item before the
> operation completes.
> 
> While the current use case for automatic relogging is limited, the
> mechanism is AIL based because it 1.) provides existing callbacks
> into all possible log item types for future support and 2.) has
> applicable context to determine when to relog particular items (such
> as when an item pins the log tail). This provides enough flexibility
> to support various log item types and future workloads without
> introducing complexity up front for currently unknown use cases.
> Further complexity, such as preallocated or regranted relog
> transaction reservation or custom relog handlers can be considered
> as the need arises.

....

> +/*
> + * Relog log items on the AIL relog queue.
> + */
> +static void
> +xfs_ail_relog(
> +	struct work_struct	*work)
> +{
> +	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
> +						     ail_relog_work);
> +	struct xfs_mount	*mp = ailp->ail_mount;
> +	struct xfs_trans	*tp;
> +	struct xfs_log_item	*lip, *lipp;
> +	int			error;

Probably need PF_MEMALLOC here - this will need to make progress in
low memory situations, just like the xfsaild.

> +	/* XXX: define a ->tr_relog reservation */
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> +	if (error)
> +		return;

Also needs memalloc_nofs_save() around this whole function, because
we most definitely don't want this recursing into the filesystem and
running transactions when we are trying to ensure we don't pin the
tail of the log...

> +
> +	spin_lock(&ailp->ail_lock);
> +	list_for_each_entry_safe(lip, lipp, &ailp->ail_relog_list, li_trans) {
> +		list_del_init(&lip->li_trans);
> +		xfs_trans_add_item(tp, lip);
> +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> +		tp->t_flags |= XFS_TRANS_DIRTY;
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +
> +	error = xfs_trans_commit(tp);
> +	ASSERT(!error);
> +}

Simple, but I see several issues here regarding how generic this
approach is.

1. Memory reclaim deadlock. Needs to be run under
memalloc_nofs_save() context (and possibly PF_MEMALLOC) because
transaction allocation can trigger reclaim and that can run
filesystem transactions and can block waiting for log space. If we
already pin the tail of the log, then we deadlock in memory
reclaim...

2. Transaction reservation deadlock. If the item being relogged is
already at the tail of the log, or if the item that pins the tail
preventing further log reservations from finding space is in the
relog list, we will effectively deadlock the filesystem here.

3. it's going to require a log reservation for every item on the
relog list, in total. That cannot be known ahead of time as the
reservation is dependent on the dirty size of the items being added
to the transaction. Hence, if this is a generic mechanism, the
required reservation could be quite large... If it's too large, see
#1.

4. xfs_trans_commit() requires objects being logged to be locked
against concurrent modification.  Locking in the AIL context is
non-blocking, so the item needs to be locked against modification
before it is added to the ail_relog_list, and won't get unlocked
until it it committed again.

5. Locking random items in random order into the ail_relog_list
opens us up to ABBA deadlocks with locking in ongoing modifications.

6. If the item is already dirty in a transaction (i.e. currently
locked and joined to another transaction - being relogged) then then
xfs_trans_add_item() is either going to fire asserts because
XFS_LI_DIRTY is already set or it's going to corrupt the item list
of that already running transaction.

Given this, I'm skeptical this can be made into a useful, reliable
generic async relogging mechanism.

>  /*
>   * The cursor keeps track of where our current traversal is up to by tracking
>   * the next item in the list for us. However, for this to be safe, removing an
> @@ -363,7 +395,7 @@ static long
>  xfsaild_push(
>  	struct xfs_ail		*ailp)
>  {
> -	xfs_mount_t		*mp = ailp->ail_mount;
> +	struct xfs_mount	*mp = ailp->ail_mount;
>  	struct xfs_ail_cursor	cur;
>  	struct xfs_log_item	*lip;
>  	xfs_lsn_t		lsn;
> @@ -425,6 +457,13 @@ xfsaild_push(
>  			ailp->ail_last_pushed_lsn = lsn;
>  			break;
>  
> +		case XFS_ITEM_RELOG:
> +			trace_xfs_ail_relog(lip);
> +			ASSERT(list_empty(&lip->li_trans));
> +			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
> +			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
> +			break;
> +
>  		case XFS_ITEM_FLUSHING:
>  			/*
>  			 * The item or its backing buffer is already being
> @@ -491,6 +530,9 @@ xfsaild_push(
>  	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
>  		ailp->ail_log_flush++;
>  
> +	if (!list_empty(&ailp->ail_relog_list))
> +		queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work);
> +

Hmmm. Nothing here appears to guarantee forwards progress of the
relogged items? We just queue the items and move on? What if we scan
the ail and trip over the item again before it's been processed by
the relog worker function?

>  	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
>  out_done:
>  		/*
> @@ -834,15 +876,24 @@ xfs_trans_ail_init(
>  	spin_lock_init(&ailp->ail_lock);
>  	INIT_LIST_HEAD(&ailp->ail_buf_list);
>  	init_waitqueue_head(&ailp->ail_empty);
> +	INIT_LIST_HEAD(&ailp->ail_relog_list);
> +	INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog);
> +
> +	ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0,
> +					     mp->m_super->s_id);

It'll need WQ_MEMRECLAIM so it has a rescuer thread (relogging has
to work when we are low on memory), and possibly WQ_HIPRI so that it
doesn't get stuck behind other workqueues that run transactions and
run the log out of space before we try to reserve log space for the
relogging....

Cheers,

Dave.
Brian Foster Dec. 6, 2019, 2:56 p.m. UTC | #2
On Fri, Dec 06, 2019 at 08:02:11AM +1100, Dave Chinner wrote:
> On Thu, Dec 05, 2019 at 12:50:36PM -0500, Brian Foster wrote:
> > This is an AIL based mechanism to enable automatic relogging of
> > selected log items. The use case is for particular operations that
> > commit an item known to pin the tail of the log for a potentially
> > long period of time and otherwise cannot use a rolling transaction.
> > While this does not provide the deadlock avoidance guarantees of a
> > rolling transaction, it ties the relog transaction into AIL pushing
> > pressure such that we should expect the transaction to reserve the
> > necessary log space long before deadlock becomes a problem.
> > 
> > To enable relogging, a bit is set on the log item before it is first
> > committed to the log subsystem. Once the item commits to the on-disk
> > log and inserts to the AIL, AIL pushing dictates when the item is
> > ready for a relog. When that occurs, the item relogs in an
> > independent transaction to ensure the log tail keeps moving without
> > intervention from the original committer.  To disable relogging, the
> > original committer clears the log item bit and optionally waits on
> > relogging activity to cease if it needs to reuse the item before the
> > operation completes.
> > 
> > While the current use case for automatic relogging is limited, the
> > mechanism is AIL based because it 1.) provides existing callbacks
> > into all possible log item types for future support and 2.) has
> > applicable context to determine when to relog particular items (such
> > as when an item pins the log tail). This provides enough flexibility
> > to support various log item types and future workloads without
> > introducing complexity up front for currently unknown use cases.
> > Further complexity, such as preallocated or regranted relog
> > transaction reservation or custom relog handlers can be considered
> > as the need arises.
> 
> ....
> 
> > +/*
> > + * Relog log items on the AIL relog queue.
> > + */
> > +static void
> > +xfs_ail_relog(
> > +	struct work_struct	*work)
> > +{
> > +	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
> > +						     ail_relog_work);
> > +	struct xfs_mount	*mp = ailp->ail_mount;
> > +	struct xfs_trans	*tp;
> > +	struct xfs_log_item	*lip, *lipp;
> > +	int			error;
> 
> Probably need PF_MEMALLOC here - this will need to make progress in
> low memory situations, just like the xfsaild.
> 
> > +	/* XXX: define a ->tr_relog reservation */
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> > +	if (error)
> > +		return;
> 
> Also needs memalloc_nofs_save() around this whole function, because
> we most definitely don't want this recursing into the filesystem and
> running transactions when we are trying to ensure we don't pin the
> tail of the log...
> 

Ok. Note that this will likely change to not allocate the transaction
here, but I'll revisit this when appropriate.

> > +
> > +	spin_lock(&ailp->ail_lock);
> > +	list_for_each_entry_safe(lip, lipp, &ailp->ail_relog_list, li_trans) {
> > +		list_del_init(&lip->li_trans);
> > +		xfs_trans_add_item(tp, lip);
> > +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > +		tp->t_flags |= XFS_TRANS_DIRTY;
> > +	}
> > +	spin_unlock(&ailp->ail_lock);
> > +
> > +	error = xfs_trans_commit(tp);
> > +	ASSERT(!error);
> > +}
> 
> Simple, but I see several issues here regarding how generic this
> approach is.
> 
> 1. Memory reclaim deadlock. Needs to be run under
> memalloc_nofs_save() context (and possibly PF_MEMALLOC) because
> transaction allocation can trigger reclaim and that can run
> filesystem transactions and can block waiting for log space. If we
> already pin the tail of the log, then we deadlock in memory
> reclaim...
> 
> 2. Transaction reservation deadlock. If the item being relogged is
> already at the tail of the log, or if the item that pins the tail
> preventing further log reservations from finding space is in the
> relog list, we will effectively deadlock the filesystem here.
> 

Yes, this was by design. There are a couple angles to this..

First, my testing to this point suggests that the current use case
doesn't require this level of guarantee. This deadlock can manifest in a
few seconds when combined with a parallel workload and small log. I've
been stalling the quotaoff completion for minutes under those same
conditions to this point without reproducing any sort of issues.

That said, I think there are other issues with this approach. This
hasn't been tested thoroughly yet, for one. :) I'm not sure the
practical behavior would apply to the scrub use case (more relogged
items) or non-intent items (with lock contention). Finally, I'm not sure
it's totally appropriate to lock items for relog before we acquire
reservation for the relog transaction. All in all, I'm still considering
changes in this area. It's just a matter of finding an approach that is
simple enough and isn't overkill in terms of overhead or complexity...

> 3. it's going to require a log reservation for every item on the
> relog list, in total. That cannot be known ahead of time as the
> reservation is dependent on the dirty size of the items being added
> to the transaction. Hence, if this is a generic mechanism, the
> required reservation could be quite large... If it's too large, see
> #1.
> 

Yeah, I punted on the transaction management for this RFC because 1.)
it's independent from the mechanics of how to relog items and so not
necessary to demonstrate that concept and 2.) I can think of at least
three or four different ways to manage transaction reservation to
address this problem, with varying levels of complexity/flexibility.

See the previous RFCs for examples of what I mean. I've been thinking
about doing something similar for this variant, but a bit more simple:

- Original (i.e. quotaoff) transaction becomes a permanent transaction.
- Immediately after transaction allocation, explicitly charge the
  relogging subsystem with a certain amount of (fixed size) reservation
  and roll the transaction.
- The rolled transaction proceeds as the original would have, sets the
  relog bit on associated item(s) and commits. The relogged items are
  thus added to the log subsystem with an already charged relog system.

Note that the aforementioned relog reservation is undone in the event of
error/cancel before the final transaction commits.

On the backend/relog side of things:

- Establish a max item count for the relog transaction (probably 1 for
  now). When the max is hit, the relog worker instance rolls the
  transaction and repeats until it drains the current relog item list
  (similar to a truncate operation, for example).
- Keep a global count of the number of active relog items somewhere.
  Once this drops to zero, the relog transaction is free to be cancelled
  and the reservation returned to the grant heads.

The latter bit essentially means the first relog transaction is
responsible for charging the relog system and doing the little
transaction roll dance, while subsequent parallel users effectively
acquire a reference count on the existence of the permanent relog
transaction until there are zero active reloggable items in the
subsystem. The transaction itself probably starts out as a fixed
->tr_relog reservation defined to the maximum supported reloggable
transaction (i.e. quotaoff, for now).

ISTM this keeps things reasonably simple, doesn't introduce runtime
overhead unless relogging is active and guarantees deadlock avoidance.
Thoughts on that?

> 4. xfs_trans_commit() requires objects being logged to be locked
> against concurrent modification.  Locking in the AIL context is
> non-blocking, so the item needs to be locked against modification
> before it is added to the ail_relog_list, and won't get unlocked
> until it it committed again.
> 

Yep..

> 5. Locking random items in random order into the ail_relog_list
> opens us up to ABBA deadlocks with locking in ongoing modifications.
> 

This relies on locking behavior as already implemented by xfsaild. The
whole point is to tie into existing infrastructure rather than duplicate
it or introduce new access patterns. All this does is change how we
handle certain log items once they are locked. It doesn't introduce any
new locking patterns that don't already occur.

> 6. If the item is already dirty in a transaction (i.e. currently
> locked and joined to another transaction - being relogged) then then
> xfs_trans_add_item() is either going to fire asserts because
> XFS_LI_DIRTY is already set or it's going to corrupt the item list
> of that already running transaction.
> 

As above, the item isn't processed unless it is locked or otherwise
exclusive. In a general sense, there are no new log item access patterns
here. Relogging follows the same log item access rules as the rest of
the code.

> Given this, I'm skeptical this can be made into a useful, reliable
> generic async relogging mechanism.
> 

Fair, but that's also not the current goal. The goal is to address the
current use cases of quotaoff and scrub with consideration for similar,
basic handling of arbitrary log items in the future. It's fairly trivial
to implement relogging of a buffer or inode on top of this, for example.

If more complex use cases arise, we can build in more complexity as
needed or further genericize the mechanism, but I see no need to build
in complexity or introduce major abstractions given the current couple
of fairly quirky use cases and otherwise lack of further requirements. I
view the lack of infrastructure as an advantage so I'm intentionally
trying to keep things simple. IOW, worst case, the mechanism in this
approach is easy to repurpose because after the reservation management
bits, it's just a couple log item states, a list and a workqueue job.

The locking issues noted above strike me as a misunderstanding one way
or another (which is easy to misinterpret given the undefined use case).
Given that and with the transaction issues resolved I don't see the
remaining concerns as blockers. If you have thoughts on some future use
cases that obviously conflict or might warrant further
complexity/requirements right now, I'd be interested to hear more about
it. Otherwise I'm just kind of guessing at things in that regard; it's
hard to design a system around unknown requirements...

(I could also look into wiring up an arbitrary non-intent log item as a
canary, I suppose. That would be a final RFC patch that serves no real
purpose and wouldn't be merged, but would hook up the mechanism for
testing purposes, help prove sanity and provide some actual code to
reason about.)

> >  /*
> >   * The cursor keeps track of where our current traversal is up to by tracking
> >   * the next item in the list for us. However, for this to be safe, removing an
> > @@ -363,7 +395,7 @@ static long
> >  xfsaild_push(
> >  	struct xfs_ail		*ailp)
> >  {
> > -	xfs_mount_t		*mp = ailp->ail_mount;
> > +	struct xfs_mount	*mp = ailp->ail_mount;
> >  	struct xfs_ail_cursor	cur;
> >  	struct xfs_log_item	*lip;
> >  	xfs_lsn_t		lsn;
> > @@ -425,6 +457,13 @@ xfsaild_push(
> >  			ailp->ail_last_pushed_lsn = lsn;
> >  			break;
> >  
> > +		case XFS_ITEM_RELOG:
> > +			trace_xfs_ail_relog(lip);
> > +			ASSERT(list_empty(&lip->li_trans));
> > +			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
> > +			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
> > +			break;
> > +
> >  		case XFS_ITEM_FLUSHING:
> >  			/*
> >  			 * The item or its backing buffer is already being
> > @@ -491,6 +530,9 @@ xfsaild_push(
> >  	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
> >  		ailp->ail_log_flush++;
> >  
> > +	if (!list_empty(&ailp->ail_relog_list))
> > +		queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work);
> > +
> 
> Hmmm. Nothing here appears to guarantee forwards progress of the
> relogged items? We just queue the items and move on? What if we scan
> the ail and trip over the item again before it's been processed by
> the relog worker function?
> 

See the XFS_LI_RELOGGED bit (and specifically how it is used in the next
patch). This could be lifted up a layer if that is more clear.

> >  	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
> >  out_done:
> >  		/*
> > @@ -834,15 +876,24 @@ xfs_trans_ail_init(
> >  	spin_lock_init(&ailp->ail_lock);
> >  	INIT_LIST_HEAD(&ailp->ail_buf_list);
> >  	init_waitqueue_head(&ailp->ail_empty);
> > +	INIT_LIST_HEAD(&ailp->ail_relog_list);
> > +	INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog);
> > +
> > +	ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0,
> > +					     mp->m_super->s_id);
> 
> It'll need WQ_MEMRECLAIM so it has a rescuer thread (relogging has
> to work when we are low on memory), and possibly WQ_HIPRI so that it
> doesn't get stuck behind other workqueues that run transactions and
> run the log out of space before we try to reserve log space for the
> relogging....
> 

Ok, I had dropped those for now because it wasn't immediately clear if
they were needed. I'll take another look at this. Thanks for the
feedback...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index c13bb3655e48..6c2a9cdadd03 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1063,6 +1063,7 @@  DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
+DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
 
 DECLARE_EVENT_CLASS(xfs_ail_class,
 	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 3b208f9a865c..f2c06cdd1074 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -763,6 +763,35 @@  xfs_trans_del_item(
 	list_del_init(&lip->li_trans);
 }
 
+void
+xfs_trans_enable_relog(
+	struct xfs_log_item	*lip)
+{
+	set_bit(XFS_LI_RELOG, &lip->li_flags);
+}
+
+void
+xfs_trans_disable_relog(
+	struct xfs_log_item	*lip,
+	bool			drain) /* wait for relogging to cease */
+{
+	struct xfs_mount	*mp = lip->li_mountp;
+
+	clear_bit(XFS_LI_RELOG, &lip->li_flags);
+
+	if (!drain)
+		return;
+
+	/*
+	 * Some operations might require relog activity to cease before they can
+	 * proceed. For example, an operation must wait before including a
+	 * non-lockable log item (i.e. intent) in another transaction.
+	 */
+	while (wait_on_bit_timeout(&lip->li_flags, XFS_LI_RELOGGED,
+				   TASK_UNINTERRUPTIBLE, HZ))
+		xfs_log_force(mp, XFS_LOG_SYNC);
+}
+
 /* Detach and unlock all of the items in a transaction */
 static void
 xfs_trans_free_items(
@@ -848,6 +877,7 @@  xfs_trans_committed_bulk(
 
 		if (aborted)
 			set_bit(XFS_LI_ABORTED, &lip->li_flags);
+		clear_and_wake_up_bit(XFS_LI_RELOGGED, &lip->li_flags);
 
 		if (lip->li_ops->flags & XFS_ITEM_RELEASE_WHEN_COMMITTED) {
 			lip->li_ops->iop_release(lip);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 64d7f171ebd3..6d4311d82c4c 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -59,12 +59,16 @@  struct xfs_log_item {
 #define	XFS_LI_ABORTED	1
 #define	XFS_LI_FAILED	2
 #define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
+#define	XFS_LI_RELOG	4	/* automatic relogging */
+#define	XFS_LI_RELOGGED	5	/* relogged by xfsaild */
 
 #define XFS_LI_FLAGS \
 	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
 	{ (1 << XFS_LI_ABORTED),	"ABORTED" }, \
 	{ (1 << XFS_LI_FAILED),		"FAILED" }, \
-	{ (1 << XFS_LI_DIRTY),		"DIRTY" }
+	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
+	{ (1 << XFS_LI_RELOG),		"RELOG" }, \
+	{ (1 << XFS_LI_RELOGGED),	"RELOGGED" }
 
 struct xfs_item_ops {
 	unsigned flags;
@@ -95,6 +99,7 @@  void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
 #define XFS_ITEM_PINNED		1
 #define XFS_ITEM_LOCKED		2
 #define XFS_ITEM_FLUSHING	3
+#define XFS_ITEM_RELOG		4
 
 /*
  * Deferred operation item relogging limits.
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 00cc5b8734be..bb54d00ae095 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -143,6 +143,38 @@  xfs_ail_max_lsn(
 	return lsn;
 }
 
+/*
+ * Relog log items on the AIL relog queue.
+ */
+static void
+xfs_ail_relog(
+	struct work_struct	*work)
+{
+	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
+						     ail_relog_work);
+	struct xfs_mount	*mp = ailp->ail_mount;
+	struct xfs_trans	*tp;
+	struct xfs_log_item	*lip, *lipp;
+	int			error;
+
+	/* XXX: define a ->tr_relog reservation */
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
+	if (error)
+		return;
+
+	spin_lock(&ailp->ail_lock);
+	list_for_each_entry_safe(lip, lipp, &ailp->ail_relog_list, li_trans) {
+		list_del_init(&lip->li_trans);
+		xfs_trans_add_item(tp, lip);
+		set_bit(XFS_LI_DIRTY, &lip->li_flags);
+		tp->t_flags |= XFS_TRANS_DIRTY;
+	}
+	spin_unlock(&ailp->ail_lock);
+
+	error = xfs_trans_commit(tp);
+	ASSERT(!error);
+}
+
 /*
  * The cursor keeps track of where our current traversal is up to by tracking
  * the next item in the list for us. However, for this to be safe, removing an
@@ -363,7 +395,7 @@  static long
 xfsaild_push(
 	struct xfs_ail		*ailp)
 {
-	xfs_mount_t		*mp = ailp->ail_mount;
+	struct xfs_mount	*mp = ailp->ail_mount;
 	struct xfs_ail_cursor	cur;
 	struct xfs_log_item	*lip;
 	xfs_lsn_t		lsn;
@@ -425,6 +457,13 @@  xfsaild_push(
 			ailp->ail_last_pushed_lsn = lsn;
 			break;
 
+		case XFS_ITEM_RELOG:
+			trace_xfs_ail_relog(lip);
+			ASSERT(list_empty(&lip->li_trans));
+			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
+			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
+			break;
+
 		case XFS_ITEM_FLUSHING:
 			/*
 			 * The item or its backing buffer is already being
@@ -491,6 +530,9 @@  xfsaild_push(
 	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
 		ailp->ail_log_flush++;
 
+	if (!list_empty(&ailp->ail_relog_list))
+		queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work);
+
 	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
 out_done:
 		/*
@@ -834,15 +876,24 @@  xfs_trans_ail_init(
 	spin_lock_init(&ailp->ail_lock);
 	INIT_LIST_HEAD(&ailp->ail_buf_list);
 	init_waitqueue_head(&ailp->ail_empty);
+	INIT_LIST_HEAD(&ailp->ail_relog_list);
+	INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog);
+
+	ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0,
+					     mp->m_super->s_id);
+	if (!ailp->ail_relog_wq)
+		goto out_free_ailp;
 
 	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
 			ailp->ail_mount->m_super->s_id);
 	if (IS_ERR(ailp->ail_task))
-		goto out_free_ailp;
+		goto out_destroy_wq;
 
 	mp->m_ail = ailp;
 	return 0;
 
+out_destroy_wq:
+	destroy_workqueue(ailp->ail_relog_wq);
 out_free_ailp:
 	kmem_free(ailp);
 	return -ENOMEM;
@@ -855,5 +906,6 @@  xfs_trans_ail_destroy(
 	struct xfs_ail	*ailp = mp->m_ail;
 
 	kthread_stop(ailp->ail_task);
+	destroy_workqueue(ailp->ail_relog_wq);
 	kmem_free(ailp);
 }
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 2e073c1c4614..3cefc821350e 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -16,6 +16,8 @@  struct xfs_log_vec;
 void	xfs_trans_init(struct xfs_mount *);
 void	xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *);
 void	xfs_trans_del_item(struct xfs_log_item *);
+void	xfs_trans_enable_relog(struct xfs_log_item *);
+void	xfs_trans_disable_relog(struct xfs_log_item *, bool);
 void	xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp);
 
 void	xfs_trans_committed_bulk(struct xfs_ail *ailp, struct xfs_log_vec *lv,
@@ -61,6 +63,9 @@  struct xfs_ail {
 	int			ail_log_flush;
 	struct list_head	ail_buf_list;
 	wait_queue_head_t	ail_empty;
+	struct work_struct	ail_relog_work;
+	struct list_head	ail_relog_list;
+	struct workqueue_struct	*ail_relog_wq;
 };
 
 /*