diff mbox series

[RFC,v5,5/9] xfs: automatic log item relog mechanism

Message ID 20200227134321.7238-6-bfoster@redhat.com (mailing list archive)
State Superseded, archived
Headers show
Series xfs: automatic relogging experiment | expand

Commit Message

Brian Foster Feb. 27, 2020, 1:43 p.m. UTC
Now that relog reservation is available and relog state tracking is
in place, all that remains to automatically relog items is the relog
mechanism itself. An item with relogging enabled is basically pinned
from writeback until relog is disabled. Instead of being written
back, the item must instead be periodically committed in a new
transaction to move it in the physical log. The purpose of moving
the item is to avoid long term tail pinning and thus avoid log
deadlocks for long running operations.

The ideal time to relog an item is in response to tail pushing
pressure. This accommodates the current workload at any given time
as opposed to a fixed time interval or log reservation heuristic,
which risks performance regression. This is essentially the same
heuristic that drives metadata writeback. XFS already implements
various log tail pushing heuristics that attempt to keep the log
progressing on an active fileystem under various workloads.

The act of relogging an item simply requires to add it to a
transaction and commit. This pushes the already dirty item into a
subsequent log checkpoint and frees up its previous location in the
on-disk log. Joining an item to a transaction of course requires
locking the item first, which means we have to be aware of
type-specific locks and lock ordering wherever the relog takes
place.

Fundamentally, this points to xfsaild as the ideal location to
process relog enabled items. xfsaild already processes log resident
items, is driven by log tail pushing pressure, processes arbitrary
log item types through callbacks, and is sensitive to type-specific
locking rules by design. The fact that automatic relogging
essentially diverts items between writeback or relog also suggests
xfsaild as an ideal location to process items one way or the other.

Of course, we don't want xfsaild to process transactions as it is a
critical component of the log subsystem for driving metadata
writeback and freeing up log space. Therefore, similar to how
xfsaild builds up a writeback queue of dirty items and queues writes
asynchronously, make xfsaild responsible only for directing pending
relog items into an appropriate queue and create an async
(workqueue) context for processing the queue. The workqueue context
utilizes the pre-reserved relog ticket to drain the queue by rolling
a permanent transaction.

Update the AIL pushing infrastructure to support a new RELOG item
state. If a log item push returns the relog state, queue the item
for relog instead of writeback. On completion of a push cycle,
schedule the relog task at the same point metadata buffer I/O is
submitted. This allows items to be relogged automatically under the
same locking rules and pressure heuristics that govern metadata
writeback.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trace.h      |   1 +
 fs/xfs/xfs_trans.h      |   1 +
 fs/xfs/xfs_trans_ail.c  | 103 +++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trans_priv.h |   3 ++
 4 files changed, 106 insertions(+), 2 deletions(-)

Comments

Allison Henderson Feb. 27, 2020, 10:54 p.m. UTC | #1
On 2/27/20 6:43 AM, Brian Foster wrote:
> Now that relog reservation is available and relog state tracking is
> in place, all that remains to automatically relog items is the relog
> mechanism itself. An item with relogging enabled is basically pinned
> from writeback until relog is disabled. Instead of being written
> back, the item must instead be periodically committed in a new
> transaction to move it in the physical log. The purpose of moving
> the item is to avoid long term tail pinning and thus avoid log
> deadlocks for long running operations.
> 
> The ideal time to relog an item is in response to tail pushing
> pressure. This accommodates the current workload at any given time
> as opposed to a fixed time interval or log reservation heuristic,
> which risks performance regression. This is essentially the same
> heuristic that drives metadata writeback. XFS already implements
> various log tail pushing heuristics that attempt to keep the log
> progressing on an active fileystem under various workloads.
> 
> The act of relogging an item simply requires to add it to a
> transaction and commit. This pushes the already dirty item into a
> subsequent log checkpoint and frees up its previous location in the
> on-disk log. Joining an item to a transaction of course requires
> locking the item first, which means we have to be aware of
> type-specific locks and lock ordering wherever the relog takes
> place.
> 
> Fundamentally, this points to xfsaild as the ideal location to
> process relog enabled items. xfsaild already processes log resident
> items, is driven by log tail pushing pressure, processes arbitrary
> log item types through callbacks, and is sensitive to type-specific
> locking rules by design. The fact that automatic relogging
> essentially diverts items between writeback or relog also suggests
> xfsaild as an ideal location to process items one way or the other.
> 
> Of course, we don't want xfsaild to process transactions as it is a
> critical component of the log subsystem for driving metadata
> writeback and freeing up log space. Therefore, similar to how
> xfsaild builds up a writeback queue of dirty items and queues writes
> asynchronously, make xfsaild responsible only for directing pending
> relog items into an appropriate queue and create an async
> (workqueue) context for processing the queue. The workqueue context
> utilizes the pre-reserved relog ticket to drain the queue by rolling
> a permanent transaction.
> 
> Update the AIL pushing infrastructure to support a new RELOG item
> state. If a log item push returns the relog state, queue the item
> for relog instead of writeback. On completion of a push cycle,
> schedule the relog task at the same point metadata buffer I/O is
> submitted. This allows items to be relogged automatically under the
> same locking rules and pressure heuristics that govern metadata
> writeback.
>  > Signed-off-by: Brian Foster <bfoster@redhat.com>
Ok, I followed it through, and didn't notice any logic errors

Reviewed-by: Allison Collins <allison.henderson@oracle.com>
> ---
>   fs/xfs/xfs_trace.h      |   1 +
>   fs/xfs/xfs_trans.h      |   1 +
>   fs/xfs/xfs_trans_ail.c  | 103 +++++++++++++++++++++++++++++++++++++++-
>   fs/xfs/xfs_trans_priv.h |   3 ++
>   4 files changed, 106 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index a066617ec54d..df0114ec66f1 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1063,6 +1063,7 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
>   DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
>   DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
>   
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index fc4c25b6eee4..1637df32c64c 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -99,6 +99,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
>   #define XFS_ITEM_PINNED		1
>   #define XFS_ITEM_LOCKED		2
>   #define XFS_ITEM_FLUSHING	3
> +#define XFS_ITEM_RELOG		4
>   
>   /*
>    * Deferred operation item relogging limits.
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index a3fb64275baa..71a47faeaae8 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -144,6 +144,75 @@ xfs_ail_max_lsn(
>   	return lsn;
>   }
>   
> +/*
> + * Relog log items on the AIL relog queue.
> + */
> +static void
> +xfs_ail_relog(
> +	struct work_struct	*work)
> +{
> +	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
> +						     ail_relog_work);
> +	struct xfs_mount	*mp = ailp->ail_mount;
> +	struct xfs_trans_res	tres = {};
> +	struct xfs_trans	*tp;
> +	struct xfs_log_item	*lip;
> +	int			error;
> +
> +	/*
> +	 * The first transaction to submit a relog item contributed relog
> +	 * reservation to the relog ticket before committing. Create an empty
> +	 * transaction and manually associate the relog ticket.
> +	 */
> +	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);
> +	ASSERT(!error);
> +	if (error)
> +		return;
> +	tp->t_log_res = M_RES(mp)->tr_relog.tr_logres;
> +	tp->t_log_count = M_RES(mp)->tr_relog.tr_logcount;
> +	tp->t_flags |= M_RES(mp)->tr_relog.tr_logflags;
> +	tp->t_ticket = xfs_log_ticket_get(ailp->ail_relog_tic);
> +
> +	spin_lock(&ailp->ail_lock);
> +	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> +					       struct xfs_log_item,
> +					       li_trans)) != NULL) {
> +		/*
> +		 * Drop the AIL processing ticket reference once the relog list
> +		 * is emptied. At this point it's possible for our transaction
> +		 * to hold the only reference.
> +		 */
> +		list_del_init(&lip->li_trans);
> +		if (list_empty(&ailp->ail_relog_list))
> +			xfs_log_ticket_put(ailp->ail_relog_tic);
> +		spin_unlock(&ailp->ail_lock);
> +
> +		xfs_trans_add_item(tp, lip);
> +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> +		tp->t_flags |= XFS_TRANS_DIRTY;
> +		/* XXX: include ticket owner task fix */
> +		error = xfs_trans_roll(&tp);
> +		ASSERT(!error);
> +		if (error)
> +			goto out;
> +		spin_lock(&ailp->ail_lock);
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +
> +out:
> +	/* XXX: handle shutdown scenario */
> +	/*
> +	 * Drop the relog reference owned by the transaction separately because
> +	 * we don't want the cancel to release reservation if this isn't the
> +	 * final reference. The relog ticket and associated reservation needs
> +	 * to persist so long as relog items are active in the log subsystem.
> +	 */
> +	xfs_trans_ail_relog_put(mp);
> +
> +	tp->t_ticket = NULL;
> +	xfs_trans_cancel(tp);
> +}
> +
>   /*
>    * The cursor keeps track of where our current traversal is up to by tracking
>    * the next item in the list for us. However, for this to be safe, removing an
> @@ -364,7 +433,7 @@ static long
>   xfsaild_push(
>   	struct xfs_ail		*ailp)
>   {
> -	xfs_mount_t		*mp = ailp->ail_mount;
> +	struct xfs_mount	*mp = ailp->ail_mount;
>   	struct xfs_ail_cursor	cur;
>   	struct xfs_log_item	*lip;
>   	xfs_lsn_t		lsn;
> @@ -426,6 +495,23 @@ xfsaild_push(
>   			ailp->ail_last_pushed_lsn = lsn;
>   			break;
>   
> +		case XFS_ITEM_RELOG:
> +			/*
> +			 * The item requires a relog. Add to the pending relog
> +			 * list and set the relogged bit to prevent further
> +			 * relog requests. The relog bit and ticket reference
> +			 * can be dropped from the item at any point, so hold a
> +			 * relog ticket reference for the pending relog list to
> +			 * ensure the ticket stays around.
> +			 */
> +			trace_xfs_ail_relog(lip);
> +			ASSERT(list_empty(&lip->li_trans));
> +			if (list_empty(&ailp->ail_relog_list))
> +				xfs_log_ticket_get(ailp->ail_relog_tic);
> +			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
> +			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
> +			break;
> +
>   		case XFS_ITEM_FLUSHING:
>   			/*
>   			 * The item or its backing buffer is already being
> @@ -492,6 +578,9 @@ xfsaild_push(
>   	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
>   		ailp->ail_log_flush++;
>   
> +	if (!list_empty(&ailp->ail_relog_list))
> +		queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work);
> +
>   	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
>   out_done:
>   		/*
> @@ -922,15 +1011,24 @@ xfs_trans_ail_init(
>   	spin_lock_init(&ailp->ail_lock);
>   	INIT_LIST_HEAD(&ailp->ail_buf_list);
>   	init_waitqueue_head(&ailp->ail_empty);
> +	INIT_LIST_HEAD(&ailp->ail_relog_list);
> +	INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog);
> +
> +	ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0,
> +					     mp->m_super->s_id);
> +	if (!ailp->ail_relog_wq)
> +		goto out_free_ailp;
>   
>   	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
>   			ailp->ail_mount->m_super->s_id);
>   	if (IS_ERR(ailp->ail_task))
> -		goto out_free_ailp;
> +		goto out_destroy_wq;
>   
>   	mp->m_ail = ailp;
>   	return 0;
>   
> +out_destroy_wq:
> +	destroy_workqueue(ailp->ail_relog_wq);
>   out_free_ailp:
>   	kmem_free(ailp);
>   	return -ENOMEM;
> @@ -944,5 +1042,6 @@ xfs_trans_ail_destroy(
>   
>   	ASSERT(ailp->ail_relog_tic == NULL);
>   	kthread_stop(ailp->ail_task);
> +	destroy_workqueue(ailp->ail_relog_wq);
>   	kmem_free(ailp);
>   }
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index d1edec1cb8ad..33a724534869 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -63,6 +63,9 @@ struct xfs_ail {
>   	int			ail_log_flush;
>   	struct list_head	ail_buf_list;
>   	wait_queue_head_t	ail_empty;
> +	struct work_struct	ail_relog_work;
> +	struct list_head	ail_relog_list;
> +	struct workqueue_struct	*ail_relog_wq;
>   	struct xlog_ticket	*ail_relog_tic;
>   };
>   
>
Darrick J. Wong Feb. 28, 2020, 12:13 a.m. UTC | #2
On Thu, Feb 27, 2020 at 08:43:17AM -0500, Brian Foster wrote:
> Now that relog reservation is available and relog state tracking is
> in place, all that remains to automatically relog items is the relog
> mechanism itself. An item with relogging enabled is basically pinned
> from writeback until relog is disabled. Instead of being written
> back, the item must instead be periodically committed in a new
> transaction to move it in the physical log. The purpose of moving
> the item is to avoid long term tail pinning and thus avoid log
> deadlocks for long running operations.
> 
> The ideal time to relog an item is in response to tail pushing
> pressure. This accommodates the current workload at any given time
> as opposed to a fixed time interval or log reservation heuristic,
> which risks performance regression. This is essentially the same
> heuristic that drives metadata writeback. XFS already implements
> various log tail pushing heuristics that attempt to keep the log
> progressing on an active fileystem under various workloads.
> 
> The act of relogging an item simply requires to add it to a
> transaction and commit. This pushes the already dirty item into a
> subsequent log checkpoint and frees up its previous location in the
> on-disk log. Joining an item to a transaction of course requires
> locking the item first, which means we have to be aware of
> type-specific locks and lock ordering wherever the relog takes
> place.
> 
> Fundamentally, this points to xfsaild as the ideal location to
> process relog enabled items. xfsaild already processes log resident
> items, is driven by log tail pushing pressure, processes arbitrary
> log item types through callbacks, and is sensitive to type-specific
> locking rules by design. The fact that automatic relogging
> essentially diverts items between writeback or relog also suggests
> xfsaild as an ideal location to process items one way or the other.
> 
> Of course, we don't want xfsaild to process transactions as it is a
> critical component of the log subsystem for driving metadata
> writeback and freeing up log space. Therefore, similar to how
> xfsaild builds up a writeback queue of dirty items and queues writes
> asynchronously, make xfsaild responsible only for directing pending
> relog items into an appropriate queue and create an async
> (workqueue) context for processing the queue. The workqueue context
> utilizes the pre-reserved relog ticket to drain the queue by rolling
> a permanent transaction.

Aha!  I bet that's that workqueue I was musing about earlier.

> Update the AIL pushing infrastructure to support a new RELOG item
> state. If a log item push returns the relog state, queue the item
> for relog instead of writeback. On completion of a push cycle,
> schedule the relog task at the same point metadata buffer I/O is
> submitted. This allows items to be relogged automatically under the
> same locking rules and pressure heuristics that govern metadata
> writeback.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_trace.h      |   1 +
>  fs/xfs/xfs_trans.h      |   1 +
>  fs/xfs/xfs_trans_ail.c  | 103 +++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_trans_priv.h |   3 ++
>  4 files changed, 106 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index a066617ec54d..df0114ec66f1 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1063,6 +1063,7 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
>  DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
>  DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
>  
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index fc4c25b6eee4..1637df32c64c 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -99,6 +99,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
>  #define XFS_ITEM_PINNED		1
>  #define XFS_ITEM_LOCKED		2
>  #define XFS_ITEM_FLUSHING	3
> +#define XFS_ITEM_RELOG		4
>  
>  /*
>   * Deferred operation item relogging limits.
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index a3fb64275baa..71a47faeaae8 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -144,6 +144,75 @@ xfs_ail_max_lsn(
>  	return lsn;
>  }
>  
> +/*
> + * Relog log items on the AIL relog queue.
> + */
> +static void
> +xfs_ail_relog(
> +	struct work_struct	*work)
> +{
> +	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
> +						     ail_relog_work);
> +	struct xfs_mount	*mp = ailp->ail_mount;
> +	struct xfs_trans_res	tres = {};
> +	struct xfs_trans	*tp;
> +	struct xfs_log_item	*lip;
> +	int			error;
> +
> +	/*
> +	 * The first transaction to submit a relog item contributed relog
> +	 * reservation to the relog ticket before committing. Create an empty
> +	 * transaction and manually associate the relog ticket.
> +	 */
> +	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);

Ah, and I see that the work item actually does create its own
transaction to relog the items...

> +	ASSERT(!error);
> +	if (error)
> +		return;
> +	tp->t_log_res = M_RES(mp)->tr_relog.tr_logres;
> +	tp->t_log_count = M_RES(mp)->tr_relog.tr_logcount;
> +	tp->t_flags |= M_RES(mp)->tr_relog.tr_logflags;
> +	tp->t_ticket = xfs_log_ticket_get(ailp->ail_relog_tic);
> +
> +	spin_lock(&ailp->ail_lock);
> +	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> +					       struct xfs_log_item,
> +					       li_trans)) != NULL) {

...but this part really cranks up my curiosity about what happens when
there are more items to relog than there is actual reservation in this
transaction?  I think most transactions types reserve enough space that
we could attach hundreds of relogged intent items.

> +		/*
> +		 * Drop the AIL processing ticket reference once the relog list
> +		 * is emptied. At this point it's possible for our transaction
> +		 * to hold the only reference.
> +		 */
> +		list_del_init(&lip->li_trans);
> +		if (list_empty(&ailp->ail_relog_list))
> +			xfs_log_ticket_put(ailp->ail_relog_tic);
> +		spin_unlock(&ailp->ail_lock);
> +
> +		xfs_trans_add_item(tp, lip);
> +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> +		tp->t_flags |= XFS_TRANS_DIRTY;
> +		/* XXX: include ticket owner task fix */

XXX?

--D

> +		error = xfs_trans_roll(&tp);
> +		ASSERT(!error);
> +		if (error)
> +			goto out;
> +		spin_lock(&ailp->ail_lock);
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +
> +out:
> +	/* XXX: handle shutdown scenario */
> +	/*
> +	 * Drop the relog reference owned by the transaction separately because
> +	 * we don't want the cancel to release reservation if this isn't the
> +	 * final reference. The relog ticket and associated reservation needs
> +	 * to persist so long as relog items are active in the log subsystem.
> +	 */
> +	xfs_trans_ail_relog_put(mp);
> +
> +	tp->t_ticket = NULL;
> +	xfs_trans_cancel(tp);
> +}
> +
>  /*
>   * The cursor keeps track of where our current traversal is up to by tracking
>   * the next item in the list for us. However, for this to be safe, removing an
> @@ -364,7 +433,7 @@ static long
>  xfsaild_push(
>  	struct xfs_ail		*ailp)
>  {
> -	xfs_mount_t		*mp = ailp->ail_mount;
> +	struct xfs_mount	*mp = ailp->ail_mount;
>  	struct xfs_ail_cursor	cur;
>  	struct xfs_log_item	*lip;
>  	xfs_lsn_t		lsn;
> @@ -426,6 +495,23 @@ xfsaild_push(
>  			ailp->ail_last_pushed_lsn = lsn;
>  			break;
>  
> +		case XFS_ITEM_RELOG:
> +			/*
> +			 * The item requires a relog. Add to the pending relog
> +			 * list and set the relogged bit to prevent further
> +			 * relog requests. The relog bit and ticket reference
> +			 * can be dropped from the item at any point, so hold a
> +			 * relog ticket reference for the pending relog list to
> +			 * ensure the ticket stays around.
> +			 */
> +			trace_xfs_ail_relog(lip);
> +			ASSERT(list_empty(&lip->li_trans));
> +			if (list_empty(&ailp->ail_relog_list))
> +				xfs_log_ticket_get(ailp->ail_relog_tic);
> +			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
> +			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
> +			break;
> +
>  		case XFS_ITEM_FLUSHING:
>  			/*
>  			 * The item or its backing buffer is already being
> @@ -492,6 +578,9 @@ xfsaild_push(
>  	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
>  		ailp->ail_log_flush++;
>  
> +	if (!list_empty(&ailp->ail_relog_list))
> +		queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work);
> +
>  	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
>  out_done:
>  		/*
> @@ -922,15 +1011,24 @@ xfs_trans_ail_init(
>  	spin_lock_init(&ailp->ail_lock);
>  	INIT_LIST_HEAD(&ailp->ail_buf_list);
>  	init_waitqueue_head(&ailp->ail_empty);
> +	INIT_LIST_HEAD(&ailp->ail_relog_list);
> +	INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog);
> +
> +	ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0,
> +					     mp->m_super->s_id);
> +	if (!ailp->ail_relog_wq)
> +		goto out_free_ailp;
>  
>  	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
>  			ailp->ail_mount->m_super->s_id);
>  	if (IS_ERR(ailp->ail_task))
> -		goto out_free_ailp;
> +		goto out_destroy_wq;
>  
>  	mp->m_ail = ailp;
>  	return 0;
>  
> +out_destroy_wq:
> +	destroy_workqueue(ailp->ail_relog_wq);
>  out_free_ailp:
>  	kmem_free(ailp);
>  	return -ENOMEM;
> @@ -944,5 +1042,6 @@ xfs_trans_ail_destroy(
>  
>  	ASSERT(ailp->ail_relog_tic == NULL);
>  	kthread_stop(ailp->ail_task);
> +	destroy_workqueue(ailp->ail_relog_wq);
>  	kmem_free(ailp);
>  }
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index d1edec1cb8ad..33a724534869 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -63,6 +63,9 @@ struct xfs_ail {
>  	int			ail_log_flush;
>  	struct list_head	ail_buf_list;
>  	wait_queue_head_t	ail_empty;
> +	struct work_struct	ail_relog_work;
> +	struct list_head	ail_relog_list;
> +	struct workqueue_struct	*ail_relog_wq;
>  	struct xlog_ticket	*ail_relog_tic;
>  };
>  
> -- 
> 2.21.1
>
Brian Foster Feb. 28, 2020, 2:02 p.m. UTC | #3
On Thu, Feb 27, 2020 at 04:13:45PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 27, 2020 at 08:43:17AM -0500, Brian Foster wrote:
> > Now that relog reservation is available and relog state tracking is
> > in place, all that remains to automatically relog items is the relog
> > mechanism itself. An item with relogging enabled is basically pinned
> > from writeback until relog is disabled. Instead of being written
> > back, the item must instead be periodically committed in a new
> > transaction to move it in the physical log. The purpose of moving
> > the item is to avoid long term tail pinning and thus avoid log
> > deadlocks for long running operations.
> > 
> > The ideal time to relog an item is in response to tail pushing
> > pressure. This accommodates the current workload at any given time
> > as opposed to a fixed time interval or log reservation heuristic,
> > which risks performance regression. This is essentially the same
> > heuristic that drives metadata writeback. XFS already implements
> > various log tail pushing heuristics that attempt to keep the log
> > progressing on an active fileystem under various workloads.
> > 
> > The act of relogging an item simply requires to add it to a
> > transaction and commit. This pushes the already dirty item into a
> > subsequent log checkpoint and frees up its previous location in the
> > on-disk log. Joining an item to a transaction of course requires
> > locking the item first, which means we have to be aware of
> > type-specific locks and lock ordering wherever the relog takes
> > place.
> > 
> > Fundamentally, this points to xfsaild as the ideal location to
> > process relog enabled items. xfsaild already processes log resident
> > items, is driven by log tail pushing pressure, processes arbitrary
> > log item types through callbacks, and is sensitive to type-specific
> > locking rules by design. The fact that automatic relogging
> > essentially diverts items between writeback or relog also suggests
> > xfsaild as an ideal location to process items one way or the other.
> > 
> > Of course, we don't want xfsaild to process transactions as it is a
> > critical component of the log subsystem for driving metadata
> > writeback and freeing up log space. Therefore, similar to how
> > xfsaild builds up a writeback queue of dirty items and queues writes
> > asynchronously, make xfsaild responsible only for directing pending
> > relog items into an appropriate queue and create an async
> > (workqueue) context for processing the queue. The workqueue context
> > utilizes the pre-reserved relog ticket to drain the queue by rolling
> > a permanent transaction.
> 
> Aha!  I bet that's that workqueue I was musing about earlier.
> 

:)

> > Update the AIL pushing infrastructure to support a new RELOG item
> > state. If a log item push returns the relog state, queue the item
> > for relog instead of writeback. On completion of a push cycle,
> > schedule the relog task at the same point metadata buffer I/O is
> > submitted. This allows items to be relogged automatically under the
> > same locking rules and pressure heuristics that govern metadata
> > writeback.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_trace.h      |   1 +
> >  fs/xfs/xfs_trans.h      |   1 +
> >  fs/xfs/xfs_trans_ail.c  | 103 +++++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_trans_priv.h |   3 ++
> >  4 files changed, 106 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index a066617ec54d..df0114ec66f1 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -1063,6 +1063,7 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> > +DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
> >  DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
> >  DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
> >  
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index fc4c25b6eee4..1637df32c64c 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -99,6 +99,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
> >  #define XFS_ITEM_PINNED		1
> >  #define XFS_ITEM_LOCKED		2
> >  #define XFS_ITEM_FLUSHING	3
> > +#define XFS_ITEM_RELOG		4
> >  
> >  /*
> >   * Deferred operation item relogging limits.
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index a3fb64275baa..71a47faeaae8 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -144,6 +144,75 @@ xfs_ail_max_lsn(
> >  	return lsn;
> >  }
> >  
> > +/*
> > + * Relog log items on the AIL relog queue.
> > + */
> > +static void
> > +xfs_ail_relog(
> > +	struct work_struct	*work)
> > +{
> > +	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
> > +						     ail_relog_work);
> > +	struct xfs_mount	*mp = ailp->ail_mount;
> > +	struct xfs_trans_res	tres = {};
> > +	struct xfs_trans	*tp;
> > +	struct xfs_log_item	*lip;
> > +	int			error;
> > +
> > +	/*
> > +	 * The first transaction to submit a relog item contributed relog
> > +	 * reservation to the relog ticket before committing. Create an empty
> > +	 * transaction and manually associate the relog ticket.
> > +	 */
> > +	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);
> 
> Ah, and I see that the work item actually does create its own
> transaction to relog the items...
> 

Yeah, bit of a hack to allocate a shell transaction and attach the relog
ticket managed by the earlier patch, but it ended up cleaner than having
the transaction allocated and passed into this context.

> > +	ASSERT(!error);
> > +	if (error)
> > +		return;
> > +	tp->t_log_res = M_RES(mp)->tr_relog.tr_logres;
> > +	tp->t_log_count = M_RES(mp)->tr_relog.tr_logcount;
> > +	tp->t_flags |= M_RES(mp)->tr_relog.tr_logflags;
> > +	tp->t_ticket = xfs_log_ticket_get(ailp->ail_relog_tic);
> > +
> > +	spin_lock(&ailp->ail_lock);
> > +	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> > +					       struct xfs_log_item,
> > +					       li_trans)) != NULL) {
> 
> ...but this part really cranks up my curiosity about what happens when
> there are more items to relog than there is actual reservation in this
> transaction?  I think most transactions types reserve enough space that
> we could attach hundreds of relogged intent items.
> 

See my earlier comment around batching. Right now we only relog one item
at a time and the relog reservation is intended to be the max possible
reloggable item in the fs. This needs to increase to support some kind
of batching here, but I think the prospective reloggable items right now
(i.e. 2 or 3 different intent types) allows a fixed calculation size to
work well enough for our needs.

Note that I think there's a whole separate ball of complexity we could
delve into if we wanted to support something like arbitrary, per-item
(set) relog tickets with different reservation values as opposed to one
global, fixed size ticket. That would require some association between
log items and tickets and perhaps other items covered by the same
ticket, etc., but would provide a much more generic mechanism. As it is,
I think that's hugely overkill for the current use cases, but maybe we
find a reason to evolve this into something like that down the road..

> > +		/*
> > +		 * Drop the AIL processing ticket reference once the relog list
> > +		 * is emptied. At this point it's possible for our transaction
> > +		 * to hold the only reference.
> > +		 */
> > +		list_del_init(&lip->li_trans);
> > +		if (list_empty(&ailp->ail_relog_list))
> > +			xfs_log_ticket_put(ailp->ail_relog_tic);
> > +		spin_unlock(&ailp->ail_lock);
> > +
> > +		xfs_trans_add_item(tp, lip);
> > +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > +		tp->t_flags |= XFS_TRANS_DIRTY;
> > +		/* XXX: include ticket owner task fix */
> 
> XXX?
> 

Oops, this refers to patch 1. I added the patch but forgot to remove the
comment, so this can go away..

Brian

> --D
> 
> > +		error = xfs_trans_roll(&tp);
> > +		ASSERT(!error);
> > +		if (error)
> > +			goto out;
> > +		spin_lock(&ailp->ail_lock);
> > +	}
> > +	spin_unlock(&ailp->ail_lock);
> > +
> > +out:
> > +	/* XXX: handle shutdown scenario */
> > +	/*
> > +	 * Drop the relog reference owned by the transaction separately because
> > +	 * we don't want the cancel to release reservation if this isn't the
> > +	 * final reference. The relog ticket and associated reservation needs
> > +	 * to persist so long as relog items are active in the log subsystem.
> > +	 */
> > +	xfs_trans_ail_relog_put(mp);
> > +
> > +	tp->t_ticket = NULL;
> > +	xfs_trans_cancel(tp);
> > +}
> > +
> >  /*
> >   * The cursor keeps track of where our current traversal is up to by tracking
> >   * the next item in the list for us. However, for this to be safe, removing an
> > @@ -364,7 +433,7 @@ static long
> >  xfsaild_push(
> >  	struct xfs_ail		*ailp)
> >  {
> > -	xfs_mount_t		*mp = ailp->ail_mount;
> > +	struct xfs_mount	*mp = ailp->ail_mount;
> >  	struct xfs_ail_cursor	cur;
> >  	struct xfs_log_item	*lip;
> >  	xfs_lsn_t		lsn;
> > @@ -426,6 +495,23 @@ xfsaild_push(
> >  			ailp->ail_last_pushed_lsn = lsn;
> >  			break;
> >  
> > +		case XFS_ITEM_RELOG:
> > +			/*
> > +			 * The item requires a relog. Add to the pending relog
> > +			 * list and set the relogged bit to prevent further
> > +			 * relog requests. The relog bit and ticket reference
> > +			 * can be dropped from the item at any point, so hold a
> > +			 * relog ticket reference for the pending relog list to
> > +			 * ensure the ticket stays around.
> > +			 */
> > +			trace_xfs_ail_relog(lip);
> > +			ASSERT(list_empty(&lip->li_trans));
> > +			if (list_empty(&ailp->ail_relog_list))
> > +				xfs_log_ticket_get(ailp->ail_relog_tic);
> > +			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
> > +			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
> > +			break;
> > +
> >  		case XFS_ITEM_FLUSHING:
> >  			/*
> >  			 * The item or its backing buffer is already being
> > @@ -492,6 +578,9 @@ xfsaild_push(
> >  	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
> >  		ailp->ail_log_flush++;
> >  
> > +	if (!list_empty(&ailp->ail_relog_list))
> > +		queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work);
> > +
> >  	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
> >  out_done:
> >  		/*
> > @@ -922,15 +1011,24 @@ xfs_trans_ail_init(
> >  	spin_lock_init(&ailp->ail_lock);
> >  	INIT_LIST_HEAD(&ailp->ail_buf_list);
> >  	init_waitqueue_head(&ailp->ail_empty);
> > +	INIT_LIST_HEAD(&ailp->ail_relog_list);
> > +	INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog);
> > +
> > +	ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0,
> > +					     mp->m_super->s_id);
> > +	if (!ailp->ail_relog_wq)
> > +		goto out_free_ailp;
> >  
> >  	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
> >  			ailp->ail_mount->m_super->s_id);
> >  	if (IS_ERR(ailp->ail_task))
> > -		goto out_free_ailp;
> > +		goto out_destroy_wq;
> >  
> >  	mp->m_ail = ailp;
> >  	return 0;
> >  
> > +out_destroy_wq:
> > +	destroy_workqueue(ailp->ail_relog_wq);
> >  out_free_ailp:
> >  	kmem_free(ailp);
> >  	return -ENOMEM;
> > @@ -944,5 +1042,6 @@ xfs_trans_ail_destroy(
> >  
> >  	ASSERT(ailp->ail_relog_tic == NULL);
> >  	kthread_stop(ailp->ail_task);
> > +	destroy_workqueue(ailp->ail_relog_wq);
> >  	kmem_free(ailp);
> >  }
> > diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> > index d1edec1cb8ad..33a724534869 100644
> > --- a/fs/xfs/xfs_trans_priv.h
> > +++ b/fs/xfs/xfs_trans_priv.h
> > @@ -63,6 +63,9 @@ struct xfs_ail {
> >  	int			ail_log_flush;
> >  	struct list_head	ail_buf_list;
> >  	wait_queue_head_t	ail_empty;
> > +	struct work_struct	ail_relog_work;
> > +	struct list_head	ail_relog_list;
> > +	struct workqueue_struct	*ail_relog_wq;
> >  	struct xlog_ticket	*ail_relog_tic;
> >  };
> >  
> > -- 
> > 2.21.1
> > 
>
Dave Chinner March 2, 2020, 7:18 a.m. UTC | #4
On Thu, Feb 27, 2020 at 08:43:17AM -0500, Brian Foster wrote:
> Now that relog reservation is available and relog state tracking is
> in place, all that remains to automatically relog items is the relog
> mechanism itself. An item with relogging enabled is basically pinned
> from writeback until relog is disabled. Instead of being written
> back, the item must instead be periodically committed in a new
> transaction to move it in the physical log. The purpose of moving
> the item is to avoid long term tail pinning and thus avoid log
> deadlocks for long running operations.
> 
> The ideal time to relog an item is in response to tail pushing
> pressure. This accommodates the current workload at any given time
> as opposed to a fixed time interval or log reservation heuristic,
> which risks performance regression. This is essentially the same
> heuristic that drives metadata writeback. XFS already implements
> various log tail pushing heuristics that attempt to keep the log
> progressing on an active fileystem under various workloads.
> 
> The act of relogging an item simply requires to add it to a
> transaction and commit. This pushes the already dirty item into a
> subsequent log checkpoint and frees up its previous location in the
> on-disk log. Joining an item to a transaction of course requires
> locking the item first, which means we have to be aware of
> type-specific locks and lock ordering wherever the relog takes
> place.
> 
> Fundamentally, this points to xfsaild as the ideal location to
> process relog enabled items. xfsaild already processes log resident
> items, is driven by log tail pushing pressure, processes arbitrary
> log item types through callbacks, and is sensitive to type-specific
> locking rules by design. The fact that automatic relogging
> essentially diverts items between writeback or relog also suggests
> xfsaild as an ideal location to process items one way or the other.
> 
> Of course, we don't want xfsaild to process transactions as it is a
> critical component of the log subsystem for driving metadata
> writeback and freeing up log space. Therefore, similar to how
> xfsaild builds up a writeback queue of dirty items and queues writes
> asynchronously, make xfsaild responsible only for directing pending
> relog items into an appropriate queue and create an async
> (workqueue) context for processing the queue. The workqueue context
> utilizes the pre-reserved relog ticket to drain the queue by rolling
> a permanent transaction.
> 
> Update the AIL pushing infrastructure to support a new RELOG item
> state. If a log item push returns the relog state, queue the item
> for relog instead of writeback. On completion of a push cycle,
> schedule the relog task at the same point metadata buffer I/O is
> submitted. This allows items to be relogged automatically under the
> same locking rules and pressure heuristics that govern metadata
> writeback.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_trace.h      |   1 +
>  fs/xfs/xfs_trans.h      |   1 +
>  fs/xfs/xfs_trans_ail.c  | 103 +++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_trans_priv.h |   3 ++
>  4 files changed, 106 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index a066617ec54d..df0114ec66f1 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1063,6 +1063,7 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
>  DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
>  DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
>  
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index fc4c25b6eee4..1637df32c64c 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -99,6 +99,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
>  #define XFS_ITEM_PINNED		1
>  #define XFS_ITEM_LOCKED		2
>  #define XFS_ITEM_FLUSHING	3
> +#define XFS_ITEM_RELOG		4
>  
>  /*
>   * Deferred operation item relogging limits.
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index a3fb64275baa..71a47faeaae8 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -144,6 +144,75 @@ xfs_ail_max_lsn(
>  	return lsn;
>  }
>  
> +/*
> + * Relog log items on the AIL relog queue.
> + */
> +static void
> +xfs_ail_relog(
> +	struct work_struct	*work)
> +{
> +	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
> +						     ail_relog_work);
> +	struct xfs_mount	*mp = ailp->ail_mount;
> +	struct xfs_trans_res	tres = {};
> +	struct xfs_trans	*tp;
> +	struct xfs_log_item	*lip;
> +	int			error;
> +
> +	/*
> +	 * The first transaction to submit a relog item contributed relog
> +	 * reservation to the relog ticket before committing. Create an empty
> +	 * transaction and manually associate the relog ticket.
> +	 */
> +	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);

I suspect deadlocks on filesystems in the process of being frozen
when the log is full and relogging is required to make progress.

> +	ASSERT(!error);
> +	if (error)
> +		return;
> +	tp->t_log_res = M_RES(mp)->tr_relog.tr_logres;
> +	tp->t_log_count = M_RES(mp)->tr_relog.tr_logcount;
> +	tp->t_flags |= M_RES(mp)->tr_relog.tr_logflags;
> +	tp->t_ticket = xfs_log_ticket_get(ailp->ail_relog_tic);

So this assumes you've stolen the log reservation for this ticket
from somewhere else, because otherwise the transaction log
reservation and the ticket don't match.

FWIW, I'm having trouble real keeping all the ail relog ticket
references straight. Code seems to be
arbitratily taking and dropping references to that ticket, and I
can't see a pattern or set of rules for usage.

Why does this specific transaction need a reference to the ticket,
when the ail_relog_list has a reference, every item that has been
marked as XFS_LI_RELOG already has a reference, etc?

> +	spin_lock(&ailp->ail_lock);
> +	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> +					       struct xfs_log_item,
> +					       li_trans)) != NULL) {

I dislike the way the "swiss army knife" list macro makes this
code unreadable.

        while (!list_empty(&ailp->ail_relog_list)) {
		lip = list_first_entry(...)

is much neater and easier to read.

> +		/*
> +		 * Drop the AIL processing ticket reference once the relog list
> +		 * is emptied. At this point it's possible for our transaction
> +		 * to hold the only reference.
> +		 */
> +		list_del_init(&lip->li_trans);
> +		if (list_empty(&ailp->ail_relog_list))
> +			xfs_log_ticket_put(ailp->ail_relog_tic);

This seems completely arbitrary.

> +		spin_unlock(&ailp->ail_lock);
> +
> +		xfs_trans_add_item(tp, lip);
> +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> +		tp->t_flags |= XFS_TRANS_DIRTY;
> +		/* XXX: include ticket owner task fix */
> +		error = xfs_trans_roll(&tp);

So the reservation for this ticket is going to be regranted over and
over again and space reserved repeatedly as we work through the
locked relog items one at a time?

Unfortunately, this violates the rule that prevents rolling
transactions from deadlocking. That is, any object that is held
locked across the transaction commit and regrant that *might pin the
tail of the log* must be relogged in the transaction to move the
item forward and prevent it from pinning the tail of the log.

IOWs, if one of the items later in the relog list pins the tail of
the log we will end up sleeping here:

  xfs_trans_roll()
    xfs_trans_reserve
      xfs_log_regrant
        xlog_grant_head_check(need_bytes)
	  xlog_grant_head_wait()

waiting for the write grant head to move. ANd it never will, because
we hold the lock on that item so the AIL can't push it out.
IOWs, using a rolling transaction per relog item will not work for
processing multiple relog items.

If it's a single transaction, and we join all the locked items
for relogging into it in a single transaction commit, then we are
fine - we don't try to regrant log space while holding locked items
that could pin the tail of the log.

We *can* use a rolling transaction if we do this - the AIL has a
permenant transaction (plus ticket!) allocated at mount time with
a log count of zero, we can then steal reserve/write grant head
space into it that ticket at CIL commit time as I mentioned
previously. We do a loop like above, but it's basically:

{
	LIST_HEAD(tmp);

	spin_lock(&ailp->ail_lock);
        if (list_empty(&ailp->ail_relog_list)) {
		spin_unlock(&ailp->ail_lock);
		return;
	}

	list_splice_init(&ilp->ail_relog_list, &tmp);
	spin_unlock(&ailp->ail_lock);

	xfs_ail_relog_items(ail, &tmp);
}

This allows the AIL to keep working and building up a new relog
as it goes along. hence we can work on our list without interruption
or needed to repeatedly take the AIL lock just to get items from the
list.

And xfs_ail_relog_items() does something like this:

{
	struct xfs_trans	*tp;

	/*
	 * Make CIL committers trying to change relog status of log
	 * items wait for us to istabilise the relog transaction
	 * again by committing the current relog list and rolling
	 * the transaction.
	 */
	down_write(&ail->ail_relog_lock);
	tp = ail->relog_trans;

	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
					       struct xfs_log_item,
					       li_trans)) != NULL) {
        while (!list_empty(&ailp->ail_relog_list)) {
		lip = list_first_entry(&ailp->ail_relog_list,
					struct xfs_log_item, li_trans);
		list_del_init(&lip->li_trans);

		xfs_trans_add_item(tp, lip);
		set_bit(XFS_LI_DIRTY, &lip->li_flags);
		tp->t_flags |= XFS_TRANS_DIRTY;
	}

	error = xfs_trans_roll(&tp);
	if (error) {
		SHUTDOWN!
	}
	ail->relog_trans = tp;
	up_write(&ail->ail_relog_lock);
}

> @@ -426,6 +495,23 @@ xfsaild_push(
>  			ailp->ail_last_pushed_lsn = lsn;
>  			break;
>  
> +		case XFS_ITEM_RELOG:
> +			/*
> +			 * The item requires a relog. Add to the pending relog
> +			 * list and set the relogged bit to prevent further
> +			 * relog requests. The relog bit and ticket reference
> +			 * can be dropped from the item at any point, so hold a
> +			 * relog ticket reference for the pending relog list to
> +			 * ensure the ticket stays around.
> +			 */
> +			trace_xfs_ail_relog(lip);
> +			ASSERT(list_empty(&lip->li_trans));
> +			if (list_empty(&ailp->ail_relog_list))
> +				xfs_log_ticket_get(ailp->ail_relog_tic);
> +			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
> +			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
> +			break;

So the XFS_LI_RELOGGED bit indicates that the item is locked on the
relog list? That means nothing else in the transaction subsystem
will set that until the item is unlocked, right? Which is when it
ends up back on the CIL. This then gets cleared by the CIL being
forced and the item moved forward in the AIL.

IOWs, a log force clears this flag until the AIL relogs it again.
Why do we need this flag to issue wakeups when the wait loop does
blocking log forces? i.e. it will have already waited for the flag
to be cleared by waiting for the log force....

Cheers,

Dave.
Dave Chinner March 2, 2020, 7:32 a.m. UTC | #5
On Fri, Feb 28, 2020 at 09:02:02AM -0500, Brian Foster wrote:
> See my earlier comment around batching. Right now we only relog one item
> at a time and the relog reservation is intended to be the max possible
> reloggable item in the fs. This needs to increase to support some kind
> of batching here, but I think the prospective reloggable items right now
> (i.e. 2 or 3 different intent types) allows a fixed calculation size to
> work well enough for our needs.
> 
> Note that I think there's a whole separate ball of complexity we could
> delve into if we wanted to support something like arbitrary, per-item
> (set) relog tickets with different reservation values as opposed to one
> global, fixed size ticket. That would require some association between
> log items and tickets and perhaps other items covered by the same
> ticket, etc., but would provide a much more generic mechanism. As it is,
> I think that's hugely overkill for the current use cases, but maybe we
> find a reason to evolve this into something like that down the road..

From what I'm seeing, a generic, arbitrary relogging mechanism based
on a reservation stealing concept similar to the CIL is going to be
simpler, less invasive, easier to expand in future and more robust
than this "limited scope" proposal.

I'm firmly in the "do it right the first time" camp here...

Cheers,

Dave.
Brian Foster March 2, 2020, 6:52 p.m. UTC | #6
On Mon, Mar 02, 2020 at 06:18:43PM +1100, Dave Chinner wrote:
> On Thu, Feb 27, 2020 at 08:43:17AM -0500, Brian Foster wrote:
> > Now that relog reservation is available and relog state tracking is
> > in place, all that remains to automatically relog items is the relog
> > mechanism itself. An item with relogging enabled is basically pinned
> > from writeback until relog is disabled. Instead of being written
> > back, the item must instead be periodically committed in a new
> > transaction to move it in the physical log. The purpose of moving
> > the item is to avoid long term tail pinning and thus avoid log
> > deadlocks for long running operations.
> > 
> > The ideal time to relog an item is in response to tail pushing
> > pressure. This accommodates the current workload at any given time
> > as opposed to a fixed time interval or log reservation heuristic,
> > which risks performance regression. This is essentially the same
> > heuristic that drives metadata writeback. XFS already implements
> > various log tail pushing heuristics that attempt to keep the log
> > progressing on an active fileystem under various workloads.
> > 
> > The act of relogging an item simply requires to add it to a
> > transaction and commit. This pushes the already dirty item into a
> > subsequent log checkpoint and frees up its previous location in the
> > on-disk log. Joining an item to a transaction of course requires
> > locking the item first, which means we have to be aware of
> > type-specific locks and lock ordering wherever the relog takes
> > place.
> > 
> > Fundamentally, this points to xfsaild as the ideal location to
> > process relog enabled items. xfsaild already processes log resident
> > items, is driven by log tail pushing pressure, processes arbitrary
> > log item types through callbacks, and is sensitive to type-specific
> > locking rules by design. The fact that automatic relogging
> > essentially diverts items between writeback or relog also suggests
> > xfsaild as an ideal location to process items one way or the other.
> > 
> > Of course, we don't want xfsaild to process transactions as it is a
> > critical component of the log subsystem for driving metadata
> > writeback and freeing up log space. Therefore, similar to how
> > xfsaild builds up a writeback queue of dirty items and queues writes
> > asynchronously, make xfsaild responsible only for directing pending
> > relog items into an appropriate queue and create an async
> > (workqueue) context for processing the queue. The workqueue context
> > utilizes the pre-reserved relog ticket to drain the queue by rolling
> > a permanent transaction.
> > 
> > Update the AIL pushing infrastructure to support a new RELOG item
> > state. If a log item push returns the relog state, queue the item
> > for relog instead of writeback. On completion of a push cycle,
> > schedule the relog task at the same point metadata buffer I/O is
> > submitted. This allows items to be relogged automatically under the
> > same locking rules and pressure heuristics that govern metadata
> > writeback.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_trace.h      |   1 +
> >  fs/xfs/xfs_trans.h      |   1 +
> >  fs/xfs/xfs_trans_ail.c  | 103 +++++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_trans_priv.h |   3 ++
> >  4 files changed, 106 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index a066617ec54d..df0114ec66f1 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -1063,6 +1063,7 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
> >  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> > +DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
> >  DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
> >  DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
> >  
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index fc4c25b6eee4..1637df32c64c 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -99,6 +99,7 @@ void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
> >  #define XFS_ITEM_PINNED		1
> >  #define XFS_ITEM_LOCKED		2
> >  #define XFS_ITEM_FLUSHING	3
> > +#define XFS_ITEM_RELOG		4
> >  
> >  /*
> >   * Deferred operation item relogging limits.
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index a3fb64275baa..71a47faeaae8 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -144,6 +144,75 @@ xfs_ail_max_lsn(
> >  	return lsn;
> >  }
> >  
> > +/*
> > + * Relog log items on the AIL relog queue.
> > + */
> > +static void
> > +xfs_ail_relog(
> > +	struct work_struct	*work)
> > +{
> > +	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
> > +						     ail_relog_work);
> > +	struct xfs_mount	*mp = ailp->ail_mount;
> > +	struct xfs_trans_res	tres = {};
> > +	struct xfs_trans	*tp;
> > +	struct xfs_log_item	*lip;
> > +	int			error;
> > +
> > +	/*
> > +	 * The first transaction to submit a relog item contributed relog
> > +	 * reservation to the relog ticket before committing. Create an empty
> > +	 * transaction and manually associate the relog ticket.
> > +	 */
> > +	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);
> 
> I suspect deadlocks on filesystems in the process of being frozen
> when the log is full and relogging is required to make progress.
> 

I'll have to look into interactions with fs freeze as well once the
basics are nailed down.

> > +	ASSERT(!error);
> > +	if (error)
> > +		return;
> > +	tp->t_log_res = M_RES(mp)->tr_relog.tr_logres;
> > +	tp->t_log_count = M_RES(mp)->tr_relog.tr_logcount;
> > +	tp->t_flags |= M_RES(mp)->tr_relog.tr_logflags;
> > +	tp->t_ticket = xfs_log_ticket_get(ailp->ail_relog_tic);
> 
> So this assumes you've stolen the log reservation for this ticket
> from somewhere else, because otherwise the transaction log
> reservation and the ticket don't match.
> 
> FWIW, I'm having trouble real keeping all the ail relog ticket
> references straight. Code seems to be
> arbitratily taking and dropping references to that ticket, and I
> can't see a pattern or set of rules for usage.
> 

The reference counting stuff is a mess. I was anticipating needing to
simplify it, perhaps abstract it from the actual log ticket count, but
I'm going to hold off on getting too deep into the weeds of the
reference counting stuff unless the reservation stealing approach
doesn't pan out. Otherwise, this should presumably all go away..

> Why does this specific transaction need a reference to the ticket,
> when the ail_relog_list has a reference, every item that has been
> marked as XFS_LI_RELOG already has a reference, etc?
> 

This transaction has a reference because it uses the log ticket. It will
release a ticket reference on commit. Also note that the relog state
(and thus references) can be cleared from items at any time.

> > +	spin_lock(&ailp->ail_lock);
> > +	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> > +					       struct xfs_log_item,
> > +					       li_trans)) != NULL) {
> 
> I dislike the way the "swiss army knife" list macro makes this
> code unreadable.
> 
>         while (!list_empty(&ailp->ail_relog_list)) {
> 		lip = list_first_entry(...)
> 
> is much neater and easier to read.
> 

Ok.

> > +		/*
> > +		 * Drop the AIL processing ticket reference once the relog list
> > +		 * is emptied. At this point it's possible for our transaction
> > +		 * to hold the only reference.
> > +		 */
> > +		list_del_init(&lip->li_trans);
> > +		if (list_empty(&ailp->ail_relog_list))
> > +			xfs_log_ticket_put(ailp->ail_relog_tic);
> 
> This seems completely arbitrary.
> 

The list holds a reference for when an item is queued for relog but the
relog state cleared before the relog occurs.

> > +		spin_unlock(&ailp->ail_lock);
> > +
> > +		xfs_trans_add_item(tp, lip);
> > +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > +		tp->t_flags |= XFS_TRANS_DIRTY;
> > +		/* XXX: include ticket owner task fix */
> > +		error = xfs_trans_roll(&tp);
> 
> So the reservation for this ticket is going to be regranted over and
> over again and space reserved repeatedly as we work through the
> locked relog items one at a time?
> 
> Unfortunately, this violates the rule that prevents rolling
> transactions from deadlocking. That is, any object that is held
> locked across the transaction commit and regrant that *might pin the
> tail of the log* must be relogged in the transaction to move the
> item forward and prevent it from pinning the tail of the log.
> 
> IOWs, if one of the items later in the relog list pins the tail of
> the log we will end up sleeping here:
> 
>   xfs_trans_roll()
>     xfs_trans_reserve
>       xfs_log_regrant
>         xlog_grant_head_check(need_bytes)
> 	  xlog_grant_head_wait()
> 
> waiting for the write grant head to move. ANd it never will, because
> we hold the lock on that item so the AIL can't push it out.
> IOWs, using a rolling transaction per relog item will not work for
> processing multiple relog items.
> 

Hm, Ok. I must be missing something about the rolling transaction
guarantees.

> If it's a single transaction, and we join all the locked items
> for relogging into it in a single transaction commit, then we are
> fine - we don't try to regrant log space while holding locked items
> that could pin the tail of the log.
> 
> We *can* use a rolling transaction if we do this - the AIL has a
> permenant transaction (plus ticket!) allocated at mount time with
> a log count of zero, we can then steal reserve/write grant head
> space into it that ticket at CIL commit time as I mentioned
> previously. We do a loop like above, but it's basically:
> 
> {
> 	LIST_HEAD(tmp);
> 
> 	spin_lock(&ailp->ail_lock);
>         if (list_empty(&ailp->ail_relog_list)) {
> 		spin_unlock(&ailp->ail_lock);
> 		return;
> 	}
> 
> 	list_splice_init(&ilp->ail_relog_list, &tmp);
> 	spin_unlock(&ailp->ail_lock);
> 
> 	xfs_ail_relog_items(ail, &tmp);
> }
> 
> This allows the AIL to keep working and building up a new relog
> as it goes along. hence we can work on our list without interruption
> or needed to repeatedly take the AIL lock just to get items from the
> list.
> 

Yeah, I was planning splicing the list as such to avoid cycling the lock
so much regardless..

> And xfs_ail_relog_items() does something like this:
> 
> {
> 	struct xfs_trans	*tp;
> 
> 	/*
> 	 * Make CIL committers trying to change relog status of log
> 	 * items wait for us to istabilise the relog transaction
> 	 * again by committing the current relog list and rolling
> 	 * the transaction.
> 	 */
> 	down_write(&ail->ail_relog_lock);
> 	tp = ail->relog_trans;
> 
> 	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> 					       struct xfs_log_item,
> 					       li_trans)) != NULL) {
>         while (!list_empty(&ailp->ail_relog_list)) {
> 		lip = list_first_entry(&ailp->ail_relog_list,
> 					struct xfs_log_item, li_trans);
> 		list_del_init(&lip->li_trans);
> 
> 		xfs_trans_add_item(tp, lip);
> 		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> 		tp->t_flags |= XFS_TRANS_DIRTY;
> 	}
> 
> 	error = xfs_trans_roll(&tp);
> 	if (error) {
> 		SHUTDOWN!
> 	}
> 	ail->relog_trans = tp;
> 	up_write(&ail->ail_relog_lock);
> }
> 

I think I follow.. The fundamental difference here is basically that we
commit whatever we locked, right? IOW, the current approach _could_
technically be corrected, but it would have to lock one item at a time
(ugh) rather than build up a queue of locked items..?

The reservation stealing approach facilitates this batching because
instead of only having a guarantee that we can commit one max sized
relog item at a time, we can commit however many we have queued because
sufficient reservation has already been acquired.

That does raise another issue in that we presumably want some kind of
maximum transaction size and/or maximum outstanding relog reservation
with the above approach. Otherwise it could be possible for a workload
to go off the rails without any kind of throttling or hueristics
incorporated in the current (mostly) fixed transaction sizes. Perhaps
it's reasonable enough to cap outstanding relog reservation to the max
transaction size and return an error to callers that attempt to exceed
it..? It's not clear to me if that would impact the prospective scrub
use case. Hmmm.. maybe the right thing to do is cap the size of the
current relog queue so we cap the size of the relog transaction without
necessarily capping the max outstanding relog reservation. Thoughts on
that?

> > @@ -426,6 +495,23 @@ xfsaild_push(
> >  			ailp->ail_last_pushed_lsn = lsn;
> >  			break;
> >  
> > +		case XFS_ITEM_RELOG:
> > +			/*
> > +			 * The item requires a relog. Add to the pending relog
> > +			 * list and set the relogged bit to prevent further
> > +			 * relog requests. The relog bit and ticket reference
> > +			 * can be dropped from the item at any point, so hold a
> > +			 * relog ticket reference for the pending relog list to
> > +			 * ensure the ticket stays around.
> > +			 */
> > +			trace_xfs_ail_relog(lip);
> > +			ASSERT(list_empty(&lip->li_trans));
> > +			if (list_empty(&ailp->ail_relog_list))
> > +				xfs_log_ticket_get(ailp->ail_relog_tic);
> > +			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
> > +			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
> > +			break;
> 
> So the XFS_LI_RELOGGED bit indicates that the item is locked on the
> relog list? That means nothing else in the transaction subsystem
> will set that until the item is unlocked, right? Which is when it
> ends up back on the CIL. This then gets cleared by the CIL being
> forced and the item moved forward in the AIL.
> 
> IOWs, a log force clears this flag until the AIL relogs it again.
> Why do we need this flag to issue wakeups when the wait loop does
> blocking log forces? i.e. it will have already waited for the flag
> to be cleared by waiting for the log force....
> 

The bit tracks when a relog item has been queued. It's set when first
added to the relog queue and is cleared when the item lands back in the
on-disk log. The primary purpose of the bit is thus to prevent spurious
relogging of an item that's already been dropped back into the CIL and
thus a move in the physical log is pending.

The wait/wake is a secondary usage to isolate the item to the AIL once
relogging is disabled. The wait covers the case where the item is queued
on the relog list but not committed back to the log subsystem. Otherwise
the log force serves no purpose.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner March 3, 2020, 12:06 a.m. UTC | #7
On Mon, Mar 02, 2020 at 01:52:52PM -0500, Brian Foster wrote:
> On Mon, Mar 02, 2020 at 06:18:43PM +1100, Dave Chinner wrote:
> > On Thu, Feb 27, 2020 at 08:43:17AM -0500, Brian Foster wrote:
> > > +		spin_unlock(&ailp->ail_lock);
> > > +
> > > +		xfs_trans_add_item(tp, lip);
> > > +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > > +		tp->t_flags |= XFS_TRANS_DIRTY;
> > > +		/* XXX: include ticket owner task fix */
> > > +		error = xfs_trans_roll(&tp);
> > 
> > So the reservation for this ticket is going to be regranted over and
> > over again and space reserved repeatedly as we work through the
> > locked relog items one at a time?
> > 
> > Unfortunately, this violates the rule that prevents rolling
> > transactions from deadlocking. That is, any object that is held
> > locked across the transaction commit and regrant that *might pin the
> > tail of the log* must be relogged in the transaction to move the
> > item forward and prevent it from pinning the tail of the log.
> > 
> > IOWs, if one of the items later in the relog list pins the tail of
> > the log we will end up sleeping here:
> > 
> >   xfs_trans_roll()
> >     xfs_trans_reserve
> >       xfs_log_regrant
> >         xlog_grant_head_check(need_bytes)
> > 	  xlog_grant_head_wait()
> > 
> > waiting for the write grant head to move. ANd it never will, because
> > we hold the lock on that item so the AIL can't push it out.
> > IOWs, using a rolling transaction per relog item will not work for
> > processing multiple relog items.
> > 
> 
> Hm, Ok. I must be missing something about the rolling transaction
> guarantees.

Ok, I know you understand some of the rules, but because I don't
know quite which bit of this complex game you are missing, I'll go
over it from the start (sorry for repeating things you know!).

We're not allowed to hold locked items that are in the AIL over a
transaction reservation call because the transaction reservation may
need to push the log to free up space in the log. Writing back
metadata requires locking the item to flush it, and if we hold it
locked the it can't be flushed. Hence if it pins the tail of the log
and prevents the the reservation from making space available, we
deadlock.

Normally, the only thing that is pinned across a transaction roll is
an inode, and the inode is being logged in every transaction. Hence
it is being continually moved to the head of the log and so can't
pin the tail of the log and prevent the reservation making progress.

The problem here is that having a log reservation for a modification
doesn't guarantee you that log space is immediately available - all
it guarantees you is that the log space will be available if the log
tail if free to move forward.

That's why there are two grant heads. The reservation grant head is
the one that guarantees that you have space available in the log for
the rolling transaction. That is always immediately regranted during
transaction commit, hence we guarantee the rolling transaction will
always fit in the log. The reserve head ensures we never overcommit
the available log space.

The second grant head is the write head, and this tracks the space
immediately available to physically write into the log. This is
essentially tracks the space available for physical writes into the
log. When your write reservation runs out (i.e. after the number of
rolls the  log count in the initial transaction reservation
specifies), we have to regrant physical space for the next
transaction in the rolling chain. If the log is physically full, we
have to wait for physical space to be made available.

The only way to increase the amount of physical space available in
the log is to have the tail move forwards. xfs_trans_reserve() does
that by setting a push target for the AIL to flush all the metadata
older than that target. It then blocks waiting for the tail of the
log to move. When the tail of the log moves, the available write
grant space increases because the log head can now physically move
forwards in the log.

Hence when the log is full and we are in a tail pushing situation,
new transactions wait on the reserve grant head to get the log space
guarantee they require. Long duration rolling transactions already
have a log space guarantee from the reserve grant head, so they
end up waiting for the physical log space they require on the write
grant head.

The tail pinning deadlock rolling transactions can trigger is
against the write grant head, not the reserve grant head. If the
tail of the log cannot move, then the write grant space never
increases and xfs_trans_reserve() blocks forever. Hence we cannot
call xfs_trans_roll() whilst holding items locked that have a high
probability of being at the tail of the log.

Given that this relogging functionality is all about preventing
items from either pinning the tail of the log or disappearing off
the tail of the log because they aren't relogged, we have to be very
careful about holding them locked over operations that require
the AIL to be able to make forwards progress....


> > If it's a single transaction, and we join all the locked items
> > for relogging into it in a single transaction commit, then we are
> > fine - we don't try to regrant log space while holding locked items
> > that could pin the tail of the log.
> > 
> > We *can* use a rolling transaction if we do this - the AIL has a
> > permenant transaction (plus ticket!) allocated at mount time with
> > a log count of zero, we can then steal reserve/write grant head
> > space into it that ticket at CIL commit time as I mentioned
> > previously. We do a loop like above, but it's basically:
> > 
> > {
> > 	LIST_HEAD(tmp);
> > 
> > 	spin_lock(&ailp->ail_lock);
> >         if (list_empty(&ailp->ail_relog_list)) {
> > 		spin_unlock(&ailp->ail_lock);
> > 		return;
> > 	}
> > 
> > 	list_splice_init(&ilp->ail_relog_list, &tmp);
> > 	spin_unlock(&ailp->ail_lock);
> > 
> > 	xfs_ail_relog_items(ail, &tmp);
> > }
> > 
> > This allows the AIL to keep working and building up a new relog
> > as it goes along. hence we can work on our list without interruption
> > or needed to repeatedly take the AIL lock just to get items from the
> > list.
> > 
> 
> Yeah, I was planning splicing the list as such to avoid cycling the lock
> so much regardless..
> 
> > And xfs_ail_relog_items() does something like this:
> > 
> > {
> > 	struct xfs_trans	*tp;
> > 
> > 	/*
> > 	 * Make CIL committers trying to change relog status of log
> > 	 * items wait for us to istabilise the relog transaction
> > 	 * again by committing the current relog list and rolling
> > 	 * the transaction.
> > 	 */
> > 	down_write(&ail->ail_relog_lock);
> > 	tp = ail->relog_trans;
> > 
> > 	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> > 					       struct xfs_log_item,
> > 					       li_trans)) != NULL) {
> >         while (!list_empty(&ailp->ail_relog_list)) {
> > 		lip = list_first_entry(&ailp->ail_relog_list,
> > 					struct xfs_log_item, li_trans);
> > 		list_del_init(&lip->li_trans);
> > 
> > 		xfs_trans_add_item(tp, lip);
> > 		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > 		tp->t_flags |= XFS_TRANS_DIRTY;
> > 	}
> > 
> > 	error = xfs_trans_roll(&tp);
> > 	if (error) {
> > 		SHUTDOWN!
> > 	}
> > 	ail->relog_trans = tp;
> > 	up_write(&ail->ail_relog_lock);
> > }
> > 
> 
> I think I follow.. The fundamental difference here is basically that we
> commit whatever we locked, right? IOW, the current approach _could_
> technically be corrected, but it would have to lock one item at a time
> (ugh) rather than build up a queue of locked items..?
> 
> The reservation stealing approach facilitates this batching because
> instead of only having a guarantee that we can commit one max sized
> relog item at a time, we can commit however many we have queued because
> sufficient reservation has already been acquired.

Yes.

> That does raise another issue in that we presumably want some kind of
> maximum transaction size and/or maximum outstanding relog reservation
> with the above approach. Otherwise it could be possible for a workload
> to go off the rails without any kind of throttling or hueristics
> incorporated in the current (mostly) fixed transaction sizes.

Possibly, though I really don't have any intuition on how big the
relog reservation could possible grow. Right now it doesn't seem
like there's a space problem (single item!), so perhaps this is
something we can defer until we have some further understanding of
how many relog items are active at any given time?

> Perhaps
> it's reasonable enough to cap outstanding relog reservation to the max
> transaction size and return an error to callers that attempt to exceed

Max transaction size the CIL currently uses for large logs is 32MB.
That's an awful lot of relog items....

> it..? It's not clear to me if that would impact the prospective scrub
> use case. Hmmm.. maybe the right thing to do is cap the size of the
> current relog queue so we cap the size of the relog transaction without
> necessarily capping the max outstanding relog reservation. Thoughts on
> that?

Maybe.

Though this seems like an ideal candidate for a relog reservation
grant head. i.e. the transaction reserve structure we pass to
xfs_trans_reserve() has a new field that contains the relog
reservation needed for this transaction, and we use the existing
lockless grant head accounting infrastructure to throttle relogging
to an acceptible bound...

Cheers,

Dave.
Brian Foster March 3, 2020, 2:14 p.m. UTC | #8
On Tue, Mar 03, 2020 at 11:06:43AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2020 at 01:52:52PM -0500, Brian Foster wrote:
> > On Mon, Mar 02, 2020 at 06:18:43PM +1100, Dave Chinner wrote:
> > > On Thu, Feb 27, 2020 at 08:43:17AM -0500, Brian Foster wrote:
> > > > +		spin_unlock(&ailp->ail_lock);
> > > > +
> > > > +		xfs_trans_add_item(tp, lip);
> > > > +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > > > +		tp->t_flags |= XFS_TRANS_DIRTY;
> > > > +		/* XXX: include ticket owner task fix */
> > > > +		error = xfs_trans_roll(&tp);
> > > 
> > > So the reservation for this ticket is going to be regranted over and
> > > over again and space reserved repeatedly as we work through the
> > > locked relog items one at a time?
> > > 
> > > Unfortunately, this violates the rule that prevents rolling
> > > transactions from deadlocking. That is, any object that is held
> > > locked across the transaction commit and regrant that *might pin the
> > > tail of the log* must be relogged in the transaction to move the
> > > item forward and prevent it from pinning the tail of the log.
> > > 
> > > IOWs, if one of the items later in the relog list pins the tail of
> > > the log we will end up sleeping here:
> > > 
> > >   xfs_trans_roll()
> > >     xfs_trans_reserve
> > >       xfs_log_regrant
> > >         xlog_grant_head_check(need_bytes)
> > > 	  xlog_grant_head_wait()
> > > 
> > > waiting for the write grant head to move. ANd it never will, because
> > > we hold the lock on that item so the AIL can't push it out.
> > > IOWs, using a rolling transaction per relog item will not work for
> > > processing multiple relog items.
> > > 
> > 
> > Hm, Ok. I must be missing something about the rolling transaction
> > guarantees.
> 
> Ok, I know you understand some of the rules, but because I don't
> know quite which bit of this complex game you are missing, I'll go
> over it from the start (sorry for repeating things you know!).
> 
> We're not allowed to hold locked items that are in the AIL over a
> transaction reservation call because the transaction reservation may
> need to push the log to free up space in the log. Writing back
> metadata requires locking the item to flush it, and if we hold it
> locked the it can't be flushed. Hence if it pins the tail of the log
> and prevents the the reservation from making space available, we
> deadlock.
> 
> Normally, the only thing that is pinned across a transaction roll is
> an inode, and the inode is being logged in every transaction. Hence
> it is being continually moved to the head of the log and so can't
> pin the tail of the log and prevent the reservation making progress.
> 
> The problem here is that having a log reservation for a modification
> doesn't guarantee you that log space is immediately available - all
> it guarantees you is that the log space will be available if the log
> tail if free to move forward.
> 
> That's why there are two grant heads. The reservation grant head is
> the one that guarantees that you have space available in the log for
> the rolling transaction. That is always immediately regranted during
> transaction commit, hence we guarantee the rolling transaction will
> always fit in the log. The reserve head ensures we never overcommit
> the available log space.
> 
> The second grant head is the write head, and this tracks the space
> immediately available to physically write into the log. This is
> essentially tracks the space available for physical writes into the
> log. When your write reservation runs out (i.e. after the number of
> rolls the  log count in the initial transaction reservation
> specifies), we have to regrant physical space for the next
> transaction in the rolling chain. If the log is physically full, we
> have to wait for physical space to be made available.
> 
> The only way to increase the amount of physical space available in
> the log is to have the tail move forwards. xfs_trans_reserve() does
> that by setting a push target for the AIL to flush all the metadata
> older than that target. It then blocks waiting for the tail of the
> log to move. When the tail of the log moves, the available write
> grant space increases because the log head can now physically move
> forwards in the log.
> 
> Hence when the log is full and we are in a tail pushing situation,
> new transactions wait on the reserve grant head to get the log space
> guarantee they require. Long duration rolling transactions already
> have a log space guarantee from the reserve grant head, so they
> end up waiting for the physical log space they require on the write
> grant head.
> 
> The tail pinning deadlock rolling transactions can trigger is
> against the write grant head, not the reserve grant head. If the
> tail of the log cannot move, then the write grant space never
> increases and xfs_trans_reserve() blocks forever. Hence we cannot
> call xfs_trans_roll() whilst holding items locked that have a high
> probability of being at the tail of the log.
> 

Ok. I've stared at the reserve vs. write grant head code before and
recognized the difference in behavior between new transactions bumping
both heads and rolling transactions replenishing straight from the write
head (presumably holding a constant log reservation over the entire
process), but the side effect of that in the rolling transaction case
wasn't always clear to me. This clears up the purpose of the write head
and makes the connection to the deadlock vector. Thanks for the
explanation...

> Given that this relogging functionality is all about preventing
> items from either pinning the tail of the log or disappearing off
> the tail of the log because they aren't relogged, we have to be very
> careful about holding them locked over operations that require
> the AIL to be able to make forwards progress....
> 

Indeed, and we're intentionally relying on AIL pressure to defer relogs
until items are potentially at or near the tail of the log as well.

> 
> > > If it's a single transaction, and we join all the locked items
> > > for relogging into it in a single transaction commit, then we are
> > > fine - we don't try to regrant log space while holding locked items
> > > that could pin the tail of the log.
> > > 
> > > We *can* use a rolling transaction if we do this - the AIL has a
> > > permenant transaction (plus ticket!) allocated at mount time with
> > > a log count of zero, we can then steal reserve/write grant head
> > > space into it that ticket at CIL commit time as I mentioned
> > > previously. We do a loop like above, but it's basically:
> > > 
> > > {
> > > 	LIST_HEAD(tmp);
> > > 
> > > 	spin_lock(&ailp->ail_lock);
> > >         if (list_empty(&ailp->ail_relog_list)) {
> > > 		spin_unlock(&ailp->ail_lock);
> > > 		return;
> > > 	}
> > > 
> > > 	list_splice_init(&ilp->ail_relog_list, &tmp);
> > > 	spin_unlock(&ailp->ail_lock);
> > > 
> > > 	xfs_ail_relog_items(ail, &tmp);
> > > }
> > > 
> > > This allows the AIL to keep working and building up a new relog
> > > as it goes along. hence we can work on our list without interruption
> > > or needed to repeatedly take the AIL lock just to get items from the
> > > list.
> > > 
> > 
> > Yeah, I was planning splicing the list as such to avoid cycling the lock
> > so much regardless..
> > 
> > > And xfs_ail_relog_items() does something like this:
> > > 
> > > {
> > > 	struct xfs_trans	*tp;
> > > 
> > > 	/*
> > > 	 * Make CIL committers trying to change relog status of log
> > > 	 * items wait for us to istabilise the relog transaction
> > > 	 * again by committing the current relog list and rolling
> > > 	 * the transaction.
> > > 	 */
> > > 	down_write(&ail->ail_relog_lock);
> > > 	tp = ail->relog_trans;
> > > 
> > > 	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
> > > 					       struct xfs_log_item,
> > > 					       li_trans)) != NULL) {
> > >         while (!list_empty(&ailp->ail_relog_list)) {
> > > 		lip = list_first_entry(&ailp->ail_relog_list,
> > > 					struct xfs_log_item, li_trans);
> > > 		list_del_init(&lip->li_trans);
> > > 
> > > 		xfs_trans_add_item(tp, lip);
> > > 		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > > 		tp->t_flags |= XFS_TRANS_DIRTY;
> > > 	}
> > > 
> > > 	error = xfs_trans_roll(&tp);
> > > 	if (error) {
> > > 		SHUTDOWN!
> > > 	}
> > > 	ail->relog_trans = tp;
> > > 	up_write(&ail->ail_relog_lock);
> > > }
> > > 
> > 
> > I think I follow.. The fundamental difference here is basically that we
> > commit whatever we locked, right? IOW, the current approach _could_
> > technically be corrected, but it would have to lock one item at a time
> > (ugh) rather than build up a queue of locked items..?
> > 
> > The reservation stealing approach facilitates this batching because
> > instead of only having a guarantee that we can commit one max sized
> > relog item at a time, we can commit however many we have queued because
> > sufficient reservation has already been acquired.
> 
> Yes.
> 
> > That does raise another issue in that we presumably want some kind of
> > maximum transaction size and/or maximum outstanding relog reservation
> > with the above approach. Otherwise it could be possible for a workload
> > to go off the rails without any kind of throttling or hueristics
> > incorporated in the current (mostly) fixed transaction sizes.
> 
> Possibly, though I really don't have any intuition on how big the
> relog reservation could possible grow. Right now it doesn't seem
> like there's a space problem (single item!), so perhaps this is
> something we can defer until we have some further understanding of
> how many relog items are active at any given time?
> 

Sure, it's certainly not an immediate issue for the current use case.

> > Perhaps
> > it's reasonable enough to cap outstanding relog reservation to the max
> > transaction size and return an error to callers that attempt to exceed
> 
> Max transaction size the CIL currently uses for large logs is 32MB.
> That's an awful lot of relog items....
> 

Yeah. As above, it's not clear to me what the max relog requirement
might be in the worst case of online repair of a metadata btree. This
might warrant some empirical data once the mechanism is in place. The
buffer relogging test code could help with that as well.

> > it..? It's not clear to me if that would impact the prospective scrub
> > use case. Hmmm.. maybe the right thing to do is cap the size of the
> > current relog queue so we cap the size of the relog transaction without
> > necessarily capping the max outstanding relog reservation. Thoughts on
> > that?
> 
> Maybe.
> 
> Though this seems like an ideal candidate for a relog reservation
> grant head. i.e. the transaction reserve structure we pass to
> xfs_trans_reserve() has a new field that contains the relog
> reservation needed for this transaction, and we use the existing
> lockless grant head accounting infrastructure to throttle relogging
> to an acceptible bound...
> 

That's an interesting idea too. I guess this will mostly be dictated by
the requirements of the repair use case, and there's potential for
flexibility in terms of having a hard cap vs. a throttling approach
independent from the reservation tracking implementation details.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index a066617ec54d..df0114ec66f1 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1063,6 +1063,7 @@  DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
+DEFINE_LOG_ITEM_EVENT(xfs_ail_relog);
 DEFINE_LOG_ITEM_EVENT(xfs_relog_item);
 DEFINE_LOG_ITEM_EVENT(xfs_relog_item_cancel);
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index fc4c25b6eee4..1637df32c64c 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -99,6 +99,7 @@  void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
 #define XFS_ITEM_PINNED		1
 #define XFS_ITEM_LOCKED		2
 #define XFS_ITEM_FLUSHING	3
+#define XFS_ITEM_RELOG		4
 
 /*
  * Deferred operation item relogging limits.
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index a3fb64275baa..71a47faeaae8 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -144,6 +144,75 @@  xfs_ail_max_lsn(
 	return lsn;
 }
 
+/*
+ * Relog log items on the AIL relog queue.
+ */
+static void
+xfs_ail_relog(
+	struct work_struct	*work)
+{
+	struct xfs_ail		*ailp = container_of(work, struct xfs_ail,
+						     ail_relog_work);
+	struct xfs_mount	*mp = ailp->ail_mount;
+	struct xfs_trans_res	tres = {};
+	struct xfs_trans	*tp;
+	struct xfs_log_item	*lip;
+	int			error;
+
+	/*
+	 * The first transaction to submit a relog item contributed relog
+	 * reservation to the relog ticket before committing. Create an empty
+	 * transaction and manually associate the relog ticket.
+	 */
+	error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);
+	ASSERT(!error);
+	if (error)
+		return;
+	tp->t_log_res = M_RES(mp)->tr_relog.tr_logres;
+	tp->t_log_count = M_RES(mp)->tr_relog.tr_logcount;
+	tp->t_flags |= M_RES(mp)->tr_relog.tr_logflags;
+	tp->t_ticket = xfs_log_ticket_get(ailp->ail_relog_tic);
+
+	spin_lock(&ailp->ail_lock);
+	while ((lip = list_first_entry_or_null(&ailp->ail_relog_list,
+					       struct xfs_log_item,
+					       li_trans)) != NULL) {
+		/*
+		 * Drop the AIL processing ticket reference once the relog list
+		 * is emptied. At this point it's possible for our transaction
+		 * to hold the only reference.
+		 */
+		list_del_init(&lip->li_trans);
+		if (list_empty(&ailp->ail_relog_list))
+			xfs_log_ticket_put(ailp->ail_relog_tic);
+		spin_unlock(&ailp->ail_lock);
+
+		xfs_trans_add_item(tp, lip);
+		set_bit(XFS_LI_DIRTY, &lip->li_flags);
+		tp->t_flags |= XFS_TRANS_DIRTY;
+		/* XXX: include ticket owner task fix */
+		error = xfs_trans_roll(&tp);
+		ASSERT(!error);
+		if (error)
+			goto out;
+		spin_lock(&ailp->ail_lock);
+	}
+	spin_unlock(&ailp->ail_lock);
+
+out:
+	/* XXX: handle shutdown scenario */
+	/*
+	 * Drop the relog reference owned by the transaction separately because
+	 * we don't want the cancel to release reservation if this isn't the
+	 * final reference. The relog ticket and associated reservation needs
+	 * to persist so long as relog items are active in the log subsystem.
+	 */
+	xfs_trans_ail_relog_put(mp);
+
+	tp->t_ticket = NULL;
+	xfs_trans_cancel(tp);
+}
+
 /*
  * The cursor keeps track of where our current traversal is up to by tracking
  * the next item in the list for us. However, for this to be safe, removing an
@@ -364,7 +433,7 @@  static long
 xfsaild_push(
 	struct xfs_ail		*ailp)
 {
-	xfs_mount_t		*mp = ailp->ail_mount;
+	struct xfs_mount	*mp = ailp->ail_mount;
 	struct xfs_ail_cursor	cur;
 	struct xfs_log_item	*lip;
 	xfs_lsn_t		lsn;
@@ -426,6 +495,23 @@  xfsaild_push(
 			ailp->ail_last_pushed_lsn = lsn;
 			break;
 
+		case XFS_ITEM_RELOG:
+			/*
+			 * The item requires a relog. Add to the pending relog
+			 * list and set the relogged bit to prevent further
+			 * relog requests. The relog bit and ticket reference
+			 * can be dropped from the item at any point, so hold a
+			 * relog ticket reference for the pending relog list to
+			 * ensure the ticket stays around.
+			 */
+			trace_xfs_ail_relog(lip);
+			ASSERT(list_empty(&lip->li_trans));
+			if (list_empty(&ailp->ail_relog_list))
+				xfs_log_ticket_get(ailp->ail_relog_tic);
+			list_add_tail(&lip->li_trans, &ailp->ail_relog_list);
+			set_bit(XFS_LI_RELOGGED, &lip->li_flags);
+			break;
+
 		case XFS_ITEM_FLUSHING:
 			/*
 			 * The item or its backing buffer is already being
@@ -492,6 +578,9 @@  xfsaild_push(
 	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
 		ailp->ail_log_flush++;
 
+	if (!list_empty(&ailp->ail_relog_list))
+		queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work);
+
 	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
 out_done:
 		/*
@@ -922,15 +1011,24 @@  xfs_trans_ail_init(
 	spin_lock_init(&ailp->ail_lock);
 	INIT_LIST_HEAD(&ailp->ail_buf_list);
 	init_waitqueue_head(&ailp->ail_empty);
+	INIT_LIST_HEAD(&ailp->ail_relog_list);
+	INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog);
+
+	ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0,
+					     mp->m_super->s_id);
+	if (!ailp->ail_relog_wq)
+		goto out_free_ailp;
 
 	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
 			ailp->ail_mount->m_super->s_id);
 	if (IS_ERR(ailp->ail_task))
-		goto out_free_ailp;
+		goto out_destroy_wq;
 
 	mp->m_ail = ailp;
 	return 0;
 
+out_destroy_wq:
+	destroy_workqueue(ailp->ail_relog_wq);
 out_free_ailp:
 	kmem_free(ailp);
 	return -ENOMEM;
@@ -944,5 +1042,6 @@  xfs_trans_ail_destroy(
 
 	ASSERT(ailp->ail_relog_tic == NULL);
 	kthread_stop(ailp->ail_task);
+	destroy_workqueue(ailp->ail_relog_wq);
 	kmem_free(ailp);
 }
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index d1edec1cb8ad..33a724534869 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -63,6 +63,9 @@  struct xfs_ail {
 	int			ail_log_flush;
 	struct list_head	ail_buf_list;
 	wait_queue_head_t	ail_empty;
+	struct work_struct	ail_relog_work;
+	struct list_head	ail_relog_list;
+	struct workqueue_struct	*ail_relog_wq;
 	struct xlog_ticket	*ail_relog_tic;
 };