diff mbox series

[RFC,v5,3/9] xfs: automatic relogging reservation management

Message ID 20200227134321.7238-4-bfoster@redhat.com (mailing list archive)
State Deferred, archived
Headers show
Series xfs: automatic relogging experiment | expand

Commit Message

Brian Foster Feb. 27, 2020, 1:43 p.m. UTC
Automatic item relogging will occur from xfsaild context. xfsaild
cannot acquire log reservation itself because it is also responsible
for writeback and thus making used log reservation available again.
Since there is no guarantee log reservation is available by the time
a relogged item reaches the AIL, this is prone to deadlock.

To guarantee log reservation for automatic relogging, implement a
reservation management scheme where a transaction that is capable of
enabling relogging of an item must contribute the necessary
reservation to the relog mechanism up front. Use reference counting
to associate the lifetime of pending relog reservation to the
lifetime of in-core log items with relogging enabled.

The basic log reservation sequence for a relog enabled transaction
is as follows:

- A transaction that uses relogging specifies XFS_TRANS_RELOG at
  allocation time.
- Once initialized, RELOG transactions check for the existence of
  the global relog log ticket. If it exists, grab a reference and
  return. If not, allocate an empty ticket and install into the relog
  subsystem. Seed the relog ticket from reservation of the current
  transaction. Roll the current transaction to replenish its
  reservation and return to the caller.
- The transaction is used as normal. If an item is relogged in the
  transaction, that item acquires a reference on the global relog
  ticket currently held open by the transaction. The item's reference
  persists until relogging is disabled on the item.
- The RELOG transaction commits and releases its reference to the
  global relog ticket. The global relog ticket is released once its
  reference count drops to zero.

This provides a central relog log ticket that guarantees reservation
availability for relogged items, avoids log reservation deadlocks
and is allocated and released on demand.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_shared.h |  1 +
 fs/xfs/xfs_trans.c         | 37 +++++++++++++---
 fs/xfs/xfs_trans.h         |  3 ++
 fs/xfs/xfs_trans_ail.c     | 89 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans_priv.h    |  1 +
 5 files changed, 126 insertions(+), 5 deletions(-)

Comments

Allison Henderson Feb. 27, 2020, 8:49 p.m. UTC | #1
On 2/27/20 6:43 AM, Brian Foster wrote:
> Automatic item relogging will occur from xfsaild context. xfsaild
> cannot acquire log reservation itself because it is also responsible
> for writeback and thus making used log reservation available again.
> Since there is no guarantee log reservation is available by the time
> a relogged item reaches the AIL, this is prone to deadlock.
> 
> To guarantee log reservation for automatic relogging, implement a
> reservation management scheme where a transaction that is capable of
> enabling relogging of an item must contribute the necessary
> reservation to the relog mechanism up front. Use reference counting
> to associate the lifetime of pending relog reservation to the
> lifetime of in-core log items with relogging enabled.
> 
> The basic log reservation sequence for a relog enabled transaction
> is as follows:
> 
> - A transaction that uses relogging specifies XFS_TRANS_RELOG at
>    allocation time.
> - Once initialized, RELOG transactions check for the existence of
>    the global relog log ticket. If it exists, grab a reference and
>    return. If not, allocate an empty ticket and install into the relog
>    subsystem. Seed the relog ticket from reservation of the current
>    transaction. Roll the current transaction to replenish its
>    reservation and return to the caller.
> - The transaction is used as normal. If an item is relogged in the
>    transaction, that item acquires a reference on the global relog
>    ticket currently held open by the transaction. The item's reference
>    persists until relogging is disabled on the item.
> - The RELOG transaction commits and releases its reference to the
>    global relog ticket. The global relog ticket is released once its
>    reference count drops to zero.
> 
> This provides a central relog log ticket that guarantees reservation
> availability for relogged items, avoids log reservation deadlocks
> and is allocated and released on demand.
> 
Ok, I followed it through and didn't see any obvious errors
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/libxfs/xfs_shared.h |  1 +
>   fs/xfs/xfs_trans.c         | 37 +++++++++++++---
>   fs/xfs/xfs_trans.h         |  3 ++
>   fs/xfs/xfs_trans_ail.c     | 89 ++++++++++++++++++++++++++++++++++++++
>   fs/xfs/xfs_trans_priv.h    |  1 +
>   5 files changed, 126 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index c45acbd3add9..0a10ca0853ab 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -77,6 +77,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>    * made then this algorithm will eventually find all the space it needs.
>    */
>   #define XFS_TRANS_LOWMODE	0x100	/* allocate in low space mode */
> +#define XFS_TRANS_RELOG		0x200	/* enable automatic relogging */
>   
>   /*
>    * Field values for xfs_trans_mod_sb.
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3b208f9a865c..8ac05ed8deda 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -107,9 +107,14 @@ xfs_trans_dup(
>   
>   	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
>   		       (tp->t_flags & XFS_TRANS_RESERVE) |
> -		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT);
> -	/* We gave our writer reference to the new transaction */
> +		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
> +		       (tp->t_flags & XFS_TRANS_RELOG);
> +	/*
> +	 * The writer reference and relog reference transfer to the new
> +	 * transaction.
> +	 */
>   	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
> +	tp->t_flags &= ~XFS_TRANS_RELOG;
>   	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
>   
>   	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
> @@ -284,15 +289,25 @@ xfs_trans_alloc(
>   	tp->t_firstblock = NULLFSBLOCK;
>   
>   	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> -	if (error) {
> -		xfs_trans_cancel(tp);
> -		return error;
> +	if (error)
> +		goto error;
> +
> +	if (flags & XFS_TRANS_RELOG) {
> +		error = xfs_trans_ail_relog_reserve(&tp);
> +		if (error)
> +			goto error;
>   	}
>   
>   	trace_xfs_trans_alloc(tp, _RET_IP_);
>   
>   	*tpp = tp;
>   	return 0;
> +
> +error:
> +	/* clear relog flag if we haven't acquired a ref */
> +	tp->t_flags &= ~XFS_TRANS_RELOG;
> +	xfs_trans_cancel(tp);
> +	return error;
>   }
>   
>   /*
> @@ -973,6 +988,10 @@ __xfs_trans_commit(
>   
>   	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>   
> +	/* release the relog ticket reference if this transaction holds one */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
> +
>   	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>   	xfs_trans_free(tp);
>   
> @@ -1004,6 +1023,10 @@ __xfs_trans_commit(
>   			error = -EIO;
>   		tp->t_ticket = NULL;
>   	}
> +	/* release the relog ticket reference if this transaction holds one */
> +	/* XXX: handle RELOG items on transaction abort */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
>   	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>   	xfs_trans_free_items(tp, !!error);
>   	xfs_trans_free(tp);
> @@ -1064,6 +1087,10 @@ xfs_trans_cancel(
>   		tp->t_ticket = NULL;
>   	}
>   
> +	/* release the relog ticket reference if this transaction holds one */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
> +
>   	/* mark this thread as no longer being in a transaction */
>   	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>   
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 752c7fef9de7..a032989943bd 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -236,6 +236,9 @@ int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
>   void		xfs_trans_cancel(xfs_trans_t *);
>   int		xfs_trans_ail_init(struct xfs_mount *);
>   void		xfs_trans_ail_destroy(struct xfs_mount *);
> +int		xfs_trans_ail_relog_reserve(struct xfs_trans **);
> +bool		xfs_trans_ail_relog_get(struct xfs_mount *);
> +int		xfs_trans_ail_relog_put(struct xfs_mount *);
>   
>   void		xfs_trans_buf_set_type(struct xfs_trans *, struct xfs_buf *,
>   				       enum xfs_blft);
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 00cc5b8734be..a3fb64275baa 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -17,6 +17,7 @@
>   #include "xfs_errortag.h"
>   #include "xfs_error.h"
>   #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>   
>   #ifdef DEBUG
>   /*
> @@ -818,6 +819,93 @@ xfs_trans_ail_delete(
>   		xfs_log_space_wake(ailp->ail_mount);
>   }
>   
> +bool
> +xfs_trans_ail_relog_get(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	bool			ret = false;
> +
> +	spin_lock(&ailp->ail_lock);
> +	if (ailp->ail_relog_tic) {
> +		xfs_log_ticket_get(ailp->ail_relog_tic);
> +		ret = true;
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +	return ret;
> +}
> +
> +/*
> + * Reserve log space for the automatic relogging ->tr_relog ticket. This
> + * requires a clean, permanent transaction from the caller. Pull reservation
> + * for the relog ticket and roll the caller's transaction back to its fully
> + * reserved state. If the AIL relog ticket is already initialized, grab a
> + * reference and return.
> + */
> +int
> +xfs_trans_ail_relog_reserve(
> +	struct xfs_trans	**tpp)
> +{
> +	struct xfs_trans	*tp = *tpp;
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	struct xlog_ticket	*tic;
> +	uint32_t		logres = M_RES(mp)->tr_relog.tr_logres;
> +
> +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> +
> +	if (xfs_trans_ail_relog_get(mp))
> +		return 0;
> +
> +	/* no active ticket, fall into slow path to allocate one.. */
> +	tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0);
> +	if (!tic)
> +		return -ENOMEM;
> +	ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res);
> +
> +	/* check again since we dropped the lock for the allocation */
> +	spin_lock(&ailp->ail_lock);
> +	if (ailp->ail_relog_tic) {
> +		xfs_log_ticket_get(ailp->ail_relog_tic);
> +		spin_unlock(&ailp->ail_lock);
> +		xfs_log_ticket_put(tic);
> +		return 0;
> +	}
> +
> +	/* attach and reserve space for the ->tr_relog ticket */
> +	ailp->ail_relog_tic = tic;
> +	tp->t_ticket->t_curr_res -= tic->t_curr_res;
> +	spin_unlock(&ailp->ail_lock);
> +
> +	return xfs_trans_roll(tpp);
> +}
> +
> +/*
> + * Release a reference to the relog ticket.
> + */
> +int
> +xfs_trans_ail_relog_put(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	struct xlog_ticket	*tic;
> +
> +	spin_lock(&ailp->ail_lock);
> +	if (atomic_add_unless(&ailp->ail_relog_tic->t_ref, -1, 1)) {
> +		spin_unlock(&ailp->ail_lock);
> +		return 0;
> +	}
> +
> +	ASSERT(atomic_read(&ailp->ail_relog_tic->t_ref) == 1);
> +	tic = ailp->ail_relog_tic;
> +	ailp->ail_relog_tic = NULL;
> +	spin_unlock(&ailp->ail_lock);
> +
> +	xfs_log_done(mp, tic, NULL, false);
> +	return 0;
> +}
> +
>   int
>   xfs_trans_ail_init(
>   	xfs_mount_t	*mp)
> @@ -854,6 +942,7 @@ xfs_trans_ail_destroy(
>   {
>   	struct xfs_ail	*ailp = mp->m_ail;
>   
> +	ASSERT(ailp->ail_relog_tic == NULL);
>   	kthread_stop(ailp->ail_task);
>   	kmem_free(ailp);
>   }
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 2e073c1c4614..839df6559b9f 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -61,6 +61,7 @@ struct xfs_ail {
>   	int			ail_log_flush;
>   	struct list_head	ail_buf_list;
>   	wait_queue_head_t	ail_empty;
> +	struct xlog_ticket	*ail_relog_tic;
>   };
>   
>   /*
>
Darrick J. Wong Feb. 28, 2020, 12:02 a.m. UTC | #2
On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote:
> Automatic item relogging will occur from xfsaild context. xfsaild
> cannot acquire log reservation itself because it is also responsible
> for writeback and thus making used log reservation available again.
> Since there is no guarantee log reservation is available by the time
> a relogged item reaches the AIL, this is prone to deadlock.
> 
> To guarantee log reservation for automatic relogging, implement a
> reservation management scheme where a transaction that is capable of
> enabling relogging of an item must contribute the necessary
> reservation to the relog mechanism up front.

Ooooh, I had wondered where I was going to find that hook. :)

What does it mean to be capable of enabling relogging of an item?

For the quotaoff example, does this mean that all the transactions that
happen on behalf of a quotaoff operation must specify TRANS_RELOG?

What if the relog thread grinds to a halt while other non-RELOG threads
continue to push things into the log?  Can we starve and/or livelock
waiting around?  Or should the log be able to kick some higher level
thread to inject a TRANS_RELOG transaction to move things along?

> Use reference counting
> to associate the lifetime of pending relog reservation to the
> lifetime of in-core log items with relogging enabled.

Ok, so we only have to pay the relog reservation while there are
reloggable items floating around in the system.

> The basic log reservation sequence for a relog enabled transaction
> is as follows:
> 
> - A transaction that uses relogging specifies XFS_TRANS_RELOG at
>   allocation time.
> - Once initialized, RELOG transactions check for the existence of
>   the global relog log ticket. If it exists, grab a reference and
>   return. If not, allocate an empty ticket and install into the relog
>   subsystem. Seed the relog ticket from reservation of the current
>   transaction. Roll the current transaction to replenish its
>   reservation and return to the caller.

I guess we'd have to be careful that the transaction we're stealing from
actually has enough reservation to re-log a pending item, but that
shouldn't be difficult.

I worry that there might be some operation somewhere that Just Works
because tr_logcount * tr_logres is enough space for it to run without
having to get more reseration, but (tr_logcount - 1) * tr_logres isn't
enough.  Though that might not be a big issue seeing how bloated the
log reservations become when reflink and rmap are turned on. <cough>

> - The transaction is used as normal. If an item is relogged in the
>   transaction, that item acquires a reference on the global relog
>   ticket currently held open by the transaction. The item's reference
>   persists until relogging is disabled on the item.
> - The RELOG transaction commits and releases its reference to the
>   global relog ticket. The global relog ticket is released once its
>   reference count drops to zero.
> 
> This provides a central relog log ticket that guarantees reservation
> availability for relogged items, avoids log reservation deadlocks
> and is allocated and released on demand.

Sounds cool.  /me jumps in.

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_shared.h |  1 +
>  fs/xfs/xfs_trans.c         | 37 +++++++++++++---
>  fs/xfs/xfs_trans.h         |  3 ++
>  fs/xfs/xfs_trans_ail.c     | 89 ++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trans_priv.h    |  1 +
>  5 files changed, 126 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index c45acbd3add9..0a10ca0853ab 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -77,6 +77,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>   * made then this algorithm will eventually find all the space it needs.
>   */
>  #define XFS_TRANS_LOWMODE	0x100	/* allocate in low space mode */
> +#define XFS_TRANS_RELOG		0x200	/* enable automatic relogging */
>  
>  /*
>   * Field values for xfs_trans_mod_sb.
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3b208f9a865c..8ac05ed8deda 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -107,9 +107,14 @@ xfs_trans_dup(
>  
>  	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
>  		       (tp->t_flags & XFS_TRANS_RESERVE) |
> -		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT);
> -	/* We gave our writer reference to the new transaction */
> +		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
> +		       (tp->t_flags & XFS_TRANS_RELOG);
> +	/*
> +	 * The writer reference and relog reference transfer to the new
> +	 * transaction.
> +	 */
>  	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
> +	tp->t_flags &= ~XFS_TRANS_RELOG;
>  	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
>  
>  	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
> @@ -284,15 +289,25 @@ xfs_trans_alloc(
>  	tp->t_firstblock = NULLFSBLOCK;
>  
>  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> -	if (error) {
> -		xfs_trans_cancel(tp);
> -		return error;
> +	if (error)
> +		goto error;
> +
> +	if (flags & XFS_TRANS_RELOG) {
> +		error = xfs_trans_ail_relog_reserve(&tp);
> +		if (error)
> +			goto error;
>  	}
>  
>  	trace_xfs_trans_alloc(tp, _RET_IP_);
>  
>  	*tpp = tp;
>  	return 0;
> +
> +error:
> +	/* clear relog flag if we haven't acquired a ref */
> +	tp->t_flags &= ~XFS_TRANS_RELOG;
> +	xfs_trans_cancel(tp);
> +	return error;
>  }
>  
>  /*
> @@ -973,6 +988,10 @@ __xfs_trans_commit(
>  
>  	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>  
> +	/* release the relog ticket reference if this transaction holds one */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
> +
>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free(tp);
>  
> @@ -1004,6 +1023,10 @@ __xfs_trans_commit(
>  			error = -EIO;
>  		tp->t_ticket = NULL;
>  	}
> +	/* release the relog ticket reference if this transaction holds one */
> +	/* XXX: handle RELOG items on transaction abort */

"Handle"?  Hm.  Do the reloggable items end up attached in some way to
this new transaction, or are we purely stealing the reservation so that
the ail can use it to relog the items on its own?  If it's the second,
then I wonder what handling do we need to do?

Or maybe you meant handling the relog items that the caller attached to
this relog transaction?  Won't those get cancelled the same way they do
now?

Mechanically this looks reasonable.

--D

> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free_items(tp, !!error);
>  	xfs_trans_free(tp);
> @@ -1064,6 +1087,10 @@ xfs_trans_cancel(
>  		tp->t_ticket = NULL;
>  	}
>  
> +	/* release the relog ticket reference if this transaction holds one */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
> +
>  	/* mark this thread as no longer being in a transaction */
>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 752c7fef9de7..a032989943bd 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -236,6 +236,9 @@ int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
>  void		xfs_trans_cancel(xfs_trans_t *);
>  int		xfs_trans_ail_init(struct xfs_mount *);
>  void		xfs_trans_ail_destroy(struct xfs_mount *);
> +int		xfs_trans_ail_relog_reserve(struct xfs_trans **);
> +bool		xfs_trans_ail_relog_get(struct xfs_mount *);
> +int		xfs_trans_ail_relog_put(struct xfs_mount *);
>  
>  void		xfs_trans_buf_set_type(struct xfs_trans *, struct xfs_buf *,
>  				       enum xfs_blft);
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 00cc5b8734be..a3fb64275baa 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -17,6 +17,7 @@
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  
>  #ifdef DEBUG
>  /*
> @@ -818,6 +819,93 @@ xfs_trans_ail_delete(
>  		xfs_log_space_wake(ailp->ail_mount);
>  }
>  
> +bool
> +xfs_trans_ail_relog_get(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	bool			ret = false;
> +
> +	spin_lock(&ailp->ail_lock);
> +	if (ailp->ail_relog_tic) {
> +		xfs_log_ticket_get(ailp->ail_relog_tic);
> +		ret = true;
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +	return ret;
> +}
> +
> +/*
> + * Reserve log space for the automatic relogging ->tr_relog ticket. This
> + * requires a clean, permanent transaction from the caller. Pull reservation
> + * for the relog ticket and roll the caller's transaction back to its fully
> + * reserved state. If the AIL relog ticket is already initialized, grab a
> + * reference and return.
> + */
> +int
> +xfs_trans_ail_relog_reserve(
> +	struct xfs_trans	**tpp)
> +{
> +	struct xfs_trans	*tp = *tpp;
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	struct xlog_ticket	*tic;
> +	uint32_t		logres = M_RES(mp)->tr_relog.tr_logres;
> +
> +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> +
> +	if (xfs_trans_ail_relog_get(mp))
> +		return 0;
> +
> +	/* no active ticket, fall into slow path to allocate one.. */
> +	tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0);
> +	if (!tic)
> +		return -ENOMEM;
> +	ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res);
> +
> +	/* check again since we dropped the lock for the allocation */
> +	spin_lock(&ailp->ail_lock);
> +	if (ailp->ail_relog_tic) {
> +		xfs_log_ticket_get(ailp->ail_relog_tic);
> +		spin_unlock(&ailp->ail_lock);
> +		xfs_log_ticket_put(tic);
> +		return 0;
> +	}
> +
> +	/* attach and reserve space for the ->tr_relog ticket */
> +	ailp->ail_relog_tic = tic;
> +	tp->t_ticket->t_curr_res -= tic->t_curr_res;
> +	spin_unlock(&ailp->ail_lock);
> +
> +	return xfs_trans_roll(tpp);
> +}
> +
> +/*
> + * Release a reference to the relog ticket.
> + */
> +int
> +xfs_trans_ail_relog_put(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	struct xlog_ticket	*tic;
> +
> +	spin_lock(&ailp->ail_lock);
> +	if (atomic_add_unless(&ailp->ail_relog_tic->t_ref, -1, 1)) {
> +		spin_unlock(&ailp->ail_lock);
> +		return 0;
> +	}
> +
> +	ASSERT(atomic_read(&ailp->ail_relog_tic->t_ref) == 1);
> +	tic = ailp->ail_relog_tic;
> +	ailp->ail_relog_tic = NULL;
> +	spin_unlock(&ailp->ail_lock);
> +
> +	xfs_log_done(mp, tic, NULL, false);
> +	return 0;
> +}
> +
>  int
>  xfs_trans_ail_init(
>  	xfs_mount_t	*mp)
> @@ -854,6 +942,7 @@ xfs_trans_ail_destroy(
>  {
>  	struct xfs_ail	*ailp = mp->m_ail;
>  
> +	ASSERT(ailp->ail_relog_tic == NULL);
>  	kthread_stop(ailp->ail_task);
>  	kmem_free(ailp);
>  }
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 2e073c1c4614..839df6559b9f 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -61,6 +61,7 @@ struct xfs_ail {
>  	int			ail_log_flush;
>  	struct list_head	ail_buf_list;
>  	wait_queue_head_t	ail_empty;
> +	struct xlog_ticket	*ail_relog_tic;
>  };
>  
>  /*
> -- 
> 2.21.1
>
Brian Foster Feb. 28, 2020, 1:55 p.m. UTC | #3
On Thu, Feb 27, 2020 at 04:02:18PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote:
> > Automatic item relogging will occur from xfsaild context. xfsaild
> > cannot acquire log reservation itself because it is also responsible
> > for writeback and thus making used log reservation available again.
> > Since there is no guarantee log reservation is available by the time
> > a relogged item reaches the AIL, this is prone to deadlock.
> > 
> > To guarantee log reservation for automatic relogging, implement a
> > reservation management scheme where a transaction that is capable of
> > enabling relogging of an item must contribute the necessary
> > reservation to the relog mechanism up front.
> 
> Ooooh, I had wondered where I was going to find that hook. :)
> 
> What does it mean to be capable of enabling relogging of an item?
> 

It's basically a requirement introduced by this patch that any
transaction that wants to enable relog on an item needs to check whether
the "central relog ticket" already exists, and if not, it needs to
contribute reservation to it as part of its normal transaction
allocation process.

> For the quotaoff example, does this mean that all the transactions that
> happen on behalf of a quotaoff operation must specify TRANS_RELOG?
> 

No, only the transaction that commits the quotaoff start intent.

> What if the relog thread grinds to a halt while other non-RELOG threads
> continue to push things into the log?  Can we starve and/or livelock
> waiting around?  Or should the log be able to kick some higher level
> thread to inject a TRANS_RELOG transaction to move things along?
> 

I hope not. :P At least I haven't seen such issues in my (fairly limited
and focused) testing so far.

> > Use reference counting
> > to associate the lifetime of pending relog reservation to the
> > lifetime of in-core log items with relogging enabled.
> 
> Ok, so we only have to pay the relog reservation while there are
> reloggable items floating around in the system.
> 

Yeah, I was back and forth on this because the relog reservation
tracking thing is a bit of complexity itself. It's much simpler to have
the AIL or something allocate a relog transaction up front and just keep
it around for the lifetime of the mount. My thinking was that relogs
could be quite rare (quotaoff, scrub), so it's probably an unfair use of
resources. I figured it best to try and provide some flexibility up
front and we can always back off to something more simplified later.

If we ended up with something like a reloggable "zero on recovery"
intent for writeback, that ties more into core fs functionality and the
simpler relog ticket implementation might be justified.

> > The basic log reservation sequence for a relog enabled transaction
> > is as follows:
> > 
> > - A transaction that uses relogging specifies XFS_TRANS_RELOG at
> >   allocation time.
> > - Once initialized, RELOG transactions check for the existence of
> >   the global relog log ticket. If it exists, grab a reference and
> >   return. If not, allocate an empty ticket and install into the relog
> >   subsystem. Seed the relog ticket from reservation of the current
> >   transaction. Roll the current transaction to replenish its
> >   reservation and return to the caller.
> 
> I guess we'd have to be careful that the transaction we're stealing from
> actually has enough reservation to re-log a pending item, but that
> shouldn't be difficult.
> 
> I worry that there might be some operation somewhere that Just Works
> because tr_logcount * tr_logres is enough space for it to run without
> having to get more reseration, but (tr_logcount - 1) * tr_logres isn't
> enough.  Though that might not be a big issue seeing how bloated the
> log reservations become when reflink and rmap are turned on. <cough>
> 

This ties in to the size calculation of the relog transaction and the
fact that the contributing transaction rolls before it carries on to
normal use. IOW, this approach adds zero extra reservation consumption
to existing transaction commits, so there should be no additional risk
of reservation overrun in that regard.

> > - The transaction is used as normal. If an item is relogged in the
> >   transaction, that item acquires a reference on the global relog
> >   ticket currently held open by the transaction. The item's reference
> >   persists until relogging is disabled on the item.
> > - The RELOG transaction commits and releases its reference to the
> >   global relog ticket. The global relog ticket is released once its
> >   reference count drops to zero.
> > 
> > This provides a central relog log ticket that guarantees reservation
> > availability for relogged items, avoids log reservation deadlocks
> > and is allocated and released on demand.
> 
> Sounds cool.  /me jumps in.
> 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_shared.h |  1 +
> >  fs/xfs/xfs_trans.c         | 37 +++++++++++++---
> >  fs/xfs/xfs_trans.h         |  3 ++
> >  fs/xfs/xfs_trans_ail.c     | 89 ++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_trans_priv.h    |  1 +
> >  5 files changed, 126 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> > index c45acbd3add9..0a10ca0853ab 100644
> > --- a/fs/xfs/libxfs/xfs_shared.h
> > +++ b/fs/xfs/libxfs/xfs_shared.h
> > @@ -77,6 +77,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
> >   * made then this algorithm will eventually find all the space it needs.
> >   */
> >  #define XFS_TRANS_LOWMODE	0x100	/* allocate in low space mode */
> > +#define XFS_TRANS_RELOG		0x200	/* enable automatic relogging */
> >  
> >  /*
> >   * Field values for xfs_trans_mod_sb.
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 3b208f9a865c..8ac05ed8deda 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -107,9 +107,14 @@ xfs_trans_dup(
> >  
> >  	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
> >  		       (tp->t_flags & XFS_TRANS_RESERVE) |
> > -		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT);
> > -	/* We gave our writer reference to the new transaction */
> > +		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
> > +		       (tp->t_flags & XFS_TRANS_RELOG);
> > +	/*
> > +	 * The writer reference and relog reference transfer to the new
> > +	 * transaction.
> > +	 */
> >  	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
> > +	tp->t_flags &= ~XFS_TRANS_RELOG;
> >  	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
> >  
> >  	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
> > @@ -284,15 +289,25 @@ xfs_trans_alloc(
> >  	tp->t_firstblock = NULLFSBLOCK;
> >  
> >  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> > -	if (error) {
> > -		xfs_trans_cancel(tp);
> > -		return error;
> > +	if (error)
> > +		goto error;
> > +
> > +	if (flags & XFS_TRANS_RELOG) {
> > +		error = xfs_trans_ail_relog_reserve(&tp);
> > +		if (error)
> > +			goto error;
> >  	}
> >  
> >  	trace_xfs_trans_alloc(tp, _RET_IP_);
> >  
> >  	*tpp = tp;
> >  	return 0;
> > +
> > +error:
> > +	/* clear relog flag if we haven't acquired a ref */
> > +	tp->t_flags &= ~XFS_TRANS_RELOG;
> > +	xfs_trans_cancel(tp);
> > +	return error;
> >  }
> >  
> >  /*
> > @@ -973,6 +988,10 @@ __xfs_trans_commit(
> >  
> >  	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
> >  
> > +	/* release the relog ticket reference if this transaction holds one */
> > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > +		xfs_trans_ail_relog_put(mp);
> > +
> >  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> >  	xfs_trans_free(tp);
> >  
> > @@ -1004,6 +1023,10 @@ __xfs_trans_commit(
> >  			error = -EIO;
> >  		tp->t_ticket = NULL;
> >  	}
> > +	/* release the relog ticket reference if this transaction holds one */
> > +	/* XXX: handle RELOG items on transaction abort */
> 
> "Handle"?  Hm.  Do the reloggable items end up attached in some way to
> this new transaction, or are we purely stealing the reservation so that
> the ail can use it to relog the items on its own?  If it's the second,
> then I wonder what handling do we need to do?
> 

More the latter...

> Or maybe you meant handling the relog items that the caller attached to
> this relog transaction?  Won't those get cancelled the same way they do
> now?
> 

This comment was more of a note to self that when putting this together
I hadn't thought through the abort/shutdown path and whether the code is
correct (i.e., should the transaction cancel relog state? I still need
to test an abort of a relog commit, etc.). That's still something I need
to work through, but I wouldn't read more into the comment than that.

Brian

> Mechanically this looks reasonable.
> 
> --D
> 
> > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > +		xfs_trans_ail_relog_put(mp);
> >  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> >  	xfs_trans_free_items(tp, !!error);
> >  	xfs_trans_free(tp);
> > @@ -1064,6 +1087,10 @@ xfs_trans_cancel(
> >  		tp->t_ticket = NULL;
> >  	}
> >  
> > +	/* release the relog ticket reference if this transaction holds one */
> > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > +		xfs_trans_ail_relog_put(mp);
> > +
> >  	/* mark this thread as no longer being in a transaction */
> >  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> >  
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index 752c7fef9de7..a032989943bd 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -236,6 +236,9 @@ int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
> >  void		xfs_trans_cancel(xfs_trans_t *);
> >  int		xfs_trans_ail_init(struct xfs_mount *);
> >  void		xfs_trans_ail_destroy(struct xfs_mount *);
> > +int		xfs_trans_ail_relog_reserve(struct xfs_trans **);
> > +bool		xfs_trans_ail_relog_get(struct xfs_mount *);
> > +int		xfs_trans_ail_relog_put(struct xfs_mount *);
> >  
> >  void		xfs_trans_buf_set_type(struct xfs_trans *, struct xfs_buf *,
> >  				       enum xfs_blft);
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 00cc5b8734be..a3fb64275baa 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -17,6 +17,7 @@
> >  #include "xfs_errortag.h"
> >  #include "xfs_error.h"
> >  #include "xfs_log.h"
> > +#include "xfs_log_priv.h"
> >  
> >  #ifdef DEBUG
> >  /*
> > @@ -818,6 +819,93 @@ xfs_trans_ail_delete(
> >  		xfs_log_space_wake(ailp->ail_mount);
> >  }
> >  
> > +bool
> > +xfs_trans_ail_relog_get(
> > +	struct xfs_mount	*mp)
> > +{
> > +	struct xfs_ail		*ailp = mp->m_ail;
> > +	bool			ret = false;
> > +
> > +	spin_lock(&ailp->ail_lock);
> > +	if (ailp->ail_relog_tic) {
> > +		xfs_log_ticket_get(ailp->ail_relog_tic);
> > +		ret = true;
> > +	}
> > +	spin_unlock(&ailp->ail_lock);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Reserve log space for the automatic relogging ->tr_relog ticket. This
> > + * requires a clean, permanent transaction from the caller. Pull reservation
> > + * for the relog ticket and roll the caller's transaction back to its fully
> > + * reserved state. If the AIL relog ticket is already initialized, grab a
> > + * reference and return.
> > + */
> > +int
> > +xfs_trans_ail_relog_reserve(
> > +	struct xfs_trans	**tpp)
> > +{
> > +	struct xfs_trans	*tp = *tpp;
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	struct xfs_ail		*ailp = mp->m_ail;
> > +	struct xlog_ticket	*tic;
> > +	uint32_t		logres = M_RES(mp)->tr_relog.tr_logres;
> > +
> > +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> > +
> > +	if (xfs_trans_ail_relog_get(mp))
> > +		return 0;
> > +
> > +	/* no active ticket, fall into slow path to allocate one.. */
> > +	tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0);
> > +	if (!tic)
> > +		return -ENOMEM;
> > +	ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res);
> > +
> > +	/* check again since we dropped the lock for the allocation */
> > +	spin_lock(&ailp->ail_lock);
> > +	if (ailp->ail_relog_tic) {
> > +		xfs_log_ticket_get(ailp->ail_relog_tic);
> > +		spin_unlock(&ailp->ail_lock);
> > +		xfs_log_ticket_put(tic);
> > +		return 0;
> > +	}
> > +
> > +	/* attach and reserve space for the ->tr_relog ticket */
> > +	ailp->ail_relog_tic = tic;
> > +	tp->t_ticket->t_curr_res -= tic->t_curr_res;
> > +	spin_unlock(&ailp->ail_lock);
> > +
> > +	return xfs_trans_roll(tpp);
> > +}
> > +
> > +/*
> > + * Release a reference to the relog ticket.
> > + */
> > +int
> > +xfs_trans_ail_relog_put(
> > +	struct xfs_mount	*mp)
> > +{
> > +	struct xfs_ail		*ailp = mp->m_ail;
> > +	struct xlog_ticket	*tic;
> > +
> > +	spin_lock(&ailp->ail_lock);
> > +	if (atomic_add_unless(&ailp->ail_relog_tic->t_ref, -1, 1)) {
> > +		spin_unlock(&ailp->ail_lock);
> > +		return 0;
> > +	}
> > +
> > +	ASSERT(atomic_read(&ailp->ail_relog_tic->t_ref) == 1);
> > +	tic = ailp->ail_relog_tic;
> > +	ailp->ail_relog_tic = NULL;
> > +	spin_unlock(&ailp->ail_lock);
> > +
> > +	xfs_log_done(mp, tic, NULL, false);
> > +	return 0;
> > +}
> > +
> >  int
> >  xfs_trans_ail_init(
> >  	xfs_mount_t	*mp)
> > @@ -854,6 +942,7 @@ xfs_trans_ail_destroy(
> >  {
> >  	struct xfs_ail	*ailp = mp->m_ail;
> >  
> > +	ASSERT(ailp->ail_relog_tic == NULL);
> >  	kthread_stop(ailp->ail_task);
> >  	kmem_free(ailp);
> >  }
> > diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> > index 2e073c1c4614..839df6559b9f 100644
> > --- a/fs/xfs/xfs_trans_priv.h
> > +++ b/fs/xfs/xfs_trans_priv.h
> > @@ -61,6 +61,7 @@ struct xfs_ail {
> >  	int			ail_log_flush;
> >  	struct list_head	ail_buf_list;
> >  	wait_queue_head_t	ail_empty;
> > +	struct xlog_ticket	*ail_relog_tic;
> >  };
> >  
> >  /*
> > -- 
> > 2.21.1
> > 
>
Dave Chinner March 2, 2020, 3:07 a.m. UTC | #4
On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote:
> Automatic item relogging will occur from xfsaild context. xfsaild
> cannot acquire log reservation itself because it is also responsible
> for writeback and thus making used log reservation available again.
> Since there is no guarantee log reservation is available by the time
> a relogged item reaches the AIL, this is prone to deadlock.
> 
> To guarantee log reservation for automatic relogging, implement a
> reservation management scheme where a transaction that is capable of
> enabling relogging of an item must contribute the necessary
> reservation to the relog mechanism up front. Use reference counting
> to associate the lifetime of pending relog reservation to the
> lifetime of in-core log items with relogging enabled.
> 
> The basic log reservation sequence for a relog enabled transaction
> is as follows:
> 
> - A transaction that uses relogging specifies XFS_TRANS_RELOG at
>   allocation time.
> - Once initialized, RELOG transactions check for the existence of
>   the global relog log ticket. If it exists, grab a reference and
>   return. If not, allocate an empty ticket and install into the relog
>   subsystem. Seed the relog ticket from reservation of the current
>   transaction. Roll the current transaction to replenish its
>   reservation and return to the caller.
> - The transaction is used as normal. If an item is relogged in the
>   transaction, that item acquires a reference on the global relog
>   ticket currently held open by the transaction. The item's reference
>   persists until relogging is disabled on the item.
> - The RELOG transaction commits and releases its reference to the
>   global relog ticket. The global relog ticket is released once its
>   reference count drops to zero.
> 
> This provides a central relog log ticket that guarantees reservation
> availability for relogged items, avoids log reservation deadlocks
> and is allocated and released on demand.

Hi Brain,

I've held off commenting immediately on this while I tried to get
the concept of dynamic relogging straight in my head. I couldn't put
my finger on what I thought was wrong - just a nagging feeling that
I'd gone down this path before and it ended in somethign that didn't
work.

It wasn't until a couple of hours ago that a big cogs clunked into
place and I realised this roughly mirrored a path I went down 12 or
13 years ago trying to implement what turned into the CIL. I failed
at least 4 times over 5 years trying to implement delayed logging...

THere's a couple of simple code comments below before what was in my
head seemed to gel together into something slightly more coherent
than "it seems inside out"....

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_shared.h |  1 +
>  fs/xfs/xfs_trans.c         | 37 +++++++++++++---
>  fs/xfs/xfs_trans.h         |  3 ++
>  fs/xfs/xfs_trans_ail.c     | 89 ++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trans_priv.h    |  1 +
>  5 files changed, 126 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index c45acbd3add9..0a10ca0853ab 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -77,6 +77,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>   * made then this algorithm will eventually find all the space it needs.
>   */
>  #define XFS_TRANS_LOWMODE	0x100	/* allocate in low space mode */
> +#define XFS_TRANS_RELOG		0x200	/* enable automatic relogging */
>  
>  /*
>   * Field values for xfs_trans_mod_sb.
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3b208f9a865c..8ac05ed8deda 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -107,9 +107,14 @@ xfs_trans_dup(
>  
>  	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
>  		       (tp->t_flags & XFS_TRANS_RESERVE) |
> -		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT);
> -	/* We gave our writer reference to the new transaction */
> +		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
> +		       (tp->t_flags & XFS_TRANS_RELOG);
> +	/*
> +	 * The writer reference and relog reference transfer to the new
> +	 * transaction.
> +	 */
>  	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
> +	tp->t_flags &= ~XFS_TRANS_RELOG;
>  	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
>  
>  	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
> @@ -284,15 +289,25 @@ xfs_trans_alloc(
>  	tp->t_firstblock = NULLFSBLOCK;
>  
>  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> -	if (error) {
> -		xfs_trans_cancel(tp);
> -		return error;
> +	if (error)
> +		goto error;
> +
> +	if (flags & XFS_TRANS_RELOG) {
> +		error = xfs_trans_ail_relog_reserve(&tp);
> +		if (error)
> +			goto error;
>  	}

Hmmmm. So we are putting the AIL lock directly into the transaction
reserve path? xfs_trans_reserve() goes out of it's way to be
lockless in the fast paths, so if you want this to be a generic
mechanism that any transaction can use, the reservation needs to be
completely lockless.

>  
>  	trace_xfs_trans_alloc(tp, _RET_IP_);
>  
>  	*tpp = tp;
>  	return 0;
> +
> +error:
> +	/* clear relog flag if we haven't acquired a ref */
> +	tp->t_flags &= ~XFS_TRANS_RELOG;
> +	xfs_trans_cancel(tp);
> +	return error;

seems like a case of "only set the flags once you have a reference"?
Then xfs_trans_cancel() can clean up without special cases being
needed anywhere...

>  }
>  
>  /*
> @@ -973,6 +988,10 @@ __xfs_trans_commit(
>  
>  	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>  
> +	/* release the relog ticket reference if this transaction holds one */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);
> +

That looks ... interesting. xfs_trans_ail_relog_put() can call
xfs_log_done(), which means it can do a log write, which means the
commit lsn for this transaction could change. Hence to make a relog
permanent as a result of a sync transaction, we'd need the
commit_lsn of the AIL relog ticket write here, not that of the
original transaction that was written.


>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free(tp);
>  
> @@ -1004,6 +1023,10 @@ __xfs_trans_commit(
>  			error = -EIO;
>  		tp->t_ticket = NULL;
>  	}
> +	/* release the relog ticket reference if this transaction holds one */
> +	/* XXX: handle RELOG items on transaction abort */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);

This has the potential for log writes to be issued from a
transaction abort context, right?

>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free_items(tp, !!error);
>  	xfs_trans_free(tp);
> @@ -1064,6 +1087,10 @@ xfs_trans_cancel(
>  		tp->t_ticket = NULL;
>  	}
>  
> +	/* release the relog ticket reference if this transaction holds one */
> +	if (tp->t_flags & XFS_TRANS_RELOG)
> +		xfs_trans_ail_relog_put(mp);

And log writes from a cancel seems possible here, too. I don't think
we do this at all right now, so this could have some interesting
unexpected side-effects during error handling. (As if that wasn't
complex enough to begin with!)

> @@ -818,6 +819,93 @@ xfs_trans_ail_delete(
>  		xfs_log_space_wake(ailp->ail_mount);
>  }
>  
> +bool
> +xfs_trans_ail_relog_get(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	bool			ret = false;
> +
> +	spin_lock(&ailp->ail_lock);
> +	if (ailp->ail_relog_tic) {
> +		xfs_log_ticket_get(ailp->ail_relog_tic);
> +		ret = true;
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +	return ret;
> +}
> +
> +/*
> + * Reserve log space for the automatic relogging ->tr_relog ticket. This
> + * requires a clean, permanent transaction from the caller. Pull reservation
> + * for the relog ticket and roll the caller's transaction back to its fully
> + * reserved state. If the AIL relog ticket is already initialized, grab a
> + * reference and return.
> + */
> +int
> +xfs_trans_ail_relog_reserve(
> +	struct xfs_trans	**tpp)
> +{
> +	struct xfs_trans	*tp = *tpp;
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_ail		*ailp = mp->m_ail;
> +	struct xlog_ticket	*tic;
> +	uint32_t		logres = M_RES(mp)->tr_relog.tr_logres;
> +
> +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> +
> +	if (xfs_trans_ail_relog_get(mp))
> +		return 0;
> +
> +	/* no active ticket, fall into slow path to allocate one.. */
> +	tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0);
> +	if (!tic)
> +		return -ENOMEM;
> +	ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res);
> +
> +	/* check again since we dropped the lock for the allocation */
> +	spin_lock(&ailp->ail_lock);
> +	if (ailp->ail_relog_tic) {
> +		xfs_log_ticket_get(ailp->ail_relog_tic);
> +		spin_unlock(&ailp->ail_lock);
> +		xfs_log_ticket_put(tic);
> +		return 0;
> +	}
> +
> +	/* attach and reserve space for the ->tr_relog ticket */
> +	ailp->ail_relog_tic = tic;
> +	tp->t_ticket->t_curr_res -= tic->t_curr_res;
> +	spin_unlock(&ailp->ail_lock);
> +
> +	return xfs_trans_roll(tpp);
> +}

Hmmm.

So before we've even returned from xfs_trans_alloc(), we may have
committed the transaction and burnt and entire transaction entire
log space reservations?

IOWs, to support relogging of a single item in a permanent
transaction, we have to increase the log count of the transaction so
that we don't end up running out of reservation space one
transaction before the end of the normal fast path behaviour of the
change being made?

And we do that up front because we're not exactly sure of how much
space the item that needs relogging is going to require, but the AIL
is going to need to hold on to it from as long as it needs to relog
the item.

<GRRRRIIINNNNDDDD>

<CLUNK>

Ahhhhh.

This is the reason why the CIL ended up using a reservation stealing
mechanism for it's ticket. Every log reservation takes into account
the worst case log overhead for that single transaction - that means
there are no special up front hooks or changes to the log
reservation to support the CIL, and it also means that the CIL can
steal what it needs from the current ticket at commit time. Then the
CIL can commit at any time, knowing it has enough space to write all
the items it tracks to the log.

This avoided the transaction reservation mechanism from needing to
know anything about the CIL or that it was stealing space from
committing transactions for it's own private ticket. It greatly
simplified everything, and it's the reason that the CIL succeeded
where several other attempts to relog items in memory failed....

So....

Why can't the AIL steal the reservation it needs from the current
transaction when the item that needs relogging is being committed?
i.e. we add the "relog overhead" to the permanent transaction that
requires items to be relogged by the AIL, and when that item is
formatted into the CIL we also check to see if it is currently
marked as "reloggable". If it's not, we steal the relogging
reservation from the transaction (essentially the item's formatted
size) and pass it to the AIL's private relog ticket, setting the log
item to be "reloggable" so the reservation isn't stolen over and
over again as the object is relogged in the CIL.

Hence as we commit reloggable items to the CIL, the AIL ticket
reservation grows with each of those items marked as reloggable.
As we relog items to the CIL and the CIL grows it's reservation via
the size delta, the AIL reservation can also be updated with the
same delta.

Hence the AIL will always know exactly how much space it needs to relog all
the items it holds for relogging, and because it's been stolen from
the original transaction it is, like the CIL tciket reservation,
considered used space in the log. Hence the log space required for
relogging items via the AIL is correctly accounted for without
needing up front static, per-item reservations.

When the item is logged the final time and the reloggable flag is
removed (e.g. when the quotaoff completes) then we can remove the
reservation from AIL ticket and add it back to the current
transaction. Hence when the final transaction is committed and the
ticket for that transaction is released via xfs_log_done(), the
space the AIL held for relogging the item is also released.

This doesn't require any modifications to the transaction
reservation subsystem, nor the itransaction commit/cancel code,
no changes to the log space accounting, etc. All we need to do is
tag items when we log them as reloggable, and in the CIL formatting
of the item pass the formatted size differences to the AIL so it can
steal the reservation it needs.

Then the AIL doesn't need a dynamic relogging ticket. It can just
hold a single ticket from init to teardown by taking an extra
reference to the ticket. When it is running a relog, it can build it
as though it's just a normal permanent transaction without needing
to get a new reservation, and when it rolls the write space consumed
is regranted automatically.

AFAICT, that gets rid of the need for all this ticket reference counting,
null pointer checking, lock juggling and rechecking, etc. It does
not require modification to the transaction reservation path, and we
can account for the relogging on an item by item basis in each
individual transaction reservation....

Thoughts?

-Dave.
Brian Foster March 2, 2020, 6:06 p.m. UTC | #5
On Mon, Mar 02, 2020 at 02:07:50PM +1100, Dave Chinner wrote:
> On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote:
> > Automatic item relogging will occur from xfsaild context. xfsaild
> > cannot acquire log reservation itself because it is also responsible
> > for writeback and thus making used log reservation available again.
> > Since there is no guarantee log reservation is available by the time
> > a relogged item reaches the AIL, this is prone to deadlock.
> > 
> > To guarantee log reservation for automatic relogging, implement a
> > reservation management scheme where a transaction that is capable of
> > enabling relogging of an item must contribute the necessary
> > reservation to the relog mechanism up front. Use reference counting
> > to associate the lifetime of pending relog reservation to the
> > lifetime of in-core log items with relogging enabled.
> > 
> > The basic log reservation sequence for a relog enabled transaction
> > is as follows:
> > 
> > - A transaction that uses relogging specifies XFS_TRANS_RELOG at
> >   allocation time.
> > - Once initialized, RELOG transactions check for the existence of
> >   the global relog log ticket. If it exists, grab a reference and
> >   return. If not, allocate an empty ticket and install into the relog
> >   subsystem. Seed the relog ticket from reservation of the current
> >   transaction. Roll the current transaction to replenish its
> >   reservation and return to the caller.
> > - The transaction is used as normal. If an item is relogged in the
> >   transaction, that item acquires a reference on the global relog
> >   ticket currently held open by the transaction. The item's reference
> >   persists until relogging is disabled on the item.
> > - The RELOG transaction commits and releases its reference to the
> >   global relog ticket. The global relog ticket is released once its
> >   reference count drops to zero.
> > 
> > This provides a central relog log ticket that guarantees reservation
> > availability for relogged items, avoids log reservation deadlocks
> > and is allocated and released on demand.
> 
> Hi Brain,
> 
> I've held off commenting immediately on this while I tried to get
> the concept of dynamic relogging straight in my head. I couldn't put
> my finger on what I thought was wrong - just a nagging feeling that
> I'd gone down this path before and it ended in somethign that didn't
> work.
> 

No problem..

> It wasn't until a couple of hours ago that a big cogs clunked into
> place and I realised this roughly mirrored a path I went down 12 or
> 13 years ago trying to implement what turned into the CIL. I failed
> at least 4 times over 5 years trying to implement delayed logging...
> 
> THere's a couple of simple code comments below before what was in my
> head seemed to gel together into something slightly more coherent
> than "it seems inside out"....
> 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_shared.h |  1 +
> >  fs/xfs/xfs_trans.c         | 37 +++++++++++++---
> >  fs/xfs/xfs_trans.h         |  3 ++
> >  fs/xfs/xfs_trans_ail.c     | 89 ++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_trans_priv.h    |  1 +
> >  5 files changed, 126 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> > index c45acbd3add9..0a10ca0853ab 100644
> > --- a/fs/xfs/libxfs/xfs_shared.h
> > +++ b/fs/xfs/libxfs/xfs_shared.h
> > @@ -77,6 +77,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
> >   * made then this algorithm will eventually find all the space it needs.
> >   */
> >  #define XFS_TRANS_LOWMODE	0x100	/* allocate in low space mode */
> > +#define XFS_TRANS_RELOG		0x200	/* enable automatic relogging */
> >  
> >  /*
> >   * Field values for xfs_trans_mod_sb.
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 3b208f9a865c..8ac05ed8deda 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -107,9 +107,14 @@ xfs_trans_dup(
> >  
> >  	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
> >  		       (tp->t_flags & XFS_TRANS_RESERVE) |
> > -		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT);
> > -	/* We gave our writer reference to the new transaction */
> > +		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
> > +		       (tp->t_flags & XFS_TRANS_RELOG);
> > +	/*
> > +	 * The writer reference and relog reference transfer to the new
> > +	 * transaction.
> > +	 */
> >  	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
> > +	tp->t_flags &= ~XFS_TRANS_RELOG;
> >  	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
> >  
> >  	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
> > @@ -284,15 +289,25 @@ xfs_trans_alloc(
> >  	tp->t_firstblock = NULLFSBLOCK;
> >  
> >  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> > -	if (error) {
> > -		xfs_trans_cancel(tp);
> > -		return error;
> > +	if (error)
> > +		goto error;
> > +
> > +	if (flags & XFS_TRANS_RELOG) {
> > +		error = xfs_trans_ail_relog_reserve(&tp);
> > +		if (error)
> > +			goto error;
> >  	}
> 
> Hmmmm. So we are putting the AIL lock directly into the transaction
> reserve path? xfs_trans_reserve() goes out of it's way to be
> lockless in the fast paths, so if you want this to be a generic
> mechanism that any transaction can use, the reservation needs to be
> completely lockless.
> 

Yeah, this is one of those warts mentioned in the cover letter. I wasn't
planning to make it lockless as much as using an independent lock, but I
can take a closer look at that if it still looks like a contention point
after getting through the bigger picture feedback..

> >  
> >  	trace_xfs_trans_alloc(tp, _RET_IP_);
> >  
> >  	*tpp = tp;
> >  	return 0;
> > +
> > +error:
> > +	/* clear relog flag if we haven't acquired a ref */
> > +	tp->t_flags &= ~XFS_TRANS_RELOG;
> > +	xfs_trans_cancel(tp);
> > +	return error;
> 
> seems like a case of "only set the flags once you have a reference"?
> Then xfs_trans_cancel() can clean up without special cases being
> needed anywhere...
> 

Yeah, I suppose that might be a bit more readable.

> >  }
> >  
> >  /*
> > @@ -973,6 +988,10 @@ __xfs_trans_commit(
> >  
> >  	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
> >  
> > +	/* release the relog ticket reference if this transaction holds one */
> > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > +		xfs_trans_ail_relog_put(mp);
> > +
> 
> That looks ... interesting. xfs_trans_ail_relog_put() can call
> xfs_log_done(), which means it can do a log write, which means the
> commit lsn for this transaction could change. Hence to make a relog
> permanent as a result of a sync transaction, we'd need the
> commit_lsn of the AIL relog ticket write here, not that of the
> original transaction that was written.
> 

Hmm.. xfs_log_done() is eventually called for every transaction ticket,
so I don't think it should be doing log writes based on that. The
_ail_relog_put() path is basically just intended to properly terminate
the relog ticket based on the existing transaction completion path. I'd
have to dig a bit further into the code here, but e.g. we don't pass an
iclog via this path so if we were attempting to write I'd probably have
seen assert failures (or worse) by now. Indeed, this is similar to a
cancel of a clean transaction, the difference being this is open-coded
because we only have the relog ticket in this context.

> 
> >  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> >  	xfs_trans_free(tp);
> >  
> > @@ -1004,6 +1023,10 @@ __xfs_trans_commit(
> >  			error = -EIO;
> >  		tp->t_ticket = NULL;
> >  	}
> > +	/* release the relog ticket reference if this transaction holds one */
> > +	/* XXX: handle RELOG items on transaction abort */
> > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > +		xfs_trans_ail_relog_put(mp);
> 
> This has the potential for log writes to be issued from a
> transaction abort context, right?
> 
> >  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> >  	xfs_trans_free_items(tp, !!error);
> >  	xfs_trans_free(tp);
> > @@ -1064,6 +1087,10 @@ xfs_trans_cancel(
> >  		tp->t_ticket = NULL;
> >  	}
> >  
> > +	/* release the relog ticket reference if this transaction holds one */
> > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > +		xfs_trans_ail_relog_put(mp);
> 
> And log writes from a cancel seems possible here, too. I don't think
> we do this at all right now, so this could have some interesting
> unexpected side-effects during error handling. (As if that wasn't
> complex enough to begin with!)
> 

I'm not quite sure I follow your comments here, though I do still need
to work through all of the error paths and make sure everything is
correct. I've been putting details like that off in favor of getting a
high level design approach worked out first.

> > @@ -818,6 +819,93 @@ xfs_trans_ail_delete(
> >  		xfs_log_space_wake(ailp->ail_mount);
> >  }
> >  
> > +bool
> > +xfs_trans_ail_relog_get(
> > +	struct xfs_mount	*mp)
> > +{
> > +	struct xfs_ail		*ailp = mp->m_ail;
> > +	bool			ret = false;
> > +
> > +	spin_lock(&ailp->ail_lock);
> > +	if (ailp->ail_relog_tic) {
> > +		xfs_log_ticket_get(ailp->ail_relog_tic);
> > +		ret = true;
> > +	}
> > +	spin_unlock(&ailp->ail_lock);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Reserve log space for the automatic relogging ->tr_relog ticket. This
> > + * requires a clean, permanent transaction from the caller. Pull reservation
> > + * for the relog ticket and roll the caller's transaction back to its fully
> > + * reserved state. If the AIL relog ticket is already initialized, grab a
> > + * reference and return.
> > + */
> > +int
> > +xfs_trans_ail_relog_reserve(
> > +	struct xfs_trans	**tpp)
> > +{
> > +	struct xfs_trans	*tp = *tpp;
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	struct xfs_ail		*ailp = mp->m_ail;
> > +	struct xlog_ticket	*tic;
> > +	uint32_t		logres = M_RES(mp)->tr_relog.tr_logres;
> > +
> > +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> > +
> > +	if (xfs_trans_ail_relog_get(mp))
> > +		return 0;
> > +
> > +	/* no active ticket, fall into slow path to allocate one.. */
> > +	tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0);
> > +	if (!tic)
> > +		return -ENOMEM;
> > +	ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res);
> > +
> > +	/* check again since we dropped the lock for the allocation */
> > +	spin_lock(&ailp->ail_lock);
> > +	if (ailp->ail_relog_tic) {
> > +		xfs_log_ticket_get(ailp->ail_relog_tic);
> > +		spin_unlock(&ailp->ail_lock);
> > +		xfs_log_ticket_put(tic);
> > +		return 0;
> > +	}
> > +
> > +	/* attach and reserve space for the ->tr_relog ticket */
> > +	ailp->ail_relog_tic = tic;
> > +	tp->t_ticket->t_curr_res -= tic->t_curr_res;
> > +	spin_unlock(&ailp->ail_lock);
> > +
> > +	return xfs_trans_roll(tpp);
> > +}
> 
> Hmmm.
> 
> So before we've even returned from xfs_trans_alloc(), we may have
> committed the transaction and burnt and entire transaction entire
> log space reservations?
> 

Yep.

> IOWs, to support relogging of a single item in a permanent
> transaction, we have to increase the log count of the transaction so
> that we don't end up running out of reservation space one
> transaction before the end of the normal fast path behaviour of the
> change being made?
> 

Right.

> And we do that up front because we're not exactly sure of how much
> space the item that needs relogging is going to require, but the AIL
> is going to need to hold on to it from as long as it needs to relog
> the item.
> 

*nod*

> <GRRRRIIINNNNDDDD>
> 
> <CLUNK>
> 
> Ahhhhh.
> 

Heh. :P

> This is the reason why the CIL ended up using a reservation stealing
> mechanism for it's ticket. Every log reservation takes into account
> the worst case log overhead for that single transaction - that means
> there are no special up front hooks or changes to the log
> reservation to support the CIL, and it also means that the CIL can
> steal what it needs from the current ticket at commit time. Then the
> CIL can commit at any time, knowing it has enough space to write all
> the items it tracks to the log.
> 
> This avoided the transaction reservation mechanism from needing to
> know anything about the CIL or that it was stealing space from
> committing transactions for it's own private ticket. It greatly
> simplified everything, and it's the reason that the CIL succeeded
> where several other attempts to relog items in memory failed....
> 
> So....
> 
> Why can't the AIL steal the reservation it needs from the current
> transaction when the item that needs relogging is being committed?

I think we can. My first proposal[1] implemented something like this.
The broader design was much less fleshed out, so it was crudely
implemented as more of a commit time hook to relog pending items based
on the currently committing transaction as opposed to reservation
stealing for an independent relog ticket. This was actually based on the
observation that the CIL already did something similar (though I didn't
have all of the background as to why that approach was used for the
CIL).

The feedback at the time was to consider moving in a different direction
that didn't involve stealing reservation from transactions, which gave
me the impression that approach wasn't going to be acceptable. Granted,
the first few variations of this were widely different in approach and I
don't think there was any explicit objection to reservation stealing
itself, it just seemed a better development approach to work out an
ideal design based on fundamentals as opposed to limiting the scope to
contexts that might facilitate reservation stealing. Despite the fact
that this code is hairy, the fundamental approach is rather
straightforward and simplistic. The hairiness mostly comes from
attempting to make things generic and dynamic and could certainly use
some cleanup.

Now that this is at a point where I'm reasonably confident it is
fundamentally correct, I'm quite happy to revisit a reservation stealing
approach if that improves functional operation. This can serve a decent
reference/baseline implementation for that, I think.

[1] https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/

> i.e. we add the "relog overhead" to the permanent transaction that
> requires items to be relogged by the AIL, and when that item is
> formatted into the CIL we also check to see if it is currently
> marked as "reloggable". If it's not, we steal the relogging
> reservation from the transaction (essentially the item's formatted
> size) and pass it to the AIL's private relog ticket, setting the log
> item to be "reloggable" so the reservation isn't stolen over and
> over again as the object is relogged in the CIL.
> 

Ok, so this implies to me we'd update the per-transaction reservation
calculations to incorporate worst case relog overhead for each
transaction. E.g., the quotaoff transaction would x2 the intent size,
a transaction that might relog a buffer adds a max buffer size overhead,
etc. Right?

> Hence as we commit reloggable items to the CIL, the AIL ticket
> reservation grows with each of those items marked as reloggable.
> As we relog items to the CIL and the CIL grows it's reservation via
> the size delta, the AIL reservation can also be updated with the
> same delta.
> 
> Hence the AIL will always know exactly how much space it needs to relog all
> the items it holds for relogging, and because it's been stolen from
> the original transaction it is, like the CIL tciket reservation,
> considered used space in the log. Hence the log space required for
> relogging items via the AIL is correctly accounted for without
> needing up front static, per-item reservations.
> 

By this I assume you're referring to avoiding the need to reserve ->
donate -> roll in the current scheme. Instead, we'd acquire the larger
reservation up front and only steal if it necessary, which is less
overhead because we don't need to always replenish the full transaction.

Do note that I don't consider the current approach high overhead overall
because the roll only needs to happen when (and if) the first
transaction enables the relog ticket. There is no additional overhead
from that point forward. The "relog overhead" approach sounds fine to me
as well, I'm just noting that there are still some tradeoffs to either
approach.

> When the item is logged the final time and the reloggable flag is
> removed (e.g. when the quotaoff completes) then we can remove the
> reservation from AIL ticket and add it back to the current
> transaction. Hence when the final transaction is committed and the
> ticket for that transaction is released via xfs_log_done(), the
> space the AIL held for relogging the item is also released.
> 

Makes sense.

> This doesn't require any modifications to the transaction
> reservation subsystem, nor the itransaction commit/cancel code,
> no changes to the log space accounting, etc. All we need to do is
> tag items when we log them as reloggable, and in the CIL formatting
> of the item pass the formatted size differences to the AIL so it can
> steal the reservation it needs.
> 
> Then the AIL doesn't need a dynamic relogging ticket. It can just
> hold a single ticket from init to teardown by taking an extra
> reference to the ticket. When it is running a relog, it can build it
> as though it's just a normal permanent transaction without needing
> to get a new reservation, and when it rolls the write space consumed
> is regranted automatically.
> 
> AFAICT, that gets rid of the need for all this ticket reference counting,
> null pointer checking, lock juggling and rechecking, etc. It does
> not require modification to the transaction reservation path, and we
> can account for the relogging on an item by item basis in each
> individual transaction reservation....
> 

Yep, I think there's potential for some nice cleanup there. I'd be very
happy to rip out some of guts of the current patch, particularly the
reference counting bits.

> Thoughts?
> 

<thinking out loud from here on..>

One thing that comes to mind thinking about this is dealing with
batching (relog multiple items per roll). This code doesn't handle that
yet, but I anticipate it being a requirement and it's fairly easy to
update the current scheme to support a fixed item count per relog-roll.

A stealing approach potentially complicates things when it comes to
batching because we have per-item reservation granularity to consider.
For example, consider if we had a variety of different relog item types
active at once, a subset are being relogged while another subset are
being disabled (before ever needing to be relogged). For one, we'd have
to be careful to not release reservation while it might be accounted to
an active relog transaction that is currently rolling with some other
items, etc. There's also a potential quirk in handling reservation of a
relogged item that is cancelled while it's being relogged, but that
might be more of an implementation detail.

I don't think that's a show stopper, but rather just something I'd like
to have factored into the design from the start. One option could be to
maintain a separate counter of active relog reservation aside from the
actual relog ticket. That way the relog ticket could just pull from this
relog reservation pool based on the current item(s) being relogged
asynchronously from different tasks that might add or remove reservation
from the pool for separate items. That might get a little wonky when we
consider the relog ticket needs to pull from the pool and then put
something back if the item is still reloggable after the relog
transaction rolls.

Another problem is that reloggable items are still otherwise usable once
they are unlocked. So for example we'd have to account for a situation
where one transaction dirties a buffer, enables relogging, commits and
then some other transaction dirties more of the buffer and commits
without caring whether the buffer was relog enabled or not. Unless
runtime relog reservation is always worst case, that's a subtle path to
reservation overrun in the relog transaction.

I'm actually wondering if a more simple approach to tracking stolen
reservation is to add a new field that tracks active relog reservation
to each supported log item. Then the initial transaction enables
relogging with a simple transfer from the transaction to the item. The
relog transaction knows how much reservation it can assign based on the
current population of items and requires no further reservation
accounting because it rolls and thus automatically reacquires relog
reservation for each associated item. The path that clears relog state
transfers res from the item back to a transaction simply so the
reservation can be released back to the pool of unused log space. Note
that clearing relog state doesn't require a transaction in the current
implementation, but we could easily define a helper to allocate an empty
transaction, clear relog and reclaim relog reservation and then cancel
for those contexts.

Thoughts on any of the above appreciated. I need to think about this
some more but otherwise I'll attempt some form of a res stealing
approach for the next iteration...

Brian

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner March 2, 2020, 11:25 p.m. UTC | #6
On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> On Mon, Mar 02, 2020 at 02:07:50PM +1100, Dave Chinner wrote:
> > On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote:
> > > Automatic item relogging will occur from xfsaild context. xfsaild
> > > cannot acquire log reservation itself because it is also responsible
> > > for writeback and thus making used log reservation available again.
> > > Since there is no guarantee log reservation is available by the time
> > > a relogged item reaches the AIL, this is prone to deadlock.

.....

> > > @@ -284,15 +289,25 @@ xfs_trans_alloc(
> > >  	tp->t_firstblock = NULLFSBLOCK;
> > >  
> > >  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> > > -	if (error) {
> > > -		xfs_trans_cancel(tp);
> > > -		return error;
> > > +	if (error)
> > > +		goto error;
> > > +
> > > +	if (flags & XFS_TRANS_RELOG) {
> > > +		error = xfs_trans_ail_relog_reserve(&tp);
> > > +		if (error)
> > > +			goto error;
> > >  	}
> > 
> > Hmmmm. So we are putting the AIL lock directly into the transaction
> > reserve path? xfs_trans_reserve() goes out of it's way to be
> > lockless in the fast paths, so if you want this to be a generic
> > mechanism that any transaction can use, the reservation needs to be
> > completely lockless.
> > 
> 
> Yeah, this is one of those warts mentioned in the cover letter. I wasn't
> planning to make it lockless as much as using an independent lock, but I
> can take a closer look at that if it still looks like a contention point
> after getting through the bigger picture feedback..

If relogging ends up getting widely used, then a filesystem global
lock in the transaction reserve path is guaranteed to be a
contention point. :)

> > > +	/* release the relog ticket reference if this transaction holds one */
> > > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > > +		xfs_trans_ail_relog_put(mp);
> > > +
> > 
> > That looks ... interesting. xfs_trans_ail_relog_put() can call
> > xfs_log_done(), which means it can do a log write, which means the
> > commit lsn for this transaction could change. Hence to make a relog
> > permanent as a result of a sync transaction, we'd need the
> > commit_lsn of the AIL relog ticket write here, not that of the
> > original transaction that was written.
> > 
> 
> Hmm.. xfs_log_done() is eventually called for every transaction ticket,
> so I don't think it should be doing log writes based on that.

  xfs_log_done()
    xlog_commit_record()
      xlog_write(XLOG_COMMIT_TRANS)
        xlog_state_release_iclog()
	  xlog_sync()
	    xlog_write_iclog()
	      submit_bio()
>
> The
> _ail_relog_put() path is basically just intended to properly terminate
> the relog ticket based on the existing transaction completion path. I'd
> have to dig a bit further into the code here, but e.g. we don't pass an
> iclog via this path so if we were attempting to write I'd probably have
> seen assert failures (or worse) by now.

xlog_write() code handles a missing commit iclog pointer just fine.
Indeed, that's the reason xlog_write() may issue log IO itself:

....
        spin_lock(&log->l_icloglock);
        xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
        if (commit_iclog) {
                ASSERT(flags & XLOG_COMMIT_TRANS);
                *commit_iclog = iclog;
        } else {
                error = xlog_state_release_iclog(log, iclog);
        }
        spin_unlock(&log->l_icloglock);

i.e. if you don't provide xlog_write with a commit_iclog pointer
then you are saying "caller does not call xfs_log_release_iclog()
itself, so please release the iclog once the log iovec has been
written to the iclog. Hence the way you've written this code
explicitly tells xlog_write() to issue log IO on you behalf if it is
necessary....

> Indeed, this is similar to a
> cancel of a clean transaction, the difference being this is open-coded
> because we only have the relog ticket in this context.

A clean transaction has XLOG_TIC_INITED set, so xfs_log_done() will
never call xlog_commit_record() on it. XLOG_TIC_INITED is used to
indicate a start record has been written to the log for a permanent
transaction so it doesn't get written again when it is relogged.....

.... Hmmmm .....

You might be right.

It looks like delayed logging made XLOG_TIC_INITED completely
redundant. xfs_trans_commit() no longer writes directly to iclogs,
so the normal transaction tickets will never have XLOG_TIC_INITED
cleared. Hence on any kernel since delayed logging was the only
option, we will never get log writes from the xfs_trans* interfaces.

And we no longer do an xlog_write() call for every item in the
transaction after we format them into a log iovec. Instead, the CIL
just passes one big list of iovecs to a singel xlog_write() call,
so all callers only make a single xlog_write call per physical log
transaction.

OK, XLOG_TIC_INITED is redundant, and should be removed. And
xfs_log_done() needs to be split into two, one for releasing the
ticket, one for completing the xlog_write() call. Compile tested
only patch below for you :P

> > <GRRRRIIINNNNDDDD>
> > 
> > <CLUNK>
> > 
> > Ahhhhh.
> > 
> 
> Heh. :P
> 
> > This is the reason why the CIL ended up using a reservation stealing
> > mechanism for it's ticket. Every log reservation takes into account
> > the worst case log overhead for that single transaction - that means
> > there are no special up front hooks or changes to the log
> > reservation to support the CIL, and it also means that the CIL can
> > steal what it needs from the current ticket at commit time. Then the
> > CIL can commit at any time, knowing it has enough space to write all
> > the items it tracks to the log.
> > 
> > This avoided the transaction reservation mechanism from needing to
> > know anything about the CIL or that it was stealing space from
> > committing transactions for it's own private ticket. It greatly
> > simplified everything, and it's the reason that the CIL succeeded
> > where several other attempts to relog items in memory failed....
> > 
> > So....
> > 
> > Why can't the AIL steal the reservation it needs from the current
> > transaction when the item that needs relogging is being committed?
> 
> I think we can. My first proposal[1] implemented something like this.
> The broader design was much less fleshed out, so it was crudely
> implemented as more of a commit time hook to relog pending items based
> on the currently committing transaction as opposed to reservation
> stealing for an independent relog ticket. This was actually based on the
> observation that the CIL already did something similar (though I didn't
> have all of the background as to why that approach was used for the
> CIL).
> 
> The feedback at the time was to consider moving in a different direction
> that didn't involve stealing reservation from transactions, which gave
> me the impression that approach wasn't going to be acceptable. Granted,

I don't recall that discussion steering away from stealing
reservations; it was more about mechanics like using transaction
callbacks and other possible mechanisms for enabling relogging.

> the first few variations of this were widely different in approach and I
> don't think there was any explicit objection to reservation stealing
> itself, it just seemed a better development approach to work out an
> ideal design based on fundamentals as opposed to limiting the scope to
> contexts that might facilitate reservation stealing.

Right, it wasn't clear how to best implement this at the time,
because we were all just struggling to get our heads around the
problem scope. :)

> Despite the fact
> that this code is hairy, the fundamental approach is rather
> straightforward and simplistic. The hairiness mostly comes from
> attempting to make things generic and dynamic and could certainly use
> some cleanup.
> 
> Now that this is at a point where I'm reasonably confident it is
> fundamentally correct, I'm quite happy to revisit a reservation stealing
> approach if that improves functional operation. This can serve a decent
> reference/baseline implementation for that, I think.

*nod*

> 
> [1] https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/
> 
> > i.e. we add the "relog overhead" to the permanent transaction that
> > requires items to be relogged by the AIL, and when that item is
> > formatted into the CIL we also check to see if it is currently
> > marked as "reloggable". If it's not, we steal the relogging
> > reservation from the transaction (essentially the item's formatted
> > size) and pass it to the AIL's private relog ticket, setting the log
> > item to be "reloggable" so the reservation isn't stolen over and
> > over again as the object is relogged in the CIL.
> > 
> 
> Ok, so this implies to me we'd update the per-transaction reservation
> calculations to incorporate worst case relog overhead for each
> transaction. E.g., the quotaoff transaction would x2 the intent size,
> a transaction that might relog a buffer adds a max buffer size overhead,
> etc. Right?

Essentially, yes.

And the reloggable items and reservation can be clearly documented
in the per-transaction reservation explanation comments like we
already do for all the structures that are modified in a
transaction.

> > Hence as we commit reloggable items to the CIL, the AIL ticket
> > reservation grows with each of those items marked as reloggable.
> > As we relog items to the CIL and the CIL grows it's reservation via
> > the size delta, the AIL reservation can also be updated with the
> > same delta.
> > 
> > Hence the AIL will always know exactly how much space it needs to relog all
> > the items it holds for relogging, and because it's been stolen from
> > the original transaction it is, like the CIL tciket reservation,
> > considered used space in the log. Hence the log space required for
> > relogging items via the AIL is correctly accounted for without
> > needing up front static, per-item reservations.
> > 
> 
> By this I assume you're referring to avoiding the need to reserve ->
> donate -> roll in the current scheme. Instead, we'd acquire the larger
> reservation up front and only steal if it necessary, which is less
> overhead because we don't need to always replenish the full transaction.

Yes - we acquire the initial relog reservation by stealing it
from the current transaction. This gives the AIL ticket the reserved
space (both reserve and write grant space) it needs to relog the
item once.

When the item is relogged by the AIL, the transaction commit
immediately regrants the reservation space that was just consumed,
and the trans_roll regrants the write grant space the commit just
consumed from the ticket via it's call to xfs_trans_reserve(). Hence
we only need to steal the space necessary to do the first relogging
of the item as the AIL will hold that reservation until the high
level code turns off relogging for that log item.

> > Thoughts?
> 
> <thinking out loud from here on..>
> 
> One thing that comes to mind thinking about this is dealing with
> batching (relog multiple items per roll). This code doesn't handle that
> yet, but I anticipate it being a requirement and it's fairly easy to
> update the current scheme to support a fixed item count per relog-roll.
> 
> A stealing approach potentially complicates things when it comes to
> batching because we have per-item reservation granularity to consider.
> For example, consider if we had a variety of different relog item types
> active at once, a subset are being relogged while another subset are
> being disabled (before ever needing to be relogged).

So concurrent "active relogging" + "start relogging" + "stop
relogging"?

> For one, we'd have
> to be careful to not release reservation while it might be accounted to
> an active relog transaction that is currently rolling with some other
> items, etc. There's also a potential quirk in handling reservation of a
> relogged item that is cancelled while it's being relogged, but that
> might be more of an implementation detail.

Right, did you notice the ail->ail_relog_lock rwsem that I wrapped
my example relog transaction item add loop + commit function in?

i.e. while we are building and committing a relog transaction, we
hold off the transactions that are trying to add/remove their items
to/from the relog list. Hence the reservation stealing accounting in
the ticket can be be serialised against the transactional use of the
ticket.

It's basically the same method we use for serialising addition to
the CIL in transaction commit against CIL pushes draining the
current list for log writes (rwsem for add/push serialisation, spin
lock for concurrent add serialisation under the rwsem).

> I don't think that's a show stopper, but rather just something I'd like
> to have factored into the design from the start. One option could be to

*nod*

> maintain a separate counter of active relog reservation aside from the
> actual relog ticket. That way the relog ticket could just pull from this
> relog reservation pool based on the current item(s) being relogged
> asynchronously from different tasks that might add or remove reservation
> from the pool for separate items. That might get a little wonky when we
> consider the relog ticket needs to pull from the pool and then put
> something back if the item is still reloggable after the relog
> transaction rolls.

RIght, that's the whole problem that we solve via a) stealing the
initial reserve/write grant space at commit time and b) serialising
stealing vs transactional use of the ticket.

That is, after the roll, the ticket has a full reserve and grant
space reservation for all the items accounted to the relog ticket.
Every new relog item added to the ticket (or is relogged in the CIL
and uses more space) adds the full required reserve/write grant
space to the the relog ticket. Hence the relog ticket always has
current log space reserved to commit the entire set of items tagged
as reloggable. And by avoiding modifying the ticket while we are
actively processing the relog transaction, we don't screw up the
ticket accounting in the middle of the transaction....

> Another problem is that reloggable items are still otherwise usable once
> they are unlocked. So for example we'd have to account for a situation
> where one transaction dirties a buffer, enables relogging, commits and
> then some other transaction dirties more of the buffer and commits
> without caring whether the buffer was relog enabled or not.

yup, that's the delta size updates from the CIL commit. i.e. if we
relog an item to the CIL that has the XFS_LI_RELOG flag already set
on it, the change in size that we steal for the CIL ticket also
needs to be stolen for the AIL ticket. i.e. we already do almost all
the work we need to handle this.

> Unless
> runtime relog reservation is always worst case, that's a subtle path to
> reservation overrun in the relog transaction.

Yes, but it's a problem the CIL already solves for us :P

> I'm actually wondering if a more simple approach to tracking stolen
> reservation is to add a new field that tracks active relog reservation
> to each supported log item.  Then the initial transaction enables
> relogging with a simple transfer from the transaction to the item. The
> relog transaction knows how much reservation it can assign based on the
> current population of items and requires no further reservation
> accounting because it rolls and thus automatically reacquires relog
> reservation for each associated item. The path that clears relog state
> transfers res from the item back to a transaction simply so the
> reservation can be released back to the pool of unused log space. Note
> that clearing relog state doesn't require a transaction in the current
> implementation, but we could easily define a helper to allocate an empty
> transaction, clear relog and reclaim relog reservation and then cancel
> for those contexts.

I don't think we need any of that - the AIL ticket only needs to be
kept up to date with the changes to the formatted size of the item
marked for relogging. It's no different to the CIL ticket
reservation accounting from that perspective.

Cheers,

Dave.
Dave Chinner March 3, 2020, 4:07 a.m. UTC | #7
On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> OK, XLOG_TIC_INITED is redundant, and should be removed. And
> xfs_log_done() needs to be split into two, one for releasing the
> ticket, one for completing the xlog_write() call. Compile tested
> only patch below for you :P

And now with sample patch.

-Dave.
Brian Foster March 3, 2020, 2:13 p.m. UTC | #8
On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> > On Mon, Mar 02, 2020 at 02:07:50PM +1100, Dave Chinner wrote:
> > > On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote:
> > > > Automatic item relogging will occur from xfsaild context. xfsaild
> > > > cannot acquire log reservation itself because it is also responsible
> > > > for writeback and thus making used log reservation available again.
> > > > Since there is no guarantee log reservation is available by the time
> > > > a relogged item reaches the AIL, this is prone to deadlock.
> 
> .....
> 
> > > > @@ -284,15 +289,25 @@ xfs_trans_alloc(
> > > >  	tp->t_firstblock = NULLFSBLOCK;
> > > >  
> > > >  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> > > > -	if (error) {
> > > > -		xfs_trans_cancel(tp);
> > > > -		return error;
> > > > +	if (error)
> > > > +		goto error;
> > > > +
> > > > +	if (flags & XFS_TRANS_RELOG) {
> > > > +		error = xfs_trans_ail_relog_reserve(&tp);
> > > > +		if (error)
> > > > +			goto error;
> > > >  	}
> > > 
> > > Hmmmm. So we are putting the AIL lock directly into the transaction
> > > reserve path? xfs_trans_reserve() goes out of it's way to be
> > > lockless in the fast paths, so if you want this to be a generic
> > > mechanism that any transaction can use, the reservation needs to be
> > > completely lockless.
> > > 
> > 
> > Yeah, this is one of those warts mentioned in the cover letter. I wasn't
> > planning to make it lockless as much as using an independent lock, but I
> > can take a closer look at that if it still looks like a contention point
> > after getting through the bigger picture feedback..
> 
> If relogging ends up getting widely used, then a filesystem global
> lock in the transaction reserve path is guaranteed to be a
> contention point. :)
> 

I was thinking yesterday that I had implemented a fast path there.  I
intended to, but looking at the code I realize I hadn't got to that yet.
So... I see your point and this will need to be addressed if it doesn't
fall away via the relog res rework.

> > > > +	/* release the relog ticket reference if this transaction holds one */
> > > > +	if (tp->t_flags & XFS_TRANS_RELOG)
> > > > +		xfs_trans_ail_relog_put(mp);
> > > > +
> > > 
> > > That looks ... interesting. xfs_trans_ail_relog_put() can call
> > > xfs_log_done(), which means it can do a log write, which means the
> > > commit lsn for this transaction could change. Hence to make a relog
> > > permanent as a result of a sync transaction, we'd need the
> > > commit_lsn of the AIL relog ticket write here, not that of the
> > > original transaction that was written.
> > > 
> > 
> > Hmm.. xfs_log_done() is eventually called for every transaction ticket,
> > so I don't think it should be doing log writes based on that.
> 
>   xfs_log_done()
>     xlog_commit_record()
>       xlog_write(XLOG_COMMIT_TRANS)
>         xlog_state_release_iclog()
> 	  xlog_sync()
> 	    xlog_write_iclog()
> 	      submit_bio()

Yeah, I know that we _can_ issue writes from this path because it's
shared between the CIL log ticket path and the per-transaction log
ticket path. I'm just not convinced that happens from the transaction
path simply because if it did, it would seem like a potentially
noticeable bug.

> >
> > The
> > _ail_relog_put() path is basically just intended to properly terminate
> > the relog ticket based on the existing transaction completion path. I'd
> > have to dig a bit further into the code here, but e.g. we don't pass an
> > iclog via this path so if we were attempting to write I'd probably have
> > seen assert failures (or worse) by now.
> 
> xlog_write() code handles a missing commit iclog pointer just fine.
> Indeed, that's the reason xlog_write() may issue log IO itself:
> 
> ....
>         spin_lock(&log->l_icloglock);
>         xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
>         if (commit_iclog) {
>                 ASSERT(flags & XLOG_COMMIT_TRANS);
>                 *commit_iclog = iclog;
>         } else {
>                 error = xlog_state_release_iclog(log, iclog);
>         }
>         spin_unlock(&log->l_icloglock);
> 
> i.e. if you don't provide xlog_write with a commit_iclog pointer
> then you are saying "caller does not call xfs_log_release_iclog()
> itself, so please release the iclog once the log iovec has been
> written to the iclog. Hence the way you've written this code
> explicitly tells xlog_write() to issue log IO on you behalf if it is
> necessary....
> 

xlog_commit_record(), which is the xlog_write() caller in this case,
expects a valid iclog and (unconditionally) asserts otherwise. I would
expect to have seen that assert had this happened.

> > Indeed, this is similar to a
> > cancel of a clean transaction, the difference being this is open-coded
> > because we only have the relog ticket in this context.
> 
> A clean transaction has XLOG_TIC_INITED set, so xfs_log_done() will
> never call xlog_commit_record() on it. XLOG_TIC_INITED is used to
> indicate a start record has been written to the log for a permanent
> transaction so it doesn't get written again when it is relogged.....
> 
> .... Hmmmm .....
> 
> You might be right.
> 

I figured it was related to the _INITED flag since that obviously gates
the call to xlog_commit_record(), but when I've taken a quick look at
that code in the past I wasn't really able to make much sense of its
purpose (i.e. why the behavior differs between CIL and transaction log
tickets) and didn't dig any further than that.

> It looks like delayed logging made XLOG_TIC_INITED completely
> redundant. xfs_trans_commit() no longer writes directly to iclogs,
> so the normal transaction tickets will never have XLOG_TIC_INITED
> cleared. Hence on any kernel since delayed logging was the only
> option, we will never get log writes from the xfs_trans* interfaces.
> 
> And we no longer do an xlog_write() call for every item in the
> transaction after we format them into a log iovec. Instead, the CIL
> just passes one big list of iovecs to a singel xlog_write() call,
> so all callers only make a single xlog_write call per physical log
> transaction.
> 
> OK, XLOG_TIC_INITED is redundant, and should be removed. And
> xfs_log_done() needs to be split into two, one for releasing the
> ticket, one for completing the xlog_write() call. Compile tested
> only patch below for you :P
> 

Heh, Ok. Will take a look.

> > > <GRRRRIIINNNNDDDD>
> > > 
> > > <CLUNK>
> > > 
> > > Ahhhhh.
> > > 
> > 
> > Heh. :P
> > 
> > > This is the reason why the CIL ended up using a reservation stealing
> > > mechanism for it's ticket. Every log reservation takes into account
> > > the worst case log overhead for that single transaction - that means
> > > there are no special up front hooks or changes to the log
> > > reservation to support the CIL, and it also means that the CIL can
> > > steal what it needs from the current ticket at commit time. Then the
> > > CIL can commit at any time, knowing it has enough space to write all
> > > the items it tracks to the log.
> > > 
> > > This avoided the transaction reservation mechanism from needing to
> > > know anything about the CIL or that it was stealing space from
> > > committing transactions for it's own private ticket. It greatly
> > > simplified everything, and it's the reason that the CIL succeeded
> > > where several other attempts to relog items in memory failed....
> > > 
> > > So....
> > > 
> > > Why can't the AIL steal the reservation it needs from the current
> > > transaction when the item that needs relogging is being committed?
> > 
> > I think we can. My first proposal[1] implemented something like this.
> > The broader design was much less fleshed out, so it was crudely
> > implemented as more of a commit time hook to relog pending items based
> > on the currently committing transaction as opposed to reservation
> > stealing for an independent relog ticket. This was actually based on the
> > observation that the CIL already did something similar (though I didn't
> > have all of the background as to why that approach was used for the
> > CIL).
> > 
> > The feedback at the time was to consider moving in a different direction
> > that didn't involve stealing reservation from transactions, which gave
> > me the impression that approach wasn't going to be acceptable. Granted,
> 
> I don't recall that discussion steering away from stealing
> reservations; it was more about mechanics like using transaction
> callbacks and other possible mechanisms for enabling relogging.
> 
> > the first few variations of this were widely different in approach and I
> > don't think there was any explicit objection to reservation stealing
> > itself, it just seemed a better development approach to work out an
> > ideal design based on fundamentals as opposed to limiting the scope to
> > contexts that might facilitate reservation stealing.
> 
> Right, it wasn't clear how to best implement this at the time,
> because we were all just struggling to get our heads around the
> problem scope. :)
> 
> > Despite the fact
> > that this code is hairy, the fundamental approach is rather
> > straightforward and simplistic. The hairiness mostly comes from
> > attempting to make things generic and dynamic and could certainly use
> > some cleanup.
> > 
> > Now that this is at a point where I'm reasonably confident it is
> > fundamentally correct, I'm quite happy to revisit a reservation stealing
> > approach if that improves functional operation. This can serve a decent
> > reference/baseline implementation for that, I think.
> 
> *nod*
> 
> > 
> > [1] https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/
> > 
> > > i.e. we add the "relog overhead" to the permanent transaction that
> > > requires items to be relogged by the AIL, and when that item is
> > > formatted into the CIL we also check to see if it is currently
> > > marked as "reloggable". If it's not, we steal the relogging
> > > reservation from the transaction (essentially the item's formatted
> > > size) and pass it to the AIL's private relog ticket, setting the log
> > > item to be "reloggable" so the reservation isn't stolen over and
> > > over again as the object is relogged in the CIL.
> > > 
> > 
> > Ok, so this implies to me we'd update the per-transaction reservation
> > calculations to incorporate worst case relog overhead for each
> > transaction. E.g., the quotaoff transaction would x2 the intent size,
> > a transaction that might relog a buffer adds a max buffer size overhead,
> > etc. Right?
> 
> Essentially, yes.
> 
> And the reloggable items and reservation can be clearly documented
> in the per-transaction reservation explanation comments like we
> already do for all the structures that are modified in a
> transaction.
> 

Ok.

> > > Hence as we commit reloggable items to the CIL, the AIL ticket
> > > reservation grows with each of those items marked as reloggable.
> > > As we relog items to the CIL and the CIL grows it's reservation via
> > > the size delta, the AIL reservation can also be updated with the
> > > same delta.
> > > 
> > > Hence the AIL will always know exactly how much space it needs to relog all
> > > the items it holds for relogging, and because it's been stolen from
> > > the original transaction it is, like the CIL tciket reservation,
> > > considered used space in the log. Hence the log space required for
> > > relogging items via the AIL is correctly accounted for without
> > > needing up front static, per-item reservations.
> > > 
> > 
> > By this I assume you're referring to avoiding the need to reserve ->
> > donate -> roll in the current scheme. Instead, we'd acquire the larger
> > reservation up front and only steal if it necessary, which is less
> > overhead because we don't need to always replenish the full transaction.
> 
> Yes - we acquire the initial relog reservation by stealing it
> from the current transaction. This gives the AIL ticket the reserved
> space (both reserve and write grant space) it needs to relog the
> item once.
> 
> When the item is relogged by the AIL, the transaction commit
> immediately regrants the reservation space that was just consumed,
> and the trans_roll regrants the write grant space the commit just
> consumed from the ticket via it's call to xfs_trans_reserve(). Hence
> we only need to steal the space necessary to do the first relogging
> of the item as the AIL will hold that reservation until the high
> level code turns off relogging for that log item.
> 

Yep.

> > > Thoughts?
> > 
> > <thinking out loud from here on..>
> > 
> > One thing that comes to mind thinking about this is dealing with
> > batching (relog multiple items per roll). This code doesn't handle that
> > yet, but I anticipate it being a requirement and it's fairly easy to
> > update the current scheme to support a fixed item count per relog-roll.
> > 
> > A stealing approach potentially complicates things when it comes to
> > batching because we have per-item reservation granularity to consider.
> > For example, consider if we had a variety of different relog item types
> > active at once, a subset are being relogged while another subset are
> > being disabled (before ever needing to be relogged).
> 
> So concurrent "active relogging" + "start relogging" + "stop
> relogging"?
> 

Yeah..

> > For one, we'd have
> > to be careful to not release reservation while it might be accounted to
> > an active relog transaction that is currently rolling with some other
> > items, etc. There's also a potential quirk in handling reservation of a
> > relogged item that is cancelled while it's being relogged, but that
> > might be more of an implementation detail.
> 
> Right, did you notice the ail->ail_relog_lock rwsem that I wrapped
> my example relog transaction item add loop + commit function in?
> 

Yes, but the side effects didn't quite register..

> i.e. while we are building and committing a relog transaction, we
> hold off the transactions that are trying to add/remove their items
> to/from the relog list. Hence the reservation stealing accounting in
> the ticket can be be serialised against the transactional use of the
> ticket.
> 

Hmm, so this changes relog item/state management as well as the
reservation management. That's probably why it wasn't clear to me from
the code example.

> It's basically the same method we use for serialising addition to
> the CIL in transaction commit against CIL pushes draining the
> current list for log writes (rwsem for add/push serialisation, spin
> lock for concurrent add serialisation under the rwsem).
> 

Ok. The difference with the CIL is that in that context we're processing
a single, constantly maintained list of items. As of right now, there is
no such list for relog enabled items. If I'm following correctly, that's
still not necessary here, we're just talking about wrapping a rw lock
around the state management (+ res accounting) and the actual res
consumption such that a relog cancel doesn't steal reservation from an
active relog transaction, for example.

> > I don't think that's a show stopper, but rather just something I'd like
> > to have factored into the design from the start. One option could be to
> 
> *nod*
> 
> > maintain a separate counter of active relog reservation aside from the
> > actual relog ticket. That way the relog ticket could just pull from this
> > relog reservation pool based on the current item(s) being relogged
> > asynchronously from different tasks that might add or remove reservation
> > from the pool for separate items. That might get a little wonky when we
> > consider the relog ticket needs to pull from the pool and then put
> > something back if the item is still reloggable after the relog
> > transaction rolls.
> 
> RIght, that's the whole problem that we solve via a) stealing the
> initial reserve/write grant space at commit time and b) serialising
> stealing vs transactional use of the ticket.
> 
> That is, after the roll, the ticket has a full reserve and grant
> space reservation for all the items accounted to the relog ticket.
> Every new relog item added to the ticket (or is relogged in the CIL
> and uses more space) adds the full required reserve/write grant
> space to the the relog ticket. Hence the relog ticket always has
> current log space reserved to commit the entire set of items tagged
> as reloggable. And by avoiding modifying the ticket while we are
> actively processing the relog transaction, we don't screw up the
> ticket accounting in the middle of the transaction....
> 

Yep, Ok.

> > Another problem is that reloggable items are still otherwise usable once
> > they are unlocked. So for example we'd have to account for a situation
> > where one transaction dirties a buffer, enables relogging, commits and
> > then some other transaction dirties more of the buffer and commits
> > without caring whether the buffer was relog enabled or not.
> 
> yup, that's the delta size updates from the CIL commit. i.e. if we
> relog an item to the CIL that has the XFS_LI_RELOG flag already set
> on it, the change in size that we steal for the CIL ticket also
> needs to be stolen for the AIL ticket. i.e. we already do almost all
> the work we need to handle this.
> 
> > Unless
> > runtime relog reservation is always worst case, that's a subtle path to
> > reservation overrun in the relog transaction.
> 
> Yes, but it's a problem the CIL already solves for us :P
> 

Ok, I think we're pretty close to the same page here. I was thinking
about the worst case relog reservation being pulled off the committing
transaction unconditionally, where I think you're thinking about it as
the transaction (i.e. reservation calculation) would have the worst case
reservation, but we'd only pull off the delta as needed at commit time
(i.e. exactly how the CIL works wrt to transaction reservation
consumption). Let me work through a simple example to try and
(in)validate my concern:

- Transaction A is relog enabled, dirties 50% of a buffer and enables
  auto relogging. On commit, the CIL takes buffer/2 reservation for the
  log vector and the relog mechanism takes the same amount for
  subsequent relogs.
- Transaction B is not relog enabled (so no extra relog reservation),
  dirties another 40% of the (already relog enabled) buffer and commits.
  The CIL takes 90% of the transaction buffer reservation. The relog
  ticket now needs an additional 40% (since the original 50% is
  maintained by the relog system), but afaict there is no guarantee that
  res is available if trans B is not relog enabled.

So IIUC, the CIL scheme works because every transaction automatically
includes worst case reservation for every possible item supported by the
transaction. Relog transactions are presumably more selective, however,
so we need to either have some rule where only relog enabled
transactions can touch relog enabled items (it's not clear to me how
realistic that is for things like buffers etc., but it sounds
potentially delicate) or we simplify the relog reservation consumption
calculation to consume worst case reservation for the relog ticket in
anticipation of unrelated transactions dirtying more of the associated
object since it was originally committed for relog. Thoughts?

Note that carrying worst case reservation also potentially simplifies
accounting between multiple dirty+relog increments and a single relog
cancel, particularly if a relog cancelling transaction also happens to
dirty more of a buffer before it commits. I'd prefer to avoid subtle
usage landmines like that as much as possible. Hmm.. even if we do
ultimately want a more granular and dynamic relog res accounting, I
might start with the simple worst-case approach since 1.) it probably
doesn't matter for intents, 2.) it's easier to reason about the isolated
reservation calculation changes in an independent patch from the rest of
the mechanism and 3.) I'd rather get the basics nailed down and solid
before potentially fighting with granular reservation accounting
accuracy bugs. ;)

Brian

> > I'm actually wondering if a more simple approach to tracking stolen
> > reservation is to add a new field that tracks active relog reservation
> > to each supported log item.  Then the initial transaction enables
> > relogging with a simple transfer from the transaction to the item. The
> > relog transaction knows how much reservation it can assign based on the
> > current population of items and requires no further reservation
> > accounting because it rolls and thus automatically reacquires relog
> > reservation for each associated item. The path that clears relog state
> > transfers res from the item back to a transaction simply so the
> > reservation can be released back to the pool of unused log space. Note
> > that clearing relog state doesn't require a transaction in the current
> > implementation, but we could easily define a helper to allocate an empty
> > transaction, clear relog and reclaim relog reservation and then cancel
> > for those contexts.
> 
> I don't think we need any of that - the AIL ticket only needs to be
> kept up to date with the changes to the formatted size of the item
> marked for relogging. It's no different to the CIL ticket
> reservation accounting from that perspective.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Brian Foster March 3, 2020, 3:12 p.m. UTC | #9
On Tue, Mar 03, 2020 at 03:07:35PM +1100, Dave Chinner wrote:
> On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote:
> > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> > OK, XLOG_TIC_INITED is redundant, and should be removed. And
> > xfs_log_done() needs to be split into two, one for releasing the
> > ticket, one for completing the xlog_write() call. Compile tested
> > only patch below for you :P
> 
> And now with sample patch.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: kill XLOG_TIC_INITED
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Delayed logging made this redundant as we never directly write
> transactions to the log anymore. Hence we no longer make multiple
> xlog_write() calls for a transaction as we format individual items
> in a transaction, and hence don't need to keep track of whether we
> should be writing a start record for every xlog_write call.
> 

FWIW the commit log could use a bit more context, perhaps from your
previous description, about the original semantics of _INITED flag.
E.g., it's always been rather vague to me, probably because it seems to
be a remnant of some no longer fully in place functionality.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 79 ++++++++++++++++++---------------------------------
>  fs/xfs/xfs_log.h      |  4 ---
>  fs/xfs/xfs_log_cil.c  | 13 +++++----
>  fs/xfs/xfs_log_priv.h | 18 ++++++------
>  fs/xfs/xfs_trans.c    | 24 ++++++++--------
>  5 files changed, 55 insertions(+), 83 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index f6006d94a581..a45f3eefee39 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -496,8 +496,8 @@ xfs_log_reserve(
>   * This routine is called when a user of a log manager ticket is done with
>   * the reservation.  If the ticket was ever used, then a commit record for
>   * the associated transaction is written out as a log operation header with
> - * no data.  The flag XLOG_TIC_INITED is set when the first write occurs with
> - * a given ticket.  If the ticket was one with a permanent reservation, then
> + * no data. 

	      ^ trailing whitespace

> + * If the ticket was one with a permanent reservation, then
>   * a few operations are done differently.  Permanent reservation tickets by
>   * default don't release the reservation.  They just commit the current
>   * transaction with the belief that the reservation is still needed.  A flag
> @@ -506,49 +506,38 @@ xfs_log_reserve(
>   * the inited state again.  By doing this, a start record will be written
>   * out when the next write occurs.
>   */
> -xfs_lsn_t
> -xfs_log_done(
> -	struct xfs_mount	*mp,
> +int
> +xlog_write_done(
> +	struct xlog		*log,
>  	struct xlog_ticket	*ticket,
>  	struct xlog_in_core	**iclog,
> -	bool			regrant)
> +	xfs_lsn_t		*lsn)
>  {
> -	struct xlog		*log = mp->m_log;
> -	xfs_lsn_t		lsn = 0;
> -
> -	if (XLOG_FORCED_SHUTDOWN(log) ||
> -	    /*
> -	     * If nothing was ever written, don't write out commit record.
> -	     * If we get an error, just continue and give back the log ticket.
> -	     */
> -	    (((ticket->t_flags & XLOG_TIC_INITED) == 0) &&
> -	     (xlog_commit_record(log, ticket, iclog, &lsn)))) {
> -		lsn = (xfs_lsn_t) -1;
> -		regrant = false;
> -	}
> +	if (XLOG_FORCED_SHUTDOWN(log))
> +		return -EIO;
>  
> +	return xlog_commit_record(log, ticket, iclog, lsn);
> +}
>  
> +/*
> + * Release or regrant the ticket reservation now the transaction is done with
> + * it depending on caller context. Rolling transactions need the ticket
> + * regranted, otherwise we release it completely.
> + */
> +void
> +xlog_ticket_done(
> +	struct xlog		*log,
> +	struct xlog_ticket	*ticket,
> +	bool			regrant)
> +{
>  	if (!regrant) {
>  		trace_xfs_log_done_nonperm(log, ticket);
> -
> -		/*
> -		 * Release ticket if not permanent reservation or a specific
> -		 * request has been made to release a permanent reservation.
> -		 */
>  		xlog_ungrant_log_space(log, ticket);
>  	} else {
>  		trace_xfs_log_done_perm(log, ticket);
> -
>  		xlog_regrant_reserve_log_space(log, ticket);
> -		/* If this ticket was a permanent reservation and we aren't
> -		 * trying to release it, reset the inited flags; so next time
> -		 * we write, a start record will be written out.
> -		 */
> -		ticket->t_flags |= XLOG_TIC_INITED;
>  	}
> -
>  	xfs_log_ticket_put(ticket);
> -	return lsn;
>  }

In general it would be nicer to split off as much refactoring as
possible into separate patches, even though it's not yet clear to me
what granularity is possible with this patch...

>  
>  static bool
> @@ -2148,8 +2137,9 @@ xlog_print_trans(
>  }
>  
>  /*
> - * Calculate the potential space needed by the log vector.  Each region gets
> - * its own xlog_op_header_t and may need to be double word aligned.
> + * Calculate the potential space needed by the log vector.  We always write a
> + * start record, and each region gets its own xlog_op_header_t and may need to
> + * be double word aligned.
>   */
>  static int
>  xlog_write_calc_vec_length(
> @@ -2157,14 +2147,10 @@ xlog_write_calc_vec_length(
>  	struct xfs_log_vec	*log_vector)
>  {
>  	struct xfs_log_vec	*lv;
> -	int			headers = 0;
> +	int			headers = 1;
>  	int			len = 0;
>  	int			i;
>  
> -	/* acct for start rec of xact */
> -	if (ticket->t_flags & XLOG_TIC_INITED)
> -		headers++;
> -
>  	for (lv = log_vector; lv; lv = lv->lv_next) {
>  		/* we don't write ordered log vectors */
>  		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
> @@ -2195,17 +2181,11 @@ xlog_write_start_rec(
>  	struct xlog_op_header	*ophdr,
>  	struct xlog_ticket	*ticket)
>  {
> -	if (!(ticket->t_flags & XLOG_TIC_INITED))
> -		return 0;
> -
>  	ophdr->oh_tid	= cpu_to_be32(ticket->t_tid);
>  	ophdr->oh_clientid = ticket->t_clientid;
>  	ophdr->oh_len = 0;
>  	ophdr->oh_flags = XLOG_START_TRANS;
>  	ophdr->oh_res2 = 0;
> -
> -	ticket->t_flags &= ~XLOG_TIC_INITED;
> -
>  	return sizeof(struct xlog_op_header);
>  }

The header comment to this function refers to the "inited" state.

Also note that there's a similar reference in
xfs_log_write_unmount_record(), but that instance sets ->t_flags to zero
so might be fine outside of the stale comment.

>  
> @@ -2410,12 +2390,10 @@ xlog_write(
>  	len = xlog_write_calc_vec_length(ticket, log_vector);
>  
>  	/*
> -	 * Region headers and bytes are already accounted for.
> -	 * We only need to take into account start records and
> -	 * split regions in this function.
> +	 * Region headers and bytes are already accounted for.  We only need to
> +	 * take into account start records and split regions in this function.
>  	 */
> -	if (ticket->t_flags & XLOG_TIC_INITED)
> -		ticket->t_curr_res -= sizeof(xlog_op_header_t);
> +	ticket->t_curr_res -= sizeof(xlog_op_header_t);
>  

So AFAICT the CIL allocates a ticket and up to this point only mucks
around with the reservation value. That means _INITED is still in place
once we get to xlog_write(). xlog_write() immediately calls
xlog_write_calc_vec_length() and makes the ->t_curr_res adjustment
before touching ->t_flags, so those bits all seems fine.

We then get into the log vector loops, where it looks like we call
xlog_write_start_rec() for each log vector region and rely on the
_INITED flag to only write a start record once per associated ticket.
Unless I'm missing something, this looks like it would change behavior
to perhaps write a start record per-region..? Note that this might not
preclude the broader change to kill off _INITED since we're using the
same ticket throughout the call, but some initial refactoring might be
required to remove this dependency first.

>  	/*
>  	 * Commit record headers need to be accounted for. These
> @@ -3609,7 +3587,6 @@ xlog_ticket_alloc(
>  	tic->t_ocnt		= cnt;
>  	tic->t_tid		= prandom_u32();
>  	tic->t_clientid		= client;
> -	tic->t_flags		= XLOG_TIC_INITED;
>  	if (permanent)
>  		tic->t_flags |= XLOG_TIC_PERM_RESERV;
>  
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 84e06805160f..85f8d0966811 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -105,10 +105,6 @@ struct xfs_log_item;
>  struct xfs_item_ops;
>  struct xfs_trans;
>  
> -xfs_lsn_t xfs_log_done(struct xfs_mount *mp,
> -		       struct xlog_ticket *ticket,
> -		       struct xlog_in_core **iclog,
> -		       bool regrant);
>  int	  xfs_log_force(struct xfs_mount *mp, uint flags);
>  int	  xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags,
>  		int *log_forced);
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 48435cf2aa16..255065d276fc 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -841,10 +841,11 @@ xlog_cil_push(
>  	}
>  	spin_unlock(&cil->xc_push_lock);
>  
> -	/* xfs_log_done always frees the ticket on error. */
> -	commit_lsn = xfs_log_done(log->l_mp, tic, &commit_iclog, false);
> -	if (commit_lsn == -1)
> -		goto out_abort;
> +	error = xlog_write_done(log, tic, &commit_iclog, &commit_lsn);
> +	if (error)
> +		goto out_abort_free_ticket;
> +
> +	xlog_ticket_done(log, tic, false);
>  
>  	spin_lock(&commit_iclog->ic_callback_lock);
>  	if (commit_iclog->ic_state == XLOG_STATE_IOERROR) {
> @@ -876,7 +877,7 @@ xlog_cil_push(
>  	return 0;
>  
>  out_abort_free_ticket:
> -	xfs_log_ticket_put(tic);
> +	xlog_ticket_done(log, tic, false);
>  out_abort:
>  	xlog_cil_committed(ctx, true);
>  	return -EIO;
> @@ -1017,7 +1018,7 @@ xfs_log_commit_cil(
>  	if (commit_lsn)
>  		*commit_lsn = xc_commit_lsn;
>  
> -	xfs_log_done(mp, tp->t_ticket, NULL, regrant);
> +	xlog_ticket_done(log, tp->t_ticket, regrant);
>  	tp->t_ticket = NULL;
>  	xfs_trans_unreserve_and_mod_sb(tp);
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index b192c5a9f9fd..6965d164ff45 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -53,11 +53,9 @@ enum xlog_iclog_state {
>  /*
>   * Flags to log ticket
>   */
> -#define XLOG_TIC_INITED		0x1	/* has been initialized */
>  #define XLOG_TIC_PERM_RESERV	0x2	/* permanent reservation */

These values don't end up on disk, right? If not, it might be worth
resetting the _PERM_RESERV value to 1. Otherwise the rest looks like
mostly straightforward refactoring. 

Brian

>  
>  #define XLOG_TIC_FLAGS \
> -	{ XLOG_TIC_INITED,	"XLOG_TIC_INITED" }, \
>  	{ XLOG_TIC_PERM_RESERV,	"XLOG_TIC_PERM_RESERV" }
>  
>  /*
> @@ -438,14 +436,14 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
>  
>  void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
> -int
> -xlog_write(
> -	struct xlog		*log,
> -	struct xfs_log_vec	*log_vector,
> -	struct xlog_ticket	*tic,
> -	xfs_lsn_t		*start_lsn,
> -	struct xlog_in_core	**commit_iclog,
> -	uint			flags);
> +
> +int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
> +			struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
> +			struct xlog_in_core **commit_iclog, uint flags);
> +int xlog_write_done(struct xlog *log, struct xlog_ticket *ticket,
> +			struct xlog_in_core **iclog, xfs_lsn_t *lsn);
> +void xlog_ticket_done(struct xlog *log, struct xlog_ticket *ticket,
> +			bool regrant);
>  
>  /*
>   * When we crack an atomic LSN, we sample it first so that the value will not
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3b208f9a865c..85ea3727878b 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -9,6 +9,7 @@
>  #include "xfs_shared.h"
>  #include "xfs_format.h"
>  #include "xfs_log_format.h"
> +#include "xfs_log_priv.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
>  #include "xfs_extent_busy.h"
> @@ -150,8 +151,9 @@ xfs_trans_reserve(
>  	uint			blocks,
>  	uint			rtextents)
>  {
> -	int		error = 0;
> -	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	int			error = 0;
> +	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
>  
>  	/* Mark this thread as being in a transaction */
>  	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> @@ -162,7 +164,7 @@ xfs_trans_reserve(
>  	 * fail if the count would go below zero.
>  	 */
>  	if (blocks > 0) {
> -		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
> +		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
>  		if (error != 0) {
>  			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  			return -ENOSPC;
> @@ -191,9 +193,9 @@ xfs_trans_reserve(
>  
>  		if (tp->t_ticket != NULL) {
>  			ASSERT(resp->tr_logflags & XFS_TRANS_PERM_LOG_RES);
> -			error = xfs_log_regrant(tp->t_mountp, tp->t_ticket);
> +			error = xfs_log_regrant(mp, tp->t_ticket);
>  		} else {
> -			error = xfs_log_reserve(tp->t_mountp,
> +			error = xfs_log_reserve(mp,
>  						resp->tr_logres,
>  						resp->tr_logcount,
>  						&tp->t_ticket, XFS_TRANSACTION,
> @@ -213,7 +215,7 @@ xfs_trans_reserve(
>  	 * fail if the count would go below zero.
>  	 */
>  	if (rtextents > 0) {
> -		error = xfs_mod_frextents(tp->t_mountp, -((int64_t)rtextents));
> +		error = xfs_mod_frextents(mp, -((int64_t)rtextents));
>  		if (error) {
>  			error = -ENOSPC;
>  			goto undo_log;
> @@ -229,7 +231,7 @@ xfs_trans_reserve(
>  	 */
>  undo_log:
>  	if (resp->tr_logres > 0) {
> -		xfs_log_done(tp->t_mountp, tp->t_ticket, NULL, false);
> +		xlog_ticket_done(mp->m_log, tp->t_ticket, false);
>  		tp->t_ticket = NULL;
>  		tp->t_log_res = 0;
>  		tp->t_flags &= ~XFS_TRANS_PERM_LOG_RES;
> @@ -237,7 +239,7 @@ xfs_trans_reserve(
>  
>  undo_blocks:
>  	if (blocks > 0) {
> -		xfs_mod_fdblocks(tp->t_mountp, (int64_t)blocks, rsvd);
> +		xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd);
>  		tp->t_blk_res = 0;
>  	}
>  
> @@ -999,9 +1001,7 @@ __xfs_trans_commit(
>  	 */
>  	xfs_trans_unreserve_and_mod_dquots(tp);
>  	if (tp->t_ticket) {
> -		commit_lsn = xfs_log_done(mp, tp->t_ticket, NULL, regrant);
> -		if (commit_lsn == -1 && !error)
> -			error = -EIO;
> +		xlog_ticket_done(mp->m_log, tp->t_ticket, regrant);
>  		tp->t_ticket = NULL;
>  	}
>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> @@ -1060,7 +1060,7 @@ xfs_trans_cancel(
>  	xfs_trans_unreserve_and_mod_dquots(tp);
>  
>  	if (tp->t_ticket) {
> -		xfs_log_done(mp, tp->t_ticket, NULL, false);
> +		xlog_ticket_done(mp->m_log, tp->t_ticket, false);
>  		tp->t_ticket = NULL;
>  	}
>  
>
Dave Chinner March 3, 2020, 9:26 p.m. UTC | #10
On Tue, Mar 03, 2020 at 09:13:16AM -0500, Brian Foster wrote:
> On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote:
> > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> > > <thinking out loud from here on..>
> > > 
> > > One thing that comes to mind thinking about this is dealing with
> > > batching (relog multiple items per roll). This code doesn't handle that
> > > yet, but I anticipate it being a requirement and it's fairly easy to
> > > update the current scheme to support a fixed item count per relog-roll.
> > > 
> > > A stealing approach potentially complicates things when it comes to
> > > batching because we have per-item reservation granularity to consider.
> > > For example, consider if we had a variety of different relog item types
> > > active at once, a subset are being relogged while another subset are
> > > being disabled (before ever needing to be relogged).
> > 
> > So concurrent "active relogging" + "start relogging" + "stop
> > relogging"?
> > 
> 
> Yeah..
> 
> > > For one, we'd have
> > > to be careful to not release reservation while it might be accounted to
> > > an active relog transaction that is currently rolling with some other
> > > items, etc. There's also a potential quirk in handling reservation of a
> > > relogged item that is cancelled while it's being relogged, but that
> > > might be more of an implementation detail.
> > 
> > Right, did you notice the ail->ail_relog_lock rwsem that I wrapped
> > my example relog transaction item add loop + commit function in?
> > 
> 
> Yes, but the side effects didn't quite register..
> 
> > i.e. while we are building and committing a relog transaction, we
> > hold off the transactions that are trying to add/remove their items
> > to/from the relog list. Hence the reservation stealing accounting in
> > the ticket can be be serialised against the transactional use of the
> > ticket.
> > 
> 
> Hmm, so this changes relog item/state management as well as the
> reservation management. That's probably why it wasn't clear to me from
> the code example.

Yeah, I left a lot out :P

> > It's basically the same method we use for serialising addition to
> > the CIL in transaction commit against CIL pushes draining the
> > current list for log writes (rwsem for add/push serialisation, spin
> > lock for concurrent add serialisation under the rwsem).
> > 
> 
> Ok. The difference with the CIL is that in that context we're processing
> a single, constantly maintained list of items. As of right now, there is
> no such list for relog enabled items. If I'm following correctly, that's
> still not necessary here, we're just talking about wrapping a rw lock
> around the state management (+ res accounting) and the actual res
> consumption such that a relog cancel doesn't steal reservation from an
> active relog transaction, for example.

That is correct. I don't think it actually matters because we can't
remove an item that is locked and being relogged (so it's
reservation is in use). Except for the fact we can't serialise
internal itcket accounting updates the transaction might be doing
with external tciket accounting modifications any other way.

> > > I don't think that's a show stopper, but rather just something I'd like
> > > to have factored into the design from the start. One option could be to
> > 
> > *nod*
> > 
> > > maintain a separate counter of active relog reservation aside from the
> > > actual relog ticket. That way the relog ticket could just pull from this
> > > relog reservation pool based on the current item(s) being relogged
> > > asynchronously from different tasks that might add or remove reservation
> > > from the pool for separate items. That might get a little wonky when we
> > > consider the relog ticket needs to pull from the pool and then put
> > > something back if the item is still reloggable after the relog
> > > transaction rolls.
> > 
> > RIght, that's the whole problem that we solve via a) stealing the
> > initial reserve/write grant space at commit time and b) serialising
> > stealing vs transactional use of the ticket.
> > 
> > That is, after the roll, the ticket has a full reserve and grant
> > space reservation for all the items accounted to the relog ticket.
> > Every new relog item added to the ticket (or is relogged in the CIL
> > and uses more space) adds the full required reserve/write grant
> > space to the the relog ticket. Hence the relog ticket always has
> > current log space reserved to commit the entire set of items tagged
> > as reloggable. And by avoiding modifying the ticket while we are
> > actively processing the relog transaction, we don't screw up the
> > ticket accounting in the middle of the transaction....
> > 
> 
> Yep, Ok.
> 
> > > Another problem is that reloggable items are still otherwise usable once
> > > they are unlocked. So for example we'd have to account for a situation
> > > where one transaction dirties a buffer, enables relogging, commits and
> > > then some other transaction dirties more of the buffer and commits
> > > without caring whether the buffer was relog enabled or not.
> > 
> > yup, that's the delta size updates from the CIL commit. i.e. if we
> > relog an item to the CIL that has the XFS_LI_RELOG flag already set
> > on it, the change in size that we steal for the CIL ticket also
> > needs to be stolen for the AIL ticket. i.e. we already do almost all
> > the work we need to handle this.
> > 
> > > Unless
> > > runtime relog reservation is always worst case, that's a subtle path to
> > > reservation overrun in the relog transaction.
> > 
> > Yes, but it's a problem the CIL already solves for us :P
> 
> Ok, I think we're pretty close to the same page here. I was thinking
> about the worst case relog reservation being pulled off the committing
> transaction unconditionally, where I think you're thinking about it as
> the transaction (i.e. reservation calculation) would have the worst case
> reservation, but we'd only pull off the delta as needed at commit time
> (i.e. exactly how the CIL works wrt to transaction reservation
> consumption). Let me work through a simple example to try and
> (in)validate my concern:
> 
> - Transaction A is relog enabled, dirties 50% of a buffer and enables
>   auto relogging. On commit, the CIL takes buffer/2 reservation for the
>   log vector and the relog mechanism takes the same amount for
>   subsequent relogs.
> - Transaction B is not relog enabled (so no extra relog reservation),
>   dirties another 40% of the (already relog enabled) buffer and commits.
>   The CIL takes 90% of the transaction buffer reservation. The relog
>   ticket now needs an additional 40% (since the original 50% is
>   maintained by the relog system), but afaict there is no guarantee that
>   res is available if trans B is not relog enabled.

Yes, I can see that would be an issue - very well spotted, Brian.

Without reading your further comments: off the top of my head that
means we would probably have to declare particular types of objects
as reloggable, and explicitly include that object in each
reservation rather than use a generic "buffer" or "inode"
reservation for it. We are likely to only be relogging "container"
objects such as intents or high level structures such as,
inodes, AG headers, etc, so this probably isn't a huge increase
in transaction size or complexity, and it will be largely self
documenting.

But cwit still adds more complexity, and we're trying to avoid that
....

.....

Oh, I'm a bit stupid only half way through my first coffee: just get
rid of delta updates altogether and steal the entire relog
reservation for the object up front.  We really don't need delta
updates, it was just something the CIL does so I didn't think past
"we can use that because it is already there"....

/me reads on...

> So IIUC, the CIL scheme works because every transaction automatically
> includes worst case reservation for every possible item supported by the
> transaction. Relog transactions are presumably more selective, however,
> so we need to either have some rule where only relog enabled
> transactions can touch relog enabled items (it's not clear to me how
> realistic that is for things like buffers etc., but it sounds
> potentially delicate) or we simplify the relog reservation consumption
> calculation to consume worst case reservation for the relog ticket in
> anticipation of unrelated transactions dirtying more of the associated
> object since it was originally committed for relog. Thoughts?

Yep, I think we've both come to the same conclusion :)

> Note that carrying worst case reservation also potentially simplifies
> accounting between multiple dirty+relog increments and a single relog
> cancel, particularly if a relog cancelling transaction also happens to
> dirty more of a buffer before it commits. I'd prefer to avoid subtle
> usage landmines like that as much as possible. Hmm.. even if we do
> ultimately want a more granular and dynamic relog res accounting, I
> might start with the simple worst-case approach since 1.) it probably
> doesn't matter for intents, 2.) it's easier to reason about the isolated
> reservation calculation changes in an independent patch from the rest of
> the mechanism and 3.) I'd rather get the basics nailed down and solid
> before potentially fighting with granular reservation accounting
> accuracy bugs. ;)

Absolutely. I don't think we'll ever need anything more dynamic
or complex. And I think just taking the whole reservation will scale
a lot better if we are constantly modifying the object - we only
need to take the relog lock when we add or remove a relog
reservation now, not on every delta change to the object....

Again, well spotted and good thinking!

Cheers,

Dave.
Dave Chinner March 3, 2020, 9:47 p.m. UTC | #11
On Tue, Mar 03, 2020 at 10:12:17AM -0500, Brian Foster wrote:
> On Tue, Mar 03, 2020 at 03:07:35PM +1100, Dave Chinner wrote:
> > On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote:
> > > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> > > OK, XLOG_TIC_INITED is redundant, and should be removed. And
> > > xfs_log_done() needs to be split into two, one for releasing the
> > > ticket, one for completing the xlog_write() call. Compile tested
> > > only patch below for you :P
> > 
> > And now with sample patch.
> > 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > xfs: kill XLOG_TIC_INITED
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Delayed logging made this redundant as we never directly write
> > transactions to the log anymore. Hence we no longer make multiple
> > xlog_write() calls for a transaction as we format individual items
> > in a transaction, and hence don't need to keep track of whether we
> > should be writing a start record for every xlog_write call.
> > 
> 
> FWIW the commit log could use a bit more context, perhaps from your
> previous description, about the original semantics of _INITED flag.
> E.g., it's always been rather vague to me, probably because it seems to
> be a remnant of some no longer fully in place functionality.

Yup, it was a quick "here's what it looks like" patch.

> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c      | 79 ++++++++++++++++++---------------------------------
> >  fs/xfs/xfs_log.h      |  4 ---
> >  fs/xfs/xfs_log_cil.c  | 13 +++++----
> >  fs/xfs/xfs_log_priv.h | 18 ++++++------
> >  fs/xfs/xfs_trans.c    | 24 ++++++++--------
> >  5 files changed, 55 insertions(+), 83 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index f6006d94a581..a45f3eefee39 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -496,8 +496,8 @@ xfs_log_reserve(
> >   * This routine is called when a user of a log manager ticket is done with
> >   * the reservation.  If the ticket was ever used, then a commit record for
> >   * the associated transaction is written out as a log operation header with
> > - * no data.  The flag XLOG_TIC_INITED is set when the first write occurs with
> > - * a given ticket.  If the ticket was one with a permanent reservation, then
> > + * no data. 
> 
> 	      ^ trailing whitespace

Forgot to <ctrl-j> to join the lines back together :P

> 
> > + * If the ticket was one with a permanent reservation, then
> >   * a few operations are done differently.  Permanent reservation tickets by
> >   * default don't release the reservation.  They just commit the current
> >   * transaction with the belief that the reservation is still needed.  A flag
> > @@ -506,49 +506,38 @@ xfs_log_reserve(
> >   * the inited state again.  By doing this, a start record will be written
> >   * out when the next write occurs.
> >   */
> > -xfs_lsn_t
> > -xfs_log_done(
> > -	struct xfs_mount	*mp,
> > +int
> > +xlog_write_done(
> > +	struct xlog		*log,
> >  	struct xlog_ticket	*ticket,
> >  	struct xlog_in_core	**iclog,
> > -	bool			regrant)
> > +	xfs_lsn_t		*lsn)
> >  {
> > -	struct xlog		*log = mp->m_log;
> > -	xfs_lsn_t		lsn = 0;
> > -
> > -	if (XLOG_FORCED_SHUTDOWN(log) ||
> > -	    /*
> > -	     * If nothing was ever written, don't write out commit record.
> > -	     * If we get an error, just continue and give back the log ticket.
> > -	     */
> > -	    (((ticket->t_flags & XLOG_TIC_INITED) == 0) &&
> > -	     (xlog_commit_record(log, ticket, iclog, &lsn)))) {
> > -		lsn = (xfs_lsn_t) -1;
> > -		regrant = false;
> > -	}
> > +	if (XLOG_FORCED_SHUTDOWN(log))
> > +		return -EIO;
> >  
> > +	return xlog_commit_record(log, ticket, iclog, lsn);
> > +}
> >  
> > +/*
> > + * Release or regrant the ticket reservation now the transaction is done with
> > + * it depending on caller context. Rolling transactions need the ticket
> > + * regranted, otherwise we release it completely.
> > + */
> > +void
> > +xlog_ticket_done(
> > +	struct xlog		*log,
> > +	struct xlog_ticket	*ticket,
> > +	bool			regrant)
> > +{
> >  	if (!regrant) {
> >  		trace_xfs_log_done_nonperm(log, ticket);
> > -
> > -		/*
> > -		 * Release ticket if not permanent reservation or a specific
> > -		 * request has been made to release a permanent reservation.
> > -		 */
> >  		xlog_ungrant_log_space(log, ticket);
> >  	} else {
> >  		trace_xfs_log_done_perm(log, ticket);
> > -
> >  		xlog_regrant_reserve_log_space(log, ticket);
> > -		/* If this ticket was a permanent reservation and we aren't
> > -		 * trying to release it, reset the inited flags; so next time
> > -		 * we write, a start record will be written out.
> > -		 */
> > -		ticket->t_flags |= XLOG_TIC_INITED;
> >  	}
> > -
> >  	xfs_log_ticket_put(ticket);
> > -	return lsn;
> >  }
> 
> In general it would be nicer to split off as much refactoring as
> possible into separate patches, even though it's not yet clear to me
> what granularity is possible with this patch...

Yeah, there's heaps mor cleanups that can be done as a result of
this - e.g. xlog_write_done() and xlog_commit_record() should be
merged. The one caller of xlog_write() that does not provide a
commit_iclog variable shoudl do that call xlog_commit_record()
itself, then xlog_write() can just assume that always returns the
last iclog, etc....


> >  static bool
> > @@ -2148,8 +2137,9 @@ xlog_print_trans(
> >  }
> >  
> >  /*
> > - * Calculate the potential space needed by the log vector.  Each region gets
> > - * its own xlog_op_header_t and may need to be double word aligned.
> > + * Calculate the potential space needed by the log vector.  We always write a
> > + * start record, and each region gets its own xlog_op_header_t and may need to
> > + * be double word aligned.
> >   */
> >  static int
> >  xlog_write_calc_vec_length(
> > @@ -2157,14 +2147,10 @@ xlog_write_calc_vec_length(
> >  	struct xfs_log_vec	*log_vector)
> >  {
> >  	struct xfs_log_vec	*lv;
> > -	int			headers = 0;
> > +	int			headers = 1;
> >  	int			len = 0;
> >  	int			i;
> >  
> > -	/* acct for start rec of xact */
> > -	if (ticket->t_flags & XLOG_TIC_INITED)
> > -		headers++;
> > -
> >  	for (lv = log_vector; lv; lv = lv->lv_next) {
> >  		/* we don't write ordered log vectors */
> >  		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
> > @@ -2195,17 +2181,11 @@ xlog_write_start_rec(
> >  	struct xlog_op_header	*ophdr,
> >  	struct xlog_ticket	*ticket)
> >  {
> > -	if (!(ticket->t_flags & XLOG_TIC_INITED))
> > -		return 0;
> > -
> >  	ophdr->oh_tid	= cpu_to_be32(ticket->t_tid);
> >  	ophdr->oh_clientid = ticket->t_clientid;
> >  	ophdr->oh_len = 0;
> >  	ophdr->oh_flags = XLOG_START_TRANS;
> >  	ophdr->oh_res2 = 0;
> > -
> > -	ticket->t_flags &= ~XLOG_TIC_INITED;
> > -
> >  	return sizeof(struct xlog_op_header);
> >  }
> 
> The header comment to this function refers to the "inited" state.

Missed that...

> Also note that there's a similar reference in
> xfs_log_write_unmount_record(), but that instance sets ->t_flags to zero
> so might be fine outside of the stale comment.

More cleanups!

> > @@ -2410,12 +2390,10 @@ xlog_write(
> >  	len = xlog_write_calc_vec_length(ticket, log_vector);
> >  
> >  	/*
> > -	 * Region headers and bytes are already accounted for.
> > -	 * We only need to take into account start records and
> > -	 * split regions in this function.
> > +	 * Region headers and bytes are already accounted for.  We only need to
> > +	 * take into account start records and split regions in this function.
> >  	 */
> > -	if (ticket->t_flags & XLOG_TIC_INITED)
> > -		ticket->t_curr_res -= sizeof(xlog_op_header_t);
> > +	ticket->t_curr_res -= sizeof(xlog_op_header_t);
> >  
> 
> So AFAICT the CIL allocates a ticket and up to this point only mucks
> around with the reservation value. That means _INITED is still in place
> once we get to xlog_write(). xlog_write() immediately calls
> xlog_write_calc_vec_length() and makes the ->t_curr_res adjustment
> before touching ->t_flags, so those bits all seems fine.
> 
> We then get into the log vector loops, where it looks like we call
> xlog_write_start_rec() for each log vector region and rely on the
> _INITED flag to only write a start record once per associated ticket.
> Unless I'm missing something, this looks like it would change behavior
> to perhaps write a start record per-region..? Note that this might not
> preclude the broader change to kill off _INITED since we're using the
> same ticket throughout the call, but some initial refactoring might be
> required to remove this dependency first.

Ah, yes, well spotted.

I need to move the call to xlog_write_start_rec() outside
the loop - it only needs to be written once per ticket, and we only
ever supply one ticket to xlog_write() now, and it is never reused
to call back into xlog_write again for the same transaction context.

I did say "compile tested only " :)

> > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> > index b192c5a9f9fd..6965d164ff45 100644
> > --- a/fs/xfs/xfs_log_priv.h
> > +++ b/fs/xfs/xfs_log_priv.h
> > @@ -53,11 +53,9 @@ enum xlog_iclog_state {
> >  /*
> >   * Flags to log ticket
> >   */
> > -#define XLOG_TIC_INITED		0x1	/* has been initialized */
> >  #define XLOG_TIC_PERM_RESERV	0x2	/* permanent reservation */
> 
> These values don't end up on disk, right? If not, it might be worth
> resetting the _PERM_RESERV value to 1. Otherwise the rest looks like
> mostly straightforward refactoring. 

*nod*

-Dave.
Brian Foster March 4, 2020, 2:03 p.m. UTC | #12
On Wed, Mar 04, 2020 at 08:26:26AM +1100, Dave Chinner wrote:
> On Tue, Mar 03, 2020 at 09:13:16AM -0500, Brian Foster wrote:
> > On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote:
> > > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote:
> > > > <thinking out loud from here on..>
> > > > 
> > > > One thing that comes to mind thinking about this is dealing with
> > > > batching (relog multiple items per roll). This code doesn't handle that
> > > > yet, but I anticipate it being a requirement and it's fairly easy to
> > > > update the current scheme to support a fixed item count per relog-roll.
> > > > 
> > > > A stealing approach potentially complicates things when it comes to
> > > > batching because we have per-item reservation granularity to consider.
> > > > For example, consider if we had a variety of different relog item types
> > > > active at once, a subset are being relogged while another subset are
> > > > being disabled (before ever needing to be relogged).
> > > 
> > > So concurrent "active relogging" + "start relogging" + "stop
> > > relogging"?
> > > 
> > 
> > Yeah..
> > 
> > > > For one, we'd have
> > > > to be careful to not release reservation while it might be accounted to
> > > > an active relog transaction that is currently rolling with some other
> > > > items, etc. There's also a potential quirk in handling reservation of a
> > > > relogged item that is cancelled while it's being relogged, but that
> > > > might be more of an implementation detail.
> > > 
> > > Right, did you notice the ail->ail_relog_lock rwsem that I wrapped
> > > my example relog transaction item add loop + commit function in?
> > > 
> > 
> > Yes, but the side effects didn't quite register..
> > 
> > > i.e. while we are building and committing a relog transaction, we
> > > hold off the transactions that are trying to add/remove their items
> > > to/from the relog list. Hence the reservation stealing accounting in
> > > the ticket can be be serialised against the transactional use of the
> > > ticket.
> > > 
> > 
> > Hmm, so this changes relog item/state management as well as the
> > reservation management. That's probably why it wasn't clear to me from
> > the code example.
> 
> Yeah, I left a lot out :P
> 
> > > It's basically the same method we use for serialising addition to
> > > the CIL in transaction commit against CIL pushes draining the
> > > current list for log writes (rwsem for add/push serialisation, spin
> > > lock for concurrent add serialisation under the rwsem).
> > > 
> > 
> > Ok. The difference with the CIL is that in that context we're processing
> > a single, constantly maintained list of items. As of right now, there is
> > no such list for relog enabled items. If I'm following correctly, that's
> > still not necessary here, we're just talking about wrapping a rw lock
> > around the state management (+ res accounting) and the actual res
> > consumption such that a relog cancel doesn't steal reservation from an
> > active relog transaction, for example.
> 
> That is correct. I don't think it actually matters because we can't
> remove an item that is locked and being relogged (so it's
> reservation is in use). Except for the fact we can't serialise
> internal itcket accounting updates the transaction might be doing
> with external tciket accounting modifications any other way.
> 

We do still need some form of serialization for objects that aren't
currently lockable, like certain intents. I suppose we could add locks
where necessary, but right now I'm thinking of a slight alteration of
the res accounting strategy to distinguish relog enabled items currently
resident in the relog transaction (i.e. awaiting the AIL processing to
complete and commit/roll the transaction) from all other relog enabled
items might be good enough.

Without something like that (or IOW if the relog transaction is the
central holder of all outstanding relog reservation), then we always
have to serialize relog cancel on an item against xfsaild populating a
relog transaction, regardless of whether the affected item is being
relogged at the time. That's again probably not something that's a huge
deal for the current use case, but it also _seems_ like something that
can be addressed via implementation tweaks without changing the
fundamental design. I guess we'll see once I have a chance to actually
play around with it... :P

> > > > I don't think that's a show stopper, but rather just something I'd like
> > > > to have factored into the design from the start. One option could be to
> > > 
> > > *nod*
> > > 
> > > > maintain a separate counter of active relog reservation aside from the
> > > > actual relog ticket. That way the relog ticket could just pull from this
> > > > relog reservation pool based on the current item(s) being relogged
> > > > asynchronously from different tasks that might add or remove reservation
> > > > from the pool for separate items. That might get a little wonky when we
> > > > consider the relog ticket needs to pull from the pool and then put
> > > > something back if the item is still reloggable after the relog
> > > > transaction rolls.
> > > 
> > > RIght, that's the whole problem that we solve via a) stealing the
> > > initial reserve/write grant space at commit time and b) serialising
> > > stealing vs transactional use of the ticket.
> > > 
> > > That is, after the roll, the ticket has a full reserve and grant
> > > space reservation for all the items accounted to the relog ticket.
> > > Every new relog item added to the ticket (or is relogged in the CIL
> > > and uses more space) adds the full required reserve/write grant
> > > space to the the relog ticket. Hence the relog ticket always has
> > > current log space reserved to commit the entire set of items tagged
> > > as reloggable. And by avoiding modifying the ticket while we are
> > > actively processing the relog transaction, we don't screw up the
> > > ticket accounting in the middle of the transaction....
> > > 
> > 
> > Yep, Ok.
> > 
> > > > Another problem is that reloggable items are still otherwise usable once
> > > > they are unlocked. So for example we'd have to account for a situation
> > > > where one transaction dirties a buffer, enables relogging, commits and
> > > > then some other transaction dirties more of the buffer and commits
> > > > without caring whether the buffer was relog enabled or not.
> > > 
> > > yup, that's the delta size updates from the CIL commit. i.e. if we
> > > relog an item to the CIL that has the XFS_LI_RELOG flag already set
> > > on it, the change in size that we steal for the CIL ticket also
> > > needs to be stolen for the AIL ticket. i.e. we already do almost all
> > > the work we need to handle this.
> > > 
> > > > Unless
> > > > runtime relog reservation is always worst case, that's a subtle path to
> > > > reservation overrun in the relog transaction.
> > > 
> > > Yes, but it's a problem the CIL already solves for us :P
> > 
> > Ok, I think we're pretty close to the same page here. I was thinking
> > about the worst case relog reservation being pulled off the committing
> > transaction unconditionally, where I think you're thinking about it as
> > the transaction (i.e. reservation calculation) would have the worst case
> > reservation, but we'd only pull off the delta as needed at commit time
> > (i.e. exactly how the CIL works wrt to transaction reservation
> > consumption). Let me work through a simple example to try and
> > (in)validate my concern:
> > 
> > - Transaction A is relog enabled, dirties 50% of a buffer and enables
> >   auto relogging. On commit, the CIL takes buffer/2 reservation for the
> >   log vector and the relog mechanism takes the same amount for
> >   subsequent relogs.
> > - Transaction B is not relog enabled (so no extra relog reservation),
> >   dirties another 40% of the (already relog enabled) buffer and commits.
> >   The CIL takes 90% of the transaction buffer reservation. The relog
> >   ticket now needs an additional 40% (since the original 50% is
> >   maintained by the relog system), but afaict there is no guarantee that
> >   res is available if trans B is not relog enabled.
> 
> Yes, I can see that would be an issue - very well spotted, Brian.
> 
> Without reading your further comments: off the top of my head that
> means we would probably have to declare particular types of objects
> as reloggable, and explicitly include that object in each
> reservation rather than use a generic "buffer" or "inode"
> reservation for it. We are likely to only be relogging "container"
> objects such as intents or high level structures such as,
> inodes, AG headers, etc, so this probably isn't a huge increase
> in transaction size or complexity, and it will be largely self
> documenting.
> 
> But cwit still adds more complexity, and we're trying to avoid that
> ....
> 
> .....
> 
> Oh, I'm a bit stupid only half way through my first coffee: just get
> rid of delta updates altogether and steal the entire relog
> reservation for the object up front.  We really don't need delta
> updates, it was just something the CIL does so I didn't think past
> "we can use that because it is already there"....
> 
> /me reads on...
> 
> > So IIUC, the CIL scheme works because every transaction automatically
> > includes worst case reservation for every possible item supported by the
> > transaction. Relog transactions are presumably more selective, however,
> > so we need to either have some rule where only relog enabled
> > transactions can touch relog enabled items (it's not clear to me how
> > realistic that is for things like buffers etc., but it sounds
> > potentially delicate) or we simplify the relog reservation consumption
> > calculation to consume worst case reservation for the relog ticket in
> > anticipation of unrelated transactions dirtying more of the associated
> > object since it was originally committed for relog. Thoughts?
> 
> Yep, I think we've both come to the same conclusion :)
> 

Indeed. Account for the max res. size of the associated object that the
relog enabled transaction already acquired for us anyways and we
shouldn't have to worry about further modifications of the object.

> > Note that carrying worst case reservation also potentially simplifies
> > accounting between multiple dirty+relog increments and a single relog
> > cancel, particularly if a relog cancelling transaction also happens to
> > dirty more of a buffer before it commits. I'd prefer to avoid subtle
> > usage landmines like that as much as possible. Hmm.. even if we do
> > ultimately want a more granular and dynamic relog res accounting, I
> > might start with the simple worst-case approach since 1.) it probably
> > doesn't matter for intents, 2.) it's easier to reason about the isolated
> > reservation calculation changes in an independent patch from the rest of
> > the mechanism and 3.) I'd rather get the basics nailed down and solid
> > before potentially fighting with granular reservation accounting
> > accuracy bugs. ;)
> 
> Absolutely. I don't think we'll ever need anything more dynamic
> or complex. And I think just taking the whole reservation will scale
> a lot better if we are constantly modifying the object - we only
> need to take the relog lock when we add or remove a relog
> reservation now, not on every delta change to the object....
> 

Yep, Ok. I think there's still some potential quirks to work through
with this approach (re: the per-item granularity
accounting/serialization bits mentioned above), but at this point it's
easier for me to reason about it by hacking on it a bit. If there are
still any issues to work out, best to have something concrete on the
list to guide discussion. Thanks Dave, appreciate the design input!

Brian

> Again, well spotted and good thinking!
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index c45acbd3add9..0a10ca0853ab 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -77,6 +77,7 @@  void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
  * made then this algorithm will eventually find all the space it needs.
  */
 #define XFS_TRANS_LOWMODE	0x100	/* allocate in low space mode */
+#define XFS_TRANS_RELOG		0x200	/* enable automatic relogging */
 
 /*
  * Field values for xfs_trans_mod_sb.
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 3b208f9a865c..8ac05ed8deda 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -107,9 +107,14 @@  xfs_trans_dup(
 
 	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
 		       (tp->t_flags & XFS_TRANS_RESERVE) |
-		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT);
-	/* We gave our writer reference to the new transaction */
+		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
+		       (tp->t_flags & XFS_TRANS_RELOG);
+	/*
+	 * The writer reference and relog reference transfer to the new
+	 * transaction.
+	 */
 	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
+	tp->t_flags &= ~XFS_TRANS_RELOG;
 	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
 
 	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
@@ -284,15 +289,25 @@  xfs_trans_alloc(
 	tp->t_firstblock = NULLFSBLOCK;
 
 	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
-	if (error) {
-		xfs_trans_cancel(tp);
-		return error;
+	if (error)
+		goto error;
+
+	if (flags & XFS_TRANS_RELOG) {
+		error = xfs_trans_ail_relog_reserve(&tp);
+		if (error)
+			goto error;
 	}
 
 	trace_xfs_trans_alloc(tp, _RET_IP_);
 
 	*tpp = tp;
 	return 0;
+
+error:
+	/* clear relog flag if we haven't acquired a ref */
+	tp->t_flags &= ~XFS_TRANS_RELOG;
+	xfs_trans_cancel(tp);
+	return error;
 }
 
 /*
@@ -973,6 +988,10 @@  __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
+	/* release the relog ticket reference if this transaction holds one */
+	if (tp->t_flags & XFS_TRANS_RELOG)
+		xfs_trans_ail_relog_put(mp);
+
 	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
@@ -1004,6 +1023,10 @@  __xfs_trans_commit(
 			error = -EIO;
 		tp->t_ticket = NULL;
 	}
+	/* release the relog ticket reference if this transaction holds one */
+	/* XXX: handle RELOG items on transaction abort */
+	if (tp->t_flags & XFS_TRANS_RELOG)
+		xfs_trans_ail_relog_put(mp);
 	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, !!error);
 	xfs_trans_free(tp);
@@ -1064,6 +1087,10 @@  xfs_trans_cancel(
 		tp->t_ticket = NULL;
 	}
 
+	/* release the relog ticket reference if this transaction holds one */
+	if (tp->t_flags & XFS_TRANS_RELOG)
+		xfs_trans_ail_relog_put(mp);
+
 	/* mark this thread as no longer being in a transaction */
 	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 752c7fef9de7..a032989943bd 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -236,6 +236,9 @@  int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
 void		xfs_trans_cancel(xfs_trans_t *);
 int		xfs_trans_ail_init(struct xfs_mount *);
 void		xfs_trans_ail_destroy(struct xfs_mount *);
+int		xfs_trans_ail_relog_reserve(struct xfs_trans **);
+bool		xfs_trans_ail_relog_get(struct xfs_mount *);
+int		xfs_trans_ail_relog_put(struct xfs_mount *);
 
 void		xfs_trans_buf_set_type(struct xfs_trans *, struct xfs_buf *,
 				       enum xfs_blft);
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 00cc5b8734be..a3fb64275baa 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -17,6 +17,7 @@ 
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 
 #ifdef DEBUG
 /*
@@ -818,6 +819,93 @@  xfs_trans_ail_delete(
 		xfs_log_space_wake(ailp->ail_mount);
 }
 
+bool
+xfs_trans_ail_relog_get(
+	struct xfs_mount	*mp)
+{
+	struct xfs_ail		*ailp = mp->m_ail;
+	bool			ret = false;
+
+	spin_lock(&ailp->ail_lock);
+	if (ailp->ail_relog_tic) {
+		xfs_log_ticket_get(ailp->ail_relog_tic);
+		ret = true;
+	}
+	spin_unlock(&ailp->ail_lock);
+	return ret;
+}
+
+/*
+ * Reserve log space for the automatic relogging ->tr_relog ticket. This
+ * requires a clean, permanent transaction from the caller. Pull reservation
+ * for the relog ticket and roll the caller's transaction back to its fully
+ * reserved state. If the AIL relog ticket is already initialized, grab a
+ * reference and return.
+ */
+int
+xfs_trans_ail_relog_reserve(
+	struct xfs_trans	**tpp)
+{
+	struct xfs_trans	*tp = *tpp;
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_ail		*ailp = mp->m_ail;
+	struct xlog_ticket	*tic;
+	uint32_t		logres = M_RES(mp)->tr_relog.tr_logres;
+
+	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
+	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
+
+	if (xfs_trans_ail_relog_get(mp))
+		return 0;
+
+	/* no active ticket, fall into slow path to allocate one.. */
+	tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0);
+	if (!tic)
+		return -ENOMEM;
+	ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res);
+
+	/* check again since we dropped the lock for the allocation */
+	spin_lock(&ailp->ail_lock);
+	if (ailp->ail_relog_tic) {
+		xfs_log_ticket_get(ailp->ail_relog_tic);
+		spin_unlock(&ailp->ail_lock);
+		xfs_log_ticket_put(tic);
+		return 0;
+	}
+
+	/* attach and reserve space for the ->tr_relog ticket */
+	ailp->ail_relog_tic = tic;
+	tp->t_ticket->t_curr_res -= tic->t_curr_res;
+	spin_unlock(&ailp->ail_lock);
+
+	return xfs_trans_roll(tpp);
+}
+
+/*
+ * Release a reference to the relog ticket.
+ */
+int
+xfs_trans_ail_relog_put(
+	struct xfs_mount	*mp)
+{
+	struct xfs_ail		*ailp = mp->m_ail;
+	struct xlog_ticket	*tic;
+
+	spin_lock(&ailp->ail_lock);
+	if (atomic_add_unless(&ailp->ail_relog_tic->t_ref, -1, 1)) {
+		spin_unlock(&ailp->ail_lock);
+		return 0;
+	}
+
+	ASSERT(atomic_read(&ailp->ail_relog_tic->t_ref) == 1);
+	tic = ailp->ail_relog_tic;
+	ailp->ail_relog_tic = NULL;
+	spin_unlock(&ailp->ail_lock);
+
+	xfs_log_done(mp, tic, NULL, false);
+	return 0;
+}
+
 int
 xfs_trans_ail_init(
 	xfs_mount_t	*mp)
@@ -854,6 +942,7 @@  xfs_trans_ail_destroy(
 {
 	struct xfs_ail	*ailp = mp->m_ail;
 
+	ASSERT(ailp->ail_relog_tic == NULL);
 	kthread_stop(ailp->ail_task);
 	kmem_free(ailp);
 }
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 2e073c1c4614..839df6559b9f 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -61,6 +61,7 @@  struct xfs_ail {
 	int			ail_log_flush;
 	struct list_head	ail_buf_list;
 	wait_queue_head_t	ail_empty;
+	struct xlog_ticket	*ail_relog_tic;
 };
 
 /*