diff mbox series

[2/8] xfs: separate CIL commit record IO

Message ID 20210223033442.3267258-3-david@fromorbit.com (mailing list archive)
State Superseded
Headers show
Series [1/8] xfs: log stripe roundoff is a property of the log | expand

Commit Message

Dave Chinner Feb. 23, 2021, 3:34 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

To allow for iclog IO device cache flush behaviour to be optimised,
we first need to separate out the commit record iclog IO from the
rest of the checkpoint so we can wait for the checkpoint IO to
complete before we issue the commit record.

This separation is only necessary if the commit record is being
written into a different iclog to the start of the checkpoint as the
upcoming cache flushing changes requires completion ordering against
the other iclogs submitted by the checkpoint.

If the entire checkpoint and commit is in the one iclog, then they
are both covered by the one set of cache flush primitives on the
iclog and hence there is no need to separate them for ordering.

Otherwise, we need to wait for all the previous iclogs to complete
so they are ordered correctly and made stable by the REQ_PREFLUSH
that the commit record iclog IO issues. This guarantees that if a
reader sees the commit record in the journal, they will also see the
entire checkpoint that commit record closes off.

This also provides the guarantee that when the commit record IO
completes, we can safely unpin all the log items in the checkpoint
so they can be written back because the entire checkpoint is stable
in the journal.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_log_cil.c  |  7 ++++++
 fs/xfs/xfs_log_priv.h |  2 ++
 3 files changed, 64 insertions(+)

Comments

Chandan Babu R Feb. 23, 2021, 12:12 p.m. UTC | #1
On 23 Feb 2021 at 09:04, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> To allow for iclog IO device cache flush behaviour to be optimised,
> we first need to separate out the commit record iclog IO from the
> rest of the checkpoint so we can wait for the checkpoint IO to
> complete before we issue the commit record.
>
> This separation is only necessary if the commit record is being
> written into a different iclog to the start of the checkpoint as the
> upcoming cache flushing changes requires completion ordering against
> the other iclogs submitted by the checkpoint.
>
> If the entire checkpoint and commit is in the one iclog, then they
> are both covered by the one set of cache flush primitives on the
> iclog and hence there is no need to separate them for ordering.
>
> Otherwise, we need to wait for all the previous iclogs to complete
> so they are ordered correctly and made stable by the REQ_PREFLUSH
> that the commit record iclog IO issues. This guarantees that if a
> reader sees the commit record in the journal, they will also see the
> entire checkpoint that commit record closes off.
>
> This also provides the guarantee that when the commit record IO
> completes, we can safely unpin all the log items in the checkpoint
> so they can be written back because the entire checkpoint is stable
> in the journal.
>

The changes seem to be logically correct.

Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_log_cil.c  |  7 ++++++
>  fs/xfs/xfs_log_priv.h |  2 ++
>  3 files changed, 64 insertions(+)
>
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index fa284f26d10e..ff26fb46d70f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
>  	return 0;
>  }
>  
> +/*
> + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> + * holds no log locks.
> + *
> + * We walk backwards through the iclogs to find the iclog with the highest lsn
> + * in the range that we need to wait for and then wait for it to complete.
> + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> + * candidate iclog we need to sleep on have been complete by the time our
> + * candidate has completed it's IO.
> + *
> + * Therefore we only need to find the first iclog that isn't clean within the
> + * span of our flush range. If we come across a clean, newly activated iclog
> + * with a lsn of 0, it means IO has completed on this iclog and all previous
> + * iclogs will be have been completed prior to this one. Hence finding a newly
> + * activated iclog indicates that there are no iclogs in the range we need to
> + * wait on and we are done searching.
> + */
> +int
> +xlog_wait_on_iclog_lsn(
> +	struct xlog_in_core	*iclog,
> +	xfs_lsn_t		start_lsn)
> +{
> +	struct xlog		*log = iclog->ic_log;
> +	struct xlog_in_core	*prev;
> +	int			error = -EIO;
> +
> +	spin_lock(&log->l_icloglock);
> +	if (XLOG_FORCED_SHUTDOWN(log))
> +		goto out_unlock;
> +
> +	error = 0;
> +	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
> +
> +		/* Done if the lsn is before our start lsn */
> +		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
> +				start_lsn) < 0)
> +			break;
> +
> +		/* Don't need to wait on completed, clean iclogs */
> +		if (prev->ic_state == XLOG_STATE_DIRTY ||
> +		    prev->ic_state == XLOG_STATE_ACTIVE) {
> +			continue;
> +		}
> +
> +		/* wait for completion on this iclog */
> +		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);
> +		return 0;
> +	}
> +
> +out_unlock:
> +	spin_unlock(&log->l_icloglock);
> +	return error;
> +}
> +
>  /*
>   * Write out an unmount record using the ticket provided. We have to account for
>   * the data space used in the unmount ticket as this write is not done from a
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..c5cc1b7ad25e 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -870,6 +870,13 @@ xlog_cil_push_work(
>  	wake_up_all(&cil->xc_commit_wait);
>  	spin_unlock(&cil->xc_push_lock);
>  
> +	/*
> +	 * If the checkpoint spans multiple iclogs, wait for all previous
> +	 * iclogs to complete before we submit the commit_iclog.
> +	 */
> +	if (ctx->start_lsn != commit_lsn)
> +		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
> +
>  	/* release the hounds! */
>  	xfs_log_release_iclog(commit_iclog);
>  	return;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 037950cf1061..a7ac85aaff4e 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -584,6 +584,8 @@ xlog_wait(
>  	remove_wait_queue(wq, &wait);
>  }
>  
> +int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn);
> +
>  /*
>   * The LSN is valid so long as it is behind the current LSN. If it isn't, this
>   * means that the next log record that includes this metadata could have a
Darrick J. Wong Feb. 24, 2021, 8:34 p.m. UTC | #2
On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To allow for iclog IO device cache flush behaviour to be optimised,
> we first need to separate out the commit record iclog IO from the
> rest of the checkpoint so we can wait for the checkpoint IO to
> complete before we issue the commit record.
> 
> This separation is only necessary if the commit record is being
> written into a different iclog to the start of the checkpoint as the
> upcoming cache flushing changes requires completion ordering against
> the other iclogs submitted by the checkpoint.
> 
> If the entire checkpoint and commit is in the one iclog, then they
> are both covered by the one set of cache flush primitives on the
> iclog and hence there is no need to separate them for ordering.
> 
> Otherwise, we need to wait for all the previous iclogs to complete
> so they are ordered correctly and made stable by the REQ_PREFLUSH
> that the commit record iclog IO issues. This guarantees that if a
> reader sees the commit record in the journal, they will also see the
> entire checkpoint that commit record closes off.
> 
> This also provides the guarantee that when the commit record IO
> completes, we can safely unpin all the log items in the checkpoint
> so they can be written back because the entire checkpoint is stable
> in the journal.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_log_cil.c  |  7 ++++++
>  fs/xfs/xfs_log_priv.h |  2 ++
>  3 files changed, 64 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index fa284f26d10e..ff26fb46d70f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
>  	return 0;
>  }
>  
> +/*
> + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> + * holds no log locks.
> + *
> + * We walk backwards through the iclogs to find the iclog with the highest lsn
> + * in the range that we need to wait for and then wait for it to complete.
> + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> + * candidate iclog we need to sleep on have been complete by the time our
> + * candidate has completed it's IO.

Hmm, I guess this means that iclog header lsns are supposed to increase
as one walks forwards through the list?

> + *
> + * Therefore we only need to find the first iclog that isn't clean within the
> + * span of our flush range. If we come across a clean, newly activated iclog
> + * with a lsn of 0, it means IO has completed on this iclog and all previous
> + * iclogs will be have been completed prior to this one. Hence finding a newly
> + * activated iclog indicates that there are no iclogs in the range we need to
> + * wait on and we are done searching.

I don't see an explicit check for an iclog with a zero lsn?  Is that
implied by XLOG_STATE_ACTIVE?

Also, do you have any idea what was Christoph talking about wrt devices
with no-op flushes the last time this patch was posted?  This change
seems straightforward to me (assuming the answers to my two question are
'yes') but I didn't grok what subtlety he was alluding to...?

--D

> + */
> +int
> +xlog_wait_on_iclog_lsn(
> +	struct xlog_in_core	*iclog,
> +	xfs_lsn_t		start_lsn)
> +{
> +	struct xlog		*log = iclog->ic_log;
> +	struct xlog_in_core	*prev;
> +	int			error = -EIO;
> +
> +	spin_lock(&log->l_icloglock);
> +	if (XLOG_FORCED_SHUTDOWN(log))
> +		goto out_unlock;
> +
> +	error = 0;
> +	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
> +
> +		/* Done if the lsn is before our start lsn */
> +		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
> +				start_lsn) < 0)
> +			break;
> +
> +		/* Don't need to wait on completed, clean iclogs */
> +		if (prev->ic_state == XLOG_STATE_DIRTY ||
> +		    prev->ic_state == XLOG_STATE_ACTIVE) {
> +			continue;
> +		}
> +
> +		/* wait for completion on this iclog */
> +		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);
> +		return 0;
> +	}
> +
> +out_unlock:
> +	spin_unlock(&log->l_icloglock);
> +	return error;
> +}
> +
>  /*
>   * Write out an unmount record using the ticket provided. We have to account for
>   * the data space used in the unmount ticket as this write is not done from a
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..c5cc1b7ad25e 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -870,6 +870,13 @@ xlog_cil_push_work(
>  	wake_up_all(&cil->xc_commit_wait);
>  	spin_unlock(&cil->xc_push_lock);
>  
> +	/*
> +	 * If the checkpoint spans multiple iclogs, wait for all previous
> +	 * iclogs to complete before we submit the commit_iclog.
> +	 */
> +	if (ctx->start_lsn != commit_lsn)
> +		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
> +
>  	/* release the hounds! */
>  	xfs_log_release_iclog(commit_iclog);
>  	return;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 037950cf1061..a7ac85aaff4e 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -584,6 +584,8 @@ xlog_wait(
>  	remove_wait_queue(wq, &wait);
>  }
>  
> +int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn);
> +
>  /*
>   * The LSN is valid so long as it is behind the current LSN. If it isn't, this
>   * means that the next log record that includes this metadata could have a
> -- 
> 2.28.0
>
Dave Chinner Feb. 24, 2021, 9:44 p.m. UTC | #3
On Wed, Feb 24, 2021 at 12:34:29PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To allow for iclog IO device cache flush behaviour to be optimised,
> > we first need to separate out the commit record iclog IO from the
> > rest of the checkpoint so we can wait for the checkpoint IO to
> > complete before we issue the commit record.
> > 
> > This separation is only necessary if the commit record is being
> > written into a different iclog to the start of the checkpoint as the
> > upcoming cache flushing changes requires completion ordering against
> > the other iclogs submitted by the checkpoint.
> > 
> > If the entire checkpoint and commit is in the one iclog, then they
> > are both covered by the one set of cache flush primitives on the
> > iclog and hence there is no need to separate them for ordering.
> > 
> > Otherwise, we need to wait for all the previous iclogs to complete
> > so they are ordered correctly and made stable by the REQ_PREFLUSH
> > that the commit record iclog IO issues. This guarantees that if a
> > reader sees the commit record in the journal, they will also see the
> > entire checkpoint that commit record closes off.
> > 
> > This also provides the guarantee that when the commit record IO
> > completes, we can safely unpin all the log items in the checkpoint
> > so they can be written back because the entire checkpoint is stable
> > in the journal.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_log_cil.c  |  7 ++++++
> >  fs/xfs/xfs_log_priv.h |  2 ++
> >  3 files changed, 64 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index fa284f26d10e..ff26fb46d70f 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> > + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> > + * holds no log locks.
> > + *
> > + * We walk backwards through the iclogs to find the iclog with the highest lsn
> > + * in the range that we need to wait for and then wait for it to complete.
> > + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> > + * candidate iclog we need to sleep on have been complete by the time our
> > + * candidate has completed it's IO.
> 
> Hmm, I guess this means that iclog header lsns are supposed to increase
> as one walks forwards through the list?

yes, the iclogs are written sequentially to the log - we don't
switch the log->l_iclog pointer to the current active iclog until we
switch it out, and then the next iclog in the loop is physically
located at a higher lsn to the one we just switched out.

> > + *
> > + * Therefore we only need to find the first iclog that isn't clean within the
> > + * span of our flush range. If we come across a clean, newly activated iclog
> > + * with a lsn of 0, it means IO has completed on this iclog and all previous
> > + * iclogs will be have been completed prior to this one. Hence finding a newly
> > + * activated iclog indicates that there are no iclogs in the range we need to
> > + * wait on and we are done searching.
> 
> I don't see an explicit check for an iclog with a zero lsn?  Is that
> implied by XLOG_STATE_ACTIVE?

It's handled by the XFS_LSN_CMP(prev_lsn, start_lsn) < 0 check.  if
the prev_lsn is zero because the iclog is clean, then this check
will always be true.

> Also, do you have any idea what was Christoph talking about wrt devices
> with no-op flushes the last time this patch was posted?  This change
> seems straightforward to me (assuming the answers to my two question are
> 'yes') but I didn't grok what subtlety he was alluding to...?

He was wondering what devices benefited from this. It has no impact
on highspeed devices that do not require flushes/FUA (e.g. high end
intel optane SSDs) but those are not the devices this change is
aimed at. There are no regressions on these high end devices,
either, so they are largely irrelevant to the patch and what it
targets...

Cheers,

Dave.
Darrick J. Wong Feb. 24, 2021, 11:06 p.m. UTC | #4
On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> On Wed, Feb 24, 2021 at 12:34:29PM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > To allow for iclog IO device cache flush behaviour to be optimised,
> > > we first need to separate out the commit record iclog IO from the
> > > rest of the checkpoint so we can wait for the checkpoint IO to
> > > complete before we issue the commit record.
> > > 
> > > This separation is only necessary if the commit record is being
> > > written into a different iclog to the start of the checkpoint as the
> > > upcoming cache flushing changes requires completion ordering against
> > > the other iclogs submitted by the checkpoint.
> > > 
> > > If the entire checkpoint and commit is in the one iclog, then they
> > > are both covered by the one set of cache flush primitives on the
> > > iclog and hence there is no need to separate them for ordering.
> > > 
> > > Otherwise, we need to wait for all the previous iclogs to complete
> > > so they are ordered correctly and made stable by the REQ_PREFLUSH
> > > that the commit record iclog IO issues. This guarantees that if a
> > > reader sees the commit record in the journal, they will also see the
> > > entire checkpoint that commit record closes off.
> > > 
> > > This also provides the guarantee that when the commit record IO
> > > completes, we can safely unpin all the log items in the checkpoint
> > > so they can be written back because the entire checkpoint is stable
> > > in the journal.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_log_cil.c  |  7 ++++++
> > >  fs/xfs/xfs_log_priv.h |  2 ++
> > >  3 files changed, 64 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index fa284f26d10e..ff26fb46d70f 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
> > >  	return 0;
> > >  }
> > >  
> > > +/*
> > > + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> > > + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> > > + * holds no log locks.
> > > + *
> > > + * We walk backwards through the iclogs to find the iclog with the highest lsn
> > > + * in the range that we need to wait for and then wait for it to complete.
> > > + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> > > + * candidate iclog we need to sleep on have been complete by the time our
> > > + * candidate has completed it's IO.
> > 
> > Hmm, I guess this means that iclog header lsns are supposed to increase
> > as one walks forwards through the list?
> 
> yes, the iclogs are written sequentially to the log - we don't
> switch the log->l_iclog pointer to the current active iclog until we
> switch it out, and then the next iclog in the loop is physically
> located at a higher lsn to the one we just switched out.
> 
> > > + *
> > > + * Therefore we only need to find the first iclog that isn't clean within the
> > > + * span of our flush range. If we come across a clean, newly activated iclog
> > > + * with a lsn of 0, it means IO has completed on this iclog and all previous
> > > + * iclogs will be have been completed prior to this one. Hence finding a newly
> > > + * activated iclog indicates that there are no iclogs in the range we need to
> > > + * wait on and we are done searching.
> > 
> > I don't see an explicit check for an iclog with a zero lsn?  Is that
> > implied by XLOG_STATE_ACTIVE?
> 
> It's handled by the XFS_LSN_CMP(prev_lsn, start_lsn) < 0 check.  if
> the prev_lsn is zero because the iclog is clean, then this check
> will always be true.
> 
> > Also, do you have any idea what was Christoph talking about wrt devices
> > with no-op flushes the last time this patch was posted?  This change
> > seems straightforward to me (assuming the answers to my two question are
> > 'yes') but I didn't grok what subtlety he was alluding to...?
> 
> He was wondering what devices benefited from this. It has no impact
> on highspeed devices that do not require flushes/FUA (e.g. high end
> intel optane SSDs) but those are not the devices this change is
> aimed at. There are no regressions on these high end devices,
> either, so they are largely irrelevant to the patch and what it
> targets...

Ok, that's what I thought.  It seemed fairly self-evident to me that
high speed devices wouldn't care.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Christoph Hellwig Feb. 25, 2021, 8:34 a.m. UTC | #5
On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> > Also, do you have any idea what was Christoph talking about wrt devices
> > with no-op flushes the last time this patch was posted?  This change
> > seems straightforward to me (assuming the answers to my two question are
> > 'yes') but I didn't grok what subtlety he was alluding to...?
> 
> He was wondering what devices benefited from this. It has no impact
> on highspeed devices that do not require flushes/FUA (e.g. high end
> intel optane SSDs) but those are not the devices this change is
> aimed at. There are no regressions on these high end devices,
> either, so they are largely irrelevant to the patch and what it
> targets...

I don't think it is that simple.  Pretty much every device aimed at
enterprise use does not enable a volatile write cache by default.  That
also includes hard drives, arrays and NAND based SSDs.

Especially for hard drives (or slower arrays) the actual I/O wait might
matter.  What is the argument against making this conditional?
Dave Chinner Feb. 25, 2021, 8:47 p.m. UTC | #6
On Thu, Feb 25, 2021 at 09:34:47AM +0100, Christoph Hellwig wrote:
> On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> > > Also, do you have any idea what was Christoph talking about wrt devices
> > > with no-op flushes the last time this patch was posted?  This change
> > > seems straightforward to me (assuming the answers to my two question are
> > > 'yes') but I didn't grok what subtlety he was alluding to...?
> > 
> > He was wondering what devices benefited from this. It has no impact
> > on highspeed devices that do not require flushes/FUA (e.g. high end
> > intel optane SSDs) but those are not the devices this change is
> > aimed at. There are no regressions on these high end devices,
> > either, so they are largely irrelevant to the patch and what it
> > targets...
> 
> I don't think it is that simple.  Pretty much every device aimed at
> enterprise use does not enable a volatile write cache by default.  That
> also includes hard drives, arrays and NAND based SSDs.
> 
> Especially for hard drives (or slower arrays) the actual I/O wait might
> matter. 

Sorry, I/O wait might matter for what?

I'm really not sure what you're objecting to - you've hand-waved
about hardware that doesn't need cache flushes twice now and
inferred that they'd be adversely affected by removing cache
flushes. That just doesn't make any sense at all, and I have numbers
to back it up.

You also asked what storage it improved performance on and I told
you and then also pointed out all the software layers that it
massively helps, too, regardless of the physical storage
characteristics.

https://lore.kernel.org/linux-xfs/20210203212013.GV4662@dread.disaster.area/

I have numbers to back it up. You did not reply to me, so I'm not
going to waste time repeating myself here.

> What is the argument against making this conditional?

There is no argument for making this conditional. You've created an
undefined strawman and are demanding that I prove it wrong. If
you've got anything concrete, then tell us about it directly and
provide numbers.

Cheers,

Dave.
Darrick J. Wong Feb. 26, 2021, 2:48 a.m. UTC | #7
On Thu, Feb 25, 2021 at 09:34:47AM +0100, Christoph Hellwig wrote:
> On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> > > Also, do you have any idea what was Christoph talking about wrt devices
> > > with no-op flushes the last time this patch was posted?  This change
> > > seems straightforward to me (assuming the answers to my two question are
> > > 'yes') but I didn't grok what subtlety he was alluding to...?
> > 
> > He was wondering what devices benefited from this. It has no impact
> > on highspeed devices that do not require flushes/FUA (e.g. high end
> > intel optane SSDs) but those are not the devices this change is
> > aimed at. There are no regressions on these high end devices,
> > either, so they are largely irrelevant to the patch and what it
> > targets...
> 
> I don't think it is that simple.  Pretty much every device aimed at
> enterprise use does not enable a volatile write cache by default.  That
> also includes hard drives, arrays and NAND based SSDs.
> 
> Especially for hard drives (or slower arrays) the actual I/O wait might
> matter.  What is the argument against making this conditional?

I still don't understand what you're asking about here --

AFAICT the net effect of this patchset is that it reduces the number of
preflushes and FUA log writes.  To my knowledge, on a high end device
with no volatile write cache, flushes are a no-op (because all writes
are persisted somewhere immediately) and a FUA write should be the exact
same thing as a non-FUA write.  Because XFS will now issue fewer no-op
persistence commands to the device, there should be no effect at all.

In contrast, a dumb stone tablet with a write cache hooked up to SATA
will have agonizingly slow cache flushes.  XFS will issue fewer
persistence commands to the rock, which in turn makes things faster
because we're calling the engravers less often.

What am I missing here?  Are you saying that the cost of a cache flush
goes up much faster than the amount of data that has to be flushed?

--D
Brian Foster Feb. 28, 2021, 4:36 p.m. UTC | #8
On Thu, Feb 25, 2021 at 06:48:28PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 25, 2021 at 09:34:47AM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> > > > Also, do you have any idea what was Christoph talking about wrt devices
> > > > with no-op flushes the last time this patch was posted?  This change
> > > > seems straightforward to me (assuming the answers to my two question are
> > > > 'yes') but I didn't grok what subtlety he was alluding to...?
> > > 
> > > He was wondering what devices benefited from this. It has no impact
> > > on highspeed devices that do not require flushes/FUA (e.g. high end
> > > intel optane SSDs) but those are not the devices this change is
> > > aimed at. There are no regressions on these high end devices,
> > > either, so they are largely irrelevant to the patch and what it
> > > targets...
> > 
> > I don't think it is that simple.  Pretty much every device aimed at
> > enterprise use does not enable a volatile write cache by default.  That
> > also includes hard drives, arrays and NAND based SSDs.
> > 
> > Especially for hard drives (or slower arrays) the actual I/O wait might
> > matter.  What is the argument against making this conditional?
> 
> I still don't understand what you're asking about here --
> 
> AFAICT the net effect of this patchset is that it reduces the number of
> preflushes and FUA log writes.  To my knowledge, on a high end device
> with no volatile write cache, flushes are a no-op (because all writes
> are persisted somewhere immediately) and a FUA write should be the exact
> same thing as a non-FUA write.  Because XFS will now issue fewer no-op
> persistence commands to the device, there should be no effect at all.
> 

Except the cost of the new iowaits used to implement iclog ordering...
which I think is what Christoph has been asking about..?

IOW, considering the storage configuration noted above where the impact
of the flush/fua optimizations is neutral, the net effect of this change
is whatever impact is introduced by intra-checkpoint iowaits and iclog
ordering. What is that impact?

Note that it's not clear enough to me to suggest whether that impact
might be significant or not. Hopefully it's neutral (?), but that seems
like best case scenario so I do think it's a reasonable question.

Brian

> In contrast, a dumb stone tablet with a write cache hooked up to SATA
> will have agonizingly slow cache flushes.  XFS will issue fewer
> persistence commands to the rock, which in turn makes things faster
> because we're calling the engravers less often.
> 
> What am I missing here?  Are you saying that the cost of a cache flush
> goes up much faster than the amount of data that has to be flushed?
> 
> --D
>
Dave Chinner Feb. 28, 2021, 11:46 p.m. UTC | #9
On Sun, Feb 28, 2021 at 11:36:13AM -0500, Brian Foster wrote:
> On Thu, Feb 25, 2021 at 06:48:28PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 25, 2021 at 09:34:47AM +0100, Christoph Hellwig wrote:
> > > On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> > > > > Also, do you have any idea what was Christoph talking about wrt devices
> > > > > with no-op flushes the last time this patch was posted?  This change
> > > > > seems straightforward to me (assuming the answers to my two question are
> > > > > 'yes') but I didn't grok what subtlety he was alluding to...?
> > > > 
> > > > He was wondering what devices benefited from this. It has no impact
> > > > on highspeed devices that do not require flushes/FUA (e.g. high end
> > > > intel optane SSDs) but those are not the devices this change is
> > > > aimed at. There are no regressions on these high end devices,
> > > > either, so they are largely irrelevant to the patch and what it
> > > > targets...
> > > 
> > > I don't think it is that simple.  Pretty much every device aimed at
> > > enterprise use does not enable a volatile write cache by default.  That
> > > also includes hard drives, arrays and NAND based SSDs.
> > > 
> > > Especially for hard drives (or slower arrays) the actual I/O wait might
> > > matter.  What is the argument against making this conditional?
> > 
> > I still don't understand what you're asking about here --
> > 
> > AFAICT the net effect of this patchset is that it reduces the number of
> > preflushes and FUA log writes.  To my knowledge, on a high end device
> > with no volatile write cache, flushes are a no-op (because all writes
> > are persisted somewhere immediately) and a FUA write should be the exact
> > same thing as a non-FUA write.  Because XFS will now issue fewer no-op
> > persistence commands to the device, there should be no effect at all.
> > 
> 
> Except the cost of the new iowaits used to implement iclog ordering...
> which I think is what Christoph has been asking about..?

And I've already answered - it is largely just noise.

> IOW, considering the storage configuration noted above where the impact
> of the flush/fua optimizations is neutral, the net effect of this change
> is whatever impact is introduced by intra-checkpoint iowaits and iclog
> ordering. What is that impact?

All I've really noticed is that long tail latencies on operations go
down a bit. That seems to correlate with spending less time waiting
for log space when the log is full, but it's a marginal improvement
at best.

Otherwise I cannot measure any significant difference in performance
or behaviour across any of the metrics I monitor during performance
testing.

> Note that it's not clear enough to me to suggest whether that impact
> might be significant or not. Hopefully it's neutral (?), but that seems
> like best case scenario so I do think it's a reasonable question.

Yes, It's a reasonable question, but I answered it entirely and in
great detail the first time.  Repeating the same question multiple
times just with slightly different phrasing does not change the
answer, nor explain to me what the undocumented concern might be...

Cheers,

Dave.
Christoph Hellwig March 1, 2021, 9:09 a.m. UTC | #10
On Fri, Feb 26, 2021 at 07:47:55AM +1100, Dave Chinner wrote:
> Sorry, I/O wait might matter for what?

Think you have a SAS hard drive, WCE=0, typical queue depth of a few
dozend commands.

Before that we'd submit a bunch of iclogs, which are generally
sequential except of course for the log wrap around case.  The drive
can now easily take all the iclogs and write them in one rotation.

Now if we wait for the previous iclogs before submitting the
commit_iclog we need at least one more additional full roundtrip.
Brian Foster March 1, 2021, 3:19 p.m. UTC | #11
On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To allow for iclog IO device cache flush behaviour to be optimised,
> we first need to separate out the commit record iclog IO from the
> rest of the checkpoint so we can wait for the checkpoint IO to
> complete before we issue the commit record.
> 
> This separation is only necessary if the commit record is being
> written into a different iclog to the start of the checkpoint as the
> upcoming cache flushing changes requires completion ordering against
> the other iclogs submitted by the checkpoint.
> 
> If the entire checkpoint and commit is in the one iclog, then they
> are both covered by the one set of cache flush primitives on the
> iclog and hence there is no need to separate them for ordering.
> 
> Otherwise, we need to wait for all the previous iclogs to complete
> so they are ordered correctly and made stable by the REQ_PREFLUSH
> that the commit record iclog IO issues. This guarantees that if a
> reader sees the commit record in the journal, they will also see the
> entire checkpoint that commit record closes off.
> 
> This also provides the guarantee that when the commit record IO
> completes, we can safely unpin all the log items in the checkpoint
> so they can be written back because the entire checkpoint is stable
> in the journal.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_log_cil.c  |  7 ++++++
>  fs/xfs/xfs_log_priv.h |  2 ++
>  3 files changed, 64 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index fa284f26d10e..ff26fb46d70f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
>  	return 0;
>  }
>  
> +/*
> + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> + * holds no log locks.
> + *
> + * We walk backwards through the iclogs to find the iclog with the highest lsn
> + * in the range that we need to wait for and then wait for it to complete.
> + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> + * candidate iclog we need to sleep on have been complete by the time our
> + * candidate has completed it's IO.
> + *
> + * Therefore we only need to find the first iclog that isn't clean within the
> + * span of our flush range. If we come across a clean, newly activated iclog
> + * with a lsn of 0, it means IO has completed on this iclog and all previous
> + * iclogs will be have been completed prior to this one. Hence finding a newly
> + * activated iclog indicates that there are no iclogs in the range we need to
> + * wait on and we are done searching.
> + */
> +int
> +xlog_wait_on_iclog_lsn(
> +	struct xlog_in_core	*iclog,
> +	xfs_lsn_t		start_lsn)
> +{
> +	struct xlog		*log = iclog->ic_log;
> +	struct xlog_in_core	*prev;
> +	int			error = -EIO;
> +
> +	spin_lock(&log->l_icloglock);
> +	if (XLOG_FORCED_SHUTDOWN(log))
> +		goto out_unlock;
> +
> +	error = 0;
> +	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
> +
> +		/* Done if the lsn is before our start lsn */
> +		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
> +				start_lsn) < 0)
> +			break;
> +
> +		/* Don't need to wait on completed, clean iclogs */
> +		if (prev->ic_state == XLOG_STATE_DIRTY ||
> +		    prev->ic_state == XLOG_STATE_ACTIVE) {
> +			continue;
> +		}
> +
> +		/* wait for completion on this iclog */
> +		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);

You haven't addressed my feedback from the previous version. In
particular the bit about whether it is safe to block on ->ic_force_wait
from here considering some of our more quirky buffer locking behavior.

That aside, this iteration logic all seems a bit overengineered to me.
We have the commit record iclog of the current checkpoint and thus the
immediately previous iclog in the ring. We know that previous record
isn't earlier than start_lsn because the caller confirmed that start_lsn
!= commit_lsn. We also know that iclog can't become dirty -> active
until it and all previous iclog writes have completed because the
callback ordering implemented by xlog_state_do_callback() won't clean
the iclog until that point. Given that, can't this whole thing be
replaced with a check of iclog->prev to either see if it's been cleaned
or to otherwise xlog_wait() for that condition and return?

Brian

> +		return 0;
> +	}
> +
> +out_unlock:
> +	spin_unlock(&log->l_icloglock);
> +	return error;
> +}
> +
>  /*
>   * Write out an unmount record using the ticket provided. We have to account for
>   * the data space used in the unmount ticket as this write is not done from a
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..c5cc1b7ad25e 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -870,6 +870,13 @@ xlog_cil_push_work(
>  	wake_up_all(&cil->xc_commit_wait);
>  	spin_unlock(&cil->xc_push_lock);
>  
> +	/*
> +	 * If the checkpoint spans multiple iclogs, wait for all previous
> +	 * iclogs to complete before we submit the commit_iclog.
> +	 */
> +	if (ctx->start_lsn != commit_lsn)
> +		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
> +
>  	/* release the hounds! */
>  	xfs_log_release_iclog(commit_iclog);
>  	return;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 037950cf1061..a7ac85aaff4e 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -584,6 +584,8 @@ xlog_wait(
>  	remove_wait_queue(wq, &wait);
>  }
>  
> +int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn);
> +
>  /*
>   * The LSN is valid so long as it is behind the current LSN. If it isn't, this
>   * means that the next log record that includes this metadata could have a
> -- 
> 2.28.0
>
Brian Foster March 1, 2021, 3:33 p.m. UTC | #12
On Mon, Mar 01, 2021 at 10:46:42AM +1100, Dave Chinner wrote:
> On Sun, Feb 28, 2021 at 11:36:13AM -0500, Brian Foster wrote:
> > On Thu, Feb 25, 2021 at 06:48:28PM -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 25, 2021 at 09:34:47AM +0100, Christoph Hellwig wrote:
> > > > On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> > > > > > Also, do you have any idea what was Christoph talking about wrt devices
> > > > > > with no-op flushes the last time this patch was posted?  This change
> > > > > > seems straightforward to me (assuming the answers to my two question are
> > > > > > 'yes') but I didn't grok what subtlety he was alluding to...?
> > > > > 
> > > > > He was wondering what devices benefited from this. It has no impact
> > > > > on highspeed devices that do not require flushes/FUA (e.g. high end
> > > > > intel optane SSDs) but those are not the devices this change is
> > > > > aimed at. There are no regressions on these high end devices,
> > > > > either, so they are largely irrelevant to the patch and what it
> > > > > targets...
> > > > 
> > > > I don't think it is that simple.  Pretty much every device aimed at
> > > > enterprise use does not enable a volatile write cache by default.  That
> > > > also includes hard drives, arrays and NAND based SSDs.
> > > > 
> > > > Especially for hard drives (or slower arrays) the actual I/O wait might
> > > > matter.  What is the argument against making this conditional?
> > > 
> > > I still don't understand what you're asking about here --
> > > 
> > > AFAICT the net effect of this patchset is that it reduces the number of
> > > preflushes and FUA log writes.  To my knowledge, on a high end device
> > > with no volatile write cache, flushes are a no-op (because all writes
> > > are persisted somewhere immediately) and a FUA write should be the exact
> > > same thing as a non-FUA write.  Because XFS will now issue fewer no-op
> > > persistence commands to the device, there should be no effect at all.
> > > 
> > 
> > Except the cost of the new iowaits used to implement iclog ordering...
> > which I think is what Christoph has been asking about..?
> 
> And I've already answered - it is largely just noise.
> 
> > IOW, considering the storage configuration noted above where the impact
> > of the flush/fua optimizations is neutral, the net effect of this change
> > is whatever impact is introduced by intra-checkpoint iowaits and iclog
> > ordering. What is that impact?
> 
> All I've really noticed is that long tail latencies on operations go
> down a bit. That seems to correlate with spending less time waiting
> for log space when the log is full, but it's a marginal improvement
> at best.
> 
> Otherwise I cannot measure any significant difference in performance
> or behaviour across any of the metrics I monitor during performance
> testing.
> 

Ok.

> > Note that it's not clear enough to me to suggest whether that impact
> > might be significant or not. Hopefully it's neutral (?), but that seems
> > like best case scenario so I do think it's a reasonable question.
> 
> Yes, It's a reasonable question, but I answered it entirely and in
> great detail the first time.  Repeating the same question multiple
> times just with slightly different phrasing does not change the
> answer, nor explain to me what the undocumented concern might be...
> 

Darrick noted he wasn't clear on the question being asked. I rephrased
it to hopefully add some clarity, not change the answer (?).

(FWIW, the response in the previous version of this series didn't
clearly answer the question from my perspective either, so perhaps that
is why you're seeing it repeated by multiple reviewers. Regardless,
Christoph already replied with more detail so I'll just follow along in
that sub-thread..)

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner March 3, 2021, 12:11 a.m. UTC | #13
On Mon, Mar 01, 2021 at 09:09:01AM +0000, Christoph Hellwig wrote:
> On Fri, Feb 26, 2021 at 07:47:55AM +1100, Dave Chinner wrote:
> > Sorry, I/O wait might matter for what?
> 
> Think you have a SAS hard drive, WCE=0, typical queue depth of a few
> dozend commands.

Yup, so typical IO latency of 2-3ms, assuming 10krpm, assuming we
aren't doing metadata writeback which will blow this out.

I've tested on slower iSCSI devices than this (7-8ms typical av.
seek time), and it didn't show up any performance anomalies.

> Before that we'd submit a bunch of iclogs, which are generally
> sequential except of course for the log wrap around case.  The drive
> can now easily take all the iclogs and write them in one rotation.

Even if we take the best case for your example, this still means we
block on every 8 iclogs waiting 2-3ms for the spindle to rotate and
complete the IOs. Hence for a checkpoint of 32MB with 256kB iclogs,
we're blocking for 2-3ms at least 16 times before we get to the
commit iclog. With default iclog size of 32kB, we'll block a couple
of hundred times waiting on iclog IO...

IOWs, we're already talking about a best case checkpoint commit
latency of 30-50ms here.

[ And this isn't even considering media bandwidth there - 32MB on a
drive that can do maybe 200MB/s in the middle of the spindle where
the log is. That's another 150ms of data transfer time to physical
media. So if the drive is actually writing to physical media because
WCE=0, then we're taking *at least* 200ms per 32MB checkpoint. ]

> Now if we wait for the previous iclogs before submitting the
> commit_iclog we need at least one more additional full roundtrip.

So we add an average of 2-3ms to what is already taking, in the best
case, 30-50ms.

And these are mostly async commits this overhead is added to, so
there's rarely anything waiting on it and hence the extra small
latency is almost always lost in the noise. Even if the extra delay
is larger, there is rarely anything waiting on it so it's still
noise...

I just don't see anything relevant that stands out from the noise on
my systems.

Cheers,

Dave.
Dave Chinner March 3, 2021, 12:41 a.m. UTC | #14
On Mon, Mar 01, 2021 at 10:19:36AM -0500, Brian Foster wrote:
> On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To allow for iclog IO device cache flush behaviour to be optimised,
> > we first need to separate out the commit record iclog IO from the
> > rest of the checkpoint so we can wait for the checkpoint IO to
> > complete before we issue the commit record.
> > 
> > This separation is only necessary if the commit record is being
> > written into a different iclog to the start of the checkpoint as the
> > upcoming cache flushing changes requires completion ordering against
> > the other iclogs submitted by the checkpoint.
> > 
> > If the entire checkpoint and commit is in the one iclog, then they
> > are both covered by the one set of cache flush primitives on the
> > iclog and hence there is no need to separate them for ordering.
> > 
> > Otherwise, we need to wait for all the previous iclogs to complete
> > so they are ordered correctly and made stable by the REQ_PREFLUSH
> > that the commit record iclog IO issues. This guarantees that if a
> > reader sees the commit record in the journal, they will also see the
> > entire checkpoint that commit record closes off.
> > 
> > This also provides the guarantee that when the commit record IO
> > completes, we can safely unpin all the log items in the checkpoint
> > so they can be written back because the entire checkpoint is stable
> > in the journal.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_log_cil.c  |  7 ++++++
> >  fs/xfs/xfs_log_priv.h |  2 ++
> >  3 files changed, 64 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index fa284f26d10e..ff26fb46d70f 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> > + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> > + * holds no log locks.
> > + *
> > + * We walk backwards through the iclogs to find the iclog with the highest lsn
> > + * in the range that we need to wait for and then wait for it to complete.
> > + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> > + * candidate iclog we need to sleep on have been complete by the time our
> > + * candidate has completed it's IO.
> > + *
> > + * Therefore we only need to find the first iclog that isn't clean within the
> > + * span of our flush range. If we come across a clean, newly activated iclog
> > + * with a lsn of 0, it means IO has completed on this iclog and all previous
> > + * iclogs will be have been completed prior to this one. Hence finding a newly
> > + * activated iclog indicates that there are no iclogs in the range we need to
> > + * wait on and we are done searching.
> > + */
> > +int
> > +xlog_wait_on_iclog_lsn(
> > +	struct xlog_in_core	*iclog,
> > +	xfs_lsn_t		start_lsn)
> > +{
> > +	struct xlog		*log = iclog->ic_log;
> > +	struct xlog_in_core	*prev;
> > +	int			error = -EIO;
> > +
> > +	spin_lock(&log->l_icloglock);
> > +	if (XLOG_FORCED_SHUTDOWN(log))
> > +		goto out_unlock;
> > +
> > +	error = 0;
> > +	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
> > +
> > +		/* Done if the lsn is before our start lsn */
> > +		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
> > +				start_lsn) < 0)
> > +			break;
> > +
> > +		/* Don't need to wait on completed, clean iclogs */
> > +		if (prev->ic_state == XLOG_STATE_DIRTY ||
> > +		    prev->ic_state == XLOG_STATE_ACTIVE) {
> > +			continue;
> > +		}
> > +
> > +		/* wait for completion on this iclog */
> > +		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);
> 
> You haven't addressed my feedback from the previous version. In
> particular the bit about whether it is safe to block on ->ic_force_wait
> from here considering some of our more quirky buffer locking behavior.

Sorry, first I've heard about this. I don't have any such email in
my inbox.

I don't know what waiting on an iclog in the middle of a checkpoint
has to do with buffer locking behaviour, because iclogs don't use
buffers and we block waiting on iclog IO completion all the time in
xlog_state_get_iclog_space(). If it's not safe to block on iclog IO
completion here, then it's not safe to block on an iclog in
xlog_state_get_iclog_space(). That's obviously not true, so I'm
really not sure what the concern here is...

> That aside, this iteration logic all seems a bit overengineered to me.
> We have the commit record iclog of the current checkpoint and thus the
> immediately previous iclog in the ring. We know that previous record
> isn't earlier than start_lsn because the caller confirmed that start_lsn
> != commit_lsn. We also know that iclog can't become dirty -> active
> until it and all previous iclog writes have completed because the
> callback ordering implemented by xlog_state_do_callback() won't clean
> the iclog until that point. Given that, can't this whole thing be
> replaced with a check of iclog->prev to either see if it's been cleaned
> or to otherwise xlog_wait() for that condition and return?

Maybe. I was more concerned about ensuring that it did the right
thing so I checked all the things that came to mind. There was more
than enough compexity in other parts of this patchset to fill my
brain that minimal implementation were not a concern. I'll go take
another look at it.

Cheers,

Dave.
Brian Foster March 3, 2021, 3:22 p.m. UTC | #15
On Wed, Mar 03, 2021 at 11:41:19AM +1100, Dave Chinner wrote:
> On Mon, Mar 01, 2021 at 10:19:36AM -0500, Brian Foster wrote:
> > On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > To allow for iclog IO device cache flush behaviour to be optimised,
> > > we first need to separate out the commit record iclog IO from the
> > > rest of the checkpoint so we can wait for the checkpoint IO to
> > > complete before we issue the commit record.
> > > 
> > > This separation is only necessary if the commit record is being
> > > written into a different iclog to the start of the checkpoint as the
> > > upcoming cache flushing changes requires completion ordering against
> > > the other iclogs submitted by the checkpoint.
> > > 
> > > If the entire checkpoint and commit is in the one iclog, then they
> > > are both covered by the one set of cache flush primitives on the
> > > iclog and hence there is no need to separate them for ordering.
> > > 
> > > Otherwise, we need to wait for all the previous iclogs to complete
> > > so they are ordered correctly and made stable by the REQ_PREFLUSH
> > > that the commit record iclog IO issues. This guarantees that if a
> > > reader sees the commit record in the journal, they will also see the
> > > entire checkpoint that commit record closes off.
> > > 
> > > This also provides the guarantee that when the commit record IO
> > > completes, we can safely unpin all the log items in the checkpoint
> > > so they can be written back because the entire checkpoint is stable
> > > in the journal.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_log_cil.c  |  7 ++++++
> > >  fs/xfs/xfs_log_priv.h |  2 ++
> > >  3 files changed, 64 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index fa284f26d10e..ff26fb46d70f 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
> > >  	return 0;
> > >  }
> > >  
> > > +/*
> > > + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> > > + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> > > + * holds no log locks.
> > > + *
> > > + * We walk backwards through the iclogs to find the iclog with the highest lsn
> > > + * in the range that we need to wait for and then wait for it to complete.
> > > + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> > > + * candidate iclog we need to sleep on have been complete by the time our
> > > + * candidate has completed it's IO.
> > > + *
> > > + * Therefore we only need to find the first iclog that isn't clean within the
> > > + * span of our flush range. If we come across a clean, newly activated iclog
> > > + * with a lsn of 0, it means IO has completed on this iclog and all previous
> > > + * iclogs will be have been completed prior to this one. Hence finding a newly
> > > + * activated iclog indicates that there are no iclogs in the range we need to
> > > + * wait on and we are done searching.
> > > + */
> > > +int
> > > +xlog_wait_on_iclog_lsn(
> > > +	struct xlog_in_core	*iclog,
> > > +	xfs_lsn_t		start_lsn)
> > > +{
> > > +	struct xlog		*log = iclog->ic_log;
> > > +	struct xlog_in_core	*prev;
> > > +	int			error = -EIO;
> > > +
> > > +	spin_lock(&log->l_icloglock);
> > > +	if (XLOG_FORCED_SHUTDOWN(log))
> > > +		goto out_unlock;
> > > +
> > > +	error = 0;
> > > +	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
> > > +
> > > +		/* Done if the lsn is before our start lsn */
> > > +		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
> > > +				start_lsn) < 0)
> > > +			break;
> > > +
> > > +		/* Don't need to wait on completed, clean iclogs */
> > > +		if (prev->ic_state == XLOG_STATE_DIRTY ||
> > > +		    prev->ic_state == XLOG_STATE_ACTIVE) {
> > > +			continue;
> > > +		}
> > > +
> > > +		/* wait for completion on this iclog */
> > > +		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);
> > 
> > You haven't addressed my feedback from the previous version. In
> > particular the bit about whether it is safe to block on ->ic_force_wait
> > from here considering some of our more quirky buffer locking behavior.
> 
> Sorry, first I've heard about this. I don't have any such email in
> my inbox.
> 

For reference, the last bit of this mail:

https://lore.kernel.org/linux-xfs/20210201160737.GA3252048@bfoster/

> I don't know what waiting on an iclog in the middle of a checkpoint
> has to do with buffer locking behaviour, because iclogs don't use
> buffers and we block waiting on iclog IO completion all the time in
> xlog_state_get_iclog_space(). If it's not safe to block on iclog IO
> completion here, then it's not safe to block on an iclog in
> xlog_state_get_iclog_space(). That's obviously not true, so I'm
> really not sure what the concern here is...
> 

I think the broader question is not so much whether it's safe to block
here or not, but whether our current use of async log forces might have
a deadlock vector (which may or may not also include the
_get_iclog_space() scenario, I'd need to stare at that one a bit). I
referred to buffer locking because the buffer ->iop_unpin() handler can
attempt to acquire a buffer lock.

Looking again, that is the only place I see that blocks in iclog
completion callbacks and it's actually an abort scenario, which means
shutdown. I am slightly concerned that introducing more regular blocking
in the CIL push might lead to more frequent async log forces that block
on callback iclogs and thus exacerbate that issue (i.e. somebody might
be able to now reproduce yet another shutdown deadlock scenario to track
down that might not have been reproducible before, for whatever reason),
but that's probably not a serious enough problem to block this patch and
the advantages of the series overall.

Brian

> > That aside, this iteration logic all seems a bit overengineered to me.
> > We have the commit record iclog of the current checkpoint and thus the
> > immediately previous iclog in the ring. We know that previous record
> > isn't earlier than start_lsn because the caller confirmed that start_lsn
> > != commit_lsn. We also know that iclog can't become dirty -> active
> > until it and all previous iclog writes have completed because the
> > callback ordering implemented by xlog_state_do_callback() won't clean
> > the iclog until that point. Given that, can't this whole thing be
> > replaced with a check of iclog->prev to either see if it's been cleaned
> > or to otherwise xlog_wait() for that condition and return?
> 
> Maybe. I was more concerned about ensuring that it did the right
> thing so I checked all the things that came to mind. There was more
> than enough compexity in other parts of this patchset to fill my
> brain that minimal implementation were not a concern. I'll go take
> another look at it.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner March 4, 2021, 10:57 p.m. UTC | #16
On Wed, Mar 03, 2021 at 10:22:05AM -0500, Brian Foster wrote:
> On Wed, Mar 03, 2021 at 11:41:19AM +1100, Dave Chinner wrote:
> > On Mon, Mar 01, 2021 at 10:19:36AM -0500, Brian Foster wrote:
> > > On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > > You haven't addressed my feedback from the previous version. In
> > > particular the bit about whether it is safe to block on ->ic_force_wait
> > > from here considering some of our more quirky buffer locking behavior.
> > 
> > Sorry, first I've heard about this. I don't have any such email in
> > my inbox.
> > 
> 
> For reference, the last bit of this mail:
> 
> https://lore.kernel.org/linux-xfs/20210201160737.GA3252048@bfoster/
> 
> > I don't know what waiting on an iclog in the middle of a checkpoint
> > has to do with buffer locking behaviour, because iclogs don't use
> > buffers and we block waiting on iclog IO completion all the time in
> > xlog_state_get_iclog_space(). If it's not safe to block on iclog IO
> > completion here, then it's not safe to block on an iclog in
> > xlog_state_get_iclog_space(). That's obviously not true, so I'm
> > really not sure what the concern here is...
> > 
> 
> I think the broader question is not so much whether it's safe to block
> here or not, but whether our current use of async log forces might have
> a deadlock vector (which may or may not also include the
> _get_iclog_space() scenario, I'd need to stare at that one a bit). I
> referred to buffer locking because the buffer ->iop_unpin() handler can
> attempt to acquire a buffer lock.

There are none that I know of, and I'm not changing any of the log
write blocking rules. Hence if there is a problem, it's a zero-day
that we have never triggered nor have any awareness about at all.
Hence for the purposes of development and review, we can assume such
unknown design problems don't actually exist because there's
absolutely zero evidence to indicate there is problem here...

> Looking again, that is the only place I see that blocks in iclog
> completion callbacks and it's actually an abort scenario, which means
> shutdown.

Yup. The AIL simply needs to abort writeback of such locked, pinned
buffers and then everything works just fine.

> I am slightly concerned that introducing more regular blocking in
> the CIL push might lead to more frequent async log forces that
> block on callback iclogs and thus exacerbate that issue (i.e.
> somebody might be able to now reproduce yet another shutdown
> deadlock scenario to track down that might not have been
> reproducible before, for whatever reason), but that's probably not
> a serious enough problem to block this patch and the advantages of
> the series overall.

And that's why I updated the log force stats accounting to capture
the async log forces and how we account log forces that block. That
gives me direct visibility into the blocking behaviour while I'm
running tests. And even with this new visibility, I can't see any
change in the metrics that are above the noise floor...

Cheers,

Dave.
Dave Chinner March 5, 2021, 12:44 a.m. UTC | #17
On Wed, Mar 03, 2021 at 11:41:19AM +1100, Dave Chinner wrote:
> On Mon, Mar 01, 2021 at 10:19:36AM -0500, Brian Foster wrote:
> > On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > That aside, this iteration logic all seems a bit overengineered to me.
> > We have the commit record iclog of the current checkpoint and thus the
> > immediately previous iclog in the ring. We know that previous record
> > isn't earlier than start_lsn because the caller confirmed that start_lsn
> > != commit_lsn. We also know that iclog can't become dirty -> active
> > until it and all previous iclog writes have completed because the
> > callback ordering implemented by xlog_state_do_callback() won't clean
> > the iclog until that point. Given that, can't this whole thing be
> > replaced with a check of iclog->prev to either see if it's been cleaned
> > or to otherwise xlog_wait() for that condition and return?
> 
> Maybe. I was more concerned about ensuring that it did the right
> thing so I checked all the things that came to mind. There was more
> than enough compexity in other parts of this patchset to fill my
> brain that minimal implementation were not a concern. I'll go take
> another look at it.

Ok, so we can just use xlog_wait_on_iclog() here. I didn't look too
closely at the implementation of that function, just took the
comment above it at face value that it only waited for an iclog to
hit the disk.

We actually have two different iclog IO completion wait points - one
to wait for an iclog to hit the disk, and one to wait for hit the
disk and run completion callbacks. i.e. one is not ordered against
other iclogs and the other is strictly ordered.

The ordered version runs completion callbacks before waking waiters
thereby guaranteeing all previous iclog have been completed before
completing the current iclog and waking waiters.

The CIL code needs the latter, so yes, this can be simplified down
to a single xlog_wait_on_iclog(commit_iclog->ic_prev); call from the
CIL.

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index fa284f26d10e..ff26fb46d70f 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -808,6 +808,61 @@  xlog_wait_on_iclog(
 	return 0;
 }
 
+/*
+ * Wait on any iclogs that are still flushing in the range of start_lsn to the
+ * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
+ * holds no log locks.
+ *
+ * We walk backwards through the iclogs to find the iclog with the highest lsn
+ * in the range that we need to wait for and then wait for it to complete.
+ * Completion ordering of iclog IOs ensures that all prior iclogs to the
+ * candidate iclog we need to sleep on have been complete by the time our
+ * candidate has completed it's IO.
+ *
+ * Therefore we only need to find the first iclog that isn't clean within the
+ * span of our flush range. If we come across a clean, newly activated iclog
+ * with a lsn of 0, it means IO has completed on this iclog and all previous
+ * iclogs will be have been completed prior to this one. Hence finding a newly
+ * activated iclog indicates that there are no iclogs in the range we need to
+ * wait on and we are done searching.
+ */
+int
+xlog_wait_on_iclog_lsn(
+	struct xlog_in_core	*iclog,
+	xfs_lsn_t		start_lsn)
+{
+	struct xlog		*log = iclog->ic_log;
+	struct xlog_in_core	*prev;
+	int			error = -EIO;
+
+	spin_lock(&log->l_icloglock);
+	if (XLOG_FORCED_SHUTDOWN(log))
+		goto out_unlock;
+
+	error = 0;
+	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
+
+		/* Done if the lsn is before our start lsn */
+		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
+				start_lsn) < 0)
+			break;
+
+		/* Don't need to wait on completed, clean iclogs */
+		if (prev->ic_state == XLOG_STATE_DIRTY ||
+		    prev->ic_state == XLOG_STATE_ACTIVE) {
+			continue;
+		}
+
+		/* wait for completion on this iclog */
+		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);
+		return 0;
+	}
+
+out_unlock:
+	spin_unlock(&log->l_icloglock);
+	return error;
+}
+
 /*
  * Write out an unmount record using the ticket provided. We have to account for
  * the data space used in the unmount ticket as this write is not done from a
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b0ef071b3cb5..c5cc1b7ad25e 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -870,6 +870,13 @@  xlog_cil_push_work(
 	wake_up_all(&cil->xc_commit_wait);
 	spin_unlock(&cil->xc_push_lock);
 
+	/*
+	 * If the checkpoint spans multiple iclogs, wait for all previous
+	 * iclogs to complete before we submit the commit_iclog.
+	 */
+	if (ctx->start_lsn != commit_lsn)
+		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
+
 	/* release the hounds! */
 	xfs_log_release_iclog(commit_iclog);
 	return;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 037950cf1061..a7ac85aaff4e 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -584,6 +584,8 @@  xlog_wait(
 	remove_wait_queue(wq, &wait);
 }
 
+int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn);
+
 /*
  * The LSN is valid so long as it is behind the current LSN. If it isn't, this
  * means that the next log record that includes this metadata could have a