diff mbox series

[2/3,v2] xfs: AIL needs asynchronous CIL forcing

Message ID 20210224232600.GH4662@dread.disaster.area (mailing list archive)
State Superseded
Headers show
Series None | expand

Commit Message

Dave Chinner Feb. 24, 2021, 11:26 p.m. UTC
From: Dave Chinner <dchinner@redhat.com>

The AIL pushing is stalling on log forces when it comes across
pinned items. This is happening on removal workloads where the AIL
is dominated by stale items that are removed from AIL when the
checkpoint that marks the items stale is committed to the journal.
This results is relatively few items in the AIL, but those that are
are often pinned as directories items are being removed from are
still being logged.

As a result, many push cycles through the CIL will first issue a
blocking log force to unpin the items. This can take some time to
complete, with tracing regularly showing push delays of half a
second and sometimes up into the range of several seconds. Sequences
like this aren't uncommon:

....
 399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 270ms delay>
 400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
 400.099623:  xfsaild: first lsn 0x11002f3600
 400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
<wanted 50ms, got 500ms delay>
 400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
 400.589349:  xfsaild: first lsn 0x1100305000
 400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
<wanted 50ms, got 460ms delay>
 400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
 400.950343:  xfsaild: first lsn 0x1100317c00
 400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
<wanted 20ms, got 200ms delay>
 401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
 401.142334:  xfsaild: first lsn 0x110032e600
 401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
<wanted 10ms, got 10ms delay>
 401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
 401.154328:  xfsaild: first lsn 0x1100353000
 401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 300ms delay>
 401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
 401.451526:  xfsaild: first lsn 0x1100353000
 401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
<wanted 50ms, got 500ms delay>
 401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
....

In each of these cases, every AIL pass saw 101 log items stuck on
the AIL (pinned) with very few other items being found. Each pass, a
log force was issued, and delay between last/first is the sleep time
+ the sync log force time.

Some of these 101 items pinned the tail of the log. The tail of the
log does slowly creep forward (first lsn), but the problem is that
the log is actually out of reservation space because it's been
running so many transactions that stale items that never reach the
AIL but consume log space. Hence we have a largely empty AIL, with
long term pins on items that pin the tail of the log that don't get
pushed frequently enough to keep log space available.

The problem is the hundreds of milliseconds that we block in the log
force pushing the CIL out to disk. The AIL should not be stalled
like this - it needs to run and flush items that are at the tail of
the log with minimal latency. What we really need to do is trigger a
log flush, but then not wait for it at all - we've already done our
waiting for stuff to complete when we backed off prior to the log
force being issued.

Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
still do a blocking flush of the CIL and that is what is causing the
issue. Hence we need a new interface for the CIL to trigger an
immediate background push of the CIL to get it moving faster but not
to wait on that to occur. While the CIL is pushing, the AIL can also
be pushing.

We already have an internal interface to do this -
xlog_cil_push_now() - but we need a wrapper for it to be used
externally. xlog_cil_force_seq() can easily be extended to do what
we need as it already implements the synchronous CIL push via
xlog_cil_push_now(). Add the necessary flags and "push current
sequence" semantics to xlog_cil_force_seq() and convert the AIL
pushing to use it.

One of the complexities here is that the CIL push does not guarantee
that the commit record for the CIL checkpoint is written to disk.
The current log force ensures this by submitting the current ACTIVE
iclog that the commit record was written to. We need the CIL to
actually write this commit record to disk for an async push to
ensure that the checkpoint actually makes it to disk and unpins the
pinned items in the checkpoint on completion. Hence we need to pass
down to the CIL push that we are doing an async flush so that it can
switch out the commit_iclog if necessary to get written to disk when
the commit iclog is finally released.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
Version 2:
- ensure the CIL checkpoint issues the commit record to disk for an
  async push. Fixes generic/530 hang on small logs.
- increment log force stats when the CIL is forced and also when it
  sleeps to give insight into the amount of blocking being done when
  the CIL is forced.

 fs/xfs/xfs_log.c       | 59 +++++++++++++++++++---------------------------
 fs/xfs/xfs_log.h       |  2 +-
 fs/xfs/xfs_log_cil.c   | 64 +++++++++++++++++++++++++++++++++++++++++---------
 fs/xfs/xfs_log_priv.h  | 10 ++++++--
 fs/xfs/xfs_sysfs.c     |  1 +
 fs/xfs/xfs_trace.c     |  1 +
 fs/xfs/xfs_trans.c     |  2 +-
 fs/xfs/xfs_trans_ail.c | 11 ++++++---
 8 files changed, 97 insertions(+), 53 deletions(-)

Comments

Dave Chinner Feb. 25, 2021, 2:15 a.m. UTC | #1
On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The AIL pushing is stalling on log forces when it comes across
> pinned items. This is happening on removal workloads where the AIL
> is dominated by stale items that are removed from AIL when the
> checkpoint that marks the items stale is committed to the journal.
> This results is relatively few items in the AIL, but those that are
> are often pinned as directories items are being removed from are
> still being logged.
.....
> One of the complexities here is that the CIL push does not guarantee
> that the commit record for the CIL checkpoint is written to disk.
> The current log force ensures this by submitting the current ACTIVE
> iclog that the commit record was written to. We need the CIL to
> actually write this commit record to disk for an async push to
> ensure that the checkpoint actually makes it to disk and unpins the
> pinned items in the checkpoint on completion. Hence we need to pass
> down to the CIL push that we are doing an async flush so that it can
> switch out the commit_iclog if necessary to get written to disk when
> the commit iclog is finally released.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> Version 2:
> - ensure the CIL checkpoint issues the commit record to disk for an
>   async push. Fixes generic/530 hang on small logs.
> - increment log force stats when the CIL is forced and also when it
>   sleeps to give insight into the amount of blocking being done when
>   the CIL is forced.

Oops, looks like I forgot to strip debug trace_printk()s out of the
patch before sending it. They are gone now, so I'll wait for review
comments before resending again...

Cheers,

Dave.
Brian Foster March 2, 2021, 9:44 p.m. UTC | #2
On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The AIL pushing is stalling on log forces when it comes across
> pinned items. This is happening on removal workloads where the AIL
> is dominated by stale items that are removed from AIL when the
> checkpoint that marks the items stale is committed to the journal.
> This results is relatively few items in the AIL, but those that are
> are often pinned as directories items are being removed from are
> still being logged.
> 
> As a result, many push cycles through the CIL will first issue a
> blocking log force to unpin the items. This can take some time to
> complete, with tracing regularly showing push delays of half a
> second and sometimes up into the range of several seconds. Sequences
> like this aren't uncommon:
> 
...
> 
> In each of these cases, every AIL pass saw 101 log items stuck on
> the AIL (pinned) with very few other items being found. Each pass, a
> log force was issued, and delay between last/first is the sleep time
> + the sync log force time.
> 
> Some of these 101 items pinned the tail of the log. The tail of the
> log does slowly creep forward (first lsn), but the problem is that
> the log is actually out of reservation space because it's been
> running so many transactions that stale items that never reach the
> AIL but consume log space. Hence we have a largely empty AIL, with
> long term pins on items that pin the tail of the log that don't get
> pushed frequently enough to keep log space available.
> 
> The problem is the hundreds of milliseconds that we block in the log
> force pushing the CIL out to disk. The AIL should not be stalled
> like this - it needs to run and flush items that are at the tail of
> the log with minimal latency. What we really need to do is trigger a
> log flush, but then not wait for it at all - we've already done our
> waiting for stuff to complete when we backed off prior to the log
> force being issued.
> 
> Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> still do a blocking flush of the CIL and that is what is causing the
> issue. Hence we need a new interface for the CIL to trigger an
> immediate background push of the CIL to get it moving faster but not
> to wait on that to occur. While the CIL is pushing, the AIL can also
> be pushing.
> 

So I think I follow what you're describing wrt to the xfsaild delay, but
what I'm not clear on is what changing this behavior actually fixes.
IOW, if the xfsaild is cycling through and finding mostly stuck items,
then we see a delay because of the time spent in a log force (because of
those stuck items), what's the advantage to issue an async force over a
sync force? Won't xfsaild effectively continue to spin on these stuck
items until the log force completes, regardless?

Is the issue not so much the throughput of xfsaild, but just the fact
that the task stalls are large enough to cause other problems? I.e., the
comment you add down in the xfsaild code refers to tail pinning causing
reservation stalls. IOW, the purpose of this patch is then to keep
xfsaild sleeps to a predictable latency to ensure we come around again
quickly enough to keep the tail moving forward (even though the AIL
itself might remain in a state where the majority of items are stuck).
Yes?

> We already have an internal interface to do this -
> xlog_cil_push_now() - but we need a wrapper for it to be used
> externally. xlog_cil_force_seq() can easily be extended to do what
> we need as it already implements the synchronous CIL push via
> xlog_cil_push_now(). Add the necessary flags and "push current
> sequence" semantics to xlog_cil_force_seq() and convert the AIL
> pushing to use it.
> 
> One of the complexities here is that the CIL push does not guarantee
> that the commit record for the CIL checkpoint is written to disk.
> The current log force ensures this by submitting the current ACTIVE
> iclog that the commit record was written to. We need the CIL to
> actually write this commit record to disk for an async push to
> ensure that the checkpoint actually makes it to disk and unpins the
> pinned items in the checkpoint on completion. Hence we need to pass
> down to the CIL push that we are doing an async flush so that it can
> switch out the commit_iclog if necessary to get written to disk when
> the commit iclog is finally released.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> Version 2:
> - ensure the CIL checkpoint issues the commit record to disk for an
>   async push. Fixes generic/530 hang on small logs.
> - increment log force stats when the CIL is forced and also when it
>   sleeps to give insight into the amount of blocking being done when
>   the CIL is forced.
> 
>  fs/xfs/xfs_log.c       | 59 +++++++++++++++++++---------------------------
>  fs/xfs/xfs_log.h       |  2 +-
>  fs/xfs/xfs_log_cil.c   | 64 +++++++++++++++++++++++++++++++++++++++++---------
>  fs/xfs/xfs_log_priv.h  | 10 ++++++--
>  fs/xfs/xfs_sysfs.c     |  1 +
>  fs/xfs/xfs_trace.c     |  1 +
>  fs/xfs/xfs_trans.c     |  2 +-
>  fs/xfs/xfs_trans_ail.c | 11 ++++++---
>  8 files changed, 97 insertions(+), 53 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 2fda8c4513b2..1b46559ef64a 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
...
> @@ -1127,13 +1159,22 @@ xlog_cil_force_seq(
>  
>  	ASSERT(sequence <= cil->xc_current_sequence);
>  
> +	if (!sequence)
> +		sequence = cil->xc_current_sequence;
> +trace_printk("seq %llu, curseq %llu, ctxseq %llu pushseq %llu flags 0x%x",
> +		sequence, cil->xc_current_sequence, cil->xc_ctx->sequence,
> +		cil->xc_push_seq, flags);
> +
> +	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
>  	/*
>  	 * check to see if we need to force out the current context.
>  	 * xlog_cil_push() handles racing pushes for the same sequence,
>  	 * so no need to deal with it here.
>  	 */
>  restart:
> -	xlog_cil_push_now(log, sequence);
> +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> +	if (!(flags & XFS_LOG_SYNC))
> +		return commit_lsn;

Hm, so now we have sync and async log force and sync and async CIL push.
A log force always requires a sync CIL push and commit record submit;
the difference is simply whether or not we wait on commit record I/O
completion. The sync CIL push drains the CIL of log items but does not
switch out the commit record iclog, while the async CIL push executes in
the workqueue context so the drain is async, but it does switch out the
commit record iclog before it completes. IOW, the async CIL push is
basically a "more async" async log force.

I can see the need for the behavior of the async CIL push here, but that
leaves a mess of interfaces and behavior matrix. Is there any reason we
couldn't just make async log forces unconditionally behave equivalent to
the async CIL push as defined by this patch? There's only a handful of
existing users and I don't see any obvious reason why they might care
whether the underlying CIL push is synchronous or not...

Brian

>  
>  	/*
>  	 * See if we can find a previous sequence still committing.
> @@ -1157,6 +1198,7 @@ xlog_cil_force_seq(
>  			 * It is still being pushed! Wait for the push to
>  			 * complete, then start again from the beginning.
>  			 */
> +			XFS_STATS_INC(log->l_mp, xs_log_force_sleep);
>  			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
>  			goto restart;
>  		}
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 7e8ec77bc6e6..6b1a251dd013 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -273,6 +273,7 @@ struct xfs_cil {
>  
>  	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
>  	uint64_t		xc_push_seq;
> +	bool			xc_push_async;
>  	struct list_head	xc_committing;
>  	wait_queue_head_t	xc_commit_wait;
>  	uint64_t		xc_current_sequence;
> @@ -487,6 +488,10 @@ int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
>  		struct xlog_in_core **commit_iclog, uint optype);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
> +void	xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
> +		int eventual_size);
> +int	xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog);
> +
>  void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
>  void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
>  
> @@ -558,12 +563,13 @@ void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
>  /*
>   * CIL force routines
>   */
> -xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence);
> +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags,
> +				uint64_t sequence);
>  
>  static inline void
>  xlog_cil_force(struct xlog *log)
>  {
> -	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
> +	xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> index f1bc88f4367c..18dc5eca6c04 100644
> --- a/fs/xfs/xfs_sysfs.c
> +++ b/fs/xfs/xfs_sysfs.c
> @@ -10,6 +10,7 @@
>  #include "xfs_log_format.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_sysfs.h"
> +#include "xfs_log.h"
>  #include "xfs_log_priv.h"
>  #include "xfs_mount.h"
>  
> diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
> index 9b8d703dc9fd..d111a994b7b6 100644
> --- a/fs/xfs/xfs_trace.c
> +++ b/fs/xfs/xfs_trace.c
> @@ -20,6 +20,7 @@
>  #include "xfs_bmap.h"
>  #include "xfs_attr.h"
>  #include "xfs_trans.h"
> +#include "xfs_log.h"
>  #include "xfs_log_priv.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_quota.h"
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 697703f3be48..7d05694681e3 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -9,7 +9,6 @@
>  #include "xfs_shared.h"
>  #include "xfs_format.h"
>  #include "xfs_log_format.h"
> -#include "xfs_log_priv.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
>  #include "xfs_extent_busy.h"
> @@ -17,6 +16,7 @@
>  #include "xfs_trans.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  #include "xfs_trace.h"
>  #include "xfs_error.h"
>  #include "xfs_defer.h"
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index dbb69b4bf3ed..dfc0206c0d36 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -17,6 +17,7 @@
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  
>  #ifdef DEBUG
>  /*
> @@ -429,8 +430,12 @@ xfsaild_push(
>  
>  	/*
>  	 * If we encountered pinned items or did not finish writing out all
> -	 * buffers the last time we ran, force the log first and wait for it
> -	 * before pushing again.
> +	 * buffers the last time we ran, force a background CIL push to get the
> +	 * items unpinned in the near future. We do not wait on the CIL push as
> +	 * that could stall us for seconds if there is enough background IO
> +	 * load. Stalling for that long when the tail of the log is pinned and
> +	 * needs flushing will hard stop the transaction subsystem when log
> +	 * space runs out.
>  	 */
>  	if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
>  	    (!list_empty_careful(&ailp->ail_buf_list) ||
> @@ -438,7 +443,7 @@ xfsaild_push(
>  		ailp->ail_log_flush = 0;
>  
>  		XFS_STATS_INC(mp, xs_push_ail_flush);
> -		xfs_log_force(mp, XFS_LOG_SYNC);
> +		xlog_cil_force_seq(mp->m_log, 0, 0);
>  	}
>  
>  	spin_lock(&ailp->ail_lock);
>
Dave Chinner March 3, 2021, 12:57 a.m. UTC | #3
On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote:
> On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The AIL pushing is stalling on log forces when it comes across
> > pinned items. This is happening on removal workloads where the AIL
> > is dominated by stale items that are removed from AIL when the
> > checkpoint that marks the items stale is committed to the journal.
> > This results is relatively few items in the AIL, but those that are
> > are often pinned as directories items are being removed from are
> > still being logged.
> > 
> > As a result, many push cycles through the CIL will first issue a
> > blocking log force to unpin the items. This can take some time to
> > complete, with tracing regularly showing push delays of half a
> > second and sometimes up into the range of several seconds. Sequences
> > like this aren't uncommon:
> > 
> ...
> > 
> > In each of these cases, every AIL pass saw 101 log items stuck on
> > the AIL (pinned) with very few other items being found. Each pass, a
> > log force was issued, and delay between last/first is the sleep time
> > + the sync log force time.
> > 
> > Some of these 101 items pinned the tail of the log. The tail of the
> > log does slowly creep forward (first lsn), but the problem is that
> > the log is actually out of reservation space because it's been
> > running so many transactions that stale items that never reach the
> > AIL but consume log space. Hence we have a largely empty AIL, with
> > long term pins on items that pin the tail of the log that don't get
> > pushed frequently enough to keep log space available.
> > 
> > The problem is the hundreds of milliseconds that we block in the log
> > force pushing the CIL out to disk. The AIL should not be stalled
> > like this - it needs to run and flush items that are at the tail of
> > the log with minimal latency. What we really need to do is trigger a
> > log flush, but then not wait for it at all - we've already done our
> > waiting for stuff to complete when we backed off prior to the log
> > force being issued.
> > 
> > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> > still do a blocking flush of the CIL and that is what is causing the
> > issue. Hence we need a new interface for the CIL to trigger an
> > immediate background push of the CIL to get it moving faster but not
> > to wait on that to occur. While the CIL is pushing, the AIL can also
> > be pushing.
> > 
> 
> So I think I follow what you're describing wrt to the xfsaild delay, but
> what I'm not clear on is what changing this behavior actually fixes.
> IOW, if the xfsaild is cycling through and finding mostly stuck items,
> then we see a delay because of the time spent in a log force (because of
> those stuck items), what's the advantage to issue an async force over a
> sync force? Won't xfsaild effectively continue to spin on these stuck
> items until the log force completes, regardless?

When buffers get flushed and then pinned and can't be written back,
we issue sync log forces from the AIL because the delwri list is not
empty and the force count is non-zero.

So it's not just pinned items in the AIL that trigger this - it's
stuff that gets pinned after it has been flushed but before it gets
written that causes problems, too.

And if it's pinned buffers in the delwri list that are the issue,
then we stop flushing other items from the list while we wait for
the log force. hence if the tail pinning items are not actually
pinned, but other buffers are, then we block on the log force when
we could have been flushing and writing the items that pin the tail
of the log.

> Is the issue not so much the throughput of xfsaild, but just the fact
> that the task stalls are large enough to cause other problems? I.e., the
> comment you add down in the xfsaild code refers to tail pinning causing
> reservation stalls.  IOW, the purpose of this patch is then to keep
> xfsaild sleeps to a predictable latency to ensure we come around again
> quickly enough to keep the tail moving forward (even though the AIL
> itself might remain in a state where the majority of items are stuck).
> Yes?

Yup, largely because there are multiple triggers for log forces. 
There's a difference between buffers that can't be flushed because
they are pinned and flushed buffers that can't be written because
they are pinned. The prior prevents the AIL from making push progress,
the latter doesn't.

> > +trace_printk("seq %llu, curseq %llu, ctxseq %llu pushseq %llu flags 0x%x",
> > +		sequence, cil->xc_current_sequence, cil->xc_ctx->sequence,
> > +		cil->xc_push_seq, flags);
> > +
> > +	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
> >  	/*
> >  	 * check to see if we need to force out the current context.
> >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> >  	 * so no need to deal with it here.
> >  	 */
> >  restart:
> > -	xlog_cil_push_now(log, sequence);
> > +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> > +	if (!(flags & XFS_LOG_SYNC))
> > +		return commit_lsn;
> 
> Hm, so now we have sync and async log force and sync and async CIL push.
> A log force always requires a sync CIL push and commit record submit;
> the difference is simply whether or not we wait on commit record I/O
> completion. The sync CIL push drains the CIL of log items but does not
> switch out the commit record iclog, while the async CIL push executes in
> the workqueue context so the drain is async, but it does switch out the
> commit record iclog before it completes. IOW, the async CIL push is
> basically a "more async" async log force.

Yes.

> I can see the need for the behavior of the async CIL push here, but that
> leaves a mess of interfaces and behavior matrix. Is there any reason we
> couldn't just make async log forces unconditionally behave equivalent to
> the async CIL push as defined by this patch? There's only a handful of
> existing users and I don't see any obvious reason why they might care
> whether the underlying CIL push is synchronous or not...

I'm not touching the rest of the log force code in this series - it
is out of scope of this bug fix and the rest of the series that it
is part of.

Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
changing things like log force behaviour and implementation is for
a future series.

Cheers,

Dave.
Brian Foster March 3, 2021, 5:32 p.m. UTC | #4
On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote:
> On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote:
> > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The AIL pushing is stalling on log forces when it comes across
> > > pinned items. This is happening on removal workloads where the AIL
> > > is dominated by stale items that are removed from AIL when the
> > > checkpoint that marks the items stale is committed to the journal.
> > > This results is relatively few items in the AIL, but those that are
> > > are often pinned as directories items are being removed from are
> > > still being logged.
> > > 
> > > As a result, many push cycles through the CIL will first issue a
> > > blocking log force to unpin the items. This can take some time to
> > > complete, with tracing regularly showing push delays of half a
> > > second and sometimes up into the range of several seconds. Sequences
> > > like this aren't uncommon:
> > > 
> > ...
> > > 
> > > In each of these cases, every AIL pass saw 101 log items stuck on
> > > the AIL (pinned) with very few other items being found. Each pass, a
> > > log force was issued, and delay between last/first is the sleep time
> > > + the sync log force time.
> > > 
> > > Some of these 101 items pinned the tail of the log. The tail of the
> > > log does slowly creep forward (first lsn), but the problem is that
> > > the log is actually out of reservation space because it's been
> > > running so many transactions that stale items that never reach the
> > > AIL but consume log space. Hence we have a largely empty AIL, with
> > > long term pins on items that pin the tail of the log that don't get
> > > pushed frequently enough to keep log space available.
> > > 
> > > The problem is the hundreds of milliseconds that we block in the log
> > > force pushing the CIL out to disk. The AIL should not be stalled
> > > like this - it needs to run and flush items that are at the tail of
> > > the log with minimal latency. What we really need to do is trigger a
> > > log flush, but then not wait for it at all - we've already done our
> > > waiting for stuff to complete when we backed off prior to the log
> > > force being issued.
> > > 
> > > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> > > still do a blocking flush of the CIL and that is what is causing the
> > > issue. Hence we need a new interface for the CIL to trigger an
> > > immediate background push of the CIL to get it moving faster but not
> > > to wait on that to occur. While the CIL is pushing, the AIL can also
> > > be pushing.
> > > 
> > 
> > So I think I follow what you're describing wrt to the xfsaild delay, but
> > what I'm not clear on is what changing this behavior actually fixes.
> > IOW, if the xfsaild is cycling through and finding mostly stuck items,
> > then we see a delay because of the time spent in a log force (because of
> > those stuck items), what's the advantage to issue an async force over a
> > sync force? Won't xfsaild effectively continue to spin on these stuck
> > items until the log force completes, regardless?
> 
> When buffers get flushed and then pinned and can't be written back,
> we issue sync log forces from the AIL because the delwri list is not
> empty and the force count is non-zero.
> 
> So it's not just pinned items in the AIL that trigger this - it's
> stuff that gets pinned after it has been flushed but before it gets
> written that causes problems, too.
> 
> And if it's pinned buffers in the delwri list that are the issue,
> then we stop flushing other items from the list while we wait for
> the log force. hence if the tail pinning items are not actually
> pinned, but other buffers are, then we block on the log force when
> we could have been flushing and writing the items that pin the tail
> of the log.
> 
> > Is the issue not so much the throughput of xfsaild, but just the fact
> > that the task stalls are large enough to cause other problems? I.e., the
> > comment you add down in the xfsaild code refers to tail pinning causing
> > reservation stalls.  IOW, the purpose of this patch is then to keep
> > xfsaild sleeps to a predictable latency to ensure we come around again
> > quickly enough to keep the tail moving forward (even though the AIL
> > itself might remain in a state where the majority of items are stuck).
> > Yes?
> 
> Yup, largely because there are multiple triggers for log forces. 
> There's a difference between buffers that can't be flushed because
> they are pinned and flushed buffers that can't be written because
> they are pinned. The prior prevents the AIL from making push progress,
> the latter doesn't.
> 
> > > +trace_printk("seq %llu, curseq %llu, ctxseq %llu pushseq %llu flags 0x%x",
> > > +		sequence, cil->xc_current_sequence, cil->xc_ctx->sequence,
> > > +		cil->xc_push_seq, flags);
> > > +
> > > +	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
> > >  	/*
> > >  	 * check to see if we need to force out the current context.
> > >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> > >  	 * so no need to deal with it here.
> > >  	 */
> > >  restart:
> > > -	xlog_cil_push_now(log, sequence);
> > > +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> > > +	if (!(flags & XFS_LOG_SYNC))
> > > +		return commit_lsn;
> > 
> > Hm, so now we have sync and async log force and sync and async CIL push.
> > A log force always requires a sync CIL push and commit record submit;
> > the difference is simply whether or not we wait on commit record I/O
> > completion. The sync CIL push drains the CIL of log items but does not
> > switch out the commit record iclog, while the async CIL push executes in
> > the workqueue context so the drain is async, but it does switch out the
> > commit record iclog before it completes. IOW, the async CIL push is
> > basically a "more async" async log force.
> 
> Yes.
> 
> > I can see the need for the behavior of the async CIL push here, but that
> > leaves a mess of interfaces and behavior matrix. Is there any reason we
> > couldn't just make async log forces unconditionally behave equivalent to
> > the async CIL push as defined by this patch? There's only a handful of
> > existing users and I don't see any obvious reason why they might care
> > whether the underlying CIL push is synchronous or not...
> 
> I'm not touching the rest of the log force code in this series - it
> is out of scope of this bug fix and the rest of the series that it
> is part of.
> 

But you already have altered the log force code by changing
xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq()
behaves. So not only does this patch expose subsystem internals to
external layers and create more log forcing interfaces/behaviors to
maintain, it invokes one or the other from the preexisting async log
force variants purely based on whether the caller had a target sequence
number or not. This doesn't make a whole lot of sense to me.

> Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> changing things like log force behaviour and implementation is for
> a future series.
> 

TBH I think this patch is kind of a mess on its own. I think the
mechanism it wants to provide is sane, but I've not even got to the
point of reviewing _that_ yet because of the seeming dismissal of higher
level feedback. I'd rather not go around in circles on this so I'll just
offer my summarized feedback to this patch...

I think this patch should be broken down into two, possibly three
sub-patches that generally do the following (not necessarily in this
order):

1. Implement the new underlying async CIL push mechanism (i.e. the
   ->xc_push_async bits and new commit iclog switching behavior).
2. Change the behavior of existing async log forces to use the new
   mechanism consistently. Document the change in behavior for historical
   reference.
3. Switch xfsaild to use an async log force instead of the current sync
   log force. Document why this might depend on the updated async
   behavior.

Further log subsystem cleanup or consolidation that this might
facilitate between the historical async log force and the newer CIL
based async log force can occur separately in future patches..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner March 4, 2021, 1:59 a.m. UTC | #5
On Wed, Mar 03, 2021 at 12:32:39PM -0500, Brian Foster wrote:
> On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote:
> > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote:
> > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> > > >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> > > >  	 * so no need to deal with it here.
> > > >  	 */
> > > >  restart:
> > > > -	xlog_cil_push_now(log, sequence);
> > > > +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> > > > +	if (!(flags & XFS_LOG_SYNC))
> > > > +		return commit_lsn;
> > > 
> > > Hm, so now we have sync and async log force and sync and async CIL push.
> > > A log force always requires a sync CIL push and commit record submit;
> > > the difference is simply whether or not we wait on commit record I/O
> > > completion. The sync CIL push drains the CIL of log items but does not
> > > switch out the commit record iclog, while the async CIL push executes in
> > > the workqueue context so the drain is async, but it does switch out the
> > > commit record iclog before it completes. IOW, the async CIL push is
> > > basically a "more async" async log force.
> > 
> > Yes.
> > 
> > > I can see the need for the behavior of the async CIL push here, but that
> > > leaves a mess of interfaces and behavior matrix. Is there any reason we
> > > couldn't just make async log forces unconditionally behave equivalent to
> > > the async CIL push as defined by this patch? There's only a handful of
> > > existing users and I don't see any obvious reason why they might care
> > > whether the underlying CIL push is synchronous or not...
> > 
> > I'm not touching the rest of the log force code in this series - it
> > is out of scope of this bug fix and the rest of the series that it
> > is part of.
> > 
> 
> But you already have altered the log force code by changing
> xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq()
> behaves.

The behaviour of the function when called from xfs_log_force*()
should be unchanged. Can you be specific about exactly what
behaviour did I change for non-synchronous xfs_log_force*() callers
so I can fix it? I have not intended to change it at all, so
whatever you are refering is an issue I need to fix...

> So not only does this patch expose subsystem internals to
> external layers and create more log forcing interfaces/behaviors to

Sorry, I don't follow. What "subsystem internals" are being exposed
and what external layer are they being exposed to?

> > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > changing things like log force behaviour and implementation is for
> > a future series.
> > 
> 
> TBH I think this patch is kind of a mess on its own. I think the
> mechanism it wants to provide is sane, but I've not even got to the
> point of reviewing _that_ yet because of the seeming dismissal of higher
> level feedback. I'd rather not go around in circles on this so I'll just
> offer my summarized feedback to this patch...

I'm not dismissing review nor am I saying the API cannot or should
not be improved. I'm simply telling you that I think the additional
changes you are proposing are outside the scope of the problem I am
addressing here. I already plan to rework the log force API (and
others) but doing so it not something that this patchset needs to
address, or indeed should address.

There are already enough subtle changes being made to core code and
algorithms that mixing them with unrelated high level external
behavioural changes that it greatly increases the risk of unexpected
regressions in the patchset. The log force are paths are used in
data integrity paths, so I want to limit the scope of behavioural
change to just the AIL where the log force has no data integrity
requirement associcated with it.

Cheers,

Dave.
Brian Foster March 4, 2021, 1:13 p.m. UTC | #6
On Thu, Mar 04, 2021 at 12:59:33PM +1100, Dave Chinner wrote:
> On Wed, Mar 03, 2021 at 12:32:39PM -0500, Brian Foster wrote:
> > On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote:
> > > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote:
> > > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> > > > >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> > > > >  	 * so no need to deal with it here.
> > > > >  	 */
> > > > >  restart:
> > > > > -	xlog_cil_push_now(log, sequence);
> > > > > +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> > > > > +	if (!(flags & XFS_LOG_SYNC))
> > > > > +		return commit_lsn;
> > > > 
> > > > Hm, so now we have sync and async log force and sync and async CIL push.
> > > > A log force always requires a sync CIL push and commit record submit;
> > > > the difference is simply whether or not we wait on commit record I/O
> > > > completion. The sync CIL push drains the CIL of log items but does not
> > > > switch out the commit record iclog, while the async CIL push executes in
> > > > the workqueue context so the drain is async, but it does switch out the
> > > > commit record iclog before it completes. IOW, the async CIL push is
> > > > basically a "more async" async log force.
> > > 
> > > Yes.
> > > 
> > > > I can see the need for the behavior of the async CIL push here, but that
> > > > leaves a mess of interfaces and behavior matrix. Is there any reason we
> > > > couldn't just make async log forces unconditionally behave equivalent to
> > > > the async CIL push as defined by this patch? There's only a handful of
> > > > existing users and I don't see any obvious reason why they might care
> > > > whether the underlying CIL push is synchronous or not...
> > > 
> > > I'm not touching the rest of the log force code in this series - it
> > > is out of scope of this bug fix and the rest of the series that it
> > > is part of.
> > > 
> > 
> > But you already have altered the log force code by changing
> > xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq()
> > behaves.
> 
> The behaviour of the function when called from xfs_log_force*()
> should be unchanged. Can you be specific about exactly what
> behaviour did I change for non-synchronous xfs_log_force*() callers
> so I can fix it? I have not intended to change it at all, so
> whatever you are refering is an issue I need to fix...
> 

xfs_log_force_seq() passes flags from the caller to xlog_cil_force_seq()
(whereas this patch presumably wants it to pass XFS_LOG_SYNC
unconditionally). IOW, xfs_log_force(mp, 0) behavior is different from
xfs_log_force_seq(mp, seq, 0, ...).

> > So not only does this patch expose subsystem internals to
> > external layers and create more log forcing interfaces/behaviors to
> 
> Sorry, I don't follow. What "subsystem internals" are being exposed
> and what external layer are they being exposed to?
> 
> > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > changing things like log force behaviour and implementation is for
> > > a future series.
> > > 
> > 
> > TBH I think this patch is kind of a mess on its own. I think the
> > mechanism it wants to provide is sane, but I've not even got to the
> > point of reviewing _that_ yet because of the seeming dismissal of higher
> > level feedback. I'd rather not go around in circles on this so I'll just
> > offer my summarized feedback to this patch...
> 
> I'm not dismissing review nor am I saying the API cannot or should
> not be improved. I'm simply telling you that I think the additional
> changes you are proposing are outside the scope of the problem I am
> addressing here. I already plan to rework the log force API (and
> others) but doing so it not something that this patchset needs to
> address, or indeed should address.
> 

I'm not proposing additional changes nor to rework the log force API.
I'm pointing out that I find this implementation to be extremely and
unnecessarily confusing. To improve it, I'm suggesting to either coopt
the existing async log force API...

> There are already enough subtle changes being made to core code and
> algorithms that mixing them with unrelated high level external
> behavioural changes that it greatly increases the risk of unexpected
> regressions in the patchset. The log force are paths are used in
> data integrity paths, so I want to limit the scope of behavioural
> change to just the AIL where the log force has no data integrity
> requirement associcated with it.
> 

... or if we really want a special async log force just for xfsaild (why
is still not clear to me), then tie it to an XFS_LOG_REALLY_ASYNC flag
or some such, pass that to the existing log force call, and document the
purpose/behavior of the new mode in detail. That at least won't require
a developer to wonder what !(flags & XFS_LOG_SYNC) happens to mean
depending on the particular log force function.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner March 4, 2021, 10:48 p.m. UTC | #7
On Thu, Mar 04, 2021 at 08:13:07AM -0500, Brian Foster wrote:
> On Thu, Mar 04, 2021 at 12:59:33PM +1100, Dave Chinner wrote:
> > On Wed, Mar 03, 2021 at 12:32:39PM -0500, Brian Foster wrote:
> > > On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote:
> > > > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote:
> > > > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> > > > > >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> > > > > >  	 * so no need to deal with it here.
> > > > > >  	 */
> > > > > >  restart:
> > > > > > -	xlog_cil_push_now(log, sequence);
> > > > > > +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> > > > > > +	if (!(flags & XFS_LOG_SYNC))
> > > > > > +		return commit_lsn;
> > > > > 
> > > > > Hm, so now we have sync and async log force and sync and async CIL push.
> > > > > A log force always requires a sync CIL push and commit record submit;
> > > > > the difference is simply whether or not we wait on commit record I/O
> > > > > completion. The sync CIL push drains the CIL of log items but does not
> > > > > switch out the commit record iclog, while the async CIL push executes in
> > > > > the workqueue context so the drain is async, but it does switch out the
> > > > > commit record iclog before it completes. IOW, the async CIL push is
> > > > > basically a "more async" async log force.
> > > > 
> > > > Yes.
> > > > 
> > > > > I can see the need for the behavior of the async CIL push here, but that
> > > > > leaves a mess of interfaces and behavior matrix. Is there any reason we
> > > > > couldn't just make async log forces unconditionally behave equivalent to
> > > > > the async CIL push as defined by this patch? There's only a handful of
> > > > > existing users and I don't see any obvious reason why they might care
> > > > > whether the underlying CIL push is synchronous or not...
> > > > 
> > > > I'm not touching the rest of the log force code in this series - it
> > > > is out of scope of this bug fix and the rest of the series that it
> > > > is part of.
> > > > 
> > > 
> > > But you already have altered the log force code by changing
> > > xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq()
> > > behaves.
> > 
> > The behaviour of the function when called from xfs_log_force*()
> > should be unchanged. Can you be specific about exactly what
> > behaviour did I change for non-synchronous xfs_log_force*() callers
> > so I can fix it? I have not intended to change it at all, so
> > whatever you are refering is an issue I need to fix...
> > 
> 
> xfs_log_force_seq() passes flags from the caller to xlog_cil_force_seq()
> (whereas this patch presumably wants it to pass XFS_LOG_SYNC
> unconditionally). IOW, xfs_log_force(mp, 0) behavior is different from
> xfs_log_force_seq(mp, seq, 0, ...).

Fixed.

> > > So not only does this patch expose subsystem internals to
> > > external layers and create more log forcing interfaces/behaviors to
> > 
> > Sorry, I don't follow. What "subsystem internals" are being exposed
> > and what external layer are they being exposed to?
> > 
> > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > > changing things like log force behaviour and implementation is for
> > > > a future series.
> > > > 
> > > 
> > > TBH I think this patch is kind of a mess on its own. I think the
> > > mechanism it wants to provide is sane, but I've not even got to the
> > > point of reviewing _that_ yet because of the seeming dismissal of higher
> > > level feedback. I'd rather not go around in circles on this so I'll just
> > > offer my summarized feedback to this patch...
> > 
> > I'm not dismissing review nor am I saying the API cannot or should
> > not be improved. I'm simply telling you that I think the additional
> > changes you are proposing are outside the scope of the problem I am
> > addressing here. I already plan to rework the log force API (and
> > others) but doing so it not something that this patchset needs to
> > address, or indeed should address.
> > 
> 
> I'm not proposing additional changes nor to rework the log force API.
> I'm pointing out that I find this implementation to be extremely and
> unnecessarily confusing.

And that's just fine - not everyone has to understand all of the
code all of the time. That's why we have a team....

> To improve it, I'm suggesting to either coopt
> the existing async log force API...

And I'm telling you that this is *future work*. As the person
actually doing the work, that's my decision and you need to accept
that. I understand your concerns and have a plan to address them -
you just have to accept that the plan to address them isn't going to
be exactly to your liking.

> > There are already enough subtle changes being made to core code and
> > algorithms that mixing them with unrelated high level external
> > behavioural changes that it greatly increases the risk of unexpected
> > regressions in the patchset. The log force are paths are used in
> > data integrity paths, so I want to limit the scope of behavioural
> > change to just the AIL where the log force has no data integrity
> > requirement associcated with it.
> > 
> 
> ... or if we really want a special async log force just for xfsaild (why
> is still not clear to me), then tie it to an XFS_LOG_REALLY_ASYNC flag
> or some such, pass that to the existing log force call, and document the
> purpose/behavior of the new mode in detail. That at least won't require
> a developer to wonder what !(flags & XFS_LOG_SYNC) happens to mean
> depending on the particular log force function.

No. That's just obfuscation. And it is extremely likely that the
first thing Christoph will ask me to do with one-shot flag that just
calls out to a single function is to get rid of the flag and call
the function directly.

This will get cleaned up. Just not in this patchset.

-Dave.
Brian Foster March 5, 2021, 2:58 p.m. UTC | #8
On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote:
> On Thu, Mar 04, 2021 at 08:13:07AM -0500, Brian Foster wrote:
> > On Thu, Mar 04, 2021 at 12:59:33PM +1100, Dave Chinner wrote:
> > > On Wed, Mar 03, 2021 at 12:32:39PM -0500, Brian Foster wrote:
> > > > On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote:
> > > > > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote:
> > > > > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote:
> > > > > > >  	 * xlog_cil_push() handles racing pushes for the same sequence,
> > > > > > >  	 * so no need to deal with it here.
> > > > > > >  	 */
> > > > > > >  restart:
> > > > > > > -	xlog_cil_push_now(log, sequence);
> > > > > > > +	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
> > > > > > > +	if (!(flags & XFS_LOG_SYNC))
> > > > > > > +		return commit_lsn;
> > > > > > 
> > > > > > Hm, so now we have sync and async log force and sync and async CIL push.
> > > > > > A log force always requires a sync CIL push and commit record submit;
> > > > > > the difference is simply whether or not we wait on commit record I/O
> > > > > > completion. The sync CIL push drains the CIL of log items but does not
> > > > > > switch out the commit record iclog, while the async CIL push executes in
> > > > > > the workqueue context so the drain is async, but it does switch out the
> > > > > > commit record iclog before it completes. IOW, the async CIL push is
> > > > > > basically a "more async" async log force.
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > I can see the need for the behavior of the async CIL push here, but that
> > > > > > leaves a mess of interfaces and behavior matrix. Is there any reason we
> > > > > > couldn't just make async log forces unconditionally behave equivalent to
> > > > > > the async CIL push as defined by this patch? There's only a handful of
> > > > > > existing users and I don't see any obvious reason why they might care
> > > > > > whether the underlying CIL push is synchronous or not...
> > > > > 
> > > > > I'm not touching the rest of the log force code in this series - it
> > > > > is out of scope of this bug fix and the rest of the series that it
> > > > > is part of.
> > > > > 
> > > > 
> > > > But you already have altered the log force code by changing
> > > > xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq()
> > > > behaves.
> > > 
> > > The behaviour of the function when called from xfs_log_force*()
> > > should be unchanged. Can you be specific about exactly what
> > > behaviour did I change for non-synchronous xfs_log_force*() callers
> > > so I can fix it? I have not intended to change it at all, so
> > > whatever you are refering is an issue I need to fix...
> > > 
> > 
> > xfs_log_force_seq() passes flags from the caller to xlog_cil_force_seq()
> > (whereas this patch presumably wants it to pass XFS_LOG_SYNC
> > unconditionally). IOW, xfs_log_force(mp, 0) behavior is different from
> > xfs_log_force_seq(mp, seq, 0, ...).
> 
> Fixed.
> 
> > > > So not only does this patch expose subsystem internals to
> > > > external layers and create more log forcing interfaces/behaviors to
> > > 
> > > Sorry, I don't follow. What "subsystem internals" are being exposed
> > > and what external layer are they being exposed to?
> > > 
> > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > > > changing things like log force behaviour and implementation is for
> > > > > a future series.
> > > > > 
> > > > 
> > > > TBH I think this patch is kind of a mess on its own. I think the
> > > > mechanism it wants to provide is sane, but I've not even got to the
> > > > point of reviewing _that_ yet because of the seeming dismissal of higher
> > > > level feedback. I'd rather not go around in circles on this so I'll just
> > > > offer my summarized feedback to this patch...
> > > 
> > > I'm not dismissing review nor am I saying the API cannot or should
> > > not be improved. I'm simply telling you that I think the additional
> > > changes you are proposing are outside the scope of the problem I am
> > > addressing here. I already plan to rework the log force API (and
> > > others) but doing so it not something that this patchset needs to
> > > address, or indeed should address.
> > > 
> > 
> > I'm not proposing additional changes nor to rework the log force API.
> > I'm pointing out that I find this implementation to be extremely and
> > unnecessarily confusing.
> 
> And that's just fine - not everyone has to understand all of the
> code all of the time. That's why we have a team....
> 
> > To improve it, I'm suggesting to either coopt
> > the existing async log force API...
> 
> And I'm telling you that this is *future work*. As the person
> actually doing the work, that's my decision and you need to accept
> that. I understand your concerns and have a plan to address them -
> you just have to accept that the plan to address them isn't going to
> be exactly to your liking.
> 

As a reviewer, I am pointing out that I object to how this patch is
implemented and offered several fairly simple suggestions on how to
address my concerns. Those suggestions have been rejected pretty much
out of hand on the basis of the existence of some future plans.

Future rework plans do not justify or change my view of this patch and
the mess it adds (for other developers, for historical context and
downstreams, etc.). Therefore, if the only option you're willing to
consider is "make this mess now and clean it up later in some broader
rework," then I'd rather we just defer this patch until after that
rework is available and avoid the mess that way.

Brian

> > > There are already enough subtle changes being made to core code and
> > > algorithms that mixing them with unrelated high level external
> > > behavioural changes that it greatly increases the risk of unexpected
> > > regressions in the patchset. The log force are paths are used in
> > > data integrity paths, so I want to limit the scope of behavioural
> > > change to just the AIL where the log force has no data integrity
> > > requirement associcated with it.
> > > 
> > 
> > ... or if we really want a special async log force just for xfsaild (why
> > is still not clear to me), then tie it to an XFS_LOG_REALLY_ASYNC flag
> > or some such, pass that to the existing log force call, and document the
> > purpose/behavior of the new mode in detail. That at least won't require
> > a developer to wonder what !(flags & XFS_LOG_SYNC) happens to mean
> > depending on the particular log force function.
> 
> No. That's just obfuscation. And it is extremely likely that the
> first thing Christoph will ask me to do with one-shot flag that just
> calls out to a single function is to get rid of the flag and call
> the function directly.
> 
> This will get cleaned up. Just not in this patchset.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner March 9, 2021, 12:44 a.m. UTC | #9
On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote:
> On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote:
> > > > > So not only does this patch expose subsystem internals to
> > > > > external layers and create more log forcing interfaces/behaviors to
> > > > 
> > > > Sorry, I don't follow. What "subsystem internals" are being exposed
> > > > and what external layer are they being exposed to?
> > > > 
> > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > > > > changing things like log force behaviour and implementation is for
> > > > > > a future series.
> > > > > > 
> > > > > 
> > > > > TBH I think this patch is kind of a mess on its own. I think the
> > > > > mechanism it wants to provide is sane, but I've not even got to the
> > > > > point of reviewing _that_ yet because of the seeming dismissal of higher
> > > > > level feedback. I'd rather not go around in circles on this so I'll just
> > > > > offer my summarized feedback to this patch...
> > > > 
> > > > I'm not dismissing review nor am I saying the API cannot or should
> > > > not be improved. I'm simply telling you that I think the additional
> > > > changes you are proposing are outside the scope of the problem I am
> > > > addressing here. I already plan to rework the log force API (and
> > > > others) but doing so it not something that this patchset needs to
> > > > address, or indeed should address.
> > > > 
> > > 
> > > I'm not proposing additional changes nor to rework the log force API.
> > > I'm pointing out that I find this implementation to be extremely and
> > > unnecessarily confusing.
> > 
> > And that's just fine - not everyone has to understand all of the
> > code all of the time. That's why we have a team....
> > 
> > > To improve it, I'm suggesting to either coopt
> > > the existing async log force API...
> > 
> > And I'm telling you that this is *future work*. As the person
> > actually doing the work, that's my decision and you need to accept
> > that. I understand your concerns and have a plan to address them -
> > you just have to accept that the plan to address them isn't going to
> > be exactly to your liking.
> > 
> 
> As a reviewer, I am pointing out that I object to how this patch is
> implemented and offered several fairly simple suggestions on how to
> address my concerns. Those suggestions have been rejected pretty much
> out of hand on the basis of the existence of some future plans.

We're allowed to disagree on the best approach. But to find a way
forward means that both the summitter and the reviewer need to
compromise. I've fixed up the obvious warts and bugs you've pointed
out, and agreed that it needs cleaning up and have committed to
cleaning it up in the near future.

> Future rework plans do not justify or change my view of this patch
> and the mess it adds (for other developers, for historical context
> and downstreams, etc.).  Therefore, if the only option you're
> willing to consider is "make this mess now and clean it up later
> in some broader rework," then I'd rather we just defer this patch
> until after that rework is available and avoid the mess that way.

So you won't review it until I have 100 outstanding patches in this
series and it's completely and utterly unreviewable? Then you'll ask
me to split it up and re-order it into digestable chunks, and so
we'll go back to having to merge this because any of the API rework
that depends on the mechanism that this patch introduces.

I'm not going to play that "now jump through this hoop" game.  We
add flags for on-off behaviours in internal functions -all the
time-. If this makes the interface so complex and confusing that you
don't understand it, then the interface was already too complex and
confusing. And fixing that is *not in the scope of this patchset*.

Demanding that code be made perfect before it can be merged is
really not very helpful. Especially when there are already plans to
rework the API but that rework is dependent on a bunch of other
changes than need to be done first.

iclogs are something that need to be moved behind the CIL, not sit
in front of CIL. The CIL abstracts the journal and writing to the
journal completely away from the transaction subsystem, yet the log
force code punches straight through that abstraction and walks
iclogs directly. The whole log force implementation needs to change,
and I plan for the API that wraps the log forces to get reworked at
that time.

For example, if we want to direct map storage for log writes, then
iclog-based log force synchronisation needs to go away because we
don't need iclogs for buffering journal writes. Hence the log foce
code should interface only with the CIL, and only the CIL should
manage whatever mechanism it is using to write to stable storage.
The code is currently the way it is because the CIL, when first
implemented, had to co-exist with the old way of writing to the log.
We haven't used that old way for a decade now and we have very
different storage performance characteristics these days, so it's
about time we got rid of the mechanism designed to be optimal for
spinning disks and actually integrated the CIL and the log
efficiently.

There are a *lot* of steps to do this, and reworking the log force
implementation and API is part of that. But reworking that API is
premature because we haven't done all the necessary pre-work in
place to make such a change yet. This patch is actually part of that
pre-work to get the mechanisms that the log force rework will rely
on.

I have very good reasons for pushing back against your suggestions,
Brian. Your suggestions have merit but this patch is not the time to
be making the changes you suggest. Code does not need to be perfect
to be merged, nor does the entire larger change they will be part of
need to be complete and 100% tested and reviewed before preparation
and infrastructure patches can be merged. This is how we've done big
changes in the past - they've been staged across multiple kernel
cycles - and not everything needs to be done in the first
patchset of a larger body of work....

Cheers,

Dave.
Darrick J. Wong March 9, 2021, 4:35 a.m. UTC | #10
On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote:
> On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote:
> > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote:
> > > > > > So not only does this patch expose subsystem internals to
> > > > > > external layers and create more log forcing interfaces/behaviors to
> > > > > 
> > > > > Sorry, I don't follow. What "subsystem internals" are being exposed
> > > > > and what external layer are they being exposed to?
> > > > > 
> > > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > > > > > changing things like log force behaviour and implementation is for
> > > > > > > a future series.
> > > > > > > 
> > > > > > 
> > > > > > TBH I think this patch is kind of a mess on its own. I think the
> > > > > > mechanism it wants to provide is sane, but I've not even got to the
> > > > > > point of reviewing _that_ yet because of the seeming dismissal of higher
> > > > > > level feedback. I'd rather not go around in circles on this so I'll just
> > > > > > offer my summarized feedback to this patch...
> > > > > 
> > > > > I'm not dismissing review nor am I saying the API cannot or should
> > > > > not be improved. I'm simply telling you that I think the additional
> > > > > changes you are proposing are outside the scope of the problem I am
> > > > > addressing here. I already plan to rework the log force API (and
> > > > > others) but doing so it not something that this patchset needs to
> > > > > address, or indeed should address.
> > > > > 
> > > > 
> > > > I'm not proposing additional changes nor to rework the log force API.
> > > > I'm pointing out that I find this implementation to be extremely and
> > > > unnecessarily confusing.
> > > 
> > > And that's just fine - not everyone has to understand all of the
> > > code all of the time. That's why we have a team....
> > > 
> > > > To improve it, I'm suggesting to either coopt
> > > > the existing async log force API...
> > > 
> > > And I'm telling you that this is *future work*. As the person
> > > actually doing the work, that's my decision and you need to accept
> > > that. I understand your concerns and have a plan to address them -
> > > you just have to accept that the plan to address them isn't going to
> > > be exactly to your liking.
> > > 
> > 
> > As a reviewer, I am pointing out that I object to how this patch is
> > implemented and offered several fairly simple suggestions on how to
> > address my concerns. Those suggestions have been rejected pretty much
> > out of hand on the basis of the existence of some future plans.
> 
> We're allowed to disagree on the best approach. But to find a way
> forward means that both the summitter and the reviewer need to
> compromise. I've fixed up the obvious warts and bugs you've pointed
> out, and agreed that it needs cleaning up and have committed to
> cleaning it up in the near future.
> 
> > Future rework plans do not justify or change my view of this patch
> > and the mess it adds (for other developers, for historical context
> > and downstreams, etc.).  Therefore, if the only option you're
> > willing to consider is "make this mess now and clean it up later
> > in some broader rework," then I'd rather we just defer this patch
> > until after that rework is available and avoid the mess that way.
> 
> So you won't review it until I have 100 outstanding patches in this
> series and it's completely and utterly unreviewable?

Already unreviewable at 45, and I've only gotten through 2/3 of it.

> Then you'll ask me to split it up and re-order it into digestable
> chunks, and so we'll go back to having to merge this because any of
> the API rework that depends on the mechanism that this patch
> introduces.

Here's something I haven't previously shared with all of you: Last cycle
when we were going around and around on the ENOSPC/EDQUOT retry loop
patches (which exploded from 13 to 41 patches) it was /very/ stressful
to have to rework this and that part every day and a half for almost
three weeks.

When I get a patchset ready for submission, I export the patches from my
development branch into a topic branch based off the latest -rc.  Every
time anyone makes a suggestion, I update the topic branch and send that
to the fstests "cloud" to see if it broke anything.  If it succeeds, I
push it to k.org and resubmit.

The part that's less obvious is the fact that rebasing is /not/ that
simple.  Next I integrate the changes into my development tree and
rebase everything that comes after /that/ to make sure it doesn't
interfere with my longer term development goals.  Sometimes this is
easy, sometimes this takes an entire day to work out the warts.
Then I send that to the fstests "cloud".  This rebase is particularly
painful because every change that everyone makes to inode
initialization collides with a rework of inode initialization that I've
been working on in preparation for metadata directory trees.

The part that's even /less/ obvious than that is that once the
development tree tests ok, then I do the same to my xfsprogs dev tree to
make sure that nothing broke.  Frequently there are ABI or cognitive
mismatches between kernel and userspace such that the build breaks, and
then I have to patch the two kernel trees and re-run everything.

So, that's really stressful for me because a minor tweak to an interface
can result in an enormous amount of work.  And I reject the argument
that I could just rebase less frequently -- Dave does that, which means
that any time he hears that one of his topic branches is being asked
about, he has to spend weeks rebasing the whole mess to the latest
upstream.  Maybe that works for him, but for me, I would hate that even
more than doing a daily rebase.

Add to that the fact that vger has been a total delivery sh*tshow all
year.  Now in addition to feeling disconnected from my family and
friends, I also feel disconnected from work people too.  This really did
nearly push me over the edge three weeks ago.

Please remember, one of the big advantages of our open development
processes is that we /do/ accept code with warty (but functional)
interfaces now, and we can clean them up later.  This is (IMHO) a good
stress-reduction tactic, because each of us (ideally) should concentrate
on getting the core algorithms right, and not focusing on rebasing code
and smoothing over the same d*** merge conflicts over and over.

Yes, it's true that people think that a maintainer's only real power is
to say 'no' in the hopes of forcing developers to fix everything now
because they can't trust that a dev will ever come back with the
promised updates, but I reject that 110%.  I'm not going anywhere, and I
/do/ trust that when the rest of you say that you'll be back with wart
remover, you will.

> I'm not going to play that "now jump through this hoop" game.  We
> add flags for on-off behaviours in internal functions -all the
> time-. If this makes the interface so complex and confusing that you
> don't understand it, then the interface was already too complex and
> confusing. And fixing that is *not in the scope of this patchset*.
> 
> Demanding that code be made perfect before it can be merged is
> really not very helpful. Especially when there are already plans to
> rework the API but that rework is dependent on a bunch of other
> changes than need to be done first.
> 
> iclogs are something that need to be moved behind the CIL, not sit
> in front of CIL. The CIL abstracts the journal and writing to the
> journal completely away from the transaction subsystem, yet the log
> force code punches straight through that abstraction and walks
> iclogs directly. The whole log force implementation needs to change,
> and I plan for the API that wraps the log forces to get reworked at
> that time.

So here's what I want to know: Do Dave's changes to the log force APIs
introduce broken behavior?  If the interfaces are so confusing that
/none/ of us understand it, can we introduce the appropriate wrappers
and documentation so that the rest of us plugging away at the rest of
the system can only call it the supported ways to achieve any of the
supported outcomes?

I'm willing to accept a bunch of documentation and "trivial" wrappers
for the rest of us as a shoofly to enable the rest of the xfs developers
to keep moving around a messy slow-moving log restructuring without
falling into a work pit.

However, it's been difficult for me to check that without being able to
reference a branch to see that at least the end result looks sane.  That
was immensely helpful for reviewing Allison's deferred xattrs series.

(TLDR: git branch plz)

The other way to ease my fears, of course, would be to submit a ton of
fstests to examine the log behavior for correctness, but that's
difficult to pull off when the control surface is the userspace ABI.

> For example, if we want to direct map storage for log writes, then
> iclog-based log force synchronisation needs to go away because we
> don't need iclogs for buffering journal writes. Hence the log foce
> code should interface only with the CIL, and only the CIL should
> manage whatever mechanism it is using to write to stable storage.
> The code is currently the way it is because the CIL, when first
> implemented, had to co-exist with the old way of writing to the log.
> We haven't used that old way for a decade now and we have very
> different storage performance characteristics these days, so it's
> about time we got rid of the mechanism designed to be optimal for
> spinning disks and actually integrated the CIL and the log
> efficiently.
> 
> There are a *lot* of steps to do this, and reworking the log force
> implementation and API is part of that. But reworking that API is
> premature because we haven't done all the necessary pre-work in
> place to make such a change yet. This patch is actually part of that
> pre-work to get the mechanisms that the log force rework will rely
> on.
> 
> I have very good reasons for pushing back against your suggestions,
> Brian. Your suggestions have merit but this patch is not the time to
> be making the changes you suggest. Code does not need to be perfect
> to be merged, nor does the entire larger change they will be part of
> need to be complete and 100% tested and reviewed before preparation
> and infrastructure patches can be merged. This is how we've done big
> changes in the past - they've been staged across multiple kernel
> cycles - and not everything needs to be done in the first
> patchset of a larger body of work....

You mean there's even more beyond the 45 already on the list? /groan/

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Brian Foster March 10, 2021, 2:10 a.m. UTC | #11
On Mon, Mar 08, 2021 at 08:35:59PM -0800, Darrick J. Wong wrote:
> On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote:
> > On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote:
> > > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote:
> > > > > > > So not only does this patch expose subsystem internals to
> > > > > > > external layers and create more log forcing interfaces/behaviors to
> > > > > > 
> > > > > > Sorry, I don't follow. What "subsystem internals" are being exposed
> > > > > > and what external layer are they being exposed to?
> > > > > > 
> > > > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > > > > > > changing things like log force behaviour and implementation is for
> > > > > > > > a future series.
> > > > > > > > 
> > > > > > > 
> > > > > > > TBH I think this patch is kind of a mess on its own. I think the
> > > > > > > mechanism it wants to provide is sane, but I've not even got to the
> > > > > > > point of reviewing _that_ yet because of the seeming dismissal of higher
> > > > > > > level feedback. I'd rather not go around in circles on this so I'll just
> > > > > > > offer my summarized feedback to this patch...
> > > > > > 
> > > > > > I'm not dismissing review nor am I saying the API cannot or should
> > > > > > not be improved. I'm simply telling you that I think the additional
> > > > > > changes you are proposing are outside the scope of the problem I am
> > > > > > addressing here. I already plan to rework the log force API (and
> > > > > > others) but doing so it not something that this patchset needs to
> > > > > > address, or indeed should address.
> > > > > > 
> > > > > 
> > > > > I'm not proposing additional changes nor to rework the log force API.
> > > > > I'm pointing out that I find this implementation to be extremely and
> > > > > unnecessarily confusing.
> > > > 
> > > > And that's just fine - not everyone has to understand all of the
> > > > code all of the time. That's why we have a team....
> > > > 
> > > > > To improve it, I'm suggesting to either coopt
> > > > > the existing async log force API...
> > > > 
> > > > And I'm telling you that this is *future work*. As the person
> > > > actually doing the work, that's my decision and you need to accept
> > > > that. I understand your concerns and have a plan to address them -
> > > > you just have to accept that the plan to address them isn't going to
> > > > be exactly to your liking.
> > > > 
> > > 
> > > As a reviewer, I am pointing out that I object to how this patch is
> > > implemented and offered several fairly simple suggestions on how to
> > > address my concerns. Those suggestions have been rejected pretty much
> > > out of hand on the basis of the existence of some future plans.
> > 
> > We're allowed to disagree on the best approach. But to find a way
> > forward means that both the summitter and the reviewer need to
> > compromise. I've fixed up the obvious warts and bugs you've pointed
> > out, and agreed that it needs cleaning up and have committed to
> > cleaning it up in the near future.
> > 
> > > Future rework plans do not justify or change my view of this patch
> > > and the mess it adds (for other developers, for historical context
> > > and downstreams, etc.).  Therefore, if the only option you're
> > > willing to consider is "make this mess now and clean it up later
> > > in some broader rework," then I'd rather we just defer this patch
> > > until after that rework is available and avoid the mess that way.
> > 
> > So you won't review it until I have 100 outstanding patches in this
> > series and it's completely and utterly unreviewable?
> 
> Already unreviewable at 45, and I've only gotten through 2/3 of it.
> 
> > Then you'll ask me to split it up and re-order it into digestable
> > chunks, and so we'll go back to having to merge this because any of
> > the API rework that depends on the mechanism that this patch
> > introduces.
> 
> Here's something I haven't previously shared with all of you: Last cycle
> when we were going around and around on the ENOSPC/EDQUOT retry loop
> patches (which exploded from 13 to 41 patches) it was /very/ stressful
> to have to rework this and that part every day and a half for almost
> three weeks.
> 
> When I get a patchset ready for submission, I export the patches from my
> development branch into a topic branch based off the latest -rc.  Every
> time anyone makes a suggestion, I update the topic branch and send that
> to the fstests "cloud" to see if it broke anything.  If it succeeds, I
> push it to k.org and resubmit.
> 
> The part that's less obvious is the fact that rebasing is /not/ that
> simple.  Next I integrate the changes into my development tree and
> rebase everything that comes after /that/ to make sure it doesn't
> interfere with my longer term development goals.  Sometimes this is
> easy, sometimes this takes an entire day to work out the warts.
> Then I send that to the fstests "cloud".  This rebase is particularly
> painful because every change that everyone makes to inode
> initialization collides with a rework of inode initialization that I've
> been working on in preparation for metadata directory trees.
> 
> The part that's even /less/ obvious than that is that once the
> development tree tests ok, then I do the same to my xfsprogs dev tree to
> make sure that nothing broke.  Frequently there are ABI or cognitive
> mismatches between kernel and userspace such that the build breaks, and
> then I have to patch the two kernel trees and re-run everything.
> 
> So, that's really stressful for me because a minor tweak to an interface
> can result in an enormous amount of work.  And I reject the argument
> that I could just rebase less frequently -- Dave does that, which means
> that any time he hears that one of his topic branches is being asked
> about, he has to spend weeks rebasing the whole mess to the latest
> upstream.  Maybe that works for him, but for me, I would hate that even
> more than doing a daily rebase.
> 
> Add to that the fact that vger has been a total delivery sh*tshow all
> year.  Now in addition to feeling disconnected from my family and
> friends, I also feel disconnected from work people too.  This really did
> nearly push me over the edge three weeks ago.
> 
> Please remember, one of the big advantages of our open development
> processes is that we /do/ accept code with warty (but functional)
> interfaces now, and we can clean them up later.  This is (IMHO) a good
> stress-reduction tactic, because each of us (ideally) should concentrate
> on getting the core algorithms right, and not focusing on rebasing code
> and smoothing over the same d*** merge conflicts over and over.
> 
> Yes, it's true that people think that a maintainer's only real power is
> to say 'no' in the hopes of forcing developers to fix everything now
> because they can't trust that a dev will ever come back with the
> promised updates, but I reject that 110%.  I'm not going anywhere, and I
> /do/ trust that when the rest of you say that you'll be back with wart
> remover, you will.
> 
> > I'm not going to play that "now jump through this hoop" game.  We
> > add flags for on-off behaviours in internal functions -all the
> > time-. If this makes the interface so complex and confusing that you
> > don't understand it, then the interface was already too complex and
> > confusing. And fixing that is *not in the scope of this patchset*.
> > 
> > Demanding that code be made perfect before it can be merged is
> > really not very helpful. Especially when there are already plans to
> > rework the API but that rework is dependent on a bunch of other
> > changes than need to be done first.
> > 
> > iclogs are something that need to be moved behind the CIL, not sit
> > in front of CIL. The CIL abstracts the journal and writing to the
> > journal completely away from the transaction subsystem, yet the log
> > force code punches straight through that abstraction and walks
> > iclogs directly. The whole log force implementation needs to change,
> > and I plan for the API that wraps the log forces to get reworked at
> > that time.
> 
> So here's what I want to know: Do Dave's changes to the log force APIs
> introduce broken behavior?  If the interfaces are so confusing that
> /none/ of us understand it, can we introduce the appropriate wrappers
> and documentation so that the rest of us plugging away at the rest of
> the system can only call it the supported ways to achieve any of the
> supported outcomes?
> 

It looks to me that the changes to xlog_cil_force_seq() could
essentially be replaced with something like:

/*
 * Behold the magical async CIL force
 * ... <explain what this does> ...
 */
static inline void
xfs_cil_force_async(
	struct xlog	*log)
{
	struct xfs_cil	*cil = log->l_cilp;

	/* true for magic async CIL force */
	xlog_cil_push_now(log, cil->xc_current_sequence, true);
}

If so, we wouldn't have to hack up xlog_cil_force_seq() with a flags
parameter that provides inconsistent async semantics (and thus show
mercy on the log force call stack above it) and otherwise only serves to
bypass 95% of that function anyways. I also think that inverting the cil
push bool param (and calling it async_push or some such) is a bit more
clear because it separates it from XFS_LOG_SYNC (or !XFS_LOG_SYNC)
semantics and simultaneously maps directly to the underlying
->xc_push_async control.

Brian

> I'm willing to accept a bunch of documentation and "trivial" wrappers
> for the rest of us as a shoofly to enable the rest of the xfs developers
> to keep moving around a messy slow-moving log restructuring without
> falling into a work pit.
> 
> However, it's been difficult for me to check that without being able to
> reference a branch to see that at least the end result looks sane.  That
> was immensely helpful for reviewing Allison's deferred xattrs series.
> 
> (TLDR: git branch plz)
> 
> The other way to ease my fears, of course, would be to submit a ton of
> fstests to examine the log behavior for correctness, but that's
> difficult to pull off when the control surface is the userspace ABI.
> 
> > For example, if we want to direct map storage for log writes, then
> > iclog-based log force synchronisation needs to go away because we
> > don't need iclogs for buffering journal writes. Hence the log foce
> > code should interface only with the CIL, and only the CIL should
> > manage whatever mechanism it is using to write to stable storage.
> > The code is currently the way it is because the CIL, when first
> > implemented, had to co-exist with the old way of writing to the log.
> > We haven't used that old way for a decade now and we have very
> > different storage performance characteristics these days, so it's
> > about time we got rid of the mechanism designed to be optimal for
> > spinning disks and actually integrated the CIL and the log
> > efficiently.
> > 
> > There are a *lot* of steps to do this, and reworking the log force
> > implementation and API is part of that. But reworking that API is
> > premature because we haven't done all the necessary pre-work in
> > place to make such a change yet. This patch is actually part of that
> > pre-work to get the mechanisms that the log force rework will rely
> > on.
> > 
> > I have very good reasons for pushing back against your suggestions,
> > Brian. Your suggestions have merit but this patch is not the time to
> > be making the changes you suggest. Code does not need to be perfect
> > to be merged, nor does the entire larger change they will be part of
> > need to be complete and 100% tested and reviewed before preparation
> > and infrastructure patches can be merged. This is how we've done big
> > changes in the past - they've been staged across multiple kernel
> > cycles - and not everything needs to be done in the first
> > patchset of a larger body of work....
> 
> You mean there's even more beyond the 45 already on the list? /groan/
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
>
Brian Foster March 10, 2021, 2:49 p.m. UTC | #12
On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote:
> On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote:
> > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote:
> > > > > > So not only does this patch expose subsystem internals to
> > > > > > external layers and create more log forcing interfaces/behaviors to
> > > > > 
> > > > > Sorry, I don't follow. What "subsystem internals" are being exposed
> > > > > and what external layer are they being exposed to?
> > > > > 
> > > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and
> > > > > > > changing things like log force behaviour and implementation is for
> > > > > > > a future series.
> > > > > > > 
> > > > > > 
> > > > > > TBH I think this patch is kind of a mess on its own. I think the
> > > > > > mechanism it wants to provide is sane, but I've not even got to the
> > > > > > point of reviewing _that_ yet because of the seeming dismissal of higher
> > > > > > level feedback. I'd rather not go around in circles on this so I'll just
> > > > > > offer my summarized feedback to this patch...
> > > > > 
> > > > > I'm not dismissing review nor am I saying the API cannot or should
> > > > > not be improved. I'm simply telling you that I think the additional
> > > > > changes you are proposing are outside the scope of the problem I am
> > > > > addressing here. I already plan to rework the log force API (and
> > > > > others) but doing so it not something that this patchset needs to
> > > > > address, or indeed should address.
> > > > > 
> > > > 
> > > > I'm not proposing additional changes nor to rework the log force API.
> > > > I'm pointing out that I find this implementation to be extremely and
> > > > unnecessarily confusing.
> > > 
> > > And that's just fine - not everyone has to understand all of the
> > > code all of the time. That's why we have a team....
> > > 
> > > > To improve it, I'm suggesting to either coopt
> > > > the existing async log force API...
> > > 
> > > And I'm telling you that this is *future work*. As the person
> > > actually doing the work, that's my decision and you need to accept
> > > that. I understand your concerns and have a plan to address them -
> > > you just have to accept that the plan to address them isn't going to
> > > be exactly to your liking.
> > > 
> > 
> > As a reviewer, I am pointing out that I object to how this patch is
> > implemented and offered several fairly simple suggestions on how to
> > address my concerns. Those suggestions have been rejected pretty much
> > out of hand on the basis of the existence of some future plans.
> 
> We're allowed to disagree on the best approach. But to find a way
> forward means that both the summitter and the reviewer need to
> compromise. I've fixed up the obvious warts and bugs you've pointed
> out, and agreed that it needs cleaning up and have committed to
> cleaning it up in the near future.
> 
> > Future rework plans do not justify or change my view of this patch
> > and the mess it adds (for other developers, for historical context
> > and downstreams, etc.).  Therefore, if the only option you're
> > willing to consider is "make this mess now and clean it up later
> > in some broader rework," then I'd rather we just defer this patch
> > until after that rework is available and avoid the mess that way.
> 
> So you won't review it until I have 100 outstanding patches in this
> series and it's completely and utterly unreviewable? Then you'll ask
> me to split it up and re-order it into digestable chunks, and so
> we'll go back to having to merge this because any of the API rework
> that depends on the mechanism that this patch introduces.
> 

(Well, I'm not reviewing it now because I'm on PTO this week...)

> I'm not going to play that "now jump through this hoop" game.  We
> add flags for on-off behaviours in internal functions -all the
> time-. If this makes the interface so complex and confusing that you
> don't understand it, then the interface was already too complex and
> confusing. And fixing that is *not in the scope of this patchset*.
> 

I'm pretty sure creating a flag for the one-off behavior here was one of
my earlier suggestions. Based on the above, what's the problem with
that? I think Darrick's suggestion (re: my previous reply) around a
wrapper approach is quite reasonable as well.

> Demanding that code be made perfect before it can be merged is
> really not very helpful. Especially when there are already plans to
> rework the API but that rework is dependent on a bunch of other
> changes than need to be done first.
> 

"Jumping through hoops" on 100 patch series? Demanding perfection?
That's a little extreme for first pass feedback on v1 of a 3 patch
series, don't you think?

My position on this patch in its current form is that regardless of any
pending rework, it is unnecessarily gross. As stated numerous times, I
think there are various options for small tweaks that make it passable
for the time being. Let me know if you become willing to entertain the
notion that this patch might stand to improve independent from some
broader rework and I'm happy to revisit any of those options in further
detail.

In the meantime, I'm growing a little tired of the hyperbolic replies
around playing games, jumping through hoops, demanding perfection, or
whatever other nonsense. I've invested the better part of a week (and
now into PTO) in review of this patch alone because I feel strongly
about the result and am willing to contribute toward a solution, not
because I want to waste time or play games. I don't feel in kind about
the replies.

Brian

P.S.,

FWIW, I certainly don't expect all feedback to be incorporated during
review. On balance, I'd say a reasonable breakdown is that probably half
makes sense, another half might be bogus, and then some percentage of
that comes down to preference between reviewer and author where I'll try
to defer to the patch author when I feel there's at least been some good
faith discussion or justification for the current approach. "No, that's
obfuscation" or "no, I'll do it later" are not productive arguments if
you're looking for compromise. Those leave no room for discussion.
There's no counter suggestion to try and address the immediate concern
and work towards something mutually agreeable, or even any opportunity
to arrive at the conclusion that the original patch might be the ideal
outcome after all.

> iclogs are something that need to be moved behind the CIL, not sit
> in front of CIL. The CIL abstracts the journal and writing to the
> journal completely away from the transaction subsystem, yet the log
> force code punches straight through that abstraction and walks
> iclogs directly. The whole log force implementation needs to change,
> and I plan for the API that wraps the log forces to get reworked at
> that time.
> 
> For example, if we want to direct map storage for log writes, then
> iclog-based log force synchronisation needs to go away because we
> don't need iclogs for buffering journal writes. Hence the log foce
> code should interface only with the CIL, and only the CIL should
> manage whatever mechanism it is using to write to stable storage.
> The code is currently the way it is because the CIL, when first
> implemented, had to co-exist with the old way of writing to the log.
> We haven't used that old way for a decade now and we have very
> different storage performance characteristics these days, so it's
> about time we got rid of the mechanism designed to be optimal for
> spinning disks and actually integrated the CIL and the log
> efficiently.
> 
> There are a *lot* of steps to do this, and reworking the log force
> implementation and API is part of that. But reworking that API is
> premature because we haven't done all the necessary pre-work in
> place to make such a change yet. This patch is actually part of that
> pre-work to get the mechanisms that the log force rework will rely
> on.
> 
> I have very good reasons for pushing back against your suggestions,
> Brian. Your suggestions have merit but this patch is not the time to
> be making the changes you suggest. Code does not need to be perfect
> to be merged, nor does the entire larger change they will be part of
> need to be complete and 100% tested and reviewed before preparation
> and infrastructure patches can be merged. This is how we've done big
> changes in the past - they've been staged across multiple kernel
> cycles - and not everything needs to be done in the first
> patchset of a larger body of work....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Brian Foster March 10, 2021, 3:13 p.m. UTC | #13
(Replying separately to the logistical bits...)

On Mon, Mar 08, 2021 at 08:35:59PM -0800, Darrick J. Wong wrote:
> On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote:
> > On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote:
> > > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote:
...
> > > 
> > > As a reviewer, I am pointing out that I object to how this patch is
> > > implemented and offered several fairly simple suggestions on how to
> > > address my concerns. Those suggestions have been rejected pretty much
> > > out of hand on the basis of the existence of some future plans.
> > 
> > We're allowed to disagree on the best approach. But to find a way
> > forward means that both the summitter and the reviewer need to
> > compromise. I've fixed up the obvious warts and bugs you've pointed
> > out, and agreed that it needs cleaning up and have committed to
> > cleaning it up in the near future.
> > 
> > > Future rework plans do not justify or change my view of this patch
> > > and the mess it adds (for other developers, for historical context
> > > and downstreams, etc.).  Therefore, if the only option you're
> > > willing to consider is "make this mess now and clean it up later
> > > in some broader rework," then I'd rather we just defer this patch
> > > until after that rework is available and avoid the mess that way.
> > 
> > So you won't review it until I have 100 outstanding patches in this
> > series and it's completely and utterly unreviewable?
> 
> Already unreviewable at 45, and I've only gotten through 2/3 of it.
> 
> > Then you'll ask me to split it up and re-order it into digestable
> > chunks, and so we'll go back to having to merge this because any of
> > the API rework that depends on the mechanism that this patch
> > introduces.
> 
> Here's something I haven't previously shared with all of you: Last cycle
> when we were going around and around on the ENOSPC/EDQUOT retry loop
> patches (which exploded from 13 to 41 patches) it was /very/ stressful
> to have to rework this and that part every day and a half for almost
> three weeks.
> 

I sympathize. There was at least a week in there that I spent pretty
much on nothing but review. :/

> When I get a patchset ready for submission, I export the patches from my
> development branch into a topic branch based off the latest -rc.  Every
> time anyone makes a suggestion, I update the topic branch and send that
> to the fstests "cloud" to see if it broke anything.  If it succeeds, I
> push it to k.org and resubmit.
> 
> The part that's less obvious is the fact that rebasing is /not/ that
> simple.  Next I integrate the changes into my development tree and
> rebase everything that comes after /that/ to make sure it doesn't
> interfere with my longer term development goals.  Sometimes this is
> easy, sometimes this takes an entire day to work out the warts.
> Then I send that to the fstests "cloud".  This rebase is particularly
> painful because every change that everyone makes to inode
> initialization collides with a rework of inode initialization that I've
> been working on in preparation for metadata directory trees.
> 
> The part that's even /less/ obvious than that is that once the
> development tree tests ok, then I do the same to my xfsprogs dev tree to
> make sure that nothing broke.  Frequently there are ABI or cognitive
> mismatches between kernel and userspace such that the build breaks, and
> then I have to patch the two kernel trees and re-run everything.
> 
> So, that's really stressful for me because a minor tweak to an interface
> can result in an enormous amount of work.  And I reject the argument
> that I could just rebase less frequently -- Dave does that, which means
> that any time he hears that one of his topic branches is being asked
> about, he has to spend weeks rebasing the whole mess to the latest
> upstream.  Maybe that works for him, but for me, I would hate that even
> more than doing a daily rebase.
> 
> Add to that the fact that vger has been a total delivery sh*tshow all
> year.  Now in addition to feeling disconnected from my family and
> friends, I also feel disconnected from work people too.  This really did
> nearly push me over the edge three weeks ago.
> 

Sorry. :(

> Please remember, one of the big advantages of our open development
> processes is that we /do/ accept code with warty (but functional)
> interfaces now, and we can clean them up later.  This is (IMHO) a good
> stress-reduction tactic, because each of us (ideally) should concentrate
> on getting the core algorithms right, and not focusing on rebasing code
> and smoothing over the same d*** merge conflicts over and over.
> 

I agree that we can and do accept code as such in certain cases, but not
always or by default. Usually, only when it makes sense due to
complexity, cross subsystem issues, etc. If the situation is the
submitter has sent a large patch series at a point where it's already so
unwieldy to change because of non-upstream reasons that some restricted
form of review is required, then I think that's a bit unfair to the
upstream community.

Using the whole block reclaim/retry loop thing as an example (just
because it was the most recent example of this kind of thing causing you
grief), perhaps those foundational changes should have been sent to the
list earlier before they became "too unwieldy to change?" I think it's
perfectly reasonable to send lightly tested or compile tested only
patches for initial thoughts, factoring feedback, etc., particularly if
a change proposes to add retry loops to various transaction allocation
callers. That's a fairly significant change in some core areas.

ISTM that on one hand review is fairly open ended and it's not really up
to the submitter to declare the scope of reviewer feedback (the
submitter is then certainly free to change certain things or not, but
the goal should be to cooperate toward some common ground). OTOH I
suppose the open development model allows for a community to determine a
standard for if/how they prefer to review patches.

That said, I personally don't like the "accept ugly code now, fix it
later over time" model by default because for one, I think it leads to a
lower quality mainline. I also think there's an implied assumption in
there that a reviewer wants to review endless series of cleanup patches
after providing such feedback on early versions of functionality. I know
if I take the time to review a 30+ patch series and had some
non-negligible amount of aesthetic feedback, I'm not really looking to
review another 20+ patches some time later just to clean up a mess that
the first series might have (unnecessarily) created, particularly when
the feedback has already been provided and so iterative review is more
time efficient (I find review much more time consuming than
development).

I dunno. That's just my .02 I guess. I do agree with Dave's previous
statements that not everything has to be 100% mutually agreeable. I
generally try to accommodate that by incorporating suggestions into my
own patches that I might not necessarily agree with, but don't feel
strongly enough about, to save time and keep things moving. Conversely
on the review side, I generally try to defer to other reviewers to see
whether my particular position might be in the majority or minority, and
go with that (That doesn't always mean I'm going to put reviewed-by tags
on patches I really don't like. In some cases, I think reviewing for
function/bugs and not nacking a patch is a form of compromise. ;).

> Yes, it's true that people think that a maintainer's only real power is
> to say 'no' in the hopes of forcing developers to fix everything now
> because they can't trust that a dev will ever come back with the
> promised updates, but I reject that 110%.  I'm not going anywhere, and I
> /do/ trust that when the rest of you say that you'll be back with wart
> remover, you will.
> 

I don't distrust that any XFS developers would make promised fixes as
such. I don't really see that as a problem at all. I think the whole
subsystem rework thing is pretty much a distraction from the original
feedback. E.g., I think your wrapper idea (explored in my previous
reply) is yet another reasonable solution that doesn't involve or
require any sort of broader rework.

Brian

> > I'm not going to play that "now jump through this hoop" game.  We
> > add flags for on-off behaviours in internal functions -all the
> > time-. If this makes the interface so complex and confusing that you
> > don't understand it, then the interface was already too complex and
> > confusing. And fixing that is *not in the scope of this patchset*.
> > 
> > Demanding that code be made perfect before it can be merged is
> > really not very helpful. Especially when there are already plans to
> > rework the API but that rework is dependent on a bunch of other
> > changes than need to be done first.
> > 
> > iclogs are something that need to be moved behind the CIL, not sit
> > in front of CIL. The CIL abstracts the journal and writing to the
> > journal completely away from the transaction subsystem, yet the log
> > force code punches straight through that abstraction and walks
> > iclogs directly. The whole log force implementation needs to change,
> > and I plan for the API that wraps the log forces to get reworked at
> > that time.
> 
> So here's what I want to know: Do Dave's changes to the log force APIs
> introduce broken behavior?  If the interfaces are so confusing that
> /none/ of us understand it, can we introduce the appropriate wrappers
> and documentation so that the rest of us plugging away at the rest of
> the system can only call it the supported ways to achieve any of the
> supported outcomes?
> 
> I'm willing to accept a bunch of documentation and "trivial" wrappers
> for the rest of us as a shoofly to enable the rest of the xfs developers
> to keep moving around a messy slow-moving log restructuring without
> falling into a work pit.
> 
> However, it's been difficult for me to check that without being able to
> reference a branch to see that at least the end result looks sane.  That
> was immensely helpful for reviewing Allison's deferred xattrs series.
> 
> (TLDR: git branch plz)
> 
> The other way to ease my fears, of course, would be to submit a ton of
> fstests to examine the log behavior for correctness, but that's
> difficult to pull off when the control surface is the userspace ABI.
> 
> > For example, if we want to direct map storage for log writes, then
> > iclog-based log force synchronisation needs to go away because we
> > don't need iclogs for buffering journal writes. Hence the log foce
> > code should interface only with the CIL, and only the CIL should
> > manage whatever mechanism it is using to write to stable storage.
> > The code is currently the way it is because the CIL, when first
> > implemented, had to co-exist with the old way of writing to the log.
> > We haven't used that old way for a decade now and we have very
> > different storage performance characteristics these days, so it's
> > about time we got rid of the mechanism designed to be optimal for
> > spinning disks and actually integrated the CIL and the log
> > efficiently.
> > 
> > There are a *lot* of steps to do this, and reworking the log force
> > implementation and API is part of that. But reworking that API is
> > premature because we haven't done all the necessary pre-work in
> > place to make such a change yet. This patch is actually part of that
> > pre-work to get the mechanisms that the log force rework will rely
> > on.
> > 
> > I have very good reasons for pushing back against your suggestions,
> > Brian. Your suggestions have merit but this patch is not the time to
> > be making the changes you suggest. Code does not need to be perfect
> > to be merged, nor does the entire larger change they will be part of
> > need to be complete and 100% tested and reviewed before preparation
> > and infrastructure patches can be merged. This is how we've done big
> > changes in the past - they've been staged across multiple kernel
> > cycles - and not everything needs to be done in the first
> > patchset of a larger body of work....
> 
> You mean there's even more beyond the 45 already on the list? /groan/
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
>
Dave Chinner March 10, 2021, 10 p.m. UTC | #14
On Tue, Mar 09, 2021 at 09:10:47PM -0500, Brian Foster wrote:
> On Mon, Mar 08, 2021 at 08:35:59PM -0800, Darrick J. Wong wrote:
> > On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote:
> > > I'm not going to play that "now jump through this hoop" game.  We
> > > add flags for on-off behaviours in internal functions -all the
> > > time-. If this makes the interface so complex and confusing that you
> > > don't understand it, then the interface was already too complex and
> > > confusing. And fixing that is *not in the scope of this patchset*.
> > > 
> > > Demanding that code be made perfect before it can be merged is
> > > really not very helpful. Especially when there are already plans to
> > > rework the API but that rework is dependent on a bunch of other
> > > changes than need to be done first.
> > > 
> > > iclogs are something that need to be moved behind the CIL, not sit
> > > in front of CIL. The CIL abstracts the journal and writing to the
> > > journal completely away from the transaction subsystem, yet the log
> > > force code punches straight through that abstraction and walks
> > > iclogs directly. The whole log force implementation needs to change,
> > > and I plan for the API that wraps the log forces to get reworked at
> > > that time.
> > 
> > So here's what I want to know: Do Dave's changes to the log force APIs
> > introduce broken behavior?  If the interfaces are so confusing that
> > /none/ of us understand it, can we introduce the appropriate wrappers
> > and documentation so that the rest of us plugging away at the rest of
> > the system can only call it the supported ways to achieve any of the
> > supported outcomes?
> > 
> 
> It looks to me that the changes to xlog_cil_force_seq() could
> essentially be replaced with something like:
> 
> /*
>  * Behold the magical async CIL force
>  * ... <explain what this does> ...
>  */
> static inline void
> xfs_cil_force_async(
> 	struct xlog	*log)
> {
> 	struct xfs_cil	*cil = log->l_cilp;
> 
> 	/* true for magic async CIL force */
> 	xlog_cil_push_now(log, cil->xc_current_sequence, true);
> }

Oh, is that what you are asking for? You were talking about changing
the API for all the callers that didn't use XFS_LOG_SYNC to make
them all async, not about a one-off, single use wrapper for the AIL.

If all you want is a single line wrapper function, then don't
suggest that the whole API should be reworked and callers changed to
use a new API. *Ask for a one-line wrapper to be added*.

This wrapper will still need to go away in the near future, but at
least it doesn't involve changing lots of unrelated code...

Cheers,

Dave.
Christoph Hellwig March 11, 2021, 12:41 p.m. UTC | #15
On Mon, Mar 08, 2021 at 08:35:59PM -0800, Darrick J. Wong wrote:
> > So you won't review it until I have 100 outstanding patches in this
> > series and it's completely and utterly unreviewable?
> 
> Already unreviewable at 45, and I've only gotten through 2/3 of it.


Yes.  For patches that don't just repetitively apply similar changes
to a few places, about 20 patches is the max that is digestable.

> Here's something I haven't previously shared with all of you: Last cycle
> when we were going around and around on the ENOSPC/EDQUOT retry loop
> patches (which exploded from 13 to 41 patches) it was /very/ stressful
> to have to rework this and that part every day and a half for almost
> three weeks.

As someone part of the loop I was a little surprised how quickly you
did respin the patches.  In general if I have feedback that requires
a major rework of a non-trivial series, I do not want to touch it
instantly.  Let the discussion continue a bit, becaue it can easily
turn into another direction and create more work.  I think waiting a few
days before doing anything that involves a lot of work generally helpsto
make everyones life a little easier.

> (TLDR: git branch plz)

Yes, for any non-trivial series that really, really helps.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 84cd9b6c6d1f..51acc0075699 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -50,11 +50,6 @@  xlog_state_get_iclog_space(
 	int			*continued_write,
 	int			*logoffsetp);
 STATIC void
-xlog_state_switch_iclogs(
-	struct xlog		*log,
-	struct xlog_in_core	*iclog,
-	int			eventual_size);
-STATIC void
 xlog_grant_push_ail(
 	struct xlog		*log,
 	int			need_bytes);
@@ -511,7 +506,7 @@  __xlog_state_release_iclog(
  * Flush iclog to disk if this is the last reference to the given iclog and the
  * it is in the WANT_SYNC state.
  */
-static int
+int
 xlog_state_release_iclog(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog)
@@ -531,23 +526,6 @@  xlog_state_release_iclog(
 	return 0;
 }
 
-void
-xfs_log_release_iclog(
-	struct xlog_in_core	*iclog)
-{
-	struct xlog		*log = iclog->ic_log;
-	bool			sync = false;
-
-	if (atomic_dec_and_lock(&iclog->ic_refcnt, &log->l_icloglock)) {
-		if (iclog->ic_state != XLOG_STATE_IOERROR)
-			sync = __xlog_state_release_iclog(log, iclog);
-		spin_unlock(&log->l_icloglock);
-	}
-
-	if (sync)
-		xlog_sync(log, iclog);
-}
-
 /*
  * Mount a log filesystem
  *
@@ -3188,7 +3166,7 @@  xfs_log_ticket_ungrant(
  * This routine will mark the current iclog in the ring as WANT_SYNC and move
  * the current iclog pointer to the next iclog in the ring.
  */
-STATIC void
+void
 xlog_state_switch_iclogs(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog,
@@ -3335,6 +3313,20 @@  xfs_log_force(
 	return -EIO;
 }
 
+/*
+ * Force the log to a specific LSN.
+ *
+ * If an iclog with that lsn can be found:
+ *	If it is in the DIRTY state, just return.
+ *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
+ *		state and go to sleep or return.
+ *	If it is in any other state, go to sleep or return.
+ *
+ * Synchronous forces are implemented with a wait queue.  All callers trying
+ * to force a given lsn to disk must wait on the queue attached to the
+ * specific in-core log.  When given in-core log finally completes its write
+ * to disk, that thread will wake up all threads waiting on the queue.
+ */
 static int
 xlog_force_lsn(
 	struct xlog		*log,
@@ -3398,18 +3390,15 @@  xlog_force_lsn(
 }
 
 /*
- * Force the in-core log to disk for a specific LSN.
+ * Force the log to a specific checkpoint sequence.
  *
- * Find in-core log with lsn.
- *	If it is in the DIRTY state, just return.
- *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
- *		state and go to sleep or return.
- *	If it is in any other state, go to sleep or return.
+ * First force the CIL so that all the required changes have been flushed to the
+ * iclogs. If the CIL force completed it will return a commit LSN that indicates
+ * the iclog that needs to be flushed to stable storage.
  *
- * Synchronous forces are implemented with a wait queue.  All callers trying
- * to force a given lsn to disk must wait on the queue attached to the
- * specific in-core log.  When given in-core log finally completes its write
- * to disk, that thread will wake up all threads waiting on the queue.
+ * If the XFS_LOG_SYNC flag is not set, we only trigger a background CIL force
+ * and do not wait for it to complete, nor do we attempt to check/flush iclogs
+ * as the CIL will not have committed when xlog_cil_force_seq() returns.
  */
 int
 xfs_log_force_seq(
@@ -3426,7 +3415,7 @@  xfs_log_force_seq(
 	XFS_STATS_INC(mp, xs_log_force);
 	trace_xfs_log_force(mp, seq, _RET_IP_);
 
-	lsn = xlog_cil_force_seq(log, seq);
+	lsn = xlog_cil_force_seq(log, flags, seq);
 	if (lsn == NULLCOMMITLSN)
 		return 0;
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 33ae53401060..ff480d372dcf 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -104,6 +104,7 @@  struct xlog_ticket;
 struct xfs_log_item;
 struct xfs_item_ops;
 struct xfs_trans;
+struct xlog;
 
 int	  xfs_log_force(struct xfs_mount *mp, uint flags);
 int	  xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags,
@@ -117,7 +118,6 @@  void	xfs_log_mount_cancel(struct xfs_mount *);
 xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
 xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
 void	  xfs_log_space_wake(struct xfs_mount *mp);
-void	  xfs_log_release_iclog(struct xlog_in_core *iclog);
 int	  xfs_log_reserve(struct xfs_mount *mp,
 			  int		   length,
 			  int		   count,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 2fda8c4513b2..1b46559ef64a 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -657,6 +657,7 @@  xlog_cil_push_work(
 	xfs_lsn_t		commit_lsn;
 	xfs_lsn_t		push_seq;
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
+	bool			commit_iclog_sync = false;
 
 	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -667,7 +668,12 @@  xlog_cil_push_work(
 	spin_lock(&cil->xc_push_lock);
 	push_seq = cil->xc_push_seq;
 	ASSERT(push_seq <= ctx->sequence);
+	commit_iclog_sync = cil->xc_push_async;
+	cil->xc_push_async = false;
 
+trace_printk("curseq %llu, ctxseq %llu pushseq %llu empty %d",
+		cil->xc_current_sequence, cil->xc_ctx->sequence,
+		cil->xc_push_seq, list_empty(&cil->xc_cil));
 	/*
 	 * As we are about to switch to a new, empty CIL context, we no longer
 	 * need to throttle tasks on CIL space overruns. Wake any waiters that
@@ -910,7 +916,11 @@  xlog_cil_push_work(
 		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
 
 	/* release the hounds! */
-	xfs_log_release_iclog(commit_iclog);
+	spin_lock(&log->l_icloglock);
+	if (commit_iclog_sync && commit_iclog->ic_state == XLOG_STATE_ACTIVE)
+		xlog_state_switch_iclogs(log, commit_iclog, 0);
+	xlog_state_release_iclog(log, commit_iclog);
+	spin_unlock(&log->l_icloglock);
 	return;
 
 out_skip:
@@ -993,13 +1003,26 @@  xlog_cil_push_background(
 /*
  * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
  * number that is passed. When it returns, the work will be queued for
- * @push_seq, but it won't be completed. The caller is expected to do any
- * waiting for push_seq to complete if it is required.
+ * @push_seq, but it won't be completed.
+ *
+ * If the caller is performing a synchronous force, we will flush the workqueue
+ * to get previously queued work moving to minimise the wait time they will
+ * undergo waiting for all outstanding pushes to complete. The caller is
+ * expected to do the required waiting for push_seq to complete.
+ *
+ * If the caller is performing an async push, we need to ensure that the
+ * checkpoint is fully flushed out of the iclogs when we finish the push. If we
+ * don't do this, then the commit record may remain sitting in memory in an
+ * ACTIVE iclog. This then requires another full log force to push to disk,
+ * which defeats the purpose of having an async, non-blocking CIL force
+ * mechanism. Hence in this case we need to pass a flag to the push work to
+ * indicate it needs to flush the commit record itself.
  */
 static void
 xlog_cil_push_now(
 	struct xlog	*log,
-	xfs_lsn_t	push_seq)
+	xfs_lsn_t	push_seq,
+	bool		sync)
 {
 	struct xfs_cil	*cil = log->l_cilp;
 
@@ -1009,7 +1032,8 @@  xlog_cil_push_now(
 	ASSERT(push_seq && push_seq <= cil->xc_current_sequence);
 
 	/* start on any pending background push to minimise wait time on it */
-	flush_work(&cil->xc_push_work);
+	if (sync)
+		flush_work(&cil->xc_push_work);
 
 	/*
 	 * If the CIL is empty or we've already pushed the sequence then
@@ -1022,6 +1046,8 @@  xlog_cil_push_now(
 	}
 
 	cil->xc_push_seq = push_seq;
+	if (!sync)
+		cil->xc_push_async = true;
 	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
 	spin_unlock(&cil->xc_push_lock);
 }
@@ -1109,16 +1135,22 @@  xlog_cil_commit(
 /*
  * Conditionally push the CIL based on the sequence passed in.
  *
- * We only need to push if we haven't already pushed the sequence
- * number given. Hence the only time we will trigger a push here is
- * if the push sequence is the same as the current context.
+ * We only need to push if we haven't already pushed the sequence number given.
+ * Hence the only time we will trigger a push here is if the push sequence is
+ * the same as the current context.
  *
- * We return the current commit lsn to allow the callers to determine if a
- * iclog flush is necessary following this call.
+ * If the sequence is zero, push the current sequence. If XFS_LOG_SYNC is set in
+ * the flags wait for it to complete, otherwise jsut return NULLCOMMITLSN to
+ * indicate we didn't wait for a commit lsn.
+ *
+ * If we waited for the push to complete, then we return the current commit lsn
+ * to allow the callers to determine if a iclog flush is necessary following
+ * this call.
  */
 xfs_lsn_t
 xlog_cil_force_seq(
 	struct xlog	*log,
+	uint32_t	flags,
 	uint64_t	sequence)
 {
 	struct xfs_cil		*cil = log->l_cilp;
@@ -1127,13 +1159,22 @@  xlog_cil_force_seq(
 
 	ASSERT(sequence <= cil->xc_current_sequence);
 
+	if (!sequence)
+		sequence = cil->xc_current_sequence;
+trace_printk("seq %llu, curseq %llu, ctxseq %llu pushseq %llu flags 0x%x",
+		sequence, cil->xc_current_sequence, cil->xc_ctx->sequence,
+		cil->xc_push_seq, flags);
+
+	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
 	/*
 	 * check to see if we need to force out the current context.
 	 * xlog_cil_push() handles racing pushes for the same sequence,
 	 * so no need to deal with it here.
 	 */
 restart:
-	xlog_cil_push_now(log, sequence);
+	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
+	if (!(flags & XFS_LOG_SYNC))
+		return commit_lsn;
 
 	/*
 	 * See if we can find a previous sequence still committing.
@@ -1157,6 +1198,7 @@  xlog_cil_force_seq(
 			 * It is still being pushed! Wait for the push to
 			 * complete, then start again from the beginning.
 			 */
+			XFS_STATS_INC(log->l_mp, xs_log_force_sleep);
 			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
 			goto restart;
 		}
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 7e8ec77bc6e6..6b1a251dd013 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -273,6 +273,7 @@  struct xfs_cil {
 
 	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
 	uint64_t		xc_push_seq;
+	bool			xc_push_async;
 	struct list_head	xc_committing;
 	wait_queue_head_t	xc_commit_wait;
 	uint64_t		xc_current_sequence;
@@ -487,6 +488,10 @@  int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
 		struct xlog_in_core **commit_iclog, uint optype);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
+void	xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
+		int eventual_size);
+int	xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog);
+
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
 
@@ -558,12 +563,13 @@  void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
 /*
  * CIL force routines
  */
-xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence);
+xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags,
+				uint64_t sequence);
 
 static inline void
 xlog_cil_force(struct xlog *log)
 {
-	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
+	xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence);
 }
 
 /*
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index f1bc88f4367c..18dc5eca6c04 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -10,6 +10,7 @@ 
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_sysfs.h"
+#include "xfs_log.h"
 #include "xfs_log_priv.h"
 #include "xfs_mount.h"
 
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 9b8d703dc9fd..d111a994b7b6 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -20,6 +20,7 @@ 
 #include "xfs_bmap.h"
 #include "xfs_attr.h"
 #include "xfs_trans.h"
+#include "xfs_log.h"
 #include "xfs_log_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_quota.h"
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 697703f3be48..7d05694681e3 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -9,7 +9,6 @@ 
 #include "xfs_shared.h"
 #include "xfs_format.h"
 #include "xfs_log_format.h"
-#include "xfs_log_priv.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_extent_busy.h"
@@ -17,6 +16,7 @@ 
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_defer.h"
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index dbb69b4bf3ed..dfc0206c0d36 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -17,6 +17,7 @@ 
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 
 #ifdef DEBUG
 /*
@@ -429,8 +430,12 @@  xfsaild_push(
 
 	/*
 	 * If we encountered pinned items or did not finish writing out all
-	 * buffers the last time we ran, force the log first and wait for it
-	 * before pushing again.
+	 * buffers the last time we ran, force a background CIL push to get the
+	 * items unpinned in the near future. We do not wait on the CIL push as
+	 * that could stall us for seconds if there is enough background IO
+	 * load. Stalling for that long when the tail of the log is pinned and
+	 * needs flushing will hard stop the transaction subsystem when log
+	 * space runs out.
 	 */
 	if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
 	    (!list_empty_careful(&ailp->ail_buf_list) ||
@@ -438,7 +443,7 @@  xfsaild_push(
 		ailp->ail_log_flush = 0;
 
 		XFS_STATS_INC(mp, xs_push_ail_flush);
-		xfs_log_force(mp, XFS_LOG_SYNC);
+		xlog_cil_force_seq(mp->m_log, 0, 0);
 	}
 
 	spin_lock(&ailp->ail_lock);