[2/2] xfs: don't wait on future iclogs when pushing the CIL

Message ID	20210615064658.854029-3-david@fromorbit.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Subject: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL Date: Tue, 15 Jun 2021 16:46:58 +1000 Message-Id: <20210615064658.854029-3-david@fromorbit.com> In-Reply-To: <20210615064658.854029-1-david@fromorbit.com> References: <20210615064658.854029-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[1/2] xfs: add iclog state trace events \| expand [1/2] xfs: add iclog state trace events [2/2] xfs: don't wait on future iclogs when pushing the CIL

Dave Chinner June 15, 2021, 6:46 a.m. UTC

From: Dave Chinner <dchinner@redhat.com>

The iclogbuf ring attached to the struct xlog is circular, hence the
first and last iclogs in the ring can only be determined by
comparing them against the log->l_iclog pointer.

In xfs_cil_push_work(), we want to wait on previous iclogs that were
issued so that we can flush them to stable storage with the commit
record write, and it simply waits on the previous iclog in the ring.
This, however, leads to CIL push hangs in generic/019 like so:

task:kworker/u33:0   state:D stack:12680 pid:    7 ppid:     2 flags:0x00004000
Workqueue: xfs-cil/pmem1 xlog_cil_push_work
Call Trace:
 __schedule+0x30b/0x9f0
 schedule+0x68/0xe0
 xlog_wait_on_iclog+0x121/0x190
 ? wake_up_q+0xa0/0xa0
 xlog_cil_push_work+0x994/0xa10
 ? _raw_spin_lock+0x15/0x20
 ? xfs_swap_extents+0x920/0x920
 process_one_work+0x1ab/0x390
 worker_thread+0x56/0x3d0
 ? rescuer_thread+0x3c0/0x3c0
 kthread+0x14d/0x170
 ? __kthread_bind_mask+0x70/0x70
 ret_from_fork+0x1f/0x30

With other threads blocking in either xlog_state_get_iclog_space()
waiting for iclog space or xlog_grant_head_wait() waiting for log
reservation space.

The problem here is that the previous iclog on the ring might
actually be a future iclog. That is, if log->l_iclog points at
commit_iclog, commit_iclog is the first (oldest) iclog in the ring
and there are no previous iclogs pending as they have all completed
their IO and been activated again. IOWs, commit_iclog->ic_prev
points to an iclog that will be written in the future, not one that
has been written in the past.

Hence, in this case, waiting on the ->ic_prev iclog is incorrect
behaviour, and depending on the state of the future iclog, we can
end up with a circular ABA wait cycle and we hang.

Fix this by only waiting on the previous iclog when the commit_iclog
is not the oldest iclog in the ring.

Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO")
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

Brian Foster June 15, 2021, 3:38 p.m. UTC | #1

On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The iclogbuf ring attached to the struct xlog is circular, hence the
> first and last iclogs in the ring can only be determined by
> comparing them against the log->l_iclog pointer.
> 
> In xfs_cil_push_work(), we want to wait on previous iclogs that were
> issued so that we can flush them to stable storage with the commit
> record write, and it simply waits on the previous iclog in the ring.
> This, however, leads to CIL push hangs in generic/019 like so:
> 
> task:kworker/u33:0   state:D stack:12680 pid:    7 ppid:     2 flags:0x00004000
> Workqueue: xfs-cil/pmem1 xlog_cil_push_work
> Call Trace:
>  __schedule+0x30b/0x9f0
>  schedule+0x68/0xe0
>  xlog_wait_on_iclog+0x121/0x190
>  ? wake_up_q+0xa0/0xa0
>  xlog_cil_push_work+0x994/0xa10
>  ? _raw_spin_lock+0x15/0x20
>  ? xfs_swap_extents+0x920/0x920
>  process_one_work+0x1ab/0x390
>  worker_thread+0x56/0x3d0
>  ? rescuer_thread+0x3c0/0x3c0
>  kthread+0x14d/0x170
>  ? __kthread_bind_mask+0x70/0x70
>  ret_from_fork+0x1f/0x30
> 
> With other threads blocking in either xlog_state_get_iclog_space()
> waiting for iclog space or xlog_grant_head_wait() waiting for log
> reservation space.
> 
> The problem here is that the previous iclog on the ring might
> actually be a future iclog. That is, if log->l_iclog points at
> commit_iclog, commit_iclog is the first (oldest) iclog in the ring
> and there are no previous iclogs pending as they have all completed
> their IO and been activated again. IOWs, commit_iclog->ic_prev
> points to an iclog that will be written in the future, not one that
> has been written in the past.
> 
> Hence, in this case, waiting on the ->ic_prev iclog is incorrect
> behaviour, and depending on the state of the future iclog, we can
> end up with a circular ABA wait cycle and we hang.
> 
> Fix this by only waiting on the previous iclog when the commit_iclog
> is not the oldest iclog in the ring.
> 
> Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO")
> Reported-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 705619e9dab4..398f00cf9cbf 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -1075,15 +1075,21 @@ xlog_cil_push_work(
>  	ticket = ctx->ticket;
>  
>  	/*
> -	 * If the checkpoint spans multiple iclogs, wait for all previous
> -	 * iclogs to complete before we submit the commit_iclog. In this case,
> -	 * the commit_iclog write needs to issue a pre-flush so that the
> -	 * ordering is correctly preserved down to stable storage.
> +	 * If the checkpoint spans multiple iclogs, wait for all previous iclogs
> +	 * to complete before we submit the commit_iclog. If the commit iclog is
> +	 * at the head of the iclog ring, then all other iclogs have completed
> +	 * and are waiting on this one and hence we don't need to wait.
> +	 *
> +	 * Regardless of whether we need to wait or not, the the commit_iclog
> +	 * write needs to issue a pre-flush so that the ordering for this
> +	 * checkpoint is correctly preserved down to stable storage.
>  	 */
>  	spin_lock(&log->l_icloglock);
>  	if (ctx->start_lsn != commit_lsn) {
> -		xlog_wait_on_iclog(commit_iclog->ic_prev);
> -		spin_lock(&log->l_icloglock);
> +		if (commit_iclog != log->l_iclog) {
> +			xlog_wait_on_iclog(commit_iclog->ic_prev);
> +			spin_lock(&log->l_icloglock);
> +		}

I'm confused at what this is attempting to accomplish. If we have a
single CIL force and that happens to span several iclogs, isn't the goal
to wait on the previous iclog to ensure the previously written content
of the current checkpoint is on-disk? If so, it looks to me that
commit_iclog will always match log->l_iclog in that scenario because in
the commit record path we don't (potentially) switch it out until toward
the end of this function. Hmm.. did you mean to check ->ic_prev against
->l_iclog perhaps, or am I just missing something else?

Brian

>  		commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
>  	}
>  
> -- 
> 2.31.1
>

Darrick J. Wong June 15, 2021, 4:19 p.m. UTC | #2

On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The iclogbuf ring attached to the struct xlog is circular, hence the
> first and last iclogs in the ring can only be determined by
> comparing them against the log->l_iclog pointer.
> 
> In xfs_cil_push_work(), we want to wait on previous iclogs that were
> issued so that we can flush them to stable storage with the commit
> record write, and it simply waits on the previous iclog in the ring.
> This, however, leads to CIL push hangs in generic/019 like so:
> 
> task:kworker/u33:0   state:D stack:12680 pid:    7 ppid:     2 flags:0x00004000
> Workqueue: xfs-cil/pmem1 xlog_cil_push_work
> Call Trace:
>  __schedule+0x30b/0x9f0
>  schedule+0x68/0xe0
>  xlog_wait_on_iclog+0x121/0x190
>  ? wake_up_q+0xa0/0xa0
>  xlog_cil_push_work+0x994/0xa10
>  ? _raw_spin_lock+0x15/0x20
>  ? xfs_swap_extents+0x920/0x920
>  process_one_work+0x1ab/0x390
>  worker_thread+0x56/0x3d0
>  ? rescuer_thread+0x3c0/0x3c0
>  kthread+0x14d/0x170
>  ? __kthread_bind_mask+0x70/0x70
>  ret_from_fork+0x1f/0x30
> 
> With other threads blocking in either xlog_state_get_iclog_space()
> waiting for iclog space or xlog_grant_head_wait() waiting for log
> reservation space.
> 
> The problem here is that the previous iclog on the ring might
> actually be a future iclog. That is, if log->l_iclog points at
> commit_iclog, commit_iclog is the first (oldest) iclog in the ring
> and there are no previous iclogs pending as they have all completed
> their IO and been activated again. IOWs, commit_iclog->ic_prev
> points to an iclog that will be written in the future, not one that
> has been written in the past.
> 
> Hence, in this case, waiting on the ->ic_prev iclog is incorrect
> behaviour, and depending on the state of the future iclog, we can
> end up with a circular ABA wait cycle and we hang.
> 
> Fix this by only waiting on the previous iclog when the commit_iclog
> is not the oldest iclog in the ring.
> 
> Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO")
> Reported-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 705619e9dab4..398f00cf9cbf 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -1075,15 +1075,21 @@ xlog_cil_push_work(
>  	ticket = ctx->ticket;
>  
>  	/*
> -	 * If the checkpoint spans multiple iclogs, wait for all previous
> -	 * iclogs to complete before we submit the commit_iclog. In this case,
> -	 * the commit_iclog write needs to issue a pre-flush so that the
> -	 * ordering is correctly preserved down to stable storage.
> +	 * If the checkpoint spans multiple iclogs, wait for all previous iclogs
> +	 * to complete before we submit the commit_iclog. If the commit iclog is
> +	 * at the head of the iclog ring, then all other iclogs have completed
> +	 * and are waiting on this one and hence we don't need to wait.
> +	 *
> +	 * Regardless of whether we need to wait or not, the the commit_iclog
> +	 * write needs to issue a pre-flush so that the ordering for this
> +	 * checkpoint is correctly preserved down to stable storage.
>  	 */
>  	spin_lock(&log->l_icloglock);
>  	if (ctx->start_lsn != commit_lsn) {
> -		xlog_wait_on_iclog(commit_iclog->ic_prev);
> -		spin_lock(&log->l_icloglock);
> +		if (commit_iclog != log->l_iclog) {
> +			xlog_wait_on_iclog(commit_iclog->ic_prev);
> +			spin_lock(&log->l_icloglock);
> +		}

I'm confused.  How can you tell that we need to wait for
commit_iclog->ic_prev to be written out by comparing commit_iclog to
log->l_iclog?  Can't you determine this by checking ic_prev for DIRTY
state?

--D


>  		commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
>  	}
>  
> -- 
> 2.31.1
>

Brian Foster June 16, 2021, 2:57 p.m. UTC | #3

On Wed, Jun 16, 2021 at 08:09:44AM +1000, Dave Chinner wrote:
> On Tue, Jun 15, 2021 at 11:38:14AM -0400, Brian Foster wrote:
> > On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The iclogbuf ring attached to the struct xlog is circular, hence the
> > > first and last iclogs in the ring can only be determined by
> > > comparing them against the log->l_iclog pointer.
> > > 
> > > In xfs_cil_push_work(), we want to wait on previous iclogs that were
> > > issued so that we can flush them to stable storage with the commit
> > > record write, and it simply waits on the previous iclog in the ring.
> > > This, however, leads to CIL push hangs in generic/019 like so:
> > > 
> > > task:kworker/u33:0   state:D stack:12680 pid:    7 ppid:     2 flags:0x00004000
> > > Workqueue: xfs-cil/pmem1 xlog_cil_push_work
> > > Call Trace:
> > >  __schedule+0x30b/0x9f0
> > >  schedule+0x68/0xe0
> > >  xlog_wait_on_iclog+0x121/0x190
> > >  ? wake_up_q+0xa0/0xa0
> > >  xlog_cil_push_work+0x994/0xa10
> > >  ? _raw_spin_lock+0x15/0x20
> > >  ? xfs_swap_extents+0x920/0x920
> > >  process_one_work+0x1ab/0x390
> > >  worker_thread+0x56/0x3d0
> > >  ? rescuer_thread+0x3c0/0x3c0
> > >  kthread+0x14d/0x170
> > >  ? __kthread_bind_mask+0x70/0x70
> > >  ret_from_fork+0x1f/0x30
> > > 
> > > With other threads blocking in either xlog_state_get_iclog_space()
> > > waiting for iclog space or xlog_grant_head_wait() waiting for log
> > > reservation space.
> > > 
> > > The problem here is that the previous iclog on the ring might
> > > actually be a future iclog. That is, if log->l_iclog points at
> > > commit_iclog, commit_iclog is the first (oldest) iclog in the ring
> > > and there are no previous iclogs pending as they have all completed
> > > their IO and been activated again. IOWs, commit_iclog->ic_prev
> > > points to an iclog that will be written in the future, not one that
> > > has been written in the past.
> > > 
> > > Hence, in this case, waiting on the ->ic_prev iclog is incorrect
> > > behaviour, and depending on the state of the future iclog, we can
> > > end up with a circular ABA wait cycle and we hang.
> > > 
> > > Fix this by only waiting on the previous iclog when the commit_iclog
> > > is not the oldest iclog in the ring.
> > > 
> > > Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO")
> > > Reported-by: Brian Foster <bfoster@redhat.com>
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log_cil.c | 18 ++++++++++++------
> > >  1 file changed, 12 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index 705619e9dab4..398f00cf9cbf 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -1075,15 +1075,21 @@ xlog_cil_push_work(
> > >  	ticket = ctx->ticket;
> > >  
> > >  	/*
> > > -	 * If the checkpoint spans multiple iclogs, wait for all previous
> > > -	 * iclogs to complete before we submit the commit_iclog. In this case,
> > > -	 * the commit_iclog write needs to issue a pre-flush so that the
> > > -	 * ordering is correctly preserved down to stable storage.
> > > +	 * If the checkpoint spans multiple iclogs, wait for all previous iclogs
> > > +	 * to complete before we submit the commit_iclog. If the commit iclog is
> > > +	 * at the head of the iclog ring, then all other iclogs have completed
> > > +	 * and are waiting on this one and hence we don't need to wait.
> > > +	 *
> > > +	 * Regardless of whether we need to wait or not, the the commit_iclog
> > > +	 * write needs to issue a pre-flush so that the ordering for this
> > > +	 * checkpoint is correctly preserved down to stable storage.
> > >  	 */
> > >  	spin_lock(&log->l_icloglock);
> > >  	if (ctx->start_lsn != commit_lsn) {
> > > -		xlog_wait_on_iclog(commit_iclog->ic_prev);
> > > -		spin_lock(&log->l_icloglock);
> > > +		if (commit_iclog != log->l_iclog) {
> > > +			xlog_wait_on_iclog(commit_iclog->ic_prev);
> > > +			spin_lock(&log->l_icloglock);
> > > +		}
> 
> Brian says:
> 
> > I'm confused at what this is attempting to accomplish. If we have a
> > single CIL force and that happens to span several iclogs, isn't the goal
> > to wait on the previous iclog to ensure the previously written content
> > of the current checkpoint is on-disk?
> 
> Yes. I wouldn't be surprised if I broke something, though, because I
> was only looking at situation where the log hang was occurring.
> 
> I think I should have added a state check on the commit_iclog being
> in ACTIVE state, too, because if it is active then there are no
> future iclogs in flight and we should definitely wait. But that
> doesn't solve the problem, either, because if it is in WANT_SYNC
> there are future iclogs in flight (because
> xlog_state_get_iclog_space() has switched iclogs) and then we're in
> trouble because....
> 
> 
> Darrick says:
> 
> > I'm confused.  How can you tell that we need to wait for
> > commit_iclog->ic_prev to be written out by comparing commit_iclog to
> > log->l_iclog?  Can't you determine this by checking ic_prev for DIRTY
> > state?
> 
> .... the problem is that we can't tell if ->ic_prev is a future or
> past iclog based on it's state. If it is in the past the ic_prev
> iclog could be in any state from WANT_SYNC to ACTIVE, depending on
> where the IO completion is. And if it's in the future, it could be
> in any state from ACTIVE to SYNC_DONE. It's the overlap in
> past/future states that causes the problem here.
> 
> Being in DIRTY state just means that IO completion is running,
> callbacks are complete and we're about to run a ic_force_wait queue
> wakeup and activate the iclog again. When the hang occurs, the
> future iclogs are stuck in the DONE_SYNC state and can't progress
> through to the callback state, let alone dirty...
> 
> Even an iclog in ACTIVE state doesn't mean it is in the past - it
> just means that it's ready to be written to. It could be being
> filled at this point in time by another checkpoint via xlog_write(),
> and it, at any point in time, can transition to WANT_SYNC, SYNCING,
> or SYNC_DONE.
> 
> xlog_wait_on_iclog() will ignore ACTIVE/DIRTY iclogs, anyway, but
> the problem is that a future iclog can still transition at any time
> to WANT_SYNC or any other state indicating IO is in progress. Hence
> we just can't tell if it is a past or future iclog and so we can't
> blindly wait on it. In the common case, ic_prev is a past iclog and
> it won't get stuck on the commit_iclog being ACTIVE/WANT_SYNC at the
> head of the log->l_iclog ring in xlog_state_do_callback() and so
> will be processed and go back to ACTIVE state. However, in the rare
> case it is a future iclog, then it gets stuck waiting for IO
> completion of the commit_iclog to reach xlog_state_do_callback()....
> 
> ----
> 
> What I was trying to take into account via the (commit_iclog !=
> log->l_iclog) check was a measure of past/future tense. The hang
> always occurs where commit_iclog == log->l_iclog and all ic_next
> iclogs are future iclogs rather than being past iclogs. As Brian
> notes, this probably just broke the single commit case, but because
> the failure stress tests load up the log greatly, they don't
> exercise that case.
> 

That's kind of why I was wondering if you meant to check the state of
->ic_prev against ->l_iclog, but thinking about it again I'm not sure
that is valid either because I don't think anything prevents an iclog
from landing at the head before it might have completed pending I/O.
Maybe that doesn't make sense, I'm still wrapping my head around the
whole multi-push CIL model the original patch creates..

> So I went and looked at all the cases where we check
> ic_prev->ic_state to make decisions. There's only one place that
> does that: log forces. The sync transaction/log force aggregation
> code is the only thing that checks the state of the previous iclog,
> and it appears to avoid the "future iclog" problem in two ways.
> First, it only checks the previous iclog state when the current head
> iclog state is ACTIVE (i.e. no future iclogs). Secondly, even if it
> ic-prev is a future iclog, it only waits on the ic_write_wait queue,
> which means the log force gets woken when IO completion occurs and
> the iclog transitions to DONE_SYNC, hence it never waits for
> ordering of callbacks (other log force code does that).
> 
> IOWs, there's only one place that actually tries to use the previous
> iclog state for making decisions, and it does doesn't really handle
> the case we need to handle here which is determining if the previous
> iclog is a past or future iclog. I does, however, handle the case of
> incorrect waiting safely.
> 
> That is, what we need to wait for in the CIL push is the completion
> of the physical IO to be signalled, not for the callbacks associated
> with that iclog to run. IOWs, neat as it may be,
> xlog_wait_on_iclog() is not the right function to be using here. We
> should be waiting on the ic_write_wait queue to be woken when iclog
> IO completion occurs, not the ic_force_wait queue that waits on
> ordered completion of entire journal checkpoints. We just want to
> ensure that all previous iclogs are physically completed before we
> do the pre-flush on the commit record IO, we don't need software
> processing of those IOs to be complete.
> 

Ok, but doesn't that leave a gap in scenarios where iclog completion
occurs out of order? I thought the reason for waiting on the force of
the previous iclog was to encapsulate that the entire content of the
checkpoint has been written to disk, not just the content in the
immediately previous iclog.

If the issue is past/future tense, is there any reason we can't filter
against the record header lsn of the previous iclog? E.g., assuming the
iclog hasn't been cleaned/reactivated (i.e. h_lsn != 0), then only block
on the previous iclog if XFS_LSN_CMP(prev_lsn, commit_lsn) <= 0?

Hmm.. that also has me wondering if it's possible that the previous
iclog might be some other future checkpoint, but then skipping the wait
entirely means the checkpoint might not actually be on-disk [1]. I
wonder if we need to do something like track the last lsn and/or iclog
touched by the current checkpoint and make sure we wait on that,
regardless of where it is in the ring (but as above, I'm still wrapping
my head around all of this).

Brian

[1] Once we work out the functional behavior one way or another, how do
we actually test/confirm that the flush behavior works as expected? I
feel as though the discussions going all the way to the initial review
feedback on this change have provided constant reminders that things
like the I/O flush / ordering dependencies are very easy to get wrong.
If that is ever the case, it's not clear to me we have any way to detect
it. Not that I'd expect to have some explicit 5 second fstest for that
problem, but have we analyzed whether the few decent crash recovery
tests we do have are sensitive enough to fail? I wonder if we could come
up with something clever and more explicit via error injection..

> So, I think the right thing to do here is change this code to wait
> on the ic_write_wait queue for completion of the previous iclog and
> then we just don't care if commit_iclog->ic_prev is a past or future
> iclog....
> 
> 
> 
> -- 
> Dave Chinner
> david@fromorbit.com
>

Dave Chinner June 16, 2021, 10:20 p.m. UTC | #4

On Wed, Jun 16, 2021 at 10:57:29AM -0400, Brian Foster wrote:
> On Wed, Jun 16, 2021 at 08:09:44AM +1000, Dave Chinner wrote:
> > On Tue, Jun 15, 2021 at 11:38:14AM -0400, Brian Foster wrote:
> > > On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote:
> > So I went and looked at all the cases where we check
> > ic_prev->ic_state to make decisions. There's only one place that
> > does that: log forces. The sync transaction/log force aggregation
> > code is the only thing that checks the state of the previous iclog,
> > and it appears to avoid the "future iclog" problem in two ways.
> > First, it only checks the previous iclog state when the current head
> > iclog state is ACTIVE (i.e. no future iclogs). Secondly, even if it
> > ic-prev is a future iclog, it only waits on the ic_write_wait queue,
> > which means the log force gets woken when IO completion occurs and
> > the iclog transitions to DONE_SYNC, hence it never waits for
> > ordering of callbacks (other log force code does that).
> > 
> > IOWs, there's only one place that actually tries to use the previous
> > iclog state for making decisions, and it does doesn't really handle
> > the case we need to handle here which is determining if the previous
> > iclog is a past or future iclog. I does, however, handle the case of
> > incorrect waiting safely.
> > 
> > That is, what we need to wait for in the CIL push is the completion
> > of the physical IO to be signalled, not for the callbacks associated
> > with that iclog to run. IOWs, neat as it may be,
> > xlog_wait_on_iclog() is not the right function to be using here. We
> > should be waiting on the ic_write_wait queue to be woken when iclog
> > IO completion occurs, not the ic_force_wait queue that waits on
> > ordered completion of entire journal checkpoints. We just want to
> > ensure that all previous iclogs are physically completed before we
> > do the pre-flush on the commit record IO, we don't need software
> > processing of those IOs to be complete.
> > 
> 
> Ok, but doesn't that leave a gap in scenarios where iclog completion
> occurs out of order? I thought the reason for waiting on the force of
> the previous iclog was to encapsulate that the entire content of the
> checkpoint has been written to disk, not just the content in the
> immediately previous iclog.

Yes. We either have to wait for all the iclogs to be written or
we have to use ic_force_wait for sequencing.

> If the issue is past/future tense, is there any reason we can't filter
> against the record header lsn of the previous iclog? E.g., assuming the
> iclog hasn't been cleaned/reactivated (i.e. h_lsn != 0), then only block
> on the previous iclog if XFS_LSN_CMP(prev_lsn, commit_lsn) <= 0?

That's what the patch I left running overnight does:

        /*
         * If the checkpoint spans multiple iclogs, wait for all previous iclogs
         * to complete before we submit the commit_iclog. We can't use state
         * checks for this - ACTIVE can be either a past completed iclog or a
         * future iclog being filled, while WANT_SYNC through SYNC_DONE can be a
         * past or future iclog awaiting IO or ordered IO completion to be run.
         * In the latter case, if it's a future iclog and we wait on it, the we
         * will hang because it won't get processed through to ic_force_wait
         * wakeup until this commit_iclog is written to disk.  Hence we use the
         * iclog header lsn and compare it to the commit lsn to determine if we
         * need to wait on iclogs or not.
         */
        spin_lock(&log->l_icloglock);
        if (ctx->start_lsn != ctx->commit_lsn) {
                struct xlog_in_core     *iclog;

                for (iclog = commit_iclog->ic_prev;
                     iclog != commit_iclog;
                     iclog = iclog->ic_prev) {
                        xfs_lsn_t       hlsn;

                        /*
                         * If the LSN of the iclog is zero or in the future it
                         * means it has passed through IO completion and
                         * activation and hence all previous iclogs have also
                         * done so. We do not need to wait at all in this case.
                         */
                        hlsn = be64_to_cpu(iclog->ic_header.h_lsn);
                        if (!hlsn || XFS_LSN_CMP(hlsn, ctx->commit_lsn) > 0)
                                break;

                        /*
                         * If the LSN of the iclog is older than the commit lsn,
                         * we have to wait on it. Waiting on this via the
                         * ic_force_wait should also order the completion of all
                         * older iclogs, too, but we leave checking that to the
                         * next loop iteration.
                         */
                        ASSERT(XFS_LSN_CMP(hlsn, ctx->commit_lsn) < 0);
                        xlog_wait_on_iclog(iclog);
                        spin_lock(&log->l_icloglock);
                }

                /*
                 * Regardless of whether we need to wait or not, the the
                 * commit_iclog write needs to issue a pre-flush so that the
                 * ordering for this checkpoint is correctly preserved down to
                 * stable storage.
                 */
                commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
        }


> Hmm.. that also has me wondering if it's possible that the previous
> iclog might be some other future checkpoint, but then skipping the wait
> entirely means the checkpoint might not actually be on-disk [1]. I

If the previous iclog is a future iclog, then all iclogs are future
iclogs and this iclog is the oldest. The iclog ring only ages in one
direction....

> wonder if we need to do something like track the last lsn and/or iclog
> touched by the current checkpoint and make sure we wait on that,
> regardless of where it is in the ring (but as above, I'm still wrapping
> my head around all of this).

We have to guarantee the log is completely stable up to the commit
record, regardless of which checkpoint the iclogs belong to.  If we
don't do that, then log recovery will consider that the checkpoint
in the log is not complete before it finds the commit record. i.e.
there will be a hole in the log where the previous cycle numbers
show through between the end of the checkpoint and the commit
record, and hence the head of the log will be pointed at the last
fully completed checkpoint, not this one.

Hence we have to ensure all prior iclogs have completed to stable
storage before writing the commit record, regardless of whether the
previous iclogs are part of the current checkpoint or not.

> [1] Once we work out the functional behavior one way or another, how do
> we actually test/confirm that the flush behavior works as expected? I
> feel as though the discussions going all the way to the initial review
> feedback on this change have provided constant reminders that things
> like the I/O flush / ordering dependencies are very easy to get wrong.
> If that is ever the case, it's not clear to me we have any way to detect
> it. Not that I'd expect to have some explicit 5 second fstest for that
> problem, but have we analyzed whether the few decent crash recovery
> tests we do have are sensitive enough to fail? I wonder if we could come
> up with something clever and more explicit via error injection..

Given that we've found these problems soon after the code was
integrated and made available for wider testing, I think it shows
that, at the integration level, we have sufficiently sensitive crash
and crash recovery tests to discover issues like this. It's not
surprising that we may not find things like this at the
individual developer level because they only have so much time and
resources at their disposal.

IOWs, I think the process is working exactly the way it should be
working. It is expected that complex, subtle problems like these
will be found (if they exist) once more people are actively using
and testing the code in a variety of different environments. THis is
exactly the function that integration testing is supposed to
provide...

So while it's possible that we could have more reliable test to
exercise this stuff (no real idea how, though), the way we are
developing and merging and finding much harder to trigger problems
once the code is available for wider testing shows that our
engineering process is working as it should be....

Cheers,

Dave.

Brian Foster June 17, 2021, 1:15 p.m. UTC | #5

On Thu, Jun 17, 2021 at 08:20:54AM +1000, Dave Chinner wrote:
> On Wed, Jun 16, 2021 at 10:57:29AM -0400, Brian Foster wrote:
> > On Wed, Jun 16, 2021 at 08:09:44AM +1000, Dave Chinner wrote:
> > > On Tue, Jun 15, 2021 at 11:38:14AM -0400, Brian Foster wrote:
> > > > On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote:
> > > So I went and looked at all the cases where we check
> > > ic_prev->ic_state to make decisions. There's only one place that
> > > does that: log forces. The sync transaction/log force aggregation
> > > code is the only thing that checks the state of the previous iclog,
> > > and it appears to avoid the "future iclog" problem in two ways.
> > > First, it only checks the previous iclog state when the current head
> > > iclog state is ACTIVE (i.e. no future iclogs). Secondly, even if it
> > > ic-prev is a future iclog, it only waits on the ic_write_wait queue,
> > > which means the log force gets woken when IO completion occurs and
> > > the iclog transitions to DONE_SYNC, hence it never waits for
> > > ordering of callbacks (other log force code does that).
> > > 
> > > IOWs, there's only one place that actually tries to use the previous
> > > iclog state for making decisions, and it does doesn't really handle
> > > the case we need to handle here which is determining if the previous
> > > iclog is a past or future iclog. I does, however, handle the case of
> > > incorrect waiting safely.
> > > 
> > > That is, what we need to wait for in the CIL push is the completion
> > > of the physical IO to be signalled, not for the callbacks associated
> > > with that iclog to run. IOWs, neat as it may be,
> > > xlog_wait_on_iclog() is not the right function to be using here. We
> > > should be waiting on the ic_write_wait queue to be woken when iclog
> > > IO completion occurs, not the ic_force_wait queue that waits on
> > > ordered completion of entire journal checkpoints. We just want to
> > > ensure that all previous iclogs are physically completed before we
> > > do the pre-flush on the commit record IO, we don't need software
> > > processing of those IOs to be complete.
> > > 
> > 
> > Ok, but doesn't that leave a gap in scenarios where iclog completion
> > occurs out of order? I thought the reason for waiting on the force of
> > the previous iclog was to encapsulate that the entire content of the
> > checkpoint has been written to disk, not just the content in the
> > immediately previous iclog.
> 
> Yes. We either have to wait for all the iclogs to be written or
> we have to use ic_force_wait for sequencing.
> 

Right, Ok. That wasn't stated clearly in your earlier comment.

> > If the issue is past/future tense, is there any reason we can't filter
> > against the record header lsn of the previous iclog? E.g., assuming the
> > iclog hasn't been cleaned/reactivated (i.e. h_lsn != 0), then only block
> > on the previous iclog if XFS_LSN_CMP(prev_lsn, commit_lsn) <= 0?
> 
> That's what the patch I left running overnight does:
> 

Ok.

>         /*
>          * If the checkpoint spans multiple iclogs, wait for all previous iclogs
>          * to complete before we submit the commit_iclog. We can't use state
>          * checks for this - ACTIVE can be either a past completed iclog or a
>          * future iclog being filled, while WANT_SYNC through SYNC_DONE can be a
>          * past or future iclog awaiting IO or ordered IO completion to be run.
>          * In the latter case, if it's a future iclog and we wait on it, the we
>          * will hang because it won't get processed through to ic_force_wait
>          * wakeup until this commit_iclog is written to disk.  Hence we use the
>          * iclog header lsn and compare it to the commit lsn to determine if we
>          * need to wait on iclogs or not.
>          */
>         spin_lock(&log->l_icloglock);
>         if (ctx->start_lsn != ctx->commit_lsn) {
>                 struct xlog_in_core     *iclog;
> 
>                 for (iclog = commit_iclog->ic_prev;
>                      iclog != commit_iclog;
>                      iclog = iclog->ic_prev) {
>                         xfs_lsn_t       hlsn;
> 
>                         /*
>                          * If the LSN of the iclog is zero or in the future it
>                          * means it has passed through IO completion and
>                          * activation and hence all previous iclogs have also
>                          * done so. We do not need to wait at all in this case.
>                          */
>                         hlsn = be64_to_cpu(iclog->ic_header.h_lsn);
>                         if (!hlsn || XFS_LSN_CMP(hlsn, ctx->commit_lsn) > 0)
>                                 break;
> 
>                         /*
>                          * If the LSN of the iclog is older than the commit lsn,
>                          * we have to wait on it. Waiting on this via the
>                          * ic_force_wait should also order the completion of all
>                          * older iclogs, too, but we leave checking that to the
>                          * next loop iteration.
>                          */

What is the point of the loop? ISTM the functional change only requires
some minimal tweaking from the first patch you posted.

>                         ASSERT(XFS_LSN_CMP(hlsn, ctx->commit_lsn) < 0);
>                         xlog_wait_on_iclog(iclog);
>                         spin_lock(&log->l_icloglock);
>                 }
> 
>                 /*
>                  * Regardless of whether we need to wait or not, the the
>                  * commit_iclog write needs to issue a pre-flush so that the
>                  * ordering for this checkpoint is correctly preserved down to
>                  * stable storage.
>                  */
>                 commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
>         }
> 
> 
> > Hmm.. that also has me wondering if it's possible that the previous
> > iclog might be some other future checkpoint, but then skipping the wait
> > entirely means the checkpoint might not actually be on-disk [1]. I
> 
> If the previous iclog is a future iclog, then all iclogs are future
> iclogs and this iclog is the oldest. The iclog ring only ages in one
> direction....
> 

Hm, yes that makes sense. I might have been confusing the new push
concurrency with the fact that all log writers still must contend over
the head iclog.

> > wonder if we need to do something like track the last lsn and/or iclog
> > touched by the current checkpoint and make sure we wait on that,
> > regardless of where it is in the ring (but as above, I'm still wrapping
> > my head around all of this).
> 
> We have to guarantee the log is completely stable up to the commit
> record, regardless of which checkpoint the iclogs belong to.  If we
> don't do that, then log recovery will consider that the checkpoint
> in the log is not complete before it finds the commit record. i.e.
> there will be a hole in the log where the previous cycle numbers
> show through between the end of the checkpoint and the commit
> record, and hence the head of the log will be pointed at the last
> fully completed checkpoint, not this one.
> 

Yet another thus far undocumented side effect of concurrent pushes..

Yes, I know that recovery has always stamped over incomplete record
blocks out beyond the prospective head block. I'm referring to the fact
that recovery clearly has some historical warts that conflict with this
change in behavior to the runtime logging code and suggesting that the
upstream merge cycle should not be the avenue to identify and resolve
these problems.

It's rather evident to me at this point that this change should be
deferred, tested much more thoroughly (not because it wasn't tested or
anything of that nature, but because our log/crash recovery testing is
historically lacking and crude) and reposted with these issues clearly
documented and addressed, at the very least in a manner that allows for
proper consideration of potential alternatives, etc., as opposed to
racing with an arbitrary upstream release deadline.

(This is of course out of my hands as it's not my tree, not my call,
etc. Just my opinion. I've thrown your latest series into my testbed
regardless.)

> Hence we have to ensure all prior iclogs have completed to stable
> storage before writing the commit record, regardless of whether the
> previous iclogs are part of the current checkpoint or not.
> 
> > [1] Once we work out the functional behavior one way or another, how do
> > we actually test/confirm that the flush behavior works as expected? I
> > feel as though the discussions going all the way to the initial review
> > feedback on this change have provided constant reminders that things
> > like the I/O flush / ordering dependencies are very easy to get wrong.
> > If that is ever the case, it's not clear to me we have any way to detect
> > it. Not that I'd expect to have some explicit 5 second fstest for that
> > problem, but have we analyzed whether the few decent crash recovery
> > tests we do have are sensitive enough to fail? I wonder if we could come
> > up with something clever and more explicit via error injection..
> 
> Given that we've found these problems soon after the code was
> integrated and made available for wider testing, I think it shows
> that, at the integration level, we have sufficiently sensitive crash
> and crash recovery tests to discover issues like this. It's not
> surprising that we may not find things like this at the
> individual developer level because they only have so much time and
> resources at their disposal.
> 

What do the bugs that were found have to do with test coverage for the
updated flush semantics? I called that out separately precisely because
recent patches demonstrate how the logic is subtle and easy to break,
but I don't see any real coverage for it at all. So it would be nice to
think about how we might improve on that.

> IOWs, I think the process is working exactly the way it should be
> working. It is expected that complex, subtle problems like these
> will be found (if they exist) once more people are actively using
> and testing the code in a variety of different environments. THis is
> exactly the function that integration testing is supposed to
> provide...
> 
> So while it's possible that we could have more reliable test to
> exercise this stuff (no real idea how, though), the way we are
> developing and merging and finding much harder to trigger problems
> once the code is available for wider testing shows that our
> engineering process is working as it should be....
> 

Yeah, that's wishful thinking. I made a point to run some tests I've
recently been hammering on (for that bli patch I still can't seem to
progress) because 1.) I know we have a bunch of logging changes in the
pipeline that I otherwise haven't been able to keep up with and 2.) I'm
about to disappear for a few months on leave at any moment and so wanted
to prioritize some testing to at least help verify we maintain
correctness or otherwise get any issues reported to the list. And TBH if
it weren't for 2., this is probably not something I would have
prioritized over some other longer term development work in the first
place.

I'd like to hope that we would have caught these issues before release
regardless, but let's be realistic.. if developer testing doesn't catch
it, upstream review/merge cycle testing doesn't catch it, then for crash
and recovery issues (no regular users are experimenting with failure and
recovery of their filesystems to see if we've found new ways to lose
their data), the odds go way down from there that somebody is going to
catch it before it's released to the wild.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

[2/2] xfs: don't wait on future iclogs when pushing the CIL

Commit Message

Comments

Patch