diff mbox

[v2] xfs: fix incorrect log_flushed on fsync

Message ID 1504280365-25354-1-git-send-email-amir73il@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Amir Goldstein Sept. 1, 2017, 3:39 p.m. UTC
When calling into _xfs_log_force{,_lsn}() with a pointer
to log_flushed variable, log_flushed will be set to 1 if:
1. xlog_sync() is called to flush the active log buffer
AND/OR
2. xlog_wait() is called to wait on a syncing log buffers

xfs_file_fsync() checks the value of log_flushed after
_xfs_log_force_lsn() call to optimize away an explicit
PREFLUSH request to the data block device after writing
out all the file's pages to disk.

This optimization is incorrect in the following sequence of events:

 Task A                    Task B
 -------------------------------------------------------
 xfs_file_fsync()
   _xfs_log_force_lsn()
     xlog_sync()
        [submit PREFLUSH]
                           xfs_file_fsync()
                             file_write_and_wait_range()
                               [submit WRITE X]
                               [endio  WRITE X]
                             _xfs_log_force_lsn()
                               xlog_wait()
        [endio  PREFLUSH]

The write X is not guarantied to be on persistent storage
when PREFLUSH request in completed, because write A was submitted
after the PREFLUSH request, but xfs_file_fsync() of task A will
be notified of log_flushed=1 and will skip explicit flush.

If the system crashes after fsync of task A, write X may not be
present on disk after reboot.

This bug was discovered and demonstrated using Josef Bacik's
dm-log-writes target, which can be used to record block io operations
and then replay a subset of these operations onto the target device.
The test goes something like this:
- Use fsx to execute ops of a file and record ops on log device
- Every now and then fsync the file, store md5 of file and mark
  the location in the log
- Then replay log onto device for each mark, mount fs and compare
  md5 of file to stored value

Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <jbacik@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Christoph,

Here is another, more subtle, version of the fix.
It keeps a lot more use cases optimized (e.g. many threads doing fsync
and setting WANT_SYNC may still be optimized).
It also addresses your concern that xlog_state_release_iclog()
may not actually start the buffer sync.

I tried to keep the logic as simple as possible:
If we see a buffer who is not yet SYNCING and we later
see that l_last_sync_lsn is >= to the lsn of that buffer,
we can safely say log_flushed.

I only tested this patch through a few crash test runs, not even full
xfstests cycle, so I am not sure it is correct, just posting to get
your feedback.

Is it worth something over the simpler v1 patch?
I can't really say.

Amir.

 fs/xfs/xfs_log.c | 76 ++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 58 insertions(+), 18 deletions(-)

Comments

Brian Foster Sept. 2, 2017, 1:19 p.m. UTC | #1
On Fri, Sep 01, 2017 at 06:39:25PM +0300, Amir Goldstein wrote:
> When calling into _xfs_log_force{,_lsn}() with a pointer
> to log_flushed variable, log_flushed will be set to 1 if:
> 1. xlog_sync() is called to flush the active log buffer
> AND/OR
> 2. xlog_wait() is called to wait on a syncing log buffers
> 
> xfs_file_fsync() checks the value of log_flushed after
> _xfs_log_force_lsn() call to optimize away an explicit
> PREFLUSH request to the data block device after writing
> out all the file's pages to disk.
> 
> This optimization is incorrect in the following sequence of events:
> 
>  Task A                    Task B
>  -------------------------------------------------------
>  xfs_file_fsync()
>    _xfs_log_force_lsn()
>      xlog_sync()
>         [submit PREFLUSH]
>                            xfs_file_fsync()
>                              file_write_and_wait_range()
>                                [submit WRITE X]
>                                [endio  WRITE X]
>                              _xfs_log_force_lsn()
>                                xlog_wait()
>         [endio  PREFLUSH]
> 
> The write X is not guarantied to be on persistent storage
> when PREFLUSH request in completed, because write A was submitted
> after the PREFLUSH request, but xfs_file_fsync() of task A will
> be notified of log_flushed=1 and will skip explicit flush.
> 
> If the system crashes after fsync of task A, write X may not be
> present on disk after reboot.
> 
> This bug was discovered and demonstrated using Josef Bacik's
> dm-log-writes target, which can be used to record block io operations
> and then replay a subset of these operations onto the target device.
> The test goes something like this:
> - Use fsx to execute ops of a file and record ops on log device
> - Every now and then fsync the file, store md5 of file and mark

>   md5 of file to stored value
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Josef Bacik <jbacik@fb.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
> 
> Christoph,
> 
> Here is another, more subtle, version of the fix.
> It keeps a lot more use cases optimized (e.g. many threads doing fsync
> and setting WANT_SYNC may still be optimized).
> It also addresses your concern that xlog_state_release_iclog()
> may not actually start the buffer sync.
> 
> I tried to keep the logic as simple as possible:
> If we see a buffer who is not yet SYNCING and we later
> see that l_last_sync_lsn is >= to the lsn of that buffer,
> we can safely say log_flushed.
> 
> I only tested this patch through a few crash test runs, not even full
> xfstests cycle, so I am not sure it is correct, just posting to get
> your feedback.
> 
> Is it worth something over the simpler v1 patch?
> I can't really say.
> 

This looks like it has a couple potential problems on a very quick look
(so I could definitely be mistaken). E.g., the lsn could be zero at the
time we set log_flushed in _xfs_log_force(). It also looks like waiting
on an iclog that is waiting to run callbacks due to out of order
completion could be interpreted as a log flush having occurred, but I
haven't stared at this long enough to say whether that is actually
possible.

Stepping back from the details.. this seems like it could be done
correctly in general. IIUC what you want to know is whether any iclog
went from a pre-SYNCING state to a post-SYNCING state during the log
force, right? The drawbacks to this are that the log states do not by
design tell us whether devices have been flushed (landmine alert).
AFAICS, the last tail lsn isn't necessarily updated on every I/O
completion either.

I'm really confused by the preoccupation with finding a way to keep this
fix localized to xfs_log_force(), as if that provides some inherent
advantage over fundamentally more simple code. If anything, the fact
that this has been broken for so long suggests that is not the case.

I'll reiterate my previous comment.. if we want to track device flush
submits+completes, please just track them directly in the buftarg using
the existing buffer flush flag and the common buffer
submission/completion paths that we already use for this kind of generic
mechanism (e.g., in-flight async I/O accounting, read/write verifiers).
I don't really see any benefit to this, at least until/unless we find
some reason to rule out the other approach.

Brian

> Amir.
> 
>  fs/xfs/xfs_log.c | 76 ++++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 58 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 0053bcf2b10a..65e867ec00f9 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -3246,7 +3246,7 @@ _xfs_log_force(
>  {
>  	struct xlog		*log = mp->m_log;
>  	struct xlog_in_core	*iclog;
> -	xfs_lsn_t		lsn;
> +	xfs_lsn_t		lsn_flushed = 0;
>  
>  	XFS_STATS_INC(mp, xs_log_force);
>  
> @@ -3290,17 +3290,22 @@ _xfs_log_force(
>  				 * the previous sync.
>  				 */
>  				atomic_inc(&iclog->ic_refcnt);
> -				lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> +				lsn_flushed = be64_to_cpu(iclog->ic_header.h_lsn);
>  				xlog_state_switch_iclogs(log, iclog, 0);
>  				spin_unlock(&log->l_icloglock);
>  
>  				if (xlog_state_release_iclog(log, iclog))
>  					return -EIO;
>  
> -				if (log_flushed)
> -					*log_flushed = 1;
>  				spin_lock(&log->l_icloglock);
> -				if (be64_to_cpu(iclog->ic_header.h_lsn) == lsn &&
> +				/*
> +				 * By the time we come around again, the iclog
> +				 * could've been filled which would give it
> +				 * another lsn.  If we have a new lsn, just
> +				 * return because the relevant data has been
> +				 * flushed.
> +				 */
> +				if (be64_to_cpu(iclog->ic_header.h_lsn) == lsn_flushed &&
>  				    iclog->ic_state != XLOG_STATE_DIRTY)
>  					goto maybe_sleep;
>  				else
> @@ -3317,10 +3322,6 @@ _xfs_log_force(
>  		}
>  	}
>  
> -	/* By the time we come around again, the iclog could've been filled
> -	 * which would give it another lsn.  If we have a new lsn, just
> -	 * return because the relevant data has been flushed.
> -	 */
>  maybe_sleep:
>  	if (flags & XFS_LOG_SYNC) {
>  		/*
> @@ -3333,6 +3334,15 @@ _xfs_log_force(
>  			spin_unlock(&log->l_icloglock);
>  			return -EIO;
>  		}
> +		/*
> +		 * Snapshot lsn of iclog if there is no risk that flush has been
> +		 * issued already for that iclog.
> +		 */
> +		if (!lsn_flushed &&
> +		    !(iclog->ic_state &
> +		      (XLOG_STATE_DONE_SYNC | XLOG_STATE_SYNCING)))
> +			lsn_flushed = be64_to_cpu(iclog->ic_header.h_lsn);
> +
>  		XFS_STATS_INC(mp, xs_log_force_sleep);
>  		xlog_wait(&iclog->ic_force_wait, &log->l_icloglock);
>  		/*
> @@ -3342,13 +3352,22 @@ _xfs_log_force(
>  		 */
>  		if (iclog->ic_state & XLOG_STATE_IOERROR)
>  			return -EIO;
> -		if (log_flushed)
> -			*log_flushed = 1;
>  	} else {
>  
>  no_sleep:
>  		spin_unlock(&log->l_icloglock);
>  	}
> +
> +	if (log_flushed) {
> +		/*
> +		 * We only report log_flushed if flush has been issued after
> +		 * entering this function (for lsn) and has already completed.
> +		 * It could be that we issued the flush or another thread, as
> +		 * long as we have seen this lsn in pre SYNCING state.
> +		 */
> +		if (XFS_LSN_CMP(lsn_flushed, atomic64_read(&log->l_last_sync_lsn)) <= 0)
> +			*log_flushed = 1;
> +	}
>  	return 0;
>  }
>  
> @@ -3391,6 +3410,7 @@ _xfs_log_force_lsn(
>  	struct xlog		*log = mp->m_log;
>  	struct xlog_in_core	*iclog;
>  	int			already_slept = 0;
> +	xfs_lsn_t		lsn_flushed = 0;
>  
>  	ASSERT(lsn != 0);
>  
> @@ -3445,10 +3465,15 @@ _xfs_log_force_lsn(
>  
>  				XFS_STATS_INC(mp, xs_log_force_sleep);
>  
> +				/*
> +				 * prev iclog is not SYNCING so flush is going
> +				 * to be issued.
> +				 */
> +				if (iclog->ic_prev->ic_state & XLOG_STATE_WANT_SYNC)
> +					lsn_flushed = be64_to_cpu(iclog->ic_prev->ic_header.h_lsn);
> +
>  				xlog_wait(&iclog->ic_prev->ic_write_wait,
>  							&log->l_icloglock);
> -				if (log_flushed)
> -					*log_flushed = 1;
>  				already_slept = 1;
>  				goto try_again;
>  			}
> @@ -3457,8 +3482,6 @@ _xfs_log_force_lsn(
>  			spin_unlock(&log->l_icloglock);
>  			if (xlog_state_release_iclog(log, iclog))
>  				return -EIO;
> -			if (log_flushed)
> -				*log_flushed = 1;
>  			spin_lock(&log->l_icloglock);
>  		}
>  
> @@ -3473,6 +3496,15 @@ _xfs_log_force_lsn(
>  				spin_unlock(&log->l_icloglock);
>  				return -EIO;
>  			}
> +			/*
> +			 * Snapshot lsn of iclog if there is no risk that flush
> +			 * has been issued already for that iclog.
> +			 */
> +			if (!lsn_flushed &&
> +			    !(iclog->ic_state &
> +			      (XLOG_STATE_DONE_SYNC | XLOG_STATE_SYNCING)))
> +			lsn_flushed = be64_to_cpu(iclog->ic_header.h_lsn);
> +
>  			XFS_STATS_INC(mp, xs_log_force_sleep);
>  			xlog_wait(&iclog->ic_force_wait, &log->l_icloglock);
>  			/*
> @@ -3482,9 +3514,6 @@ _xfs_log_force_lsn(
>  			 */
>  			if (iclog->ic_state & XLOG_STATE_IOERROR)
>  				return -EIO;
> -
> -			if (log_flushed)
> -				*log_flushed = 1;
>  		} else {		/* just return */
>  			spin_unlock(&log->l_icloglock);
>  		}
> @@ -3493,6 +3522,17 @@ _xfs_log_force_lsn(
>  	} while (iclog != log->l_iclog);
>  
>  	spin_unlock(&log->l_icloglock);
> +
> +	if (log_flushed) {
> +		/*
> +		 * We only report log_flushed if flush has been issued after
> +		 * entering this function (for lsn) and has already completed.
> +		 * It could be that we issued the flush or another thread, as
> +		 * long as we have seen this lsn in pre SYNCING state.
> +		 */
> +		if (XFS_LSN_CMP(lsn_flushed, atomic64_read(&log->l_last_sync_lsn)) <= 0)
> +			*log_flushed = 1;
> +	}
>  	return 0;
>  }
>  
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Sept. 2, 2017, 3:47 p.m. UTC | #2
On Sat, Sep 2, 2017 at 4:19 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Fri, Sep 01, 2017 at 06:39:25PM +0300, Amir Goldstein wrote:
>> When calling into _xfs_log_force{,_lsn}() with a pointer
>> to log_flushed variable, log_flushed will be set to 1 if:
>> 1. xlog_sync() is called to flush the active log buffer
>> AND/OR
>> 2. xlog_wait() is called to wait on a syncing log buffers
>>
>> xfs_file_fsync() checks the value of log_flushed after
>> _xfs_log_force_lsn() call to optimize away an explicit
>> PREFLUSH request to the data block device after writing
>> out all the file's pages to disk.
>>
>> This optimization is incorrect in the following sequence of events:
>>
>>  Task A                    Task B
>>  -------------------------------------------------------
>>  xfs_file_fsync()
>>    _xfs_log_force_lsn()
>>      xlog_sync()
>>         [submit PREFLUSH]
>>                            xfs_file_fsync()
>>                              file_write_and_wait_range()
>>                                [submit WRITE X]
>>                                [endio  WRITE X]
>>                              _xfs_log_force_lsn()
>>                                xlog_wait()
>>         [endio  PREFLUSH]
>>
>> The write X is not guarantied to be on persistent storage
>> when PREFLUSH request in completed, because write A was submitted
>> after the PREFLUSH request, but xfs_file_fsync() of task A will
>> be notified of log_flushed=1 and will skip explicit flush.
>>
>> If the system crashes after fsync of task A, write X may not be
>> present on disk after reboot.
>>
>> This bug was discovered and demonstrated using Josef Bacik's
>> dm-log-writes target, which can be used to record block io operations
>> and then replay a subset of these operations onto the target device.
>> The test goes something like this:
>> - Use fsx to execute ops of a file and record ops on log device
>> - Every now and then fsync the file, store md5 of file and mark
>
>>   md5 of file to stored value
>>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Josef Bacik <jbacik@fb.com>
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> ---
>>
>> Christoph,
>>
>> Here is another, more subtle, version of the fix.
>> It keeps a lot more use cases optimized (e.g. many threads doing fsync
>> and setting WANT_SYNC may still be optimized).
>> It also addresses your concern that xlog_state_release_iclog()
>> may not actually start the buffer sync.
>>
>> I tried to keep the logic as simple as possible:
>> If we see a buffer who is not yet SYNCING and we later
>> see that l_last_sync_lsn is >= to the lsn of that buffer,
>> we can safely say log_flushed.
>>
>> I only tested this patch through a few crash test runs, not even full
>> xfstests cycle, so I am not sure it is correct, just posting to get
>> your feedback.
>>
>> Is it worth something over the simpler v1 patch?
>> I can't really say.
>>
>
> This looks like it has a couple potential problems on a very quick look
> (so I could definitely be mistaken). E.g., the lsn could be zero at the
> time we set log_flushed in _xfs_log_force(). It also looks like waiting
> on an iclog that is waiting to run callbacks due to out of order
> completion could be interpreted as a log flush having occurred, but I
> haven't stared at this long enough to say whether that is actually
> possible.
>
> Stepping back from the details.. this seems like it could be done
> correctly in general. IIUC what you want to know is whether any iclog
> went from a pre-SYNCING state to a post-SYNCING state during the log
> force, right? The drawbacks to this are that the log states do not by
> design tell us whether devices have been flushed (landmine alert).
> AFAICS, the last tail lsn isn't necessarily updated on every I/O
> completion either.
>
> I'm really confused by the preoccupation with finding a way to keep this
> fix localized to xfs_log_force(), as if that provides some inherent
> advantage over fundamentally more simple code. If anything, the fact
> that this has been broken for so long suggests that is not the case.
>

Brian,

Don't let my motives confuse you, the localized approach has two reasons:
1. I though there may be a low hanging fix, because of already existing
    lsn counters
2. I lack the experience and time to make the 'correct' fix you suggested

> I'll reiterate my previous comment.. if we want to track device flush
> submits+completes, please just track them directly in the buftarg using
> the existing buffer flush flag and the common buffer
> submission/completion paths that we already use for this kind of generic
> mechanism (e.g., in-flight async I/O accounting, read/write verifiers).
> I don't really see any benefit to this, at least until/unless we find
> some reason to rule out the other approach.
>

If I wasn't clear, my patch was not meant to object to your comment,
just to display the alternative. If someone else posts a 'proper' patch
I will test it with crash simulator. I recon that's not going to be
for rc1 anyway.

Cheers,
Amir.
Brian Foster Sept. 5, 2017, 2:40 p.m. UTC | #3
On Sat, Sep 02, 2017 at 06:47:03PM +0300, Amir Goldstein wrote:
> On Sat, Sep 2, 2017 at 4:19 PM, Brian Foster <bfoster@redhat.com> wrote:
> > On Fri, Sep 01, 2017 at 06:39:25PM +0300, Amir Goldstein wrote:
> >> When calling into _xfs_log_force{,_lsn}() with a pointer
> >> to log_flushed variable, log_flushed will be set to 1 if:
> >> 1. xlog_sync() is called to flush the active log buffer
> >> AND/OR
> >> 2. xlog_wait() is called to wait on a syncing log buffers
> >>
> >> xfs_file_fsync() checks the value of log_flushed after
> >> _xfs_log_force_lsn() call to optimize away an explicit
> >> PREFLUSH request to the data block device after writing
> >> out all the file's pages to disk.
> >>
> >> This optimization is incorrect in the following sequence of events:
> >>
> >>  Task A                    Task B
> >>  -------------------------------------------------------
> >>  xfs_file_fsync()
> >>    _xfs_log_force_lsn()
> >>      xlog_sync()
> >>         [submit PREFLUSH]
> >>                            xfs_file_fsync()
> >>                              file_write_and_wait_range()
> >>                                [submit WRITE X]
> >>                                [endio  WRITE X]
> >>                              _xfs_log_force_lsn()
> >>                                xlog_wait()
> >>         [endio  PREFLUSH]
> >>
> >> The write X is not guarantied to be on persistent storage
> >> when PREFLUSH request in completed, because write A was submitted
> >> after the PREFLUSH request, but xfs_file_fsync() of task A will
> >> be notified of log_flushed=1 and will skip explicit flush.
> >>
> >> If the system crashes after fsync of task A, write X may not be
> >> present on disk after reboot.
> >>
> >> This bug was discovered and demonstrated using Josef Bacik's
> >> dm-log-writes target, which can be used to record block io operations
> >> and then replay a subset of these operations onto the target device.
> >> The test goes something like this:
> >> - Use fsx to execute ops of a file and record ops on log device
> >> - Every now and then fsync the file, store md5 of file and mark
> >
> >>   md5 of file to stored value
> >>
> >> Cc: Christoph Hellwig <hch@lst.de>
> >> Cc: Josef Bacik <jbacik@fb.com>
> >> Cc: <stable@vger.kernel.org>
> >> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >> ---
> >>
> >> Christoph,
> >>
> >> Here is another, more subtle, version of the fix.
> >> It keeps a lot more use cases optimized (e.g. many threads doing fsync
> >> and setting WANT_SYNC may still be optimized).
> >> It also addresses your concern that xlog_state_release_iclog()
> >> may not actually start the buffer sync.
> >>
> >> I tried to keep the logic as simple as possible:
> >> If we see a buffer who is not yet SYNCING and we later
> >> see that l_last_sync_lsn is >= to the lsn of that buffer,
> >> we can safely say log_flushed.
> >>
> >> I only tested this patch through a few crash test runs, not even full
> >> xfstests cycle, so I am not sure it is correct, just posting to get
> >> your feedback.
> >>
> >> Is it worth something over the simpler v1 patch?
> >> I can't really say.
> >>
> >
> > This looks like it has a couple potential problems on a very quick look
> > (so I could definitely be mistaken). E.g., the lsn could be zero at the
> > time we set log_flushed in _xfs_log_force(). It also looks like waiting
> > on an iclog that is waiting to run callbacks due to out of order
> > completion could be interpreted as a log flush having occurred, but I
> > haven't stared at this long enough to say whether that is actually
> > possible.
> >
> > Stepping back from the details.. this seems like it could be done
> > correctly in general. IIUC what you want to know is whether any iclog
> > went from a pre-SYNCING state to a post-SYNCING state during the log
> > force, right? The drawbacks to this are that the log states do not by
> > design tell us whether devices have been flushed (landmine alert).
> > AFAICS, the last tail lsn isn't necessarily updated on every I/O
> > completion either.
> >
> > I'm really confused by the preoccupation with finding a way to keep this
> > fix localized to xfs_log_force(), as if that provides some inherent
> > advantage over fundamentally more simple code. If anything, the fact
> > that this has been broken for so long suggests that is not the case.
> >
> 
> Brian,
> 
> Don't let my motives confuse you, the localized approach has two reasons:
> 1. I though there may be a low hanging fix, because of already existing
>     lsn counters
> 2. I lack the experience and time to make the 'correct' fix you suggested
> 

Fair enough, but note that the "correct" fix was your idea (based on
hch's patch). :) I just suggested refactoring it out of the logging code
because there's no reason it needs to be buried there.

> > I'll reiterate my previous comment.. if we want to track device flush
> > submits+completes, please just track them directly in the buftarg using
> > the existing buffer flush flag and the common buffer
> > submission/completion paths that we already use for this kind of generic
> > mechanism (e.g., in-flight async I/O accounting, read/write verifiers).
> > I don't really see any benefit to this, at least until/unless we find
> > some reason to rule out the other approach.
> >
> 
> If I wasn't clear, my patch was not meant to object to your comment,
> just to display the alternative. If someone else posts a 'proper' patch
> I will test it with crash simulator. I recon that's not going to be
> for rc1 anyway.
> 

Thanks.. I agree that a rework of the optimization can come later now
that the bug is fixed.

Christoph, are you planning to continue with your flushseq based patch?

Brian

> Cheers,
> Amir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Sept. 5, 2017, 2:54 p.m. UTC | #4
On Tue, Sep 05, 2017 at 10:40:07AM -0400, Brian Foster wrote:
> Thanks.. I agree that a rework of the optimization can come later now
> that the bug is fixed.

I agree in principle, but I don't think the patch in for-next is enough,
given that it doesn't take ic_refcnt into account.

> Christoph, are you planning to continue with your flushseq based patch?

I think it is fundamentally the right thing to do, but given how
much I've got on my plate I wouldn't mind someone else looking into it :)
Brian Foster Sept. 5, 2017, 3:34 p.m. UTC | #5
On Tue, Sep 05, 2017 at 04:54:42PM +0200, Christoph Hellwig wrote:
> On Tue, Sep 05, 2017 at 10:40:07AM -0400, Brian Foster wrote:
> > Thanks.. I agree that a rework of the optimization can come later now
> > that the bug is fixed.
> 
> I agree in principle, but I don't think the patch in for-next is enough,
> given that it doesn't take ic_refcnt into account.
> 

Hmm.. Ok, I might be missing something here. In xfs_log_force() at least
it looks like we only set log_flushed if the refcnt has reached zero.

In xfs_log_force_lsn(), the iclog has to be ACTIVE. We switch it to
WANT_SYNC, bump the refcnt and call release_iclog(). The latter may or
may not do anything depending on the refcount. We set log_flushed
immediately after the release before we know whether an xlog_sync() has
actually occurred (is that your concern?). If XFS_LOG_SYNC is set, we
wait on the iclog before _force_lsn() returns, which means the log
buffer I/O has to have completed before we return either way. If the
iclog is still WANT_SYNC, the release didn't do anything, but we wait on
ic_force_wait which is triggered during I/O completion handling. If the
iclog is already DIRTY/ACTIVE, then the I/O has already submitted and
completed.

fsync looks like the only the user of log_flushed and it also uses
XFS_LOG_SYNC. ISTM that the code is probably still bogus in principle,
but in practice the fsync code may be safe. Thoughts? Am I missing
something else?

> > Christoph, are you planning to continue with your flushseq based patch?
> 
> I think it is fundamentally the right thing to do, but given how
> much I've got on my plate I wouldn't mind someone else looking into it :)

Ok, I may be able to take a crack at it...

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Sept. 6, 2017, 11:41 a.m. UTC | #6
On Tue, Sep 05, 2017 at 11:34:39AM -0400, Brian Foster wrote:
> In xfs_log_force_lsn(), the iclog has to be ACTIVE. We switch it to
> WANT_SYNC, bump the refcnt and call release_iclog(). The latter may or
> may not do anything depending on the refcount. We set log_flushed
> immediately after the release before we know whether an xlog_sync() has
> actually occurred (is that your concern?). If XFS_LOG_SYNC is set, we
> wait on the iclog before _force_lsn() returns, which means the log
> buffer I/O has to have completed before we return either way. If the
> iclog is still WANT_SYNC, the release didn't do anything, but we wait on
> ic_force_wait which is triggered during I/O completion handling. If the
> iclog is already DIRTY/ACTIVE, then the I/O has already submitted and
> completed.
> 
> fsync looks like the only the user of log_flushed and it also uses
> XFS_LOG_SYNC. ISTM that the code is probably still bogus in principle,
> but in practice the fsync code may be safe. Thoughts? Am I missing
> something else?

Yes, I think you're right - the extra XFS_LOG_SYNC code papers over
the issue.  But given that the log_flushed indicator makes very
little sene without XFS_LOG_SYNC it should be ok.
Amir Goldstein Sept. 6, 2017, 12:08 p.m. UTC | #7
[-stable]

On Tue, Sep 5, 2017 at 5:40 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Sat, Sep 02, 2017 at 06:47:03PM +0300, Amir Goldstein wrote:
>> On Sat, Sep 2, 2017 at 4:19 PM, Brian Foster <bfoster@redhat.com> wrote:
>> > On Fri, Sep 01, 2017 at 06:39:25PM +0300, Amir Goldstein wrote:
>> >> When calling into _xfs_log_force{,_lsn}() with a pointer
>> >> to log_flushed variable, log_flushed will be set to 1 if:
>> >> 1. xlog_sync() is called to flush the active log buffer
>> >> AND/OR
>> >> 2. xlog_wait() is called to wait on a syncing log buffers
>> >>
>> >> xfs_file_fsync() checks the value of log_flushed after
>> >> _xfs_log_force_lsn() call to optimize away an explicit
>> >> PREFLUSH request to the data block device after writing
>> >> out all the file's pages to disk.
>> >>
>> >> This optimization is incorrect in the following sequence of events:
>> >>
>> >>  Task A                    Task B
>> >>  -------------------------------------------------------
>> >>  xfs_file_fsync()
>> >>    _xfs_log_force_lsn()
>> >>      xlog_sync()
>> >>         [submit PREFLUSH]
>> >>                            xfs_file_fsync()
>> >>                              file_write_and_wait_range()
>> >>                                [submit WRITE X]
>> >>                                [endio  WRITE X]
>> >>                              _xfs_log_force_lsn()
>> >>                                xlog_wait()
>> >>         [endio  PREFLUSH]
>> >>
>> >> The write X is not guarantied to be on persistent storage
>> >> when PREFLUSH request in completed, because write A was submitted
>> >> after the PREFLUSH request, but xfs_file_fsync() of task A will
>> >> be notified of log_flushed=1 and will skip explicit flush.
>> >>
>> >> If the system crashes after fsync of task A, write X may not be
>> >> present on disk after reboot.
>> >>
>> >> This bug was discovered and demonstrated using Josef Bacik's
>> >> dm-log-writes target, which can be used to record block io operations
>> >> and then replay a subset of these operations onto the target device.
>> >> The test goes something like this:
>> >> - Use fsx to execute ops of a file and record ops on log device
>> >> - Every now and then fsync the file, store md5 of file and mark
>> >
>> >>   md5 of file to stored value
>> >>
>> >> Cc: Christoph Hellwig <hch@lst.de>
>> >> Cc: Josef Bacik <jbacik@fb.com>
>> >> Cc: <stable@vger.kernel.org>
>> >> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> >> ---
>> >>
>> >> Christoph,
>> >>
>> >> Here is another, more subtle, version of the fix.
>> >> It keeps a lot more use cases optimized (e.g. many threads doing fsync
>> >> and setting WANT_SYNC may still be optimized).
>> >> It also addresses your concern that xlog_state_release_iclog()
>> >> may not actually start the buffer sync.
>> >>
>> >> I tried to keep the logic as simple as possible:
>> >> If we see a buffer who is not yet SYNCING and we later
>> >> see that l_last_sync_lsn is >= to the lsn of that buffer,
>> >> we can safely say log_flushed.
>> >>
>> >> I only tested this patch through a few crash test runs, not even full
>> >> xfstests cycle, so I am not sure it is correct, just posting to get
>> >> your feedback.
>> >>
>> >> Is it worth something over the simpler v1 patch?
>> >> I can't really say.
>> >>
>> >
>> > This looks like it has a couple potential problems on a very quick look
>> > (so I could definitely be mistaken). E.g., the lsn could be zero at the
>> > time we set log_flushed in _xfs_log_force(). It also looks like waiting
>> > on an iclog that is waiting to run callbacks due to out of order
>> > completion could be interpreted as a log flush having occurred, but I
>> > haven't stared at this long enough to say whether that is actually
>> > possible.
>> >
>> > Stepping back from the details.. this seems like it could be done
>> > correctly in general. IIUC what you want to know is whether any iclog
>> > went from a pre-SYNCING state to a post-SYNCING state during the log
>> > force, right? The drawbacks to this are that the log states do not by
>> > design tell us whether devices have been flushed (landmine alert).
>> > AFAICS, the last tail lsn isn't necessarily updated on every I/O
>> > completion either.
>> >
>> > I'm really confused by the preoccupation with finding a way to keep this
>> > fix localized to xfs_log_force(), as if that provides some inherent
>> > advantage over fundamentally more simple code. If anything, the fact
>> > that this has been broken for so long suggests that is not the case.
>> >
>>
>> Brian,
>>
>> Don't let my motives confuse you, the localized approach has two reasons:
>> 1. I though there may be a low hanging fix, because of already existing
>>     lsn counters
>> 2. I lack the experience and time to make the 'correct' fix you suggested
>>
>
> Fair enough, but note that the "correct" fix was your idea (based on
> hch's patch). :) I just suggested refactoring it out of the logging code
> because there's no reason it needs to be buried there.
>

You could also say that flush sequence counting code doesn't belong
to xfs code at all. There is nothing xfs specific about it.

If we had an API:

flush_seq = blkdev_get_flush_seq(bdev, flush_seq);
blkdev_issue_flush_after(bdev, flush_seq);

I am sure it would have been useful to more fs.

In fact, block drivers that use blk_insert_flush(),
already have serialized flush requests, so implementing
the above functionality would be quite trivial for those.

I am not fluent enough in block layer to say if this makes sense.
Christoph? Should we add some block people to this discussion?

Amir.
Christoph Hellwig Sept. 6, 2017, 1:19 p.m. UTC | #8
> You could also say that flush sequence counting code doesn't belong
> to xfs code at all. There is nothing xfs specific about it.
> 
> If we had an API:
> 
> flush_seq = blkdev_get_flush_seq(bdev, flush_seq);
> blkdev_issue_flush_after(bdev, flush_seq);
> 
> I am sure it would have been useful to more fs.
> 
> In fact, block drivers that use blk_insert_flush(),
> already have serialized flush requests, so implementing
> the above functionality would be quite trivial for those.
> 
> I am not fluent enough in block layer to say if this makes sense.
> Christoph? Should we add some block people to this discussion?

Not that the interface can't be based on blkdev_issue_flush as
our most important flushes are submitted asynchronously.

But except for that tying into the flush state machine sounds
very interesting.  Let me look into that as the designate
xfs <-> block layer liaison.
Brian Foster Sept. 6, 2017, 1:29 p.m. UTC | #9
On Wed, Sep 06, 2017 at 03:19:37PM +0200, Christoph Hellwig wrote:
> > You could also say that flush sequence counting code doesn't belong
> > to xfs code at all. There is nothing xfs specific about it.
> > 
> > If we had an API:
> > 
> > flush_seq = blkdev_get_flush_seq(bdev, flush_seq);
> > blkdev_issue_flush_after(bdev, flush_seq);
> > 
> > I am sure it would have been useful to more fs.
> > 
> > In fact, block drivers that use blk_insert_flush(),
> > already have serialized flush requests, so implementing
> > the above functionality would be quite trivial for those.
> > 
> > I am not fluent enough in block layer to say if this makes sense.
> > Christoph? Should we add some block people to this discussion?
> 
> Not that the interface can't be based on blkdev_issue_flush as
> our most important flushes are submitted asynchronously.
> 
> But except for that tying into the flush state machine sounds
> very interesting.  Let me look into that as the designate
> xfs <-> block layer liaison.

That sounds like a nice idea to me, particularly if there is potential
for other users. I'll wait to look into doing anything in XFS until we
see how this turns out. Thanks.

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 0053bcf2b10a..65e867ec00f9 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3246,7 +3246,7 @@  _xfs_log_force(
 {
 	struct xlog		*log = mp->m_log;
 	struct xlog_in_core	*iclog;
-	xfs_lsn_t		lsn;
+	xfs_lsn_t		lsn_flushed = 0;
 
 	XFS_STATS_INC(mp, xs_log_force);
 
@@ -3290,17 +3290,22 @@  _xfs_log_force(
 				 * the previous sync.
 				 */
 				atomic_inc(&iclog->ic_refcnt);
-				lsn = be64_to_cpu(iclog->ic_header.h_lsn);
+				lsn_flushed = be64_to_cpu(iclog->ic_header.h_lsn);
 				xlog_state_switch_iclogs(log, iclog, 0);
 				spin_unlock(&log->l_icloglock);
 
 				if (xlog_state_release_iclog(log, iclog))
 					return -EIO;
 
-				if (log_flushed)
-					*log_flushed = 1;
 				spin_lock(&log->l_icloglock);
-				if (be64_to_cpu(iclog->ic_header.h_lsn) == lsn &&
+				/*
+				 * By the time we come around again, the iclog
+				 * could've been filled which would give it
+				 * another lsn.  If we have a new lsn, just
+				 * return because the relevant data has been
+				 * flushed.
+				 */
+				if (be64_to_cpu(iclog->ic_header.h_lsn) == lsn_flushed &&
 				    iclog->ic_state != XLOG_STATE_DIRTY)
 					goto maybe_sleep;
 				else
@@ -3317,10 +3322,6 @@  _xfs_log_force(
 		}
 	}
 
-	/* By the time we come around again, the iclog could've been filled
-	 * which would give it another lsn.  If we have a new lsn, just
-	 * return because the relevant data has been flushed.
-	 */
 maybe_sleep:
 	if (flags & XFS_LOG_SYNC) {
 		/*
@@ -3333,6 +3334,15 @@  _xfs_log_force(
 			spin_unlock(&log->l_icloglock);
 			return -EIO;
 		}
+		/*
+		 * Snapshot lsn of iclog if there is no risk that flush has been
+		 * issued already for that iclog.
+		 */
+		if (!lsn_flushed &&
+		    !(iclog->ic_state &
+		      (XLOG_STATE_DONE_SYNC | XLOG_STATE_SYNCING)))
+			lsn_flushed = be64_to_cpu(iclog->ic_header.h_lsn);
+
 		XFS_STATS_INC(mp, xs_log_force_sleep);
 		xlog_wait(&iclog->ic_force_wait, &log->l_icloglock);
 		/*
@@ -3342,13 +3352,22 @@  _xfs_log_force(
 		 */
 		if (iclog->ic_state & XLOG_STATE_IOERROR)
 			return -EIO;
-		if (log_flushed)
-			*log_flushed = 1;
 	} else {
 
 no_sleep:
 		spin_unlock(&log->l_icloglock);
 	}
+
+	if (log_flushed) {
+		/*
+		 * We only report log_flushed if flush has been issued after
+		 * entering this function (for lsn) and has already completed.
+		 * It could be that we issued the flush or another thread, as
+		 * long as we have seen this lsn in pre SYNCING state.
+		 */
+		if (XFS_LSN_CMP(lsn_flushed, atomic64_read(&log->l_last_sync_lsn)) <= 0)
+			*log_flushed = 1;
+	}
 	return 0;
 }
 
@@ -3391,6 +3410,7 @@  _xfs_log_force_lsn(
 	struct xlog		*log = mp->m_log;
 	struct xlog_in_core	*iclog;
 	int			already_slept = 0;
+	xfs_lsn_t		lsn_flushed = 0;
 
 	ASSERT(lsn != 0);
 
@@ -3445,10 +3465,15 @@  _xfs_log_force_lsn(
 
 				XFS_STATS_INC(mp, xs_log_force_sleep);
 
+				/*
+				 * prev iclog is not SYNCING so flush is going
+				 * to be issued.
+				 */
+				if (iclog->ic_prev->ic_state & XLOG_STATE_WANT_SYNC)
+					lsn_flushed = be64_to_cpu(iclog->ic_prev->ic_header.h_lsn);
+
 				xlog_wait(&iclog->ic_prev->ic_write_wait,
 							&log->l_icloglock);
-				if (log_flushed)
-					*log_flushed = 1;
 				already_slept = 1;
 				goto try_again;
 			}
@@ -3457,8 +3482,6 @@  _xfs_log_force_lsn(
 			spin_unlock(&log->l_icloglock);
 			if (xlog_state_release_iclog(log, iclog))
 				return -EIO;
-			if (log_flushed)
-				*log_flushed = 1;
 			spin_lock(&log->l_icloglock);
 		}
 
@@ -3473,6 +3496,15 @@  _xfs_log_force_lsn(
 				spin_unlock(&log->l_icloglock);
 				return -EIO;
 			}
+			/*
+			 * Snapshot lsn of iclog if there is no risk that flush
+			 * has been issued already for that iclog.
+			 */
+			if (!lsn_flushed &&
+			    !(iclog->ic_state &
+			      (XLOG_STATE_DONE_SYNC | XLOG_STATE_SYNCING)))
+			lsn_flushed = be64_to_cpu(iclog->ic_header.h_lsn);
+
 			XFS_STATS_INC(mp, xs_log_force_sleep);
 			xlog_wait(&iclog->ic_force_wait, &log->l_icloglock);
 			/*
@@ -3482,9 +3514,6 @@  _xfs_log_force_lsn(
 			 */
 			if (iclog->ic_state & XLOG_STATE_IOERROR)
 				return -EIO;
-
-			if (log_flushed)
-				*log_flushed = 1;
 		} else {		/* just return */
 			spin_unlock(&log->l_icloglock);
 		}
@@ -3493,6 +3522,17 @@  _xfs_log_force_lsn(
 	} while (iclog != log->l_iclog);
 
 	spin_unlock(&log->l_icloglock);
+
+	if (log_flushed) {
+		/*
+		 * We only report log_flushed if flush has been issued after
+		 * entering this function (for lsn) and has already completed.
+		 * It could be that we issued the flush or another thread, as
+		 * long as we have seen this lsn in pre SYNCING state.
+		 */
+		if (XFS_LSN_CMP(lsn_flushed, atomic64_read(&log->l_last_sync_lsn)) <= 0)
+			*log_flushed = 1;
+	}
 	return 0;
 }