xfs: fix incorrect log_flushed on fsync

Message ID	1504100302-3297-1-git-send-email-amir73il@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Amir Goldstein <amir73il@gmail.com> To: "Darrick J . Wong" <darrick.wong@oracle.com>, Christoph Hellwig <hch@lst.de>, Dave Chinner <david@fromorbit.com> Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Josef Bacik <jbacik@fb.com>, stable@vger.kernel.org Subject: [PATCH] xfs: fix incorrect log_flushed on fsync Date: Wed, 30 Aug 2017 16:38:22 +0300 Message-Id: <1504100302-3297-1-git-send-email-amir73il@gmail.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Amir Goldstein Aug. 30, 2017, 1:38 p.m. UTC

When calling into _xfs_log_force{,_lsn}() with a pointer
to log_flushed variable, log_flushed will be set to 1 if:
1. xlog_sync() is called to flush the active log buffer
AND/OR
2. xlog_wait() is called to wait on a syncing log buffers

xfs_file_fsync() checks the value of log_flushed after
_xfs_log_force_lsn() call to optimize away an explicit
PREFLUSH request to the data block device after writing
out all the file's pages to disk.

This optimization is incorrect in the following sequence of events:

 Task A                    Task B
 -------------------------------------------------------
 xfs_file_fsync()
   _xfs_log_force_lsn()
     xlog_sync()
        [submit PREFLUSH] 
                           xfs_file_fsync()
                             file_write_and_wait_range()
                               [submit WRITE X]
                               [endio  WRITE X]
                             _xfs_log_force_lsn()
                               xlog_wait()
        [endio  PREFLUSH]

The write X is not guarantied to be on persistent storage
when PREFLUSH request in completed, because write A was submitted
after the PREFLUSH request, but xfs_file_fsync() of task A will
be notified of log_flushed=1 and will skip explicit flush.

If the system crashes after fsync of task A, write X may not be
present on disk after reboot.

This bug was discovered and demonstrated using Josef Bacik's
dm-log-writes target, which can be used to record block io operations
and then replay a subset of these operations onto the target device.
The test goes something like this:
- Use fsx to execute ops of a file and record ops on log device
- Every now and then fsync the file, store md5 of file and mark
  the location in the log
- Then replay log onto device for each mark, mount fs and compare
  md5 of file to stored value

Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <jbacik@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Christoph, Dave,

It's hard to believe, but I think the reported bug has been around
since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
not try to test old kernels.

I tripped over this aleged bug a week ago when I started testing
crash consistecy with dm-log-writes:
https://marc.info/?l=linux-fsdevel&m=150350333427447&w=2
Since then, I have been trying to prove that dm-log-writes was
false accusing, but turned out that it wasn't.

The reproducer is an xfstest, which I am going to re-post soon
and is also available here:
https://github.com/amir73il/xfstests/commits/dm-log-writes
On a system with spinning disk it reproduces the issue with close
100% probability within less than a minute.

Anyway, with Darrick going on vacation, let's expedite review
of this fix and try to merge some fix for v4.14-rc1 (although this
bug fix may be eligible to later rc's as well).

The change obviously changes behavior for the worse for workload
of intensive parallel fsyncing tasks, but I don't see another quick
way to fix correctness without hurting this workload.
It might be worth to get a feedback from file_write_and_wait_range(),
whether dirty pages writes have been issued at all.

I started running full xftests cycle, but wanted to send this one out
early for comments. I can test other patches if needed.

Cheers,
Amir.

 fs/xfs/xfs_log.c | 7 -------
 1 file changed, 7 deletions(-)

Christoph Hellwig Aug. 30, 2017, 1:46 p.m. UTC | #1

Oops.

Yes, except for the case of calling xlog_state_release_iclog
ourselves we can't claim that the cache has been flushed.  In
fact I'm not even sure that is save, if ic_refcnt makes
xlog_state_release_iclog exit early, so we might have to
pass additional information around there.

Amir Goldstein Aug. 30, 2017, 2:12 p.m. UTC | #2

On Wed, Aug 30, 2017 at 4:46 PM, Christoph Hellwig <hch@lst.de> wrote:
> Oops.
>
> Yes, except for the case of calling xlog_state_release_iclog
> ourselves we can't claim that the cache has been flushed.  In
> fact I'm not even sure that is save, if ic_refcnt makes
> xlog_state_release_iclog exit early, so we might have to
> pass additional information around there.

With this change, for the multiple fsyncing tasks workload, the majority won't
be optimized, so the remaining code is only really optimizing the single
process doing seldom fsync case.

I wonder if it's worth keeping the optimization around at all?

Christoph Hellwig Aug. 30, 2017, 2:21 p.m. UTC | #3

On Wed, Aug 30, 2017 at 05:12:10PM +0300, Amir Goldstein wrote:
> On Wed, Aug 30, 2017 at 4:46 PM, Christoph Hellwig <hch@lst.de> wrote:
> > Oops.
> >
> > Yes, except for the case of calling xlog_state_release_iclog
> > ourselves we can't claim that the cache has been flushed.  In
> > fact I'm not even sure that is save, if ic_refcnt makes
> > xlog_state_release_iclog exit early, so we might have to
> > pass additional information around there.
> 
> With this change, for the multiple fsyncing tasks workload, the majority won't
> be optimized, so the remaining code is only really optimizing the single
> process doing seldom fsync case.
> 
> I wonder if it's worth keeping the optimization around at all?

I think it is.  And we should get back to optimizing the multi-threaded
cases soon using something like a sequence counter that checks if we
issued a log sync after finishing the data writeback.

Darrick J. Wong Aug. 30, 2017, 5:01 p.m. UTC | #4

On Wed, Aug 30, 2017 at 04:38:22PM +0300, Amir Goldstein wrote:
> When calling into _xfs_log_force{,_lsn}() with a pointer
> to log_flushed variable, log_flushed will be set to 1 if:
> 1. xlog_sync() is called to flush the active log buffer
> AND/OR
> 2. xlog_wait() is called to wait on a syncing log buffers
> 
> xfs_file_fsync() checks the value of log_flushed after
> _xfs_log_force_lsn() call to optimize away an explicit
> PREFLUSH request to the data block device after writing
> out all the file's pages to disk.
> 
> This optimization is incorrect in the following sequence of events:
> 
>  Task A                    Task B
>  -------------------------------------------------------
>  xfs_file_fsync()
>    _xfs_log_force_lsn()
>      xlog_sync()
>         [submit PREFLUSH] 
>                            xfs_file_fsync()
>                              file_write_and_wait_range()
>                                [submit WRITE X]
>                                [endio  WRITE X]
>                              _xfs_log_force_lsn()
>                                xlog_wait()
>         [endio  PREFLUSH]
> 
> The write X is not guarantied to be on persistent storage
> when PREFLUSH request in completed, because write A was submitted
> after the PREFLUSH request, but xfs_file_fsync() of task A will
> be notified of log_flushed=1 and will skip explicit flush.

Yikes.

> If the system crashes after fsync of task A, write X may not be
> present on disk after reboot.
> 
> This bug was discovered and demonstrated using Josef Bacik's
> dm-log-writes target, which can be used to record block io operations
> and then replay a subset of these operations onto the target device.
> The test goes something like this:
> - Use fsx to execute ops of a file and record ops on log device
> - Every now and then fsync the file, store md5 of file and mark
>   the location in the log
> - Then replay log onto device for each mark, mount fs and compare
>   md5 of file to stored value
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Josef Bacik <jbacik@fb.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

Will test, and if I have time try to set up this dm-log-writes thing for
my own edification;

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

> ---
> 
> Christoph, Dave,
> 
> It's hard to believe, but I think the reported bug has been around
> since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
> not try to test old kernels.
> 
> I tripped over this aleged bug a week ago when I started testing
> crash consistecy with dm-log-writes:
> https://marc.info/?l=linux-fsdevel&m=150350333427447&w=2
> Since then, I have been trying to prove that dm-log-writes was
> false accusing, but turned out that it wasn't.
> 
> The reproducer is an xfstest, which I am going to re-post soon
> and is also available here:
> https://github.com/amir73il/xfstests/commits/dm-log-writes
> On a system with spinning disk it reproduces the issue with close
> 100% probability within less than a minute.
> 
> Anyway, with Darrick going on vacation, let's expedite review
> of this fix and try to merge some fix for v4.14-rc1 (although this
> bug fix may be eligible to later rc's as well).

Yes.

> The change obviously changes behavior for the worse for workload
> of intensive parallel fsyncing tasks, but I don't see another quick
> way to fix correctness without hurting this workload.
> It might be worth to get a feedback from file_write_and_wait_range(),
> whether dirty pages writes have been issued at all.

Agreed, but safety first. :)

--D

> I started running full xftests cycle, but wanted to send this one out
> early for comments. I can test other patches if needed.
> 
> Cheers,
> Amir.
> 
>  fs/xfs/xfs_log.c | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 0053bcf..ec159c5 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -3342,8 +3342,6 @@ _xfs_log_force(
>  		 */
>  		if (iclog->ic_state & XLOG_STATE_IOERROR)
>  			return -EIO;
> -		if (log_flushed)
> -			*log_flushed = 1;
>  	} else {
>  
>  no_sleep:
> @@ -3447,8 +3445,6 @@ _xfs_log_force_lsn(
>  
>  				xlog_wait(&iclog->ic_prev->ic_write_wait,
>  							&log->l_icloglock);
> -				if (log_flushed)
> -					*log_flushed = 1;
>  				already_slept = 1;
>  				goto try_again;
>  			}
> @@ -3482,9 +3478,6 @@ _xfs_log_force_lsn(
>  			 */
>  			if (iclog->ic_state & XLOG_STATE_IOERROR)
>  				return -EIO;
> -
> -			if (log_flushed)
> -				*log_flushed = 1;
>  		} else {		/* just return */
>  			spin_unlock(&log->l_icloglock);
>  		}
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Darrick J. Wong Sept. 18, 2017, 5:11 p.m. UTC | #5

On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> > When calling into _xfs_log_force{,_lsn}() with a pointer
> > to log_flushed variable, log_flushed will be set to 1 if:
> > 1. xlog_sync() is called to flush the active log buffer
> > AND/OR
> > 2. xlog_wait() is called to wait on a syncing log buffers
> >
> > xfs_file_fsync() checks the value of log_flushed after
> > _xfs_log_force_lsn() call to optimize away an explicit
> > PREFLUSH request to the data block device after writing
> > out all the file's pages to disk.
> >
> > This optimization is incorrect in the following sequence of events:
> >
> >  Task A                    Task B
> >  -------------------------------------------------------
> >  xfs_file_fsync()
> >    _xfs_log_force_lsn()
> >      xlog_sync()
> >         [submit PREFLUSH]
> >                            xfs_file_fsync()
> >                              file_write_and_wait_range()
> >                                [submit WRITE X]
> >                                [endio  WRITE X]
> >                              _xfs_log_force_lsn()
> >                                xlog_wait()
> >         [endio  PREFLUSH]
> >
> > The write X is not guarantied to be on persistent storage
> > when PREFLUSH request in completed, because write A was submitted
> > after the PREFLUSH request, but xfs_file_fsync() of task A will
> > be notified of log_flushed=1 and will skip explicit flush.
> >
> > If the system crashes after fsync of task A, write X may not be
> > present on disk after reboot.
> >
> > This bug was discovered and demonstrated using Josef Bacik's
> > dm-log-writes target, which can be used to record block io operations
> > and then replay a subset of these operations onto the target device.
> > The test goes something like this:
> > - Use fsx to execute ops of a file and record ops on log device
> > - Every now and then fsync the file, store md5 of file and mark
> >   the location in the log
> > - Then replay log onto device for each mark, mount fs and compare
> >   md5 of file to stored value
> >
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Josef Bacik <jbacik@fb.com>
> > Cc: <stable@vger.kernel.org>
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >
> > Christoph, Dave,
> >
> > It's hard to believe, but I think the reported bug has been around
> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
> > not try to test old kernels.
> 
> Forgot to tag commit message with:
> Fixes: f538d4da8d52 ("[XFS] write barrier support")
> 
> Maybe the tag could be added when applying to recent stables,
> so distros and older downstream stables can see the tag.
> 
> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
> if possible data loss bug should also be disclosed in some distros forum?
> I bet some users would care more about the latter than the former.
> Coincidentally, both data loss and security bugs fix the same commit..

Yes the the patch ought to get sent on to stable w/ fixes tag.  One
would hope that the distros will pick up the stable fixes from there.
That said, it's been in the kernel for 12 years without widespread
complaints about corruption, so I'm not sure this warrants public
disclosure via CVE/Phoronix vs. just fixing it.

--D

> 
> Cheers,
> Amir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Amir Goldstein Sept. 18, 2017, 6 p.m. UTC | #6

On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
>> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> > When calling into _xfs_log_force{,_lsn}() with a pointer
>> > to log_flushed variable, log_flushed will be set to 1 if:
>> > 1. xlog_sync() is called to flush the active log buffer
>> > AND/OR
>> > 2. xlog_wait() is called to wait on a syncing log buffers
>> >
>> > xfs_file_fsync() checks the value of log_flushed after
>> > _xfs_log_force_lsn() call to optimize away an explicit
>> > PREFLUSH request to the data block device after writing
>> > out all the file's pages to disk.
>> >
>> > This optimization is incorrect in the following sequence of events:
>> >
>> >  Task A                    Task B
>> >  -------------------------------------------------------
>> >  xfs_file_fsync()
>> >    _xfs_log_force_lsn()
>> >      xlog_sync()
>> >         [submit PREFLUSH]
>> >                            xfs_file_fsync()
>> >                              file_write_and_wait_range()
>> >                                [submit WRITE X]
>> >                                [endio  WRITE X]
>> >                              _xfs_log_force_lsn()
>> >                                xlog_wait()
>> >         [endio  PREFLUSH]
>> >
>> > The write X is not guarantied to be on persistent storage
>> > when PREFLUSH request in completed, because write A was submitted
>> > after the PREFLUSH request, but xfs_file_fsync() of task A will
>> > be notified of log_flushed=1 and will skip explicit flush.
>> >
>> > If the system crashes after fsync of task A, write X may not be
>> > present on disk after reboot.
>> >
>> > This bug was discovered and demonstrated using Josef Bacik's
>> > dm-log-writes target, which can be used to record block io operations
>> > and then replay a subset of these operations onto the target device.
>> > The test goes something like this:
>> > - Use fsx to execute ops of a file and record ops on log device
>> > - Every now and then fsync the file, store md5 of file and mark
>> >   the location in the log
>> > - Then replay log onto device for each mark, mount fs and compare
>> >   md5 of file to stored value
>> >
>> > Cc: Christoph Hellwig <hch@lst.de>
>> > Cc: Josef Bacik <jbacik@fb.com>
>> > Cc: <stable@vger.kernel.org>
>> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> > ---
>> >
>> > Christoph, Dave,
>> >
>> > It's hard to believe, but I think the reported bug has been around
>> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
>> > not try to test old kernels.
>>
>> Forgot to tag commit message with:
>> Fixes: f538d4da8d52 ("[XFS] write barrier support")
>>
>> Maybe the tag could be added when applying to recent stables,
>> so distros and older downstream stables can see the tag.
>>
>> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
>> if possible data loss bug should also be disclosed in some distros forum?
>> I bet some users would care more about the latter than the former.
>> Coincidentally, both data loss and security bugs fix the same commit..
>
> Yes the the patch ought to get sent on to stable w/ fixes tag.  One
> would hope that the distros will pick up the stable fixes from there.


Greg, for your consideration, please add
 Fixes: f538d4da8d52 ("[XFS] write barrier support")
If not pushed yet.

> That said, it's been in the kernel for 12 years without widespread
> complaints about corruption, so I'm not sure this warrants public
> disclosure via CVE/Phoronix vs. just fixing it.
>

I'm not sure either.
My intuition tells me that the chances of hitting the data loss bug
given a power failure are not slim, but the chances of users knowing
about the data loss are slim.
Meaning, the chances of users complaining:
"I swear, I did fsync, lost power, and after boot data was not there" are slim
and the chances of developers believing the report on hear say are
even slimmer.

Oh well. I was just wondering...

Amir.

Greg KH Sept. 18, 2017, 6:35 p.m. UTC | #7

On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> >> > When calling into _xfs_log_force{,_lsn}() with a pointer
> >> > to log_flushed variable, log_flushed will be set to 1 if:
> >> > 1. xlog_sync() is called to flush the active log buffer
> >> > AND/OR
> >> > 2. xlog_wait() is called to wait on a syncing log buffers
> >> >
> >> > xfs_file_fsync() checks the value of log_flushed after
> >> > _xfs_log_force_lsn() call to optimize away an explicit
> >> > PREFLUSH request to the data block device after writing
> >> > out all the file's pages to disk.
> >> >
> >> > This optimization is incorrect in the following sequence of events:
> >> >
> >> >  Task A                    Task B
> >> >  -------------------------------------------------------
> >> >  xfs_file_fsync()
> >> >    _xfs_log_force_lsn()
> >> >      xlog_sync()
> >> >         [submit PREFLUSH]
> >> >                            xfs_file_fsync()
> >> >                              file_write_and_wait_range()
> >> >                                [submit WRITE X]
> >> >                                [endio  WRITE X]
> >> >                              _xfs_log_force_lsn()
> >> >                                xlog_wait()
> >> >         [endio  PREFLUSH]
> >> >
> >> > The write X is not guarantied to be on persistent storage
> >> > when PREFLUSH request in completed, because write A was submitted
> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will
> >> > be notified of log_flushed=1 and will skip explicit flush.
> >> >
> >> > If the system crashes after fsync of task A, write X may not be
> >> > present on disk after reboot.
> >> >
> >> > This bug was discovered and demonstrated using Josef Bacik's
> >> > dm-log-writes target, which can be used to record block io operations
> >> > and then replay a subset of these operations onto the target device.
> >> > The test goes something like this:
> >> > - Use fsx to execute ops of a file and record ops on log device
> >> > - Every now and then fsync the file, store md5 of file and mark
> >> >   the location in the log
> >> > - Then replay log onto device for each mark, mount fs and compare
> >> >   md5 of file to stored value
> >> >
> >> > Cc: Christoph Hellwig <hch@lst.de>
> >> > Cc: Josef Bacik <jbacik@fb.com>
> >> > Cc: <stable@vger.kernel.org>
> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >> > ---
> >> >
> >> > Christoph, Dave,
> >> >
> >> > It's hard to believe, but I think the reported bug has been around
> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
> >> > not try to test old kernels.
> >>
> >> Forgot to tag commit message with:
> >> Fixes: f538d4da8d52 ("[XFS] write barrier support")
> >>
> >> Maybe the tag could be added when applying to recent stables,
> >> so distros and older downstream stables can see the tag.
> >>
> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
> >> if possible data loss bug should also be disclosed in some distros forum?
> >> I bet some users would care more about the latter than the former.
> >> Coincidentally, both data loss and security bugs fix the same commit..
> >
> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
> > would hope that the distros will pick up the stable fixes from there.
> 
> 
> Greg, for your consideration, please add
>  Fixes: f538d4da8d52 ("[XFS] write barrier support")
> If not pushed yet.

Add it to what?

totally confused,

greg k-h

Amir Goldstein Sept. 18, 2017, 7:29 p.m. UTC | #8

On Mon, Sep 18, 2017 at 9:35 PM, Greg KH <greg@kroah.com> wrote:
> On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
>> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
>> <darrick.wong@oracle.com> wrote:
>> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
>> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> >> > When calling into _xfs_log_force{,_lsn}() with a pointer
>> >> > to log_flushed variable, log_flushed will be set to 1 if:
>> >> > 1. xlog_sync() is called to flush the active log buffer
>> >> > AND/OR
>> >> > 2. xlog_wait() is called to wait on a syncing log buffers
>> >> >
>> >> > xfs_file_fsync() checks the value of log_flushed after
>> >> > _xfs_log_force_lsn() call to optimize away an explicit
>> >> > PREFLUSH request to the data block device after writing
>> >> > out all the file's pages to disk.
>> >> >
>> >> > This optimization is incorrect in the following sequence of events:
>> >> >
>> >> >  Task A                    Task B
>> >> >  -------------------------------------------------------
>> >> >  xfs_file_fsync()
>> >> >    _xfs_log_force_lsn()
>> >> >      xlog_sync()
>> >> >         [submit PREFLUSH]
>> >> >                            xfs_file_fsync()
>> >> >                              file_write_and_wait_range()
>> >> >                                [submit WRITE X]
>> >> >                                [endio  WRITE X]
>> >> >                              _xfs_log_force_lsn()
>> >> >                                xlog_wait()
>> >> >         [endio  PREFLUSH]
>> >> >
>> >> > The write X is not guarantied to be on persistent storage
>> >> > when PREFLUSH request in completed, because write A was submitted
>> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will
>> >> > be notified of log_flushed=1 and will skip explicit flush.
>> >> >
>> >> > If the system crashes after fsync of task A, write X may not be
>> >> > present on disk after reboot.
>> >> >
>> >> > This bug was discovered and demonstrated using Josef Bacik's
>> >> > dm-log-writes target, which can be used to record block io operations
>> >> > and then replay a subset of these operations onto the target device.
>> >> > The test goes something like this:
>> >> > - Use fsx to execute ops of a file and record ops on log device
>> >> > - Every now and then fsync the file, store md5 of file and mark
>> >> >   the location in the log
>> >> > - Then replay log onto device for each mark, mount fs and compare
>> >> >   md5 of file to stored value
>> >> >
>> >> > Cc: Christoph Hellwig <hch@lst.de>
>> >> > Cc: Josef Bacik <jbacik@fb.com>
>> >> > Cc: <stable@vger.kernel.org>
>> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> >> > ---
>> >> >
>> >> > Christoph, Dave,
>> >> >
>> >> > It's hard to believe, but I think the reported bug has been around
>> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
>> >> > not try to test old kernels.
>> >>
>> >> Forgot to tag commit message with:
>> >> Fixes: f538d4da8d52 ("[XFS] write barrier support")
>> >>
>> >> Maybe the tag could be added when applying to recent stables,
>> >> so distros and older downstream stables can see the tag.
>> >>
>> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
>> >> if possible data loss bug should also be disclosed in some distros forum?
>> >> I bet some users would care more about the latter than the former.
>> >> Coincidentally, both data loss and security bugs fix the same commit..
>> >
>> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
>> > would hope that the distros will pick up the stable fixes from there.
>>
>>
>> Greg, for your consideration, please add
>>  Fixes: f538d4da8d52 ("[XFS] write barrier support")
>> If not pushed yet.
>
> Add it to what?

Sorry, add that tag when applying commit 47c7d0b1950258312
to stable trees, since I missed adding the tag before it was merged
to master.

Thanks,
Amir.

Dave Chinner Sept. 18, 2017, 9:24 p.m. UTC | #9

On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
> >> if possible data loss bug should also be disclosed in some distros forum?
> >> I bet some users would care more about the latter than the former.
> >> Coincidentally, both data loss and security bugs fix the same commit..
> >
> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
> > would hope that the distros will pick up the stable fixes from there.

Yup, that's the normal process for data integrity/fs corruption
bugs.

> > That said, it's been in the kernel for 12 years without widespread
> > complaints about corruption, so I'm not sure this warrants public
> > disclosure via CVE/Phoronix vs. just fixing it.
> >
> 
> I'm not sure either.
> My intuition tells me that the chances of hitting the data loss bug
> given a power failure are not slim, but the chances of users knowing
> about the data loss are slim.

The chances of hitting it are slim. Power-fail vs fsync data
integrity testing is something we do actually run as part of QE and
have for many years.  We've been running such testing for years and
never tripped over this problem, so I think the likelihood that a
user will hit it is extremely small. Compare that to the bug we
issued the CVE for - it's in code we /don't test/, the bug affects
everyone with that support compiled in (potentially millions of
systems) and it is a guaranteed local user triggered denial of
service. There's a big difference in scope and impact between these
two cases.

Further, we have to consider precedence, developer resources and
impact when deciding what we issue disclosures for. e.g. Verifying
the problem, categorising it, finding out what systems it affected,
working out the best fix for backporting, testing the fix, working
through all the disclosure processes, etc took three long days of me
doing nothing else but working on this issue.

If we start issuing disclosures for all data integrity and/or
corruption fixes, then we're going to have no time for anything
else. That is, a significant number of bugs we fix every release are
for corruption and/or data integrity issues. The process we use for
getting them back to stable and distro kernels is to tag them for
stable kernels. Adding some heavyweight disclosure process on top of
that isn't going to provide any extra value to distros or users. All
it does is add significant extra workload to the upstream
development process.

> Meaning, the chances of users complaining:
> "I swear, I did fsync, lost power, and after boot data was not there" are slim
> and the chances of developers believing the report on hear say are
> even slimmer.

A disclosure is not going to change that. All it will do is result
in users seeing ghosts (suggestion bias) and reporting things that
aren't actually filesystem data loss. e.g. "filesystem lost data on
crash" when in reality the problem is that the app doesn't use
fsync...

Cheers,

Dave.

Amir Goldstein Sept. 19, 2017, 5:31 a.m. UTC | #10

On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
>> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
>> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
>> >> if possible data loss bug should also be disclosed in some distros forum?
>> >> I bet some users would care more about the latter than the former.
>> >> Coincidentally, both data loss and security bugs fix the same commit..
>> >
>> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
>> > would hope that the distros will pick up the stable fixes from there.
>
> Yup, that's the normal process for data integrity/fs corruption
> bugs.

Makes sense. I'm convinced that the normal process is sufficient for this
sort of bug fix.

>
>> > That said, it's been in the kernel for 12 years without widespread
>> > complaints about corruption, so I'm not sure this warrants public
>> > disclosure via CVE/Phoronix vs. just fixing it.
>> >
>>
>> I'm not sure either.
>> My intuition tells me that the chances of hitting the data loss bug
>> given a power failure are not slim, but the chances of users knowing
>> about the data loss are slim.
>
> The chances of hitting it are slim. Power-fail vs fsync data
> integrity testing is something we do actually run as part of QE and
> have for many years.  We've been running such testing for years and
> never tripped over this problem, so I think the likelihood that a
> user will hit it is extremely small.

This sentence make me unease.
Who is We and what QE testing are you referring to?
Are those tests in xfstests or any other public repository?
My first reaction to the corruption was "no way, I need to check the test"
Second reaction after checking the test was "this must very very hard to hit"
But from closer inspection, it looks like it doesn't take more than running
a couple of fsync in parallel to get to the "at risk" state, which may persist
for seconds.
Of course the chances of users being that unlucky to also get a power
failure during "at risk" state are low, but I am puzzled how power fail tests
you claim that exists, didn't catch this sooner.

Anyway, not sure there is much more to discuss, just wanted to see
if there is a post mortem lesson to be learned from this, beyond the fact that
dm-log-writes is a valuable testing tool.

Cheers,
Amir.

Darrick J. Wong Sept. 19, 2017, 5:45 a.m. UTC | #11

On Tue, Sep 19, 2017 at 08:31:37AM +0300, Amir Goldstein wrote:
> On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
> >> >> if possible data loss bug should also be disclosed in some distros forum?
> >> >> I bet some users would care more about the latter than the former.
> >> >> Coincidentally, both data loss and security bugs fix the same commit..
> >> >
> >> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
> >> > would hope that the distros will pick up the stable fixes from there.
> >
> > Yup, that's the normal process for data integrity/fs corruption
> > bugs.
> 
> Makes sense. I'm convinced that the normal process is sufficient for this
> sort of bug fix.
> 
> >
> >> > That said, it's been in the kernel for 12 years without widespread
> >> > complaints about corruption, so I'm not sure this warrants public
> >> > disclosure via CVE/Phoronix vs. just fixing it.
> >> >
> >>
> >> I'm not sure either.
> >> My intuition tells me that the chances of hitting the data loss bug
> >> given a power failure are not slim, but the chances of users knowing
> >> about the data loss are slim.
> >
> > The chances of hitting it are slim. Power-fail vs fsync data
> > integrity testing is something we do actually run as part of QE and
> > have for many years.  We've been running such testing for years and
> > never tripped over this problem, so I think the likelihood that a
> > user will hit it is extremely small.
> 
> This sentence make me unease.
> Who is We and what QE testing are you referring to?
> Are those tests in xfstests or any other public repository?
> My first reaction to the corruption was "no way, I need to check the test"
> Second reaction after checking the test was "this must very very hard to hit"

/me prefers to think that we've simply gotten lucky all these years and
nobody actually managed to die before another flush would take care of
the dirty data.

But then I did just spend a week in Las Vegas. :P

> But from closer inspection, it looks like it doesn't take more than running
> a couple of fsync in parallel to get to the "at risk" state, which may persist
> for seconds.
> Of course the chances of users being that unlucky to also get a power
> failure during "at risk" state are low, but I am puzzled how power fail tests
> you claim that exists, didn't catch this sooner.
> 
> Anyway, not sure there is much more to discuss, just wanted to see
> if there is a post mortem lesson to be learned from this, beyond the fact that
> dm-log-writes is a valuable testing tool.

Agreed. :)

--D

> 
> Cheers,
> Amir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Greg KH Sept. 19, 2017, 6:32 a.m. UTC | #12

On Mon, Sep 18, 2017 at 10:29:25PM +0300, Amir Goldstein wrote:
> On Mon, Sep 18, 2017 at 9:35 PM, Greg KH <greg@kroah.com> wrote:
> > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
> >> <darrick.wong@oracle.com> wrote:
> >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> >> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> >> >> > When calling into _xfs_log_force{,_lsn}() with a pointer
> >> >> > to log_flushed variable, log_flushed will be set to 1 if:
> >> >> > 1. xlog_sync() is called to flush the active log buffer
> >> >> > AND/OR
> >> >> > 2. xlog_wait() is called to wait on a syncing log buffers
> >> >> >
> >> >> > xfs_file_fsync() checks the value of log_flushed after
> >> >> > _xfs_log_force_lsn() call to optimize away an explicit
> >> >> > PREFLUSH request to the data block device after writing
> >> >> > out all the file's pages to disk.
> >> >> >
> >> >> > This optimization is incorrect in the following sequence of events:
> >> >> >
> >> >> >  Task A                    Task B
> >> >> >  -------------------------------------------------------
> >> >> >  xfs_file_fsync()
> >> >> >    _xfs_log_force_lsn()
> >> >> >      xlog_sync()
> >> >> >         [submit PREFLUSH]
> >> >> >                            xfs_file_fsync()
> >> >> >                              file_write_and_wait_range()
> >> >> >                                [submit WRITE X]
> >> >> >                                [endio  WRITE X]
> >> >> >                              _xfs_log_force_lsn()
> >> >> >                                xlog_wait()
> >> >> >         [endio  PREFLUSH]
> >> >> >
> >> >> > The write X is not guarantied to be on persistent storage
> >> >> > when PREFLUSH request in completed, because write A was submitted
> >> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will
> >> >> > be notified of log_flushed=1 and will skip explicit flush.
> >> >> >
> >> >> > If the system crashes after fsync of task A, write X may not be
> >> >> > present on disk after reboot.
> >> >> >
> >> >> > This bug was discovered and demonstrated using Josef Bacik's
> >> >> > dm-log-writes target, which can be used to record block io operations
> >> >> > and then replay a subset of these operations onto the target device.
> >> >> > The test goes something like this:
> >> >> > - Use fsx to execute ops of a file and record ops on log device
> >> >> > - Every now and then fsync the file, store md5 of file and mark
> >> >> >   the location in the log
> >> >> > - Then replay log onto device for each mark, mount fs and compare
> >> >> >   md5 of file to stored value
> >> >> >
> >> >> > Cc: Christoph Hellwig <hch@lst.de>
> >> >> > Cc: Josef Bacik <jbacik@fb.com>
> >> >> > Cc: <stable@vger.kernel.org>
> >> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >> >> > ---
> >> >> >
> >> >> > Christoph, Dave,
> >> >> >
> >> >> > It's hard to believe, but I think the reported bug has been around
> >> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
> >> >> > not try to test old kernels.
> >> >>
> >> >> Forgot to tag commit message with:
> >> >> Fixes: f538d4da8d52 ("[XFS] write barrier support")
> >> >>
> >> >> Maybe the tag could be added when applying to recent stables,
> >> >> so distros and older downstream stables can see the tag.
> >> >>
> >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
> >> >> if possible data loss bug should also be disclosed in some distros forum?
> >> >> I bet some users would care more about the latter than the former.
> >> >> Coincidentally, both data loss and security bugs fix the same commit..
> >> >
> >> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
> >> > would hope that the distros will pick up the stable fixes from there.
> >>
> >>
> >> Greg, for your consideration, please add
> >>  Fixes: f538d4da8d52 ("[XFS] write barrier support")
> >> If not pushed yet.
> >
> > Add it to what?
> 
> Sorry, add that tag when applying commit 47c7d0b1950258312
> to stable trees, since I missed adding the tag before it was merged
> to master.

Nah, as the tag is just needed to let me know where to backport stuff
to, I don't think it matters when I add it to the stable tree itself, so
I'll leave it as-is.

thanks,

greg k-h

Dave Chinner Sept. 20, 2017, 12:40 a.m. UTC | #13

On Tue, Sep 19, 2017 at 08:31:37AM +0300, Amir Goldstein wrote:
> On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> >> > That said, it's been in the kernel for 12 years without widespread
> >> > complaints about corruption, so I'm not sure this warrants public
> >> > disclosure via CVE/Phoronix vs. just fixing it.
> >> >
> >>
> >> I'm not sure either.
> >> My intuition tells me that the chances of hitting the data loss bug
> >> given a power failure are not slim, but the chances of users knowing
> >> about the data loss are slim.
> >
> > The chances of hitting it are slim. Power-fail vs fsync data
> > integrity testing is something we do actually run as part of QE and
> > have for many years.  We've been running such testing for years and
> > never tripped over this problem, so I think the likelihood that a
> > user will hit it is extremely small.
> 
> This sentence make me unease.
> Who is We and what QE testing are you referring to?

I've done it in the past myself with a modified crash/xfscrash to
write patterned files (via genstream/checkstream). Unfortunately, I
lost that script when the machine used for that testing suffered a
fatal, completely unrecoverable ext3 root filesystem corruption
during a power fail cycle... :/

RH QE also runs automated power fail cycle tests - we found lots of
ext4 problems with that test rig when it was first put together, but
I don't recall seeing XFS issues reported.  Eryu would have to
confirm, but ISTR that this testing was made part of the regular
RHEL major release testing cycle...

Let's not forget all the other storage vendors and apps out there
that do their own crash/power fail testing that rely on a working
fsync. Apps like ceph, cluster, various databases, etc all have
their own data integrity testing procedures, and so if there's any
obvious or easy to hit fsync bug we would have had people reporting
it long ago.

Then there's all the research tools that have had papers written
about them testing exactly the sort of thing that dm-log writes is
testing. None of these indicated any sort of problem with fsync in
XFS, but we couldn't reproduce or verify the research results of the
because none of those fine institutions ever open sourced their
tools despite repeated requests and promises that it would happen.

> Are those tests in xfstests or any other public repository?

crash/xfscrash is, and now dm-log-write, but nothing else is.

> My first reaction to the corruption was "no way, I need to check the test"
> Second reaction after checking the test was "this must very very hard to hit"
> But from closer inspection, it looks like it doesn't take more than running
> a couple of fsync in parallel to get to the "at risk" state, which may persist
> for seconds.

That may be the case, but the reality is we don't have a body of
evidence to suggest this is a problem anyone is actually hitting. In
fact, we don't have any evidence it's been seen in the wild at all.

> Of course the chances of users being that unlucky to also get a power
> failure during "at risk" state are low, but I am puzzled how power fail tests
> you claim that exists, didn't catch this sooner.

Probably for the same reason app developers and users aren't
reporting fsync data loss problems.  While the bug may "look obvious
in hindsight", the fact is that there are no evidence of data loss
after fsync on XFS in the real world.  Occam's Razor suggests that
there is something that masks the problem that we don't understand
yet....

Cheers,

Dave.

Vijay Chidambaram Sept. 20, 2017, 1:08 a.m. UTC | #14

>> Are those tests in xfstests or any other public repository?
>
> crash/xfscrash is, and now dm-log-write, but nothing else is.
>

There's also CrashMonkey (tool under active development from my
research group at UT Austin) at:
https://github.com/utsaslab/crashmonkey.

>> My first reaction to the corruption was "no way, I need to check the test"
>> Second reaction after checking the test was "this must very very hard to hit"
>> But from closer inspection, it looks like it doesn't take more than running
>> a couple of fsync in parallel to get to the "at risk" state, which may persist
>> for seconds.
>
> That may be the case, but the reality is we don't have a body of
> evidence to suggest this is a problem anyone is actually hitting. In
> fact, we don't have any evidence it's been seen in the wild at all.

Even if people did hit this bug in the wild, as Amir suggested before,
the chances of them connecting it to a crash-consistency bug in the
file system and reporting as such are extremely low. We do have many
reports of data loss in newer file systems like btrfs, but I don't
think those are connected back to crash-consistency bugs (my suspicion
is there would be many such bugs, simply due to inadequate crash
consistency testing).

When working on CrashMonkey, we find it hard to understand what a
crash consistency bug is, even when we are the ones producing the bug!
I think there is a need for good debugging/investigative tools in this
area (something we are also working on).

Thanks,
Vijay
http://www.cs.utexas.edu/~vijay/

Eryu Guan Sept. 20, 2017, 8:59 a.m. UTC | #15

On Wed, Sep 20, 2017 at 10:40:12AM +1000, Dave Chinner wrote:
> On Tue, Sep 19, 2017 at 08:31:37AM +0300, Amir Goldstein wrote:
> > On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> > >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> > >> > That said, it's been in the kernel for 12 years without widespread
> > >> > complaints about corruption, so I'm not sure this warrants public
> > >> > disclosure via CVE/Phoronix vs. just fixing it.
> > >> >
> > >>
> > >> I'm not sure either.
> > >> My intuition tells me that the chances of hitting the data loss bug
> > >> given a power failure are not slim, but the chances of users knowing
> > >> about the data loss are slim.
> > >
> > > The chances of hitting it are slim. Power-fail vs fsync data
> > > integrity testing is something we do actually run as part of QE and
> > > have for many years.  We've been running such testing for years and
> > > never tripped over this problem, so I think the likelihood that a
> > > user will hit it is extremely small.
> > 
> > This sentence make me unease.
> > Who is We and what QE testing are you referring to?
> 
> I've done it in the past myself with a modified crash/xfscrash to
> write patterned files (via genstream/checkstream). Unfortunately, I
> lost that script when the machine used for that testing suffered a
> fatal, completely unrecoverable ext3 root filesystem corruption
> during a power fail cycle... :/
> 
> RH QE also runs automated power fail cycle tests - we found lots of

Not fully automated yet, but semi-automated :)

> ext4 problems with that test rig when it was first put together, but
> I don't recall seeing XFS issues reported.  Eryu would have to
> confirm, but ISTR that this testing was made part of the regular
> RHEL major release testing cycle...

Generally speaking, we don't find many bugs in our power failure tests.
We filed a few ext3/4 bugs against RHEL6 kernel, but IIRC there was only
one XFS bug filed, and we couldn't reproduce that bug in later RHEL
minor releases. And it seems to me that we don't file any such bugs
against RHEL7 kernel. And yes, this is part of regular RHEL release
testings only, I haven't done any power failure tests with upstream
kernels yet.

Thanks,
Eryu

> 
> Let's not forget all the other storage vendors and apps out there
> that do their own crash/power fail testing that rely on a working
> fsync. Apps like ceph, cluster, various databases, etc all have
> their own data integrity testing procedures, and so if there's any
> obvious or easy to hit fsync bug we would have had people reporting
> it long ago.
> 
> Then there's all the research tools that have had papers written
> about them testing exactly the sort of thing that dm-log writes is
> testing. None of these indicated any sort of problem with fsync in
> XFS, but we couldn't reproduce or verify the research results of the
> because none of those fine institutions ever open sourced their
> tools despite repeated requests and promises that it would happen.
> 
> > Are those tests in xfstests or any other public repository?
> 
> crash/xfscrash is, and now dm-log-write, but nothing else is.
> 
> > My first reaction to the corruption was "no way, I need to check the test"
> > Second reaction after checking the test was "this must very very hard to hit"
> > But from closer inspection, it looks like it doesn't take more than running
> > a couple of fsync in parallel to get to the "at risk" state, which may persist
> > for seconds.
> 
> That may be the case, but the reality is we don't have a body of
> evidence to suggest this is a problem anyone is actually hitting. In
> fact, we don't have any evidence it's been seen in the wild at all.
> 
> > Of course the chances of users being that unlucky to also get a power
> > failure during "at risk" state are low, but I am puzzled how power fail tests
> > you claim that exists, didn't catch this sooner.
> 
> Probably for the same reason app developers and users aren't
> reporting fsync data loss problems.  While the bug may "look obvious
> in hindsight", the fact is that there are no evidence of data loss
> after fsync on XFS in the real world.  Occam's Razor suggests that
> there is something that masks the problem that we don't understand
> yet....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Amir Goldstein June 9, 2018, 4:44 a.m. UTC | #16

On Tue, Sep 19, 2017 at 9:32 AM, Greg KH <greg@kroah.com> wrote:
> On Mon, Sep 18, 2017 at 10:29:25PM +0300, Amir Goldstein wrote:
>> On Mon, Sep 18, 2017 at 9:35 PM, Greg KH <greg@kroah.com> wrote:
>> > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
>> >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
>> >> <darrick.wong@oracle.com> wrote:
>> >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
>> >> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> >> >> > When calling into _xfs_log_force{,_lsn}() with a pointer
>> >> >> > to log_flushed variable, log_flushed will be set to 1 if:
>> >> >> > 1. xlog_sync() is called to flush the active log buffer
>> >> >> > AND/OR
>> >> >> > 2. xlog_wait() is called to wait on a syncing log buffers
>> >> >> >
>> >> >> > xfs_file_fsync() checks the value of log_flushed after
>> >> >> > _xfs_log_force_lsn() call to optimize away an explicit
>> >> >> > PREFLUSH request to the data block device after writing
>> >> >> > out all the file's pages to disk.
>> >> >> >
>> >> >> > This optimization is incorrect in the following sequence of events:
>> >> >> >
>> >> >> >  Task A                    Task B
>> >> >> >  -------------------------------------------------------
>> >> >> >  xfs_file_fsync()
>> >> >> >    _xfs_log_force_lsn()
>> >> >> >      xlog_sync()
>> >> >> >         [submit PREFLUSH]
>> >> >> >                            xfs_file_fsync()
>> >> >> >                              file_write_and_wait_range()
>> >> >> >                                [submit WRITE X]
>> >> >> >                                [endio  WRITE X]
>> >> >> >                              _xfs_log_force_lsn()
>> >> >> >                                xlog_wait()
>> >> >> >         [endio  PREFLUSH]
>> >> >> >
>> >> >> > The write X is not guarantied to be on persistent storage
>> >> >> > when PREFLUSH request in completed, because write A was submitted
>> >> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will
>> >> >> > be notified of log_flushed=1 and will skip explicit flush.
>> >> >> >
>> >> >> > If the system crashes after fsync of task A, write X may not be
>> >> >> > present on disk after reboot.
>> >> >> >
>> >> >> > This bug was discovered and demonstrated using Josef Bacik's
>> >> >> > dm-log-writes target, which can be used to record block io operations
>> >> >> > and then replay a subset of these operations onto the target device.
>> >> >> > The test goes something like this:
>> >> >> > - Use fsx to execute ops of a file and record ops on log device
>> >> >> > - Every now and then fsync the file, store md5 of file and mark
>> >> >> >   the location in the log
>> >> >> > - Then replay log onto device for each mark, mount fs and compare
>> >> >> >   md5 of file to stored value
>> >> >> >
>> >> >> > Cc: Christoph Hellwig <hch@lst.de>
>> >> >> > Cc: Josef Bacik <jbacik@fb.com>
>> >> >> > Cc: <stable@vger.kernel.org>
>> >> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> >> >> > ---
>> >> >> >
>> >> >> > Christoph, Dave,
>> >> >> >
>> >> >> > It's hard to believe, but I think the reported bug has been around
>> >> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
>> >> >> > not try to test old kernels.
>> >> >>
>> >> >> Forgot to tag commit message with:
>> >> >> Fixes: f538d4da8d52 ("[XFS] write barrier support")
>> >> >>
>> >> >> Maybe the tag could be added when applying to recent stables,
>> >> >> so distros and older downstream stables can see the tag.
>> >> >>
>> >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
>> >> >> if possible data loss bug should also be disclosed in some distros forum?
>> >> >> I bet some users would care more about the latter than the former.
>> >> >> Coincidentally, both data loss and security bugs fix the same commit..
>> >> >
>> >> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
>> >> > would hope that the distros will pick up the stable fixes from there.
>> >>
>> >>
>> >> Greg, for your consideration, please add
>> >>  Fixes: f538d4da8d52 ("[XFS] write barrier support")
>> >> If not pushed yet.
>> >
>> > Add it to what?
>>
>> Sorry, add that tag when applying commit 47c7d0b1950258312
>> to stable trees, since I missed adding the tag before it was merged
>> to master.
>
> Nah, as the tag is just needed to let me know where to backport stuff
> to, I don't think it matters when I add it to the stable tree itself, so
> I'll leave it as-is.
>

Greg,

Related or not to above Fixes discussion, I now noticed that you never picked
the patch for kernel 4.4.

Ben did take it to 3.2 and 3.16 BTW.

This is a very critical bug fix IMO.
Were you waiting for an ACK from xfs maintainers or just an oversight?
Or was it me who had to check up on that?

Thanks,
Amir.

Greg KH June 9, 2018, 7:13 a.m. UTC | #17

On Sat, Jun 09, 2018 at 07:44:22AM +0300, Amir Goldstein wrote:
> On Tue, Sep 19, 2017 at 9:32 AM, Greg KH <greg@kroah.com> wrote:
> > On Mon, Sep 18, 2017 at 10:29:25PM +0300, Amir Goldstein wrote:
> >> On Mon, Sep 18, 2017 at 9:35 PM, Greg KH <greg@kroah.com> wrote:
> >> > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> >> >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
> >> >> <darrick.wong@oracle.com> wrote:
> >> >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> >> >> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> >> >> >> > When calling into _xfs_log_force{,_lsn}() with a pointer
> >> >> >> > to log_flushed variable, log_flushed will be set to 1 if:
> >> >> >> > 1. xlog_sync() is called to flush the active log buffer
> >> >> >> > AND/OR
> >> >> >> > 2. xlog_wait() is called to wait on a syncing log buffers
> >> >> >> >
> >> >> >> > xfs_file_fsync() checks the value of log_flushed after
> >> >> >> > _xfs_log_force_lsn() call to optimize away an explicit
> >> >> >> > PREFLUSH request to the data block device after writing
> >> >> >> > out all the file's pages to disk.
> >> >> >> >
> >> >> >> > This optimization is incorrect in the following sequence of events:
> >> >> >> >
> >> >> >> >  Task A                    Task B
> >> >> >> >  -------------------------------------------------------
> >> >> >> >  xfs_file_fsync()
> >> >> >> >    _xfs_log_force_lsn()
> >> >> >> >      xlog_sync()
> >> >> >> >         [submit PREFLUSH]
> >> >> >> >                            xfs_file_fsync()
> >> >> >> >                              file_write_and_wait_range()
> >> >> >> >                                [submit WRITE X]
> >> >> >> >                                [endio  WRITE X]
> >> >> >> >                              _xfs_log_force_lsn()
> >> >> >> >                                xlog_wait()
> >> >> >> >         [endio  PREFLUSH]
> >> >> >> >
> >> >> >> > The write X is not guarantied to be on persistent storage
> >> >> >> > when PREFLUSH request in completed, because write A was submitted
> >> >> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will
> >> >> >> > be notified of log_flushed=1 and will skip explicit flush.
> >> >> >> >
> >> >> >> > If the system crashes after fsync of task A, write X may not be
> >> >> >> > present on disk after reboot.
> >> >> >> >
> >> >> >> > This bug was discovered and demonstrated using Josef Bacik's
> >> >> >> > dm-log-writes target, which can be used to record block io operations
> >> >> >> > and then replay a subset of these operations onto the target device.
> >> >> >> > The test goes something like this:
> >> >> >> > - Use fsx to execute ops of a file and record ops on log device
> >> >> >> > - Every now and then fsync the file, store md5 of file and mark
> >> >> >> >   the location in the log
> >> >> >> > - Then replay log onto device for each mark, mount fs and compare
> >> >> >> >   md5 of file to stored value
> >> >> >> >
> >> >> >> > Cc: Christoph Hellwig <hch@lst.de>
> >> >> >> > Cc: Josef Bacik <jbacik@fb.com>
> >> >> >> > Cc: <stable@vger.kernel.org>
> >> >> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >> >> >> > ---
> >> >> >> >
> >> >> >> > Christoph, Dave,
> >> >> >> >
> >> >> >> > It's hard to believe, but I think the reported bug has been around
> >> >> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
> >> >> >> > not try to test old kernels.
> >> >> >>
> >> >> >> Forgot to tag commit message with:
> >> >> >> Fixes: f538d4da8d52 ("[XFS] write barrier support")
> >> >> >>
> >> >> >> Maybe the tag could be added when applying to recent stables,
> >> >> >> so distros and older downstream stables can see the tag.
> >> >> >>
> >> >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
> >> >> >> if possible data loss bug should also be disclosed in some distros forum?
> >> >> >> I bet some users would care more about the latter than the former.
> >> >> >> Coincidentally, both data loss and security bugs fix the same commit..
> >> >> >
> >> >> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
> >> >> > would hope that the distros will pick up the stable fixes from there.
> >> >>
> >> >>
> >> >> Greg, for your consideration, please add
> >> >>  Fixes: f538d4da8d52 ("[XFS] write barrier support")
> >> >> If not pushed yet.
> >> >
> >> > Add it to what?
> >>
> >> Sorry, add that tag when applying commit 47c7d0b1950258312
> >> to stable trees, since I missed adding the tag before it was merged
> >> to master.
> >
> > Nah, as the tag is just needed to let me know where to backport stuff
> > to, I don't think it matters when I add it to the stable tree itself, so
> > I'll leave it as-is.
> >
> 
> Greg,
> 
> Related or not to above Fixes discussion, I now noticed that you never picked
> the patch for kernel 4.4.
> 
> Ben did take it to 3.2 and 3.16 BTW.
> 
> This is a very critical bug fix IMO.
> Were you waiting for an ACK from xfs maintainers or just an oversight?
> Or was it me who had to check up on that?

It looks to be an oversight, sorry, now queued up.

greg k-h

xfs: fix incorrect log_flushed on fsync

Commit Message

Comments

Patch