Message ID | 1504100302-3297-1-git-send-email-amir73il@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Oops. Yes, except for the case of calling xlog_state_release_iclog ourselves we can't claim that the cache has been flushed. In fact I'm not even sure that is save, if ic_refcnt makes xlog_state_release_iclog exit early, so we might have to pass additional information around there.
On Wed, Aug 30, 2017 at 4:46 PM, Christoph Hellwig <hch@lst.de> wrote: > Oops. > > Yes, except for the case of calling xlog_state_release_iclog > ourselves we can't claim that the cache has been flushed. In > fact I'm not even sure that is save, if ic_refcnt makes > xlog_state_release_iclog exit early, so we might have to > pass additional information around there. With this change, for the multiple fsyncing tasks workload, the majority won't be optimized, so the remaining code is only really optimizing the single process doing seldom fsync case. I wonder if it's worth keeping the optimization around at all?
On Wed, Aug 30, 2017 at 05:12:10PM +0300, Amir Goldstein wrote: > On Wed, Aug 30, 2017 at 4:46 PM, Christoph Hellwig <hch@lst.de> wrote: > > Oops. > > > > Yes, except for the case of calling xlog_state_release_iclog > > ourselves we can't claim that the cache has been flushed. In > > fact I'm not even sure that is save, if ic_refcnt makes > > xlog_state_release_iclog exit early, so we might have to > > pass additional information around there. > > With this change, for the multiple fsyncing tasks workload, the majority won't > be optimized, so the remaining code is only really optimizing the single > process doing seldom fsync case. > > I wonder if it's worth keeping the optimization around at all? I think it is. And we should get back to optimizing the multi-threaded cases soon using something like a sequence counter that checks if we issued a log sync after finishing the data writeback.
On Wed, Aug 30, 2017 at 04:38:22PM +0300, Amir Goldstein wrote: > When calling into _xfs_log_force{,_lsn}() with a pointer > to log_flushed variable, log_flushed will be set to 1 if: > 1. xlog_sync() is called to flush the active log buffer > AND/OR > 2. xlog_wait() is called to wait on a syncing log buffers > > xfs_file_fsync() checks the value of log_flushed after > _xfs_log_force_lsn() call to optimize away an explicit > PREFLUSH request to the data block device after writing > out all the file's pages to disk. > > This optimization is incorrect in the following sequence of events: > > Task A Task B > ------------------------------------------------------- > xfs_file_fsync() > _xfs_log_force_lsn() > xlog_sync() > [submit PREFLUSH] > xfs_file_fsync() > file_write_and_wait_range() > [submit WRITE X] > [endio WRITE X] > _xfs_log_force_lsn() > xlog_wait() > [endio PREFLUSH] > > The write X is not guarantied to be on persistent storage > when PREFLUSH request in completed, because write A was submitted > after the PREFLUSH request, but xfs_file_fsync() of task A will > be notified of log_flushed=1 and will skip explicit flush. Yikes. > If the system crashes after fsync of task A, write X may not be > present on disk after reboot. > > This bug was discovered and demonstrated using Josef Bacik's > dm-log-writes target, which can be used to record block io operations > and then replay a subset of these operations onto the target device. > The test goes something like this: > - Use fsx to execute ops of a file and record ops on log device > - Every now and then fsync the file, store md5 of file and mark > the location in the log > - Then replay log onto device for each mark, mount fs and compare > md5 of file to stored value > > Cc: Christoph Hellwig <hch@lst.de> > Cc: Josef Bacik <jbacik@fb.com> > Cc: <stable@vger.kernel.org> > Signed-off-by: Amir Goldstein <amir73il@gmail.com> Will test, and if I have time try to set up this dm-log-writes thing for my own edification; Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > > Christoph, Dave, > > It's hard to believe, but I think the reported bug has been around > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did > not try to test old kernels. > > I tripped over this aleged bug a week ago when I started testing > crash consistecy with dm-log-writes: > https://marc.info/?l=linux-fsdevel&m=150350333427447&w=2 > Since then, I have been trying to prove that dm-log-writes was > false accusing, but turned out that it wasn't. > > The reproducer is an xfstest, which I am going to re-post soon > and is also available here: > https://github.com/amir73il/xfstests/commits/dm-log-writes > On a system with spinning disk it reproduces the issue with close > 100% probability within less than a minute. > > Anyway, with Darrick going on vacation, let's expedite review > of this fix and try to merge some fix for v4.14-rc1 (although this > bug fix may be eligible to later rc's as well). Yes. > The change obviously changes behavior for the worse for workload > of intensive parallel fsyncing tasks, but I don't see another quick > way to fix correctness without hurting this workload. > It might be worth to get a feedback from file_write_and_wait_range(), > whether dirty pages writes have been issued at all. Agreed, but safety first. :) --D > I started running full xftests cycle, but wanted to send this one out > early for comments. I can test other patches if needed. > > Cheers, > Amir. > > fs/xfs/xfs_log.c | 7 ------- > 1 file changed, 7 deletions(-) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index 0053bcf..ec159c5 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -3342,8 +3342,6 @@ _xfs_log_force( > */ > if (iclog->ic_state & XLOG_STATE_IOERROR) > return -EIO; > - if (log_flushed) > - *log_flushed = 1; > } else { > > no_sleep: > @@ -3447,8 +3445,6 @@ _xfs_log_force_lsn( > > xlog_wait(&iclog->ic_prev->ic_write_wait, > &log->l_icloglock); > - if (log_flushed) > - *log_flushed = 1; > already_slept = 1; > goto try_again; > } > @@ -3482,9 +3478,6 @@ _xfs_log_force_lsn( > */ > if (iclog->ic_state & XLOG_STATE_IOERROR) > return -EIO; > - > - if (log_flushed) > - *log_flushed = 1; > } else { /* just return */ > spin_unlock(&log->l_icloglock); > } > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote: > On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote: > > When calling into _xfs_log_force{,_lsn}() with a pointer > > to log_flushed variable, log_flushed will be set to 1 if: > > 1. xlog_sync() is called to flush the active log buffer > > AND/OR > > 2. xlog_wait() is called to wait on a syncing log buffers > > > > xfs_file_fsync() checks the value of log_flushed after > > _xfs_log_force_lsn() call to optimize away an explicit > > PREFLUSH request to the data block device after writing > > out all the file's pages to disk. > > > > This optimization is incorrect in the following sequence of events: > > > > Task A Task B > > ------------------------------------------------------- > > xfs_file_fsync() > > _xfs_log_force_lsn() > > xlog_sync() > > [submit PREFLUSH] > > xfs_file_fsync() > > file_write_and_wait_range() > > [submit WRITE X] > > [endio WRITE X] > > _xfs_log_force_lsn() > > xlog_wait() > > [endio PREFLUSH] > > > > The write X is not guarantied to be on persistent storage > > when PREFLUSH request in completed, because write A was submitted > > after the PREFLUSH request, but xfs_file_fsync() of task A will > > be notified of log_flushed=1 and will skip explicit flush. > > > > If the system crashes after fsync of task A, write X may not be > > present on disk after reboot. > > > > This bug was discovered and demonstrated using Josef Bacik's > > dm-log-writes target, which can be used to record block io operations > > and then replay a subset of these operations onto the target device. > > The test goes something like this: > > - Use fsx to execute ops of a file and record ops on log device > > - Every now and then fsync the file, store md5 of file and mark > > the location in the log > > - Then replay log onto device for each mark, mount fs and compare > > md5 of file to stored value > > > > Cc: Christoph Hellwig <hch@lst.de> > > Cc: Josef Bacik <jbacik@fb.com> > > Cc: <stable@vger.kernel.org> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > --- > > > > Christoph, Dave, > > > > It's hard to believe, but I think the reported bug has been around > > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did > > not try to test old kernels. > > Forgot to tag commit message with: > Fixes: f538d4da8d52 ("[XFS] write barrier support") > > Maybe the tag could be added when applying to recent stables, > so distros and older downstream stables can see the tag. > > The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder > if possible data loss bug should also be disclosed in some distros forum? > I bet some users would care more about the latter than the former. > Coincidentally, both data loss and security bugs fix the same commit.. Yes the the patch ought to get sent on to stable w/ fixes tag. One would hope that the distros will pick up the stable fixes from there. That said, it's been in the kernel for 12 years without widespread complaints about corruption, so I'm not sure this warrants public disclosure via CVE/Phoronix vs. just fixing it. --D > > Cheers, > Amir. > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote: >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote: >> > When calling into _xfs_log_force{,_lsn}() with a pointer >> > to log_flushed variable, log_flushed will be set to 1 if: >> > 1. xlog_sync() is called to flush the active log buffer >> > AND/OR >> > 2. xlog_wait() is called to wait on a syncing log buffers >> > >> > xfs_file_fsync() checks the value of log_flushed after >> > _xfs_log_force_lsn() call to optimize away an explicit >> > PREFLUSH request to the data block device after writing >> > out all the file's pages to disk. >> > >> > This optimization is incorrect in the following sequence of events: >> > >> > Task A Task B >> > ------------------------------------------------------- >> > xfs_file_fsync() >> > _xfs_log_force_lsn() >> > xlog_sync() >> > [submit PREFLUSH] >> > xfs_file_fsync() >> > file_write_and_wait_range() >> > [submit WRITE X] >> > [endio WRITE X] >> > _xfs_log_force_lsn() >> > xlog_wait() >> > [endio PREFLUSH] >> > >> > The write X is not guarantied to be on persistent storage >> > when PREFLUSH request in completed, because write A was submitted >> > after the PREFLUSH request, but xfs_file_fsync() of task A will >> > be notified of log_flushed=1 and will skip explicit flush. >> > >> > If the system crashes after fsync of task A, write X may not be >> > present on disk after reboot. >> > >> > This bug was discovered and demonstrated using Josef Bacik's >> > dm-log-writes target, which can be used to record block io operations >> > and then replay a subset of these operations onto the target device. >> > The test goes something like this: >> > - Use fsx to execute ops of a file and record ops on log device >> > - Every now and then fsync the file, store md5 of file and mark >> > the location in the log >> > - Then replay log onto device for each mark, mount fs and compare >> > md5 of file to stored value >> > >> > Cc: Christoph Hellwig <hch@lst.de> >> > Cc: Josef Bacik <jbacik@fb.com> >> > Cc: <stable@vger.kernel.org> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com> >> > --- >> > >> > Christoph, Dave, >> > >> > It's hard to believe, but I think the reported bug has been around >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did >> > not try to test old kernels. >> >> Forgot to tag commit message with: >> Fixes: f538d4da8d52 ("[XFS] write barrier support") >> >> Maybe the tag could be added when applying to recent stables, >> so distros and older downstream stables can see the tag. >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder >> if possible data loss bug should also be disclosed in some distros forum? >> I bet some users would care more about the latter than the former. >> Coincidentally, both data loss and security bugs fix the same commit.. > > Yes the the patch ought to get sent on to stable w/ fixes tag. One > would hope that the distros will pick up the stable fixes from there. Greg, for your consideration, please add Fixes: f538d4da8d52 ("[XFS] write barrier support") If not pushed yet. > That said, it's been in the kernel for 12 years without widespread > complaints about corruption, so I'm not sure this warrants public > disclosure via CVE/Phoronix vs. just fixing it. > I'm not sure either. My intuition tells me that the chances of hitting the data loss bug given a power failure are not slim, but the chances of users knowing about the data loss are slim. Meaning, the chances of users complaining: "I swear, I did fsync, lost power, and after boot data was not there" are slim and the chances of developers believing the report on hear say are even slimmer. Oh well. I was just wondering... Amir.
On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote: > On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong > <darrick.wong@oracle.com> wrote: > > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote: > >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote: > >> > When calling into _xfs_log_force{,_lsn}() with a pointer > >> > to log_flushed variable, log_flushed will be set to 1 if: > >> > 1. xlog_sync() is called to flush the active log buffer > >> > AND/OR > >> > 2. xlog_wait() is called to wait on a syncing log buffers > >> > > >> > xfs_file_fsync() checks the value of log_flushed after > >> > _xfs_log_force_lsn() call to optimize away an explicit > >> > PREFLUSH request to the data block device after writing > >> > out all the file's pages to disk. > >> > > >> > This optimization is incorrect in the following sequence of events: > >> > > >> > Task A Task B > >> > ------------------------------------------------------- > >> > xfs_file_fsync() > >> > _xfs_log_force_lsn() > >> > xlog_sync() > >> > [submit PREFLUSH] > >> > xfs_file_fsync() > >> > file_write_and_wait_range() > >> > [submit WRITE X] > >> > [endio WRITE X] > >> > _xfs_log_force_lsn() > >> > xlog_wait() > >> > [endio PREFLUSH] > >> > > >> > The write X is not guarantied to be on persistent storage > >> > when PREFLUSH request in completed, because write A was submitted > >> > after the PREFLUSH request, but xfs_file_fsync() of task A will > >> > be notified of log_flushed=1 and will skip explicit flush. > >> > > >> > If the system crashes after fsync of task A, write X may not be > >> > present on disk after reboot. > >> > > >> > This bug was discovered and demonstrated using Josef Bacik's > >> > dm-log-writes target, which can be used to record block io operations > >> > and then replay a subset of these operations onto the target device. > >> > The test goes something like this: > >> > - Use fsx to execute ops of a file and record ops on log device > >> > - Every now and then fsync the file, store md5 of file and mark > >> > the location in the log > >> > - Then replay log onto device for each mark, mount fs and compare > >> > md5 of file to stored value > >> > > >> > Cc: Christoph Hellwig <hch@lst.de> > >> > Cc: Josef Bacik <jbacik@fb.com> > >> > Cc: <stable@vger.kernel.org> > >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > >> > --- > >> > > >> > Christoph, Dave, > >> > > >> > It's hard to believe, but I think the reported bug has been around > >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did > >> > not try to test old kernels. > >> > >> Forgot to tag commit message with: > >> Fixes: f538d4da8d52 ("[XFS] write barrier support") > >> > >> Maybe the tag could be added when applying to recent stables, > >> so distros and older downstream stables can see the tag. > >> > >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder > >> if possible data loss bug should also be disclosed in some distros forum? > >> I bet some users would care more about the latter than the former. > >> Coincidentally, both data loss and security bugs fix the same commit.. > > > > Yes the the patch ought to get sent on to stable w/ fixes tag. One > > would hope that the distros will pick up the stable fixes from there. > > > Greg, for your consideration, please add > Fixes: f538d4da8d52 ("[XFS] write barrier support") > If not pushed yet. Add it to what? totally confused, greg k-h
On Mon, Sep 18, 2017 at 9:35 PM, Greg KH <greg@kroah.com> wrote: > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote: >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong >> <darrick.wong@oracle.com> wrote: >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote: >> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote: >> >> > When calling into _xfs_log_force{,_lsn}() with a pointer >> >> > to log_flushed variable, log_flushed will be set to 1 if: >> >> > 1. xlog_sync() is called to flush the active log buffer >> >> > AND/OR >> >> > 2. xlog_wait() is called to wait on a syncing log buffers >> >> > >> >> > xfs_file_fsync() checks the value of log_flushed after >> >> > _xfs_log_force_lsn() call to optimize away an explicit >> >> > PREFLUSH request to the data block device after writing >> >> > out all the file's pages to disk. >> >> > >> >> > This optimization is incorrect in the following sequence of events: >> >> > >> >> > Task A Task B >> >> > ------------------------------------------------------- >> >> > xfs_file_fsync() >> >> > _xfs_log_force_lsn() >> >> > xlog_sync() >> >> > [submit PREFLUSH] >> >> > xfs_file_fsync() >> >> > file_write_and_wait_range() >> >> > [submit WRITE X] >> >> > [endio WRITE X] >> >> > _xfs_log_force_lsn() >> >> > xlog_wait() >> >> > [endio PREFLUSH] >> >> > >> >> > The write X is not guarantied to be on persistent storage >> >> > when PREFLUSH request in completed, because write A was submitted >> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will >> >> > be notified of log_flushed=1 and will skip explicit flush. >> >> > >> >> > If the system crashes after fsync of task A, write X may not be >> >> > present on disk after reboot. >> >> > >> >> > This bug was discovered and demonstrated using Josef Bacik's >> >> > dm-log-writes target, which can be used to record block io operations >> >> > and then replay a subset of these operations onto the target device. >> >> > The test goes something like this: >> >> > - Use fsx to execute ops of a file and record ops on log device >> >> > - Every now and then fsync the file, store md5 of file and mark >> >> > the location in the log >> >> > - Then replay log onto device for each mark, mount fs and compare >> >> > md5 of file to stored value >> >> > >> >> > Cc: Christoph Hellwig <hch@lst.de> >> >> > Cc: Josef Bacik <jbacik@fb.com> >> >> > Cc: <stable@vger.kernel.org> >> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com> >> >> > --- >> >> > >> >> > Christoph, Dave, >> >> > >> >> > It's hard to believe, but I think the reported bug has been around >> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did >> >> > not try to test old kernels. >> >> >> >> Forgot to tag commit message with: >> >> Fixes: f538d4da8d52 ("[XFS] write barrier support") >> >> >> >> Maybe the tag could be added when applying to recent stables, >> >> so distros and older downstream stables can see the tag. >> >> >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder >> >> if possible data loss bug should also be disclosed in some distros forum? >> >> I bet some users would care more about the latter than the former. >> >> Coincidentally, both data loss and security bugs fix the same commit.. >> > >> > Yes the the patch ought to get sent on to stable w/ fixes tag. One >> > would hope that the distros will pick up the stable fixes from there. >> >> >> Greg, for your consideration, please add >> Fixes: f538d4da8d52 ("[XFS] write barrier support") >> If not pushed yet. > > Add it to what? Sorry, add that tag when applying commit 47c7d0b1950258312 to stable trees, since I missed adding the tag before it was merged to master. Thanks, Amir.
On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote: > On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote: > >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder > >> if possible data loss bug should also be disclosed in some distros forum? > >> I bet some users would care more about the latter than the former. > >> Coincidentally, both data loss and security bugs fix the same commit.. > > > > Yes the the patch ought to get sent on to stable w/ fixes tag. One > > would hope that the distros will pick up the stable fixes from there. Yup, that's the normal process for data integrity/fs corruption bugs. > > That said, it's been in the kernel for 12 years without widespread > > complaints about corruption, so I'm not sure this warrants public > > disclosure via CVE/Phoronix vs. just fixing it. > > > > I'm not sure either. > My intuition tells me that the chances of hitting the data loss bug > given a power failure are not slim, but the chances of users knowing > about the data loss are slim. The chances of hitting it are slim. Power-fail vs fsync data integrity testing is something we do actually run as part of QE and have for many years. We've been running such testing for years and never tripped over this problem, so I think the likelihood that a user will hit it is extremely small. Compare that to the bug we issued the CVE for - it's in code we /don't test/, the bug affects everyone with that support compiled in (potentially millions of systems) and it is a guaranteed local user triggered denial of service. There's a big difference in scope and impact between these two cases. Further, we have to consider precedence, developer resources and impact when deciding what we issue disclosures for. e.g. Verifying the problem, categorising it, finding out what systems it affected, working out the best fix for backporting, testing the fix, working through all the disclosure processes, etc took three long days of me doing nothing else but working on this issue. If we start issuing disclosures for all data integrity and/or corruption fixes, then we're going to have no time for anything else. That is, a significant number of bugs we fix every release are for corruption and/or data integrity issues. The process we use for getting them back to stable and distro kernels is to tag them for stable kernels. Adding some heavyweight disclosure process on top of that isn't going to provide any extra value to distros or users. All it does is add significant extra workload to the upstream development process. > Meaning, the chances of users complaining: > "I swear, I did fsync, lost power, and after boot data was not there" are slim > and the chances of developers believing the report on hear say are > even slimmer. A disclosure is not going to change that. All it will do is result in users seeing ghosts (suggestion bias) and reporting things that aren't actually filesystem data loss. e.g. "filesystem lost data on crash" when in reality the problem is that the app doesn't use fsync... Cheers, Dave.
On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote: >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote: >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder >> >> if possible data loss bug should also be disclosed in some distros forum? >> >> I bet some users would care more about the latter than the former. >> >> Coincidentally, both data loss and security bugs fix the same commit.. >> > >> > Yes the the patch ought to get sent on to stable w/ fixes tag. One >> > would hope that the distros will pick up the stable fixes from there. > > Yup, that's the normal process for data integrity/fs corruption > bugs. Makes sense. I'm convinced that the normal process is sufficient for this sort of bug fix. > >> > That said, it's been in the kernel for 12 years without widespread >> > complaints about corruption, so I'm not sure this warrants public >> > disclosure via CVE/Phoronix vs. just fixing it. >> > >> >> I'm not sure either. >> My intuition tells me that the chances of hitting the data loss bug >> given a power failure are not slim, but the chances of users knowing >> about the data loss are slim. > > The chances of hitting it are slim. Power-fail vs fsync data > integrity testing is something we do actually run as part of QE and > have for many years. We've been running such testing for years and > never tripped over this problem, so I think the likelihood that a > user will hit it is extremely small. This sentence make me unease. Who is We and what QE testing are you referring to? Are those tests in xfstests or any other public repository? My first reaction to the corruption was "no way, I need to check the test" Second reaction after checking the test was "this must very very hard to hit" But from closer inspection, it looks like it doesn't take more than running a couple of fsync in parallel to get to the "at risk" state, which may persist for seconds. Of course the chances of users being that unlucky to also get a power failure during "at risk" state are low, but I am puzzled how power fail tests you claim that exists, didn't catch this sooner. Anyway, not sure there is much more to discuss, just wanted to see if there is a post mortem lesson to be learned from this, beyond the fact that dm-log-writes is a valuable testing tool. Cheers, Amir.
On Tue, Sep 19, 2017 at 08:31:37AM +0300, Amir Goldstein wrote: > On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote: > >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote: > >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder > >> >> if possible data loss bug should also be disclosed in some distros forum? > >> >> I bet some users would care more about the latter than the former. > >> >> Coincidentally, both data loss and security bugs fix the same commit.. > >> > > >> > Yes the the patch ought to get sent on to stable w/ fixes tag. One > >> > would hope that the distros will pick up the stable fixes from there. > > > > Yup, that's the normal process for data integrity/fs corruption > > bugs. > > Makes sense. I'm convinced that the normal process is sufficient for this > sort of bug fix. > > > > >> > That said, it's been in the kernel for 12 years without widespread > >> > complaints about corruption, so I'm not sure this warrants public > >> > disclosure via CVE/Phoronix vs. just fixing it. > >> > > >> > >> I'm not sure either. > >> My intuition tells me that the chances of hitting the data loss bug > >> given a power failure are not slim, but the chances of users knowing > >> about the data loss are slim. > > > > The chances of hitting it are slim. Power-fail vs fsync data > > integrity testing is something we do actually run as part of QE and > > have for many years. We've been running such testing for years and > > never tripped over this problem, so I think the likelihood that a > > user will hit it is extremely small. > > This sentence make me unease. > Who is We and what QE testing are you referring to? > Are those tests in xfstests or any other public repository? > My first reaction to the corruption was "no way, I need to check the test" > Second reaction after checking the test was "this must very very hard to hit" /me prefers to think that we've simply gotten lucky all these years and nobody actually managed to die before another flush would take care of the dirty data. But then I did just spend a week in Las Vegas. :P > But from closer inspection, it looks like it doesn't take more than running > a couple of fsync in parallel to get to the "at risk" state, which may persist > for seconds. > Of course the chances of users being that unlucky to also get a power > failure during "at risk" state are low, but I am puzzled how power fail tests > you claim that exists, didn't catch this sooner. > > Anyway, not sure there is much more to discuss, just wanted to see > if there is a post mortem lesson to be learned from this, beyond the fact that > dm-log-writes is a valuable testing tool. Agreed. :) --D > > Cheers, > Amir. > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 18, 2017 at 10:29:25PM +0300, Amir Goldstein wrote: > On Mon, Sep 18, 2017 at 9:35 PM, Greg KH <greg@kroah.com> wrote: > > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote: > >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong > >> <darrick.wong@oracle.com> wrote: > >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote: > >> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote: > >> >> > When calling into _xfs_log_force{,_lsn}() with a pointer > >> >> > to log_flushed variable, log_flushed will be set to 1 if: > >> >> > 1. xlog_sync() is called to flush the active log buffer > >> >> > AND/OR > >> >> > 2. xlog_wait() is called to wait on a syncing log buffers > >> >> > > >> >> > xfs_file_fsync() checks the value of log_flushed after > >> >> > _xfs_log_force_lsn() call to optimize away an explicit > >> >> > PREFLUSH request to the data block device after writing > >> >> > out all the file's pages to disk. > >> >> > > >> >> > This optimization is incorrect in the following sequence of events: > >> >> > > >> >> > Task A Task B > >> >> > ------------------------------------------------------- > >> >> > xfs_file_fsync() > >> >> > _xfs_log_force_lsn() > >> >> > xlog_sync() > >> >> > [submit PREFLUSH] > >> >> > xfs_file_fsync() > >> >> > file_write_and_wait_range() > >> >> > [submit WRITE X] > >> >> > [endio WRITE X] > >> >> > _xfs_log_force_lsn() > >> >> > xlog_wait() > >> >> > [endio PREFLUSH] > >> >> > > >> >> > The write X is not guarantied to be on persistent storage > >> >> > when PREFLUSH request in completed, because write A was submitted > >> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will > >> >> > be notified of log_flushed=1 and will skip explicit flush. > >> >> > > >> >> > If the system crashes after fsync of task A, write X may not be > >> >> > present on disk after reboot. > >> >> > > >> >> > This bug was discovered and demonstrated using Josef Bacik's > >> >> > dm-log-writes target, which can be used to record block io operations > >> >> > and then replay a subset of these operations onto the target device. > >> >> > The test goes something like this: > >> >> > - Use fsx to execute ops of a file and record ops on log device > >> >> > - Every now and then fsync the file, store md5 of file and mark > >> >> > the location in the log > >> >> > - Then replay log onto device for each mark, mount fs and compare > >> >> > md5 of file to stored value > >> >> > > >> >> > Cc: Christoph Hellwig <hch@lst.de> > >> >> > Cc: Josef Bacik <jbacik@fb.com> > >> >> > Cc: <stable@vger.kernel.org> > >> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > >> >> > --- > >> >> > > >> >> > Christoph, Dave, > >> >> > > >> >> > It's hard to believe, but I think the reported bug has been around > >> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did > >> >> > not try to test old kernels. > >> >> > >> >> Forgot to tag commit message with: > >> >> Fixes: f538d4da8d52 ("[XFS] write barrier support") > >> >> > >> >> Maybe the tag could be added when applying to recent stables, > >> >> so distros and older downstream stables can see the tag. > >> >> > >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder > >> >> if possible data loss bug should also be disclosed in some distros forum? > >> >> I bet some users would care more about the latter than the former. > >> >> Coincidentally, both data loss and security bugs fix the same commit.. > >> > > >> > Yes the the patch ought to get sent on to stable w/ fixes tag. One > >> > would hope that the distros will pick up the stable fixes from there. > >> > >> > >> Greg, for your consideration, please add > >> Fixes: f538d4da8d52 ("[XFS] write barrier support") > >> If not pushed yet. > > > > Add it to what? > > Sorry, add that tag when applying commit 47c7d0b1950258312 > to stable trees, since I missed adding the tag before it was merged > to master. Nah, as the tag is just needed to let me know where to backport stuff to, I don't think it matters when I add it to the stable tree itself, so I'll leave it as-is. thanks, greg k-h
On Tue, Sep 19, 2017 at 08:31:37AM +0300, Amir Goldstein wrote: > On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote: > >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote: > >> > That said, it's been in the kernel for 12 years without widespread > >> > complaints about corruption, so I'm not sure this warrants public > >> > disclosure via CVE/Phoronix vs. just fixing it. > >> > > >> > >> I'm not sure either. > >> My intuition tells me that the chances of hitting the data loss bug > >> given a power failure are not slim, but the chances of users knowing > >> about the data loss are slim. > > > > The chances of hitting it are slim. Power-fail vs fsync data > > integrity testing is something we do actually run as part of QE and > > have for many years. We've been running such testing for years and > > never tripped over this problem, so I think the likelihood that a > > user will hit it is extremely small. > > This sentence make me unease. > Who is We and what QE testing are you referring to? I've done it in the past myself with a modified crash/xfscrash to write patterned files (via genstream/checkstream). Unfortunately, I lost that script when the machine used for that testing suffered a fatal, completely unrecoverable ext3 root filesystem corruption during a power fail cycle... :/ RH QE also runs automated power fail cycle tests - we found lots of ext4 problems with that test rig when it was first put together, but I don't recall seeing XFS issues reported. Eryu would have to confirm, but ISTR that this testing was made part of the regular RHEL major release testing cycle... Let's not forget all the other storage vendors and apps out there that do their own crash/power fail testing that rely on a working fsync. Apps like ceph, cluster, various databases, etc all have their own data integrity testing procedures, and so if there's any obvious or easy to hit fsync bug we would have had people reporting it long ago. Then there's all the research tools that have had papers written about them testing exactly the sort of thing that dm-log writes is testing. None of these indicated any sort of problem with fsync in XFS, but we couldn't reproduce or verify the research results of the because none of those fine institutions ever open sourced their tools despite repeated requests and promises that it would happen. > Are those tests in xfstests or any other public repository? crash/xfscrash is, and now dm-log-write, but nothing else is. > My first reaction to the corruption was "no way, I need to check the test" > Second reaction after checking the test was "this must very very hard to hit" > But from closer inspection, it looks like it doesn't take more than running > a couple of fsync in parallel to get to the "at risk" state, which may persist > for seconds. That may be the case, but the reality is we don't have a body of evidence to suggest this is a problem anyone is actually hitting. In fact, we don't have any evidence it's been seen in the wild at all. > Of course the chances of users being that unlucky to also get a power > failure during "at risk" state are low, but I am puzzled how power fail tests > you claim that exists, didn't catch this sooner. Probably for the same reason app developers and users aren't reporting fsync data loss problems. While the bug may "look obvious in hindsight", the fact is that there are no evidence of data loss after fsync on XFS in the real world. Occam's Razor suggests that there is something that masks the problem that we don't understand yet.... Cheers, Dave.
>> Are those tests in xfstests or any other public repository? > > crash/xfscrash is, and now dm-log-write, but nothing else is. > There's also CrashMonkey (tool under active development from my research group at UT Austin) at: https://github.com/utsaslab/crashmonkey. >> My first reaction to the corruption was "no way, I need to check the test" >> Second reaction after checking the test was "this must very very hard to hit" >> But from closer inspection, it looks like it doesn't take more than running >> a couple of fsync in parallel to get to the "at risk" state, which may persist >> for seconds. > > That may be the case, but the reality is we don't have a body of > evidence to suggest this is a problem anyone is actually hitting. In > fact, we don't have any evidence it's been seen in the wild at all. Even if people did hit this bug in the wild, as Amir suggested before, the chances of them connecting it to a crash-consistency bug in the file system and reporting as such are extremely low. We do have many reports of data loss in newer file systems like btrfs, but I don't think those are connected back to crash-consistency bugs (my suspicion is there would be many such bugs, simply due to inadequate crash consistency testing). When working on CrashMonkey, we find it hard to understand what a crash consistency bug is, even when we are the ones producing the bug! I think there is a need for good debugging/investigative tools in this area (something we are also working on). Thanks, Vijay http://www.cs.utexas.edu/~vijay/
On Wed, Sep 20, 2017 at 10:40:12AM +1000, Dave Chinner wrote: > On Tue, Sep 19, 2017 at 08:31:37AM +0300, Amir Goldstein wrote: > > On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote: > > > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote: > > >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > > >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote: > > >> > That said, it's been in the kernel for 12 years without widespread > > >> > complaints about corruption, so I'm not sure this warrants public > > >> > disclosure via CVE/Phoronix vs. just fixing it. > > >> > > > >> > > >> I'm not sure either. > > >> My intuition tells me that the chances of hitting the data loss bug > > >> given a power failure are not slim, but the chances of users knowing > > >> about the data loss are slim. > > > > > > The chances of hitting it are slim. Power-fail vs fsync data > > > integrity testing is something we do actually run as part of QE and > > > have for many years. We've been running such testing for years and > > > never tripped over this problem, so I think the likelihood that a > > > user will hit it is extremely small. > > > > This sentence make me unease. > > Who is We and what QE testing are you referring to? > > I've done it in the past myself with a modified crash/xfscrash to > write patterned files (via genstream/checkstream). Unfortunately, I > lost that script when the machine used for that testing suffered a > fatal, completely unrecoverable ext3 root filesystem corruption > during a power fail cycle... :/ > > RH QE also runs automated power fail cycle tests - we found lots of Not fully automated yet, but semi-automated :) > ext4 problems with that test rig when it was first put together, but > I don't recall seeing XFS issues reported. Eryu would have to > confirm, but ISTR that this testing was made part of the regular > RHEL major release testing cycle... Generally speaking, we don't find many bugs in our power failure tests. We filed a few ext3/4 bugs against RHEL6 kernel, but IIRC there was only one XFS bug filed, and we couldn't reproduce that bug in later RHEL minor releases. And it seems to me that we don't file any such bugs against RHEL7 kernel. And yes, this is part of regular RHEL release testings only, I haven't done any power failure tests with upstream kernels yet. Thanks, Eryu > > Let's not forget all the other storage vendors and apps out there > that do their own crash/power fail testing that rely on a working > fsync. Apps like ceph, cluster, various databases, etc all have > their own data integrity testing procedures, and so if there's any > obvious or easy to hit fsync bug we would have had people reporting > it long ago. > > Then there's all the research tools that have had papers written > about them testing exactly the sort of thing that dm-log writes is > testing. None of these indicated any sort of problem with fsync in > XFS, but we couldn't reproduce or verify the research results of the > because none of those fine institutions ever open sourced their > tools despite repeated requests and promises that it would happen. > > > Are those tests in xfstests or any other public repository? > > crash/xfscrash is, and now dm-log-write, but nothing else is. > > > My first reaction to the corruption was "no way, I need to check the test" > > Second reaction after checking the test was "this must very very hard to hit" > > But from closer inspection, it looks like it doesn't take more than running > > a couple of fsync in parallel to get to the "at risk" state, which may persist > > for seconds. > > That may be the case, but the reality is we don't have a body of > evidence to suggest this is a problem anyone is actually hitting. In > fact, we don't have any evidence it's been seen in the wild at all. > > > Of course the chances of users being that unlucky to also get a power > > failure during "at risk" state are low, but I am puzzled how power fail tests > > you claim that exists, didn't catch this sooner. > > Probably for the same reason app developers and users aren't > reporting fsync data loss problems. While the bug may "look obvious > in hindsight", the fact is that there are no evidence of data loss > after fsync on XFS in the real world. Occam's Razor suggests that > there is something that masks the problem that we don't understand > yet.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 19, 2017 at 9:32 AM, Greg KH <greg@kroah.com> wrote: > On Mon, Sep 18, 2017 at 10:29:25PM +0300, Amir Goldstein wrote: >> On Mon, Sep 18, 2017 at 9:35 PM, Greg KH <greg@kroah.com> wrote: >> > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote: >> >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong >> >> <darrick.wong@oracle.com> wrote: >> >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote: >> >> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote: >> >> >> > When calling into _xfs_log_force{,_lsn}() with a pointer >> >> >> > to log_flushed variable, log_flushed will be set to 1 if: >> >> >> > 1. xlog_sync() is called to flush the active log buffer >> >> >> > AND/OR >> >> >> > 2. xlog_wait() is called to wait on a syncing log buffers >> >> >> > >> >> >> > xfs_file_fsync() checks the value of log_flushed after >> >> >> > _xfs_log_force_lsn() call to optimize away an explicit >> >> >> > PREFLUSH request to the data block device after writing >> >> >> > out all the file's pages to disk. >> >> >> > >> >> >> > This optimization is incorrect in the following sequence of events: >> >> >> > >> >> >> > Task A Task B >> >> >> > ------------------------------------------------------- >> >> >> > xfs_file_fsync() >> >> >> > _xfs_log_force_lsn() >> >> >> > xlog_sync() >> >> >> > [submit PREFLUSH] >> >> >> > xfs_file_fsync() >> >> >> > file_write_and_wait_range() >> >> >> > [submit WRITE X] >> >> >> > [endio WRITE X] >> >> >> > _xfs_log_force_lsn() >> >> >> > xlog_wait() >> >> >> > [endio PREFLUSH] >> >> >> > >> >> >> > The write X is not guarantied to be on persistent storage >> >> >> > when PREFLUSH request in completed, because write A was submitted >> >> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will >> >> >> > be notified of log_flushed=1 and will skip explicit flush. >> >> >> > >> >> >> > If the system crashes after fsync of task A, write X may not be >> >> >> > present on disk after reboot. >> >> >> > >> >> >> > This bug was discovered and demonstrated using Josef Bacik's >> >> >> > dm-log-writes target, which can be used to record block io operations >> >> >> > and then replay a subset of these operations onto the target device. >> >> >> > The test goes something like this: >> >> >> > - Use fsx to execute ops of a file and record ops on log device >> >> >> > - Every now and then fsync the file, store md5 of file and mark >> >> >> > the location in the log >> >> >> > - Then replay log onto device for each mark, mount fs and compare >> >> >> > md5 of file to stored value >> >> >> > >> >> >> > Cc: Christoph Hellwig <hch@lst.de> >> >> >> > Cc: Josef Bacik <jbacik@fb.com> >> >> >> > Cc: <stable@vger.kernel.org> >> >> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com> >> >> >> > --- >> >> >> > >> >> >> > Christoph, Dave, >> >> >> > >> >> >> > It's hard to believe, but I think the reported bug has been around >> >> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did >> >> >> > not try to test old kernels. >> >> >> >> >> >> Forgot to tag commit message with: >> >> >> Fixes: f538d4da8d52 ("[XFS] write barrier support") >> >> >> >> >> >> Maybe the tag could be added when applying to recent stables, >> >> >> so distros and older downstream stables can see the tag. >> >> >> >> >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder >> >> >> if possible data loss bug should also be disclosed in some distros forum? >> >> >> I bet some users would care more about the latter than the former. >> >> >> Coincidentally, both data loss and security bugs fix the same commit.. >> >> > >> >> > Yes the the patch ought to get sent on to stable w/ fixes tag. One >> >> > would hope that the distros will pick up the stable fixes from there. >> >> >> >> >> >> Greg, for your consideration, please add >> >> Fixes: f538d4da8d52 ("[XFS] write barrier support") >> >> If not pushed yet. >> > >> > Add it to what? >> >> Sorry, add that tag when applying commit 47c7d0b1950258312 >> to stable trees, since I missed adding the tag before it was merged >> to master. > > Nah, as the tag is just needed to let me know where to backport stuff > to, I don't think it matters when I add it to the stable tree itself, so > I'll leave it as-is. > Greg, Related or not to above Fixes discussion, I now noticed that you never picked the patch for kernel 4.4. Ben did take it to 3.2 and 3.16 BTW. This is a very critical bug fix IMO. Were you waiting for an ACK from xfs maintainers or just an oversight? Or was it me who had to check up on that? Thanks, Amir.
On Sat, Jun 09, 2018 at 07:44:22AM +0300, Amir Goldstein wrote: > On Tue, Sep 19, 2017 at 9:32 AM, Greg KH <greg@kroah.com> wrote: > > On Mon, Sep 18, 2017 at 10:29:25PM +0300, Amir Goldstein wrote: > >> On Mon, Sep 18, 2017 at 9:35 PM, Greg KH <greg@kroah.com> wrote: > >> > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote: > >> >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong > >> >> <darrick.wong@oracle.com> wrote: > >> >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote: > >> >> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote: > >> >> >> > When calling into _xfs_log_force{,_lsn}() with a pointer > >> >> >> > to log_flushed variable, log_flushed will be set to 1 if: > >> >> >> > 1. xlog_sync() is called to flush the active log buffer > >> >> >> > AND/OR > >> >> >> > 2. xlog_wait() is called to wait on a syncing log buffers > >> >> >> > > >> >> >> > xfs_file_fsync() checks the value of log_flushed after > >> >> >> > _xfs_log_force_lsn() call to optimize away an explicit > >> >> >> > PREFLUSH request to the data block device after writing > >> >> >> > out all the file's pages to disk. > >> >> >> > > >> >> >> > This optimization is incorrect in the following sequence of events: > >> >> >> > > >> >> >> > Task A Task B > >> >> >> > ------------------------------------------------------- > >> >> >> > xfs_file_fsync() > >> >> >> > _xfs_log_force_lsn() > >> >> >> > xlog_sync() > >> >> >> > [submit PREFLUSH] > >> >> >> > xfs_file_fsync() > >> >> >> > file_write_and_wait_range() > >> >> >> > [submit WRITE X] > >> >> >> > [endio WRITE X] > >> >> >> > _xfs_log_force_lsn() > >> >> >> > xlog_wait() > >> >> >> > [endio PREFLUSH] > >> >> >> > > >> >> >> > The write X is not guarantied to be on persistent storage > >> >> >> > when PREFLUSH request in completed, because write A was submitted > >> >> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will > >> >> >> > be notified of log_flushed=1 and will skip explicit flush. > >> >> >> > > >> >> >> > If the system crashes after fsync of task A, write X may not be > >> >> >> > present on disk after reboot. > >> >> >> > > >> >> >> > This bug was discovered and demonstrated using Josef Bacik's > >> >> >> > dm-log-writes target, which can be used to record block io operations > >> >> >> > and then replay a subset of these operations onto the target device. > >> >> >> > The test goes something like this: > >> >> >> > - Use fsx to execute ops of a file and record ops on log device > >> >> >> > - Every now and then fsync the file, store md5 of file and mark > >> >> >> > the location in the log > >> >> >> > - Then replay log onto device for each mark, mount fs and compare > >> >> >> > md5 of file to stored value > >> >> >> > > >> >> >> > Cc: Christoph Hellwig <hch@lst.de> > >> >> >> > Cc: Josef Bacik <jbacik@fb.com> > >> >> >> > Cc: <stable@vger.kernel.org> > >> >> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > >> >> >> > --- > >> >> >> > > >> >> >> > Christoph, Dave, > >> >> >> > > >> >> >> > It's hard to believe, but I think the reported bug has been around > >> >> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did > >> >> >> > not try to test old kernels. > >> >> >> > >> >> >> Forgot to tag commit message with: > >> >> >> Fixes: f538d4da8d52 ("[XFS] write barrier support") > >> >> >> > >> >> >> Maybe the tag could be added when applying to recent stables, > >> >> >> so distros and older downstream stables can see the tag. > >> >> >> > >> >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder > >> >> >> if possible data loss bug should also be disclosed in some distros forum? > >> >> >> I bet some users would care more about the latter than the former. > >> >> >> Coincidentally, both data loss and security bugs fix the same commit.. > >> >> > > >> >> > Yes the the patch ought to get sent on to stable w/ fixes tag. One > >> >> > would hope that the distros will pick up the stable fixes from there. > >> >> > >> >> > >> >> Greg, for your consideration, please add > >> >> Fixes: f538d4da8d52 ("[XFS] write barrier support") > >> >> If not pushed yet. > >> > > >> > Add it to what? > >> > >> Sorry, add that tag when applying commit 47c7d0b1950258312 > >> to stable trees, since I missed adding the tag before it was merged > >> to master. > > > > Nah, as the tag is just needed to let me know where to backport stuff > > to, I don't think it matters when I add it to the stable tree itself, so > > I'll leave it as-is. > > > > Greg, > > Related or not to above Fixes discussion, I now noticed that you never picked > the patch for kernel 4.4. > > Ben did take it to 3.2 and 3.16 BTW. > > This is a very critical bug fix IMO. > Were you waiting for an ACK from xfs maintainers or just an oversight? > Or was it me who had to check up on that? It looks to be an oversight, sorry, now queued up. greg k-h
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 0053bcf..ec159c5 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -3342,8 +3342,6 @@ _xfs_log_force( */ if (iclog->ic_state & XLOG_STATE_IOERROR) return -EIO; - if (log_flushed) - *log_flushed = 1; } else { no_sleep: @@ -3447,8 +3445,6 @@ _xfs_log_force_lsn( xlog_wait(&iclog->ic_prev->ic_write_wait, &log->l_icloglock); - if (log_flushed) - *log_flushed = 1; already_slept = 1; goto try_again; } @@ -3482,9 +3478,6 @@ _xfs_log_force_lsn( */ if (iclog->ic_state & XLOG_STATE_IOERROR) return -EIO; - - if (log_flushed) - *log_flushed = 1; } else { /* just return */ spin_unlock(&log->l_icloglock); }
When calling into _xfs_log_force{,_lsn}() with a pointer to log_flushed variable, log_flushed will be set to 1 if: 1. xlog_sync() is called to flush the active log buffer AND/OR 2. xlog_wait() is called to wait on a syncing log buffers xfs_file_fsync() checks the value of log_flushed after _xfs_log_force_lsn() call to optimize away an explicit PREFLUSH request to the data block device after writing out all the file's pages to disk. This optimization is incorrect in the following sequence of events: Task A Task B ------------------------------------------------------- xfs_file_fsync() _xfs_log_force_lsn() xlog_sync() [submit PREFLUSH] xfs_file_fsync() file_write_and_wait_range() [submit WRITE X] [endio WRITE X] _xfs_log_force_lsn() xlog_wait() [endio PREFLUSH] The write X is not guarantied to be on persistent storage when PREFLUSH request in completed, because write A was submitted after the PREFLUSH request, but xfs_file_fsync() of task A will be notified of log_flushed=1 and will skip explicit flush. If the system crashes after fsync of task A, write X may not be present on disk after reboot. This bug was discovered and demonstrated using Josef Bacik's dm-log-writes target, which can be used to record block io operations and then replay a subset of these operations onto the target device. The test goes something like this: - Use fsx to execute ops of a file and record ops on log device - Every now and then fsync the file, store md5 of file and mark the location in the log - Then replay log onto device for each mark, mount fs and compare md5 of file to stored value Cc: Christoph Hellwig <hch@lst.de> Cc: Josef Bacik <jbacik@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> --- Christoph, Dave, It's hard to believe, but I think the reported bug has been around since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did not try to test old kernels. I tripped over this aleged bug a week ago when I started testing crash consistecy with dm-log-writes: https://marc.info/?l=linux-fsdevel&m=150350333427447&w=2 Since then, I have been trying to prove that dm-log-writes was false accusing, but turned out that it wasn't. The reproducer is an xfstest, which I am going to re-post soon and is also available here: https://github.com/amir73il/xfstests/commits/dm-log-writes On a system with spinning disk it reproduces the issue with close 100% probability within less than a minute. Anyway, with Darrick going on vacation, let's expedite review of this fix and try to merge some fix for v4.14-rc1 (although this bug fix may be eligible to later rc's as well). The change obviously changes behavior for the worse for workload of intensive parallel fsyncing tasks, but I don't see another quick way to fix correctness without hurting this workload. It might be worth to get a feedback from file_write_and_wait_range(), whether dirty pages writes have been issued at all. I started running full xftests cycle, but wanted to send this one out early for comments. I can test other patches if needed. Cheers, Amir. fs/xfs/xfs_log.c | 7 ------- 1 file changed, 7 deletions(-)