mbox series

[5.10,CANDIDATE,0/9] xfs stable candidate patches for 5.10.y (from v5.13+)

Message ID 20220726092125.3899077-1-amir73il@gmail.com (mailing list archive)
Headers show
Series xfs stable candidate patches for 5.10.y (from v5.13+) | expand

Message

Amir Goldstein July 26, 2022, 9:21 a.m. UTC
Darrick,

This backport series contains mostly fixes from v5.14 release along
with three deferred patches from the joint 5.10/5.15 series [1].

I ran the auto group 10 times on baseline (v5.10.131) and this series
with no observed regressions.

I ran the recoveryloop group 100 times with no observed regressions.
The soak group run is in progress (10+) with no observed regressions
so far.

I am somewhat disappointed from not seeing any improvement in the
results of the recoveryloop tests comapred to baseline.

This is the summary of the recoveryloop test results on both baseline
and backport branch:

generic,455, generic/457, generic/646: pass
generic/019, generic/475, generic/648: failing often in all config
generic/388: failing often with reflink_1024
generic/388: failing at ~1/50 rate for any config
generic/482: failing often on V4 configs
generic/482: failing at ~1/100 rate for V5 configs
xfs/057: failing at ~1/200 rate for any config

I observed no failures in soak group so far neither on baseline nor
on backport branch. I will update when I have more results.

Please let me know if there is anything else that you would like me
to test.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-xfs/20220617100641.1653164-1-amir73il@gmail.com/

Brian Foster (2):
  xfs: hold buffer across unpin and potential shutdown processing
  xfs: remove dead stale buf unpin handling code

Christoph Hellwig (1):
  xfs: refactor xfs_file_fsync

Darrick J. Wong (3):
  xfs: prevent UAF in xfs_log_item_in_current_chkpt
  xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodes
  xfs: force the log offline when log intent item recovery fails

Dave Chinner (3):
  xfs: xfs_log_force_lsn isn't passed a LSN
  xfs: logging the on disk inode LSN can make it go backwards
  xfs: Enforce attr3 buffer recovery order

 fs/xfs/libxfs/xfs_log_format.h  | 11 ++++-
 fs/xfs/libxfs/xfs_types.h       |  1 +
 fs/xfs/xfs_buf_item.c           | 60 ++++++++++--------------
 fs/xfs/xfs_buf_item_recover.c   |  1 +
 fs/xfs/xfs_dquot_item.c         |  2 +-
 fs/xfs/xfs_file.c               | 81 ++++++++++++++++++++-------------
 fs/xfs/xfs_inode.c              | 10 ++--
 fs/xfs/xfs_inode_item.c         |  4 +-
 fs/xfs/xfs_inode_item.h         |  2 +-
 fs/xfs/xfs_inode_item_recover.c | 39 ++++++++++++----
 fs/xfs/xfs_log.c                | 30 ++++++------
 fs/xfs/xfs_log.h                |  4 +-
 fs/xfs/xfs_log_cil.c            | 32 +++++--------
 fs/xfs/xfs_log_priv.h           | 15 +++---
 fs/xfs/xfs_log_recover.c        |  5 +-
 fs/xfs/xfs_mount.c              | 10 +++-
 fs/xfs/xfs_trans.c              |  6 +--
 fs/xfs/xfs_trans.h              |  4 +-
 18 files changed, 179 insertions(+), 138 deletions(-)

Comments

Amir Goldstein July 27, 2022, 7:17 p.m. UTC | #1
On Tue, Jul 26, 2022 at 11:21 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> Darrick,
>
> This backport series contains mostly fixes from v5.14 release along
> with three deferred patches from the joint 5.10/5.15 series [1].
>
> I ran the auto group 10 times on baseline (v5.10.131) and this series
> with no observed regressions.
>
> I ran the recoveryloop group 100 times with no observed regressions.
> The soak group run is in progress (10+) with no observed regressions
> so far.
>
> I am somewhat disappointed from not seeing any improvement in the
> results of the recoveryloop tests comapred to baseline.
>
> This is the summary of the recoveryloop test results on both baseline
> and backport branch:
>
> generic,455, generic/457, generic/646: pass
> generic/019, generic/475, generic/648: failing often in all config
> generic/388: failing often with reflink_1024
> generic/388: failing at ~1/50 rate for any config
> generic/482: failing often on V4 configs
> generic/482: failing at ~1/100 rate for V5 configs
> xfs/057: failing at ~1/200 rate for any config
>
> I observed no failures in soak group so far neither on baseline nor
> on backport branch. I will update when I have more results.
>

Some more results after 1.5 days of spinning:
1. soak group reached 100 runs (x5 configs) with no failures
2. Ran all the tests also on debian/testing with xfsprogs 5.18 and
    observed a very similar fail/pass pattern as with xfsprogs 5.10
3. Started to run the 3 passing recoveryloop tests 1000 times and
    an interesting pattern emerged -

generic/455 failed 3 times on baseline (out of 250 runs x 5 configs),
but if has not failed on backport branch yet (700 runs x 5 configs).

And it's not just failures, it's proper data corruptions, e.g.
"testfile2.mark1 md5sum mismatched" (and not always on mark1)

I will keep this loop spinning, but I am cautiously optimistic about
this being an actual proof of bug fix.

If these results don't change, I would be happy to get an ACK for the
series so I can post it after the long soaking.

Thanks,
Amir.
Darrick J. Wong July 28, 2022, 2:01 a.m. UTC | #2
On Wed, Jul 27, 2022 at 09:17:47PM +0200, Amir Goldstein wrote:
> On Tue, Jul 26, 2022 at 11:21 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Darrick,
> >
> > This backport series contains mostly fixes from v5.14 release along
> > with three deferred patches from the joint 5.10/5.15 series [1].
> >
> > I ran the auto group 10 times on baseline (v5.10.131) and this series
> > with no observed regressions.
> >
> > I ran the recoveryloop group 100 times with no observed regressions.
> > The soak group run is in progress (10+) with no observed regressions
> > so far.
> >
> > I am somewhat disappointed from not seeing any improvement in the
> > results of the recoveryloop tests comapred to baseline.
> >
> > This is the summary of the recoveryloop test results on both baseline
> > and backport branch:
> >
> > generic,455, generic/457, generic/646: pass
> > generic/019, generic/475, generic/648: failing often in all config

<nod> I posted a couple of patchsets to fstests@ yesterday that might
help with these recoveryloop tests failing.

https://lore.kernel.org/fstests/165886493457.1585218.32410114728132213.stgit@magnolia/T/#t
https://lore.kernel.org/fstests/165886492580.1585149.760428651537119015.stgit@magnolia/T/#t
https://lore.kernel.org/fstests/165886491119.1585061.14285332087646848837.stgit@magnolia/T/#t

> > generic/388: failing often with reflink_1024
> > generic/388: failing at ~1/50 rate for any config
> > generic/482: failing often on V4 configs
> > generic/482: failing at ~1/100 rate for V5 configs
> > xfs/057: failing at ~1/200 rate for any config
> >
> > I observed no failures in soak group so far neither on baseline nor
> > on backport branch. I will update when I have more results.
> >
> 
> Some more results after 1.5 days of spinning:
> 1. soak group reached 100 runs (x5 configs) with no failures
> 2. Ran all the tests also on debian/testing with xfsprogs 5.18 and
>     observed a very similar fail/pass pattern as with xfsprogs 5.10
> 3. Started to run the 3 passing recoveryloop tests 1000 times and
>     an interesting pattern emerged -
> 
> generic/455 failed 3 times on baseline (out of 250 runs x 5 configs),
> but if has not failed on backport branch yet (700 runs x 5 configs).
> 
> And it's not just failures, it's proper data corruptions, e.g.
> "testfile2.mark1 md5sum mismatched" (and not always on mark1)

Oh good!


> 
> I will keep this loop spinning, but I am cautiously optimistic about
> this being an actual proof of bug fix.
> 
> If these results don't change, I would be happy to get an ACK for the
> series so I can post it after the long soaking.

Patches 4-9 are an easy
Acked-by: Darrick J. Wong <djwong@kernel.org>



--D

> Thanks,
> Amir.
Darrick J. Wong July 28, 2022, 2:07 a.m. UTC | #3
On Wed, Jul 27, 2022 at 07:01:15PM -0700, Darrick J. Wong wrote:
> On Wed, Jul 27, 2022 at 09:17:47PM +0200, Amir Goldstein wrote:
> > On Tue, Jul 26, 2022 at 11:21 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > Darrick,
> > >
> > > This backport series contains mostly fixes from v5.14 release along
> > > with three deferred patches from the joint 5.10/5.15 series [1].
> > >
> > > I ran the auto group 10 times on baseline (v5.10.131) and this series
> > > with no observed regressions.
> > >
> > > I ran the recoveryloop group 100 times with no observed regressions.
> > > The soak group run is in progress (10+) with no observed regressions
> > > so far.
> > >
> > > I am somewhat disappointed from not seeing any improvement in the
> > > results of the recoveryloop tests comapred to baseline.
> > >
> > > This is the summary of the recoveryloop test results on both baseline
> > > and backport branch:
> > >
> > > generic,455, generic/457, generic/646: pass
> > > generic/019, generic/475, generic/648: failing often in all config
> 
> <nod> I posted a couple of patchsets to fstests@ yesterday that might
> help with these recoveryloop tests failing.
> 
> https://lore.kernel.org/fstests/165886493457.1585218.32410114728132213.stgit@magnolia/T/#t
> https://lore.kernel.org/fstests/165886492580.1585149.760428651537119015.stgit@magnolia/T/#t
> https://lore.kernel.org/fstests/165886491119.1585061.14285332087646848837.stgit@magnolia/T/#t
> 
> > > generic/388: failing often with reflink_1024
> > > generic/388: failing at ~1/50 rate for any config
> > > generic/482: failing often on V4 configs
> > > generic/482: failing at ~1/100 rate for V5 configs
> > > xfs/057: failing at ~1/200 rate for any config
> > >
> > > I observed no failures in soak group so far neither on baseline nor
> > > on backport branch. I will update when I have more results.
> > >
> > 
> > Some more results after 1.5 days of spinning:
> > 1. soak group reached 100 runs (x5 configs) with no failures
> > 2. Ran all the tests also on debian/testing with xfsprogs 5.18 and
> >     observed a very similar fail/pass pattern as with xfsprogs 5.10
> > 3. Started to run the 3 passing recoveryloop tests 1000 times and
> >     an interesting pattern emerged -
> > 
> > generic/455 failed 3 times on baseline (out of 250 runs x 5 configs),
> > but if has not failed on backport branch yet (700 runs x 5 configs).
> > 
> > And it's not just failures, it's proper data corruptions, e.g.
> > "testfile2.mark1 md5sum mismatched" (and not always on mark1)
> 
> Oh good!
> 
> 
> > 
> > I will keep this loop spinning, but I am cautiously optimistic about
> > this being an actual proof of bug fix.
> > 
> > If these results don't change, I would be happy to get an ACK for the
> > series so I can post it after the long soaking.
> 
> Patches 4-9 are an easy
> Acked-by: Darrick J. Wong <djwong@kernel.org>

I hit send too fast.

I think patches 1-3 look correct.  I still think it's sort of risky,
but your testing shows that things at least get better and don't
immediately explode or anything. :)

By my recollection of the log changes between 5.10 and 5.17 I think the
lsn/cil split didn't change all that much, so if you get to the end of
the week with no further problems then I say Acked-by for them too.

--D

> 
> 
> --D
> 
> > Thanks,
> > Amir.
Amir Goldstein July 28, 2022, 9:39 a.m. UTC | #4
On Thu, Jul 28, 2022 at 4:07 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Jul 27, 2022 at 07:01:15PM -0700, Darrick J. Wong wrote:
> > On Wed, Jul 27, 2022 at 09:17:47PM +0200, Amir Goldstein wrote:
> > > On Tue, Jul 26, 2022 at 11:21 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > Darrick,
> > > >
> > > > This backport series contains mostly fixes from v5.14 release along
> > > > with three deferred patches from the joint 5.10/5.15 series [1].
> > > >
> > > > I ran the auto group 10 times on baseline (v5.10.131) and this series
> > > > with no observed regressions.
> > > >
> > > > I ran the recoveryloop group 100 times with no observed regressions.
> > > > The soak group run is in progress (10+) with no observed regressions
> > > > so far.
> > > >
> > > > I am somewhat disappointed from not seeing any improvement in the
> > > > results of the recoveryloop tests comapred to baseline.
> > > >
> > > > This is the summary of the recoveryloop test results on both baseline
> > > > and backport branch:
> > > >
> > > > generic,455, generic/457, generic/646: pass
> > > > generic/019, generic/475, generic/648: failing often in all config
> >
> > <nod> I posted a couple of patchsets to fstests@ yesterday that might
> > help with these recoveryloop tests failing.
> >
> > https://lore.kernel.org/fstests/165886493457.1585218.32410114728132213.stgit@magnolia/T/#t
> > https://lore.kernel.org/fstests/165886492580.1585149.760428651537119015.stgit@magnolia/T/#t
> > https://lore.kernel.org/fstests/165886491119.1585061.14285332087646848837.stgit@magnolia/T/#t
> >
> > > > generic/388: failing often with reflink_1024
> > > > generic/388: failing at ~1/50 rate for any config
> > > > generic/482: failing often on V4 configs
> > > > generic/482: failing at ~1/100 rate for V5 configs
> > > > xfs/057: failing at ~1/200 rate for any config
> > > >
> > > > I observed no failures in soak group so far neither on baseline nor
> > > > on backport branch. I will update when I have more results.
> > > >
> > >
> > > Some more results after 1.5 days of spinning:
> > > 1. soak group reached 100 runs (x5 configs) with no failures
> > > 2. Ran all the tests also on debian/testing with xfsprogs 5.18 and
> > >     observed a very similar fail/pass pattern as with xfsprogs 5.10
> > > 3. Started to run the 3 passing recoveryloop tests 1000 times and
> > >     an interesting pattern emerged -
> > >
> > > generic/455 failed 3 times on baseline (out of 250 runs x 5 configs),
> > > but if has not failed on backport branch yet (700 runs x 5 configs).
> > >
> > > And it's not just failures, it's proper data corruptions, e.g.
> > > "testfile2.mark1 md5sum mismatched" (and not always on mark1)
> >
> > Oh good!
> >
> >
> > >
> > > I will keep this loop spinning, but I am cautiously optimistic about
> > > this being an actual proof of bug fix.

That was wishful thinking - after 1500 x 5 runs there are 2 failures also
on the backport branch.

It is still better than 4 failures out of 800 x 5 runs on baseline, but I can
not call that a proof for bug fix - only no regression.

> > >
> > > If these results don't change, I would be happy to get an ACK for the
> > > series so I can post it after the long soaking.
> >
> > Patches 4-9 are an easy
> > Acked-by: Darrick J. Wong <djwong@kernel.org>
>
> I hit send too fast.
>
> I think patches 1-3 look correct.  I still think it's sort of risky,
> but your testing shows that things at least get better and don't
> immediately explode or anything. :)
>
> By my recollection of the log changes between 5.10 and 5.17 I think the
> lsn/cil split didn't change all that much, so if you get to the end of
> the week with no further problems then I say Acked-by for them too.
>

Great. I'll keep it spinning.

Thanks,
Amir.
Amir Goldstein July 29, 2022, 4:15 p.m. UTC | #5
On Thu, Jul 28, 2022 at 11:39 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Jul 28, 2022 at 4:07 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Wed, Jul 27, 2022 at 07:01:15PM -0700, Darrick J. Wong wrote:
> > > On Wed, Jul 27, 2022 at 09:17:47PM +0200, Amir Goldstein wrote:
> > > > On Tue, Jul 26, 2022 at 11:21 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > >
> > > > > Darrick,
> > > > >
> > > > > This backport series contains mostly fixes from v5.14 release along
> > > > > with three deferred patches from the joint 5.10/5.15 series [1].
> > > > >
> > > > > I ran the auto group 10 times on baseline (v5.10.131) and this series
> > > > > with no observed regressions.
> > > > >
> > > > > I ran the recoveryloop group 100 times with no observed regressions.
> > > > > The soak group run is in progress (10+) with no observed regressions
> > > > > so far.
> > > > >
> > > > > I am somewhat disappointed from not seeing any improvement in the
> > > > > results of the recoveryloop tests comapred to baseline.
> > > > >
> > > > > This is the summary of the recoveryloop test results on both baseline
> > > > > and backport branch:
> > > > >
> > > > > generic,455, generic/457, generic/646: pass
> > > > > generic/019, generic/475, generic/648: failing often in all config
> > >
> > > <nod> I posted a couple of patchsets to fstests@ yesterday that might
> > > help with these recoveryloop tests failing.
> > >
> > > https://lore.kernel.org/fstests/165886493457.1585218.32410114728132213.stgit@magnolia/T/#t
> > > https://lore.kernel.org/fstests/165886492580.1585149.760428651537119015.stgit@magnolia/T/#t
> > > https://lore.kernel.org/fstests/165886491119.1585061.14285332087646848837.stgit@magnolia/T/#t
> > >
> > > > > generic/388: failing often with reflink_1024
> > > > > generic/388: failing at ~1/50 rate for any config
> > > > > generic/482: failing often on V4 configs
> > > > > generic/482: failing at ~1/100 rate for V5 configs
> > > > > xfs/057: failing at ~1/200 rate for any config
> > > > >
> > > > > I observed no failures in soak group so far neither on baseline nor
> > > > > on backport branch. I will update when I have more results.
> > > > >
> > > >
> > > > Some more results after 1.5 days of spinning:
> > > > 1. soak group reached 100 runs (x5 configs) with no failures
> > > > 2. Ran all the tests also on debian/testing with xfsprogs 5.18 and
> > > >     observed a very similar fail/pass pattern as with xfsprogs 5.10
> > > > 3. Started to run the 3 passing recoveryloop tests 1000 times and
> > > >     an interesting pattern emerged -
> > > >
> > > > generic/455 failed 3 times on baseline (out of 250 runs x 5 configs),
> > > > but if has not failed on backport branch yet (700 runs x 5 configs).
> > > >
> > > > And it's not just failures, it's proper data corruptions, e.g.
> > > > "testfile2.mark1 md5sum mismatched" (and not always on mark1)
> > >
> > > Oh good!
> > >
> > >
> > > >
> > > > I will keep this loop spinning, but I am cautiously optimistic about
> > > > this being an actual proof of bug fix.
>
> That was wishful thinking - after 1500 x 5 runs there are 2 failures also
> on the backport branch.
>
> It is still better than 4 failures out of 800 x 5 runs on baseline, but I can
> not call that a proof for bug fix - only no regression.
>
> > > >
> > > > If these results don't change, I would be happy to get an ACK for the
> > > > series so I can post it after the long soaking.
> > >
> > > Patches 4-9 are an easy
> > > Acked-by: Darrick J. Wong <djwong@kernel.org>
> >
> > I hit send too fast.
> >
> > I think patches 1-3 look correct.  I still think it's sort of risky,
> > but your testing shows that things at least get better and don't
> > immediately explode or anything. :)
> >
> > By my recollection of the log changes between 5.10 and 5.17 I think the
> > lsn/cil split didn't change all that much, so if you get to the end of
> > the week with no further problems then I say Acked-by for them too.
> >
>
> Great. I'll keep it spinning.
>

Well, it's the end of my week now and loop has passed over 4,000
runs x 5 configs on the backport branch with only 4 failures, so it's
going to stable.

Thanks,
Amir.