Message ID | 20220726092125.3899077-1-amir73il@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | xfs stable candidate patches for 5.10.y (from v5.13+) | expand |
On Tue, Jul 26, 2022 at 11:21 AM Amir Goldstein <amir73il@gmail.com> wrote: > > Darrick, > > This backport series contains mostly fixes from v5.14 release along > with three deferred patches from the joint 5.10/5.15 series [1]. > > I ran the auto group 10 times on baseline (v5.10.131) and this series > with no observed regressions. > > I ran the recoveryloop group 100 times with no observed regressions. > The soak group run is in progress (10+) with no observed regressions > so far. > > I am somewhat disappointed from not seeing any improvement in the > results of the recoveryloop tests comapred to baseline. > > This is the summary of the recoveryloop test results on both baseline > and backport branch: > > generic,455, generic/457, generic/646: pass > generic/019, generic/475, generic/648: failing often in all config > generic/388: failing often with reflink_1024 > generic/388: failing at ~1/50 rate for any config > generic/482: failing often on V4 configs > generic/482: failing at ~1/100 rate for V5 configs > xfs/057: failing at ~1/200 rate for any config > > I observed no failures in soak group so far neither on baseline nor > on backport branch. I will update when I have more results. > Some more results after 1.5 days of spinning: 1. soak group reached 100 runs (x5 configs) with no failures 2. Ran all the tests also on debian/testing with xfsprogs 5.18 and observed a very similar fail/pass pattern as with xfsprogs 5.10 3. Started to run the 3 passing recoveryloop tests 1000 times and an interesting pattern emerged - generic/455 failed 3 times on baseline (out of 250 runs x 5 configs), but if has not failed on backport branch yet (700 runs x 5 configs). And it's not just failures, it's proper data corruptions, e.g. "testfile2.mark1 md5sum mismatched" (and not always on mark1) I will keep this loop spinning, but I am cautiously optimistic about this being an actual proof of bug fix. If these results don't change, I would be happy to get an ACK for the series so I can post it after the long soaking. Thanks, Amir.
On Wed, Jul 27, 2022 at 09:17:47PM +0200, Amir Goldstein wrote: > On Tue, Jul 26, 2022 at 11:21 AM Amir Goldstein <amir73il@gmail.com> wrote: > > > > Darrick, > > > > This backport series contains mostly fixes from v5.14 release along > > with three deferred patches from the joint 5.10/5.15 series [1]. > > > > I ran the auto group 10 times on baseline (v5.10.131) and this series > > with no observed regressions. > > > > I ran the recoveryloop group 100 times with no observed regressions. > > The soak group run is in progress (10+) with no observed regressions > > so far. > > > > I am somewhat disappointed from not seeing any improvement in the > > results of the recoveryloop tests comapred to baseline. > > > > This is the summary of the recoveryloop test results on both baseline > > and backport branch: > > > > generic,455, generic/457, generic/646: pass > > generic/019, generic/475, generic/648: failing often in all config <nod> I posted a couple of patchsets to fstests@ yesterday that might help with these recoveryloop tests failing. https://lore.kernel.org/fstests/165886493457.1585218.32410114728132213.stgit@magnolia/T/#t https://lore.kernel.org/fstests/165886492580.1585149.760428651537119015.stgit@magnolia/T/#t https://lore.kernel.org/fstests/165886491119.1585061.14285332087646848837.stgit@magnolia/T/#t > > generic/388: failing often with reflink_1024 > > generic/388: failing at ~1/50 rate for any config > > generic/482: failing often on V4 configs > > generic/482: failing at ~1/100 rate for V5 configs > > xfs/057: failing at ~1/200 rate for any config > > > > I observed no failures in soak group so far neither on baseline nor > > on backport branch. I will update when I have more results. > > > > Some more results after 1.5 days of spinning: > 1. soak group reached 100 runs (x5 configs) with no failures > 2. Ran all the tests also on debian/testing with xfsprogs 5.18 and > observed a very similar fail/pass pattern as with xfsprogs 5.10 > 3. Started to run the 3 passing recoveryloop tests 1000 times and > an interesting pattern emerged - > > generic/455 failed 3 times on baseline (out of 250 runs x 5 configs), > but if has not failed on backport branch yet (700 runs x 5 configs). > > And it's not just failures, it's proper data corruptions, e.g. > "testfile2.mark1 md5sum mismatched" (and not always on mark1) Oh good! > > I will keep this loop spinning, but I am cautiously optimistic about > this being an actual proof of bug fix. > > If these results don't change, I would be happy to get an ACK for the > series so I can post it after the long soaking. Patches 4-9 are an easy Acked-by: Darrick J. Wong <djwong@kernel.org> --D > Thanks, > Amir.
On Wed, Jul 27, 2022 at 07:01:15PM -0700, Darrick J. Wong wrote: > On Wed, Jul 27, 2022 at 09:17:47PM +0200, Amir Goldstein wrote: > > On Tue, Jul 26, 2022 at 11:21 AM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > Darrick, > > > > > > This backport series contains mostly fixes from v5.14 release along > > > with three deferred patches from the joint 5.10/5.15 series [1]. > > > > > > I ran the auto group 10 times on baseline (v5.10.131) and this series > > > with no observed regressions. > > > > > > I ran the recoveryloop group 100 times with no observed regressions. > > > The soak group run is in progress (10+) with no observed regressions > > > so far. > > > > > > I am somewhat disappointed from not seeing any improvement in the > > > results of the recoveryloop tests comapred to baseline. > > > > > > This is the summary of the recoveryloop test results on both baseline > > > and backport branch: > > > > > > generic,455, generic/457, generic/646: pass > > > generic/019, generic/475, generic/648: failing often in all config > > <nod> I posted a couple of patchsets to fstests@ yesterday that might > help with these recoveryloop tests failing. > > https://lore.kernel.org/fstests/165886493457.1585218.32410114728132213.stgit@magnolia/T/#t > https://lore.kernel.org/fstests/165886492580.1585149.760428651537119015.stgit@magnolia/T/#t > https://lore.kernel.org/fstests/165886491119.1585061.14285332087646848837.stgit@magnolia/T/#t > > > > generic/388: failing often with reflink_1024 > > > generic/388: failing at ~1/50 rate for any config > > > generic/482: failing often on V4 configs > > > generic/482: failing at ~1/100 rate for V5 configs > > > xfs/057: failing at ~1/200 rate for any config > > > > > > I observed no failures in soak group so far neither on baseline nor > > > on backport branch. I will update when I have more results. > > > > > > > Some more results after 1.5 days of spinning: > > 1. soak group reached 100 runs (x5 configs) with no failures > > 2. Ran all the tests also on debian/testing with xfsprogs 5.18 and > > observed a very similar fail/pass pattern as with xfsprogs 5.10 > > 3. Started to run the 3 passing recoveryloop tests 1000 times and > > an interesting pattern emerged - > > > > generic/455 failed 3 times on baseline (out of 250 runs x 5 configs), > > but if has not failed on backport branch yet (700 runs x 5 configs). > > > > And it's not just failures, it's proper data corruptions, e.g. > > "testfile2.mark1 md5sum mismatched" (and not always on mark1) > > Oh good! > > > > > > I will keep this loop spinning, but I am cautiously optimistic about > > this being an actual proof of bug fix. > > > > If these results don't change, I would be happy to get an ACK for the > > series so I can post it after the long soaking. > > Patches 4-9 are an easy > Acked-by: Darrick J. Wong <djwong@kernel.org> I hit send too fast. I think patches 1-3 look correct. I still think it's sort of risky, but your testing shows that things at least get better and don't immediately explode or anything. :) By my recollection of the log changes between 5.10 and 5.17 I think the lsn/cil split didn't change all that much, so if you get to the end of the week with no further problems then I say Acked-by for them too. --D > > > --D > > > Thanks, > > Amir.
On Thu, Jul 28, 2022 at 4:07 AM Darrick J. Wong <djwong@kernel.org> wrote: > > On Wed, Jul 27, 2022 at 07:01:15PM -0700, Darrick J. Wong wrote: > > On Wed, Jul 27, 2022 at 09:17:47PM +0200, Amir Goldstein wrote: > > > On Tue, Jul 26, 2022 at 11:21 AM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > Darrick, > > > > > > > > This backport series contains mostly fixes from v5.14 release along > > > > with three deferred patches from the joint 5.10/5.15 series [1]. > > > > > > > > I ran the auto group 10 times on baseline (v5.10.131) and this series > > > > with no observed regressions. > > > > > > > > I ran the recoveryloop group 100 times with no observed regressions. > > > > The soak group run is in progress (10+) with no observed regressions > > > > so far. > > > > > > > > I am somewhat disappointed from not seeing any improvement in the > > > > results of the recoveryloop tests comapred to baseline. > > > > > > > > This is the summary of the recoveryloop test results on both baseline > > > > and backport branch: > > > > > > > > generic,455, generic/457, generic/646: pass > > > > generic/019, generic/475, generic/648: failing often in all config > > > > <nod> I posted a couple of patchsets to fstests@ yesterday that might > > help with these recoveryloop tests failing. > > > > https://lore.kernel.org/fstests/165886493457.1585218.32410114728132213.stgit@magnolia/T/#t > > https://lore.kernel.org/fstests/165886492580.1585149.760428651537119015.stgit@magnolia/T/#t > > https://lore.kernel.org/fstests/165886491119.1585061.14285332087646848837.stgit@magnolia/T/#t > > > > > > generic/388: failing often with reflink_1024 > > > > generic/388: failing at ~1/50 rate for any config > > > > generic/482: failing often on V4 configs > > > > generic/482: failing at ~1/100 rate for V5 configs > > > > xfs/057: failing at ~1/200 rate for any config > > > > > > > > I observed no failures in soak group so far neither on baseline nor > > > > on backport branch. I will update when I have more results. > > > > > > > > > > Some more results after 1.5 days of spinning: > > > 1. soak group reached 100 runs (x5 configs) with no failures > > > 2. Ran all the tests also on debian/testing with xfsprogs 5.18 and > > > observed a very similar fail/pass pattern as with xfsprogs 5.10 > > > 3. Started to run the 3 passing recoveryloop tests 1000 times and > > > an interesting pattern emerged - > > > > > > generic/455 failed 3 times on baseline (out of 250 runs x 5 configs), > > > but if has not failed on backport branch yet (700 runs x 5 configs). > > > > > > And it's not just failures, it's proper data corruptions, e.g. > > > "testfile2.mark1 md5sum mismatched" (and not always on mark1) > > > > Oh good! > > > > > > > > > > I will keep this loop spinning, but I am cautiously optimistic about > > > this being an actual proof of bug fix. That was wishful thinking - after 1500 x 5 runs there are 2 failures also on the backport branch. It is still better than 4 failures out of 800 x 5 runs on baseline, but I can not call that a proof for bug fix - only no regression. > > > > > > If these results don't change, I would be happy to get an ACK for the > > > series so I can post it after the long soaking. > > > > Patches 4-9 are an easy > > Acked-by: Darrick J. Wong <djwong@kernel.org> > > I hit send too fast. > > I think patches 1-3 look correct. I still think it's sort of risky, > but your testing shows that things at least get better and don't > immediately explode or anything. :) > > By my recollection of the log changes between 5.10 and 5.17 I think the > lsn/cil split didn't change all that much, so if you get to the end of > the week with no further problems then I say Acked-by for them too. > Great. I'll keep it spinning. Thanks, Amir.
On Thu, Jul 28, 2022 at 11:39 AM Amir Goldstein <amir73il@gmail.com> wrote: > > On Thu, Jul 28, 2022 at 4:07 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > On Wed, Jul 27, 2022 at 07:01:15PM -0700, Darrick J. Wong wrote: > > > On Wed, Jul 27, 2022 at 09:17:47PM +0200, Amir Goldstein wrote: > > > > On Tue, Jul 26, 2022 at 11:21 AM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > > > Darrick, > > > > > > > > > > This backport series contains mostly fixes from v5.14 release along > > > > > with three deferred patches from the joint 5.10/5.15 series [1]. > > > > > > > > > > I ran the auto group 10 times on baseline (v5.10.131) and this series > > > > > with no observed regressions. > > > > > > > > > > I ran the recoveryloop group 100 times with no observed regressions. > > > > > The soak group run is in progress (10+) with no observed regressions > > > > > so far. > > > > > > > > > > I am somewhat disappointed from not seeing any improvement in the > > > > > results of the recoveryloop tests comapred to baseline. > > > > > > > > > > This is the summary of the recoveryloop test results on both baseline > > > > > and backport branch: > > > > > > > > > > generic,455, generic/457, generic/646: pass > > > > > generic/019, generic/475, generic/648: failing often in all config > > > > > > <nod> I posted a couple of patchsets to fstests@ yesterday that might > > > help with these recoveryloop tests failing. > > > > > > https://lore.kernel.org/fstests/165886493457.1585218.32410114728132213.stgit@magnolia/T/#t > > > https://lore.kernel.org/fstests/165886492580.1585149.760428651537119015.stgit@magnolia/T/#t > > > https://lore.kernel.org/fstests/165886491119.1585061.14285332087646848837.stgit@magnolia/T/#t > > > > > > > > generic/388: failing often with reflink_1024 > > > > > generic/388: failing at ~1/50 rate for any config > > > > > generic/482: failing often on V4 configs > > > > > generic/482: failing at ~1/100 rate for V5 configs > > > > > xfs/057: failing at ~1/200 rate for any config > > > > > > > > > > I observed no failures in soak group so far neither on baseline nor > > > > > on backport branch. I will update when I have more results. > > > > > > > > > > > > > Some more results after 1.5 days of spinning: > > > > 1. soak group reached 100 runs (x5 configs) with no failures > > > > 2. Ran all the tests also on debian/testing with xfsprogs 5.18 and > > > > observed a very similar fail/pass pattern as with xfsprogs 5.10 > > > > 3. Started to run the 3 passing recoveryloop tests 1000 times and > > > > an interesting pattern emerged - > > > > > > > > generic/455 failed 3 times on baseline (out of 250 runs x 5 configs), > > > > but if has not failed on backport branch yet (700 runs x 5 configs). > > > > > > > > And it's not just failures, it's proper data corruptions, e.g. > > > > "testfile2.mark1 md5sum mismatched" (and not always on mark1) > > > > > > Oh good! > > > > > > > > > > > > > > I will keep this loop spinning, but I am cautiously optimistic about > > > > this being an actual proof of bug fix. > > That was wishful thinking - after 1500 x 5 runs there are 2 failures also > on the backport branch. > > It is still better than 4 failures out of 800 x 5 runs on baseline, but I can > not call that a proof for bug fix - only no regression. > > > > > > > > > If these results don't change, I would be happy to get an ACK for the > > > > series so I can post it after the long soaking. > > > > > > Patches 4-9 are an easy > > > Acked-by: Darrick J. Wong <djwong@kernel.org> > > > > I hit send too fast. > > > > I think patches 1-3 look correct. I still think it's sort of risky, > > but your testing shows that things at least get better and don't > > immediately explode or anything. :) > > > > By my recollection of the log changes between 5.10 and 5.17 I think the > > lsn/cil split didn't change all that much, so if you get to the end of > > the week with no further problems then I say Acked-by for them too. > > > > Great. I'll keep it spinning. > Well, it's the end of my week now and loop has passed over 4,000 runs x 5 configs on the backport branch with only 4 failures, so it's going to stable. Thanks, Amir.