Message ID | f3874c7e923846e453499e89f00dd4f8862e4c40.1728055880.git.ps@pks.im (mailing list archive) |
---|---|
State | Accepted |
Commit | 7355574a226e15a06c07fdcf1c401cf043298e4b |
Headers | show |
Series | [v2] t0610: work around flaky test with concurrent writers | expand |
Hi Patrick, On 04/10/2024 16:32, Patrick Steinhardt wrote: > In 6241ce2170 (refs/reftable: reload locked stack when preparing > transaction, 2024-09-24) we have introduced a new test that exercises > how the reftable backend behaves with many concurrent writers all racing > with each other. This test was introduced after a couple of fixes in > this context that should make concurrent writes behave gracefully. As it > turns out though, Windows systems do not yet handle concurrent writes > properly, as we've got two reports for Cygwin and MinGW failing in this > newly added test. > > The root cause of this is how we update the "tables.list" file: when > writing a new stack of tables we first write the data into a lockfile > and then rename that file into place. But Windows forbids us from doing > that rename when the target path is open for reading by another process. > And as the test races both readers and writers with each other we are > quite likely to hit this edge case. > > This is not a regression: the logic didn't work before the mentioned > commit, and after the commit it performs well on Linux and macOS, and > the situation on Windows should have at least improved a bit. But the > test shows that we need to put more thought into how to make this work > properly there. > > Work around the issue by disabling the test on Windows for now. While at > it, increase the locking timeout to address reported timeouts when using > either the address or memory sanitizer, which also tend to significantly > extend the runtime of this test. > > This should be revisited after Git v2.47 is out. > > Signed-off-by: Patrick Steinhardt <ps@pks.im> > --- > t/t0610-reftable-basics.sh | 17 ++++++++++++----- > 1 file changed, 12 insertions(+), 5 deletions(-) > > diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh > index 2d951c8ceb..babec7993e 100755 > --- a/t/t0610-reftable-basics.sh > +++ b/t/t0610-reftable-basics.sh > @@ -450,15 +450,22 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' > ) > ' > > -test_expect_success 'ref transaction: many concurrent writers' ' > +# This test fails most of the time on Windows systems. The root cause is > +# that Windows does not allow us to rename the "tables.list.lock" file into > +# place when "tables.list" is open for reading by a concurrent process. > +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' ' > test_when_finished "rm -rf repo" && > git init repo && > ( > cd repo && > - # Set a high timeout such that a busy CI machine will not abort > - # early. 10 seconds should hopefully be ample of time to make > - # this non-flaky. > - git config set reftable.lockTimeout 10000 && > + # Set a high timeout. While a couple of seconds should be > + # plenty, using the address sanitizer will significantly slow > + # us down here. So we are aiming way higher than you would ever > + # think is necessary just to keep us from flaking. We could > + # also lock indefinitely by passing -1, but that could > + # potentially block CI jobs indefinitely if there was a bug > + # here. > + git config set reftable.lockTimeout 300000 && > test_commit --no-tag initial && > > head=$(git rev-parse HEAD) && I did apply this patch and (unsurprising) it now passes just fine! :) ATB, Ramsay Jones
Patrick Steinhardt <ps@pks.im> writes: > This should be revisited after Git v2.47 is out. > > Signed-off-by: Patrick Steinhardt <ps@pks.im> > --- > t/t0610-reftable-basics.sh | 17 ++++++++++++----- > 1 file changed, 12 insertions(+), 5 deletions(-) Ah, I spoke before I finished my inbox. Let's use this version. Thanks.
On Fri, Oct 04, 2024 at 05:32:16PM +0200, Patrick Steinhardt wrote: > Work around the issue by disabling the test on Windows for now. While at > it, increase the locking timeout to address reported timeouts when using > either the address or memory sanitizer, which also tend to significantly > extend the runtime of this test. > > This should be revisited after Git v2.47 is out. This sounds OK to me. In general, I worry about papering over bugs with a "we'll revisit it later" note. We may forget about things or never prioritize them. And the worst-case scenario here is that we might later decide to promote reftable to the default format, forgetting that concurrency has problems on Windows. But given that it's an area you've been actively working in, there's already been some discussion towards the real fix, and we still consider reftables somewhat experimental, I feel OK bumping it for now. Thanks for all the digging on this (from you and from other folks). -Peff
diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh index 2d951c8ceb..babec7993e 100755 --- a/t/t0610-reftable-basics.sh +++ b/t/t0610-reftable-basics.sh @@ -450,15 +450,22 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' ) ' -test_expect_success 'ref transaction: many concurrent writers' ' +# This test fails most of the time on Windows systems. The root cause is +# that Windows does not allow us to rename the "tables.list.lock" file into +# place when "tables.list" is open for reading by a concurrent process. +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' ' test_when_finished "rm -rf repo" && git init repo && ( cd repo && - # Set a high timeout such that a busy CI machine will not abort - # early. 10 seconds should hopefully be ample of time to make - # this non-flaky. - git config set reftable.lockTimeout 10000 && + # Set a high timeout. While a couple of seconds should be + # plenty, using the address sanitizer will significantly slow + # us down here. So we are aiming way higher than you would ever + # think is necessary just to keep us from flaking. We could + # also lock indefinitely by passing -1, but that could + # potentially block CI jobs indefinitely if there was a bug + # here. + git config set reftable.lockTimeout 300000 && test_commit --no-tag initial && head=$(git rev-parse HEAD) &&
In 6241ce2170 (refs/reftable: reload locked stack when preparing transaction, 2024-09-24) we have introduced a new test that exercises how the reftable backend behaves with many concurrent writers all racing with each other. This test was introduced after a couple of fixes in this context that should make concurrent writes behave gracefully. As it turns out though, Windows systems do not yet handle concurrent writes properly, as we've got two reports for Cygwin and MinGW failing in this newly added test. The root cause of this is how we update the "tables.list" file: when writing a new stack of tables we first write the data into a lockfile and then rename that file into place. But Windows forbids us from doing that rename when the target path is open for reading by another process. And as the test races both readers and writers with each other we are quite likely to hit this edge case. This is not a regression: the logic didn't work before the mentioned commit, and after the commit it performs well on Linux and macOS, and the situation on Windows should have at least improved a bit. But the test shows that we need to put more thought into how to make this work properly there. Work around the issue by disabling the test on Windows for now. While at it, increase the locking timeout to address reported timeouts when using either the address or memory sanitizer, which also tend to significantly extend the runtime of this test. This should be revisited after Git v2.47 is out. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- t/t0610-reftable-basics.sh | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) Interdiff against v1: diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh index 86a746aff0..babec7993e 100755 --- a/t/t0610-reftable-basics.sh +++ b/t/t0610-reftable-basics.sh @@ -453,23 +453,18 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' # This test fails most of the time on Windows systems. The root cause is # that Windows does not allow us to rename the "tables.list.lock" file into # place when "tables.list" is open for reading by a concurrent process. -# -# The same issue does not happen on Cygwin because its implementation of -# rename(3P) is emulating POSIX-style renames, including renames over files -# that are open. -test_expect_success !MINGW 'ref transaction: many concurrent writers' ' +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' ' test_when_finished "rm -rf repo" && git init repo && ( cd repo && # Set a high timeout. While a couple of seconds should be # plenty, using the address sanitizer will significantly slow - # us down here. Furthermore, Cygwin is also way slower due to - # the POSIX-style rename emulation. So we are aiming way higher - # than you would ever think is necessary just to keep us from - # flaking. We could also lock indefinitely by passing -1, but - # that could potentially block CI jobs indefinitely if there - # was a bug here. + # us down here. So we are aiming way higher than you would ever + # think is necessary just to keep us from flaking. We could + # also lock indefinitely by passing -1, but that could + # potentially block CI jobs indefinitely if there was a bug + # here. git config set reftable.lockTimeout 300000 && test_commit --no-tag initial && base-commit: 111e864d69c84284441b083966c2065c2e9a4e78