diff mbox series

[v2] t0610: work around flaky test with concurrent writers

Message ID f3874c7e923846e453499e89f00dd4f8862e4c40.1728055880.git.ps@pks.im (mailing list archive)
State Accepted
Commit 7355574a226e15a06c07fdcf1c401cf043298e4b
Headers show
Series [v2] t0610: work around flaky test with concurrent writers | expand

Commit Message

Patrick Steinhardt Oct. 4, 2024, 3:32 p.m. UTC
In 6241ce2170 (refs/reftable: reload locked stack when preparing
transaction, 2024-09-24) we have introduced a new test that exercises
how the reftable backend behaves with many concurrent writers all racing
with each other. This test was introduced after a couple of fixes in
this context that should make concurrent writes behave gracefully. As it
turns out though, Windows systems do not yet handle concurrent writes
properly, as we've got two reports for Cygwin and MinGW failing in this
newly added test.

The root cause of this is how we update the "tables.list" file: when
writing a new stack of tables we first write the data into a lockfile
and then rename that file into place. But Windows forbids us from doing
that rename when the target path is open for reading by another process.
And as the test races both readers and writers with each other we are
quite likely to hit this edge case.

This is not a regression: the logic didn't work before the mentioned
commit, and after the commit it performs well on Linux and macOS, and
the situation on Windows should have at least improved a bit. But the
test shows that we need to put more thought into how to make this work
properly there.

Work around the issue by disabling the test on Windows for now. While at
it, increase the locking timeout to address reported timeouts when using
either the address or memory sanitizer, which also tend to significantly
extend the runtime of this test.

This should be revisited after Git v2.47 is out.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 t/t0610-reftable-basics.sh | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)


Interdiff against v1:
  diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
  index 86a746aff0..babec7993e 100755
  --- a/t/t0610-reftable-basics.sh
  +++ b/t/t0610-reftable-basics.sh
  @@ -453,23 +453,18 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' '
   # This test fails most of the time on Windows systems. The root cause is
   # that Windows does not allow us to rename the "tables.list.lock" file into
   # place when "tables.list" is open for reading by a concurrent process.
  -#
  -# The same issue does not happen on Cygwin because its implementation of
  -# rename(3P) is emulating POSIX-style renames, including renames over files
  -# that are open.
  -test_expect_success !MINGW 'ref transaction: many concurrent writers' '
  +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' '
   	test_when_finished "rm -rf repo" &&
   	git init repo &&
   	(
   		cd repo &&
   		# Set a high timeout. While a couple of seconds should be
   		# plenty, using the address sanitizer will significantly slow
  -		# us down here. Furthermore, Cygwin is also way slower due to
  -		# the POSIX-style rename emulation. So we are aiming way higher
  -		# than you would ever think is necessary just to keep us from
  -		# flaking. We could also lock indefinitely by passing -1, but
  -		# that could potentially block CI jobs indefinitely if there
  -		# was a bug here.
  +		# us down here. So we are aiming way higher than you would ever
  +		# think is necessary just to keep us from flaking. We could
  +		# also lock indefinitely by passing -1, but that could
  +		# potentially block CI jobs indefinitely if there was a bug
  +		# here.
   		git config set reftable.lockTimeout 300000 &&
   		test_commit --no-tag initial &&
   

base-commit: 111e864d69c84284441b083966c2065c2e9a4e78

Comments

Ramsay Jones Oct. 4, 2024, 4:32 p.m. UTC | #1
Hi Patrick,

On 04/10/2024 16:32, Patrick Steinhardt wrote:
> In 6241ce2170 (refs/reftable: reload locked stack when preparing
> transaction, 2024-09-24) we have introduced a new test that exercises
> how the reftable backend behaves with many concurrent writers all racing
> with each other. This test was introduced after a couple of fixes in
> this context that should make concurrent writes behave gracefully. As it
> turns out though, Windows systems do not yet handle concurrent writes
> properly, as we've got two reports for Cygwin and MinGW failing in this
> newly added test.
> 
> The root cause of this is how we update the "tables.list" file: when
> writing a new stack of tables we first write the data into a lockfile
> and then rename that file into place. But Windows forbids us from doing
> that rename when the target path is open for reading by another process.
> And as the test races both readers and writers with each other we are
> quite likely to hit this edge case.
> 
> This is not a regression: the logic didn't work before the mentioned
> commit, and after the commit it performs well on Linux and macOS, and
> the situation on Windows should have at least improved a bit. But the
> test shows that we need to put more thought into how to make this work
> properly there.
> 
> Work around the issue by disabling the test on Windows for now. While at
> it, increase the locking timeout to address reported timeouts when using
> either the address or memory sanitizer, which also tend to significantly
> extend the runtime of this test.
> 
> This should be revisited after Git v2.47 is out.
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  t/t0610-reftable-basics.sh | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
> index 2d951c8ceb..babec7993e 100755
> --- a/t/t0610-reftable-basics.sh
> +++ b/t/t0610-reftable-basics.sh
> @@ -450,15 +450,22 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' '
>  	)
>  '
>  
> -test_expect_success 'ref transaction: many concurrent writers' '
> +# This test fails most of the time on Windows systems. The root cause is
> +# that Windows does not allow us to rename the "tables.list.lock" file into
> +# place when "tables.list" is open for reading by a concurrent process.
> +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' '
>  	test_when_finished "rm -rf repo" &&
>  	git init repo &&
>  	(
>  		cd repo &&
> -		# Set a high timeout such that a busy CI machine will not abort
> -		# early. 10 seconds should hopefully be ample of time to make
> -		# this non-flaky.
> -		git config set reftable.lockTimeout 10000 &&
> +		# Set a high timeout. While a couple of seconds should be
> +		# plenty, using the address sanitizer will significantly slow
> +		# us down here. So we are aiming way higher than you would ever
> +		# think is necessary just to keep us from flaking. We could
> +		# also lock indefinitely by passing -1, but that could
> +		# potentially block CI jobs indefinitely if there was a bug
> +		# here.
> +		git config set reftable.lockTimeout 300000 &&
>  		test_commit --no-tag initial &&
>  
>  		head=$(git rev-parse HEAD) &&

I did apply this patch and (unsurprising) it now passes just fine! :)

ATB,
Ramsay Jones
Junio C Hamano Oct. 4, 2024, 4:35 p.m. UTC | #2
Patrick Steinhardt <ps@pks.im> writes:

> This should be revisited after Git v2.47 is out.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  t/t0610-reftable-basics.sh | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)

Ah, I spoke before I finished my inbox.  Let's use this version.

Thanks.
Jeff King Oct. 4, 2024, 10:41 p.m. UTC | #3
On Fri, Oct 04, 2024 at 05:32:16PM +0200, Patrick Steinhardt wrote:

> Work around the issue by disabling the test on Windows for now. While at
> it, increase the locking timeout to address reported timeouts when using
> either the address or memory sanitizer, which also tend to significantly
> extend the runtime of this test.
> 
> This should be revisited after Git v2.47 is out.

This sounds OK to me.

In general, I worry about papering over bugs with a "we'll revisit it
later" note. We may forget about things or never prioritize them. And
the worst-case scenario here is that we might later decide to promote
reftable to the default format, forgetting that concurrency has problems
on Windows.

But given that it's an area you've been actively working in, there's
already been some discussion towards the real fix, and we still consider
reftables somewhat experimental, I feel OK bumping it for now.

Thanks for all the digging on this (from you and from other folks).

-Peff
diff mbox series

Patch

diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
index 2d951c8ceb..babec7993e 100755
--- a/t/t0610-reftable-basics.sh
+++ b/t/t0610-reftable-basics.sh
@@ -450,15 +450,22 @@  test_expect_success 'ref transaction: retry acquiring tables.list lock' '
 	)
 '
 
-test_expect_success 'ref transaction: many concurrent writers' '
+# This test fails most of the time on Windows systems. The root cause is
+# that Windows does not allow us to rename the "tables.list.lock" file into
+# place when "tables.list" is open for reading by a concurrent process.
+test_expect_success !WINDOWS 'ref transaction: many concurrent writers' '
 	test_when_finished "rm -rf repo" &&
 	git init repo &&
 	(
 		cd repo &&
-		# Set a high timeout such that a busy CI machine will not abort
-		# early. 10 seconds should hopefully be ample of time to make
-		# this non-flaky.
-		git config set reftable.lockTimeout 10000 &&
+		# Set a high timeout. While a couple of seconds should be
+		# plenty, using the address sanitizer will significantly slow
+		# us down here. So we are aiming way higher than you would ever
+		# think is necessary just to keep us from flaking. We could
+		# also lock indefinitely by passing -1, but that could
+		# potentially block CI jobs indefinitely if there was a bug
+		# here.
+		git config set reftable.lockTimeout 300000 &&
 		test_commit --no-tag initial &&
 
 		head=$(git rev-parse HEAD) &&