diff mbox series

[v4,3/3] refs/reftable: reload locked stack when preparing transaction

Message ID 9ce2d18dff2a655365b609dd86ea484a489c717a.1727155858.git.ps@pks.im (mailing list archive)
State Accepted
Commit 6241ce217046cc77d306bda06ad80ac1973b4932
Headers show
Series reftable: graceful concurrent writes | expand

Commit Message

Patrick Steinhardt Sept. 24, 2024, 5:33 a.m. UTC
When starting a reftable transaction we lock all stacks we are about to
modify. While it may happen that the stack is out-of-date at this point
in time we don't really care: transactional updates encode the expected
state of a certain reference, so all that we really want to verify is
that the _current_ value matches that expected state.

Pass `REFTABLE_STACK_NEW_ADDITION_RELOAD` when locking the stack such
that an out-of-date stack will be reloaded after having been locked.
This change is safe because all verifications of the expected state
happen after this step anyway.

Add a testcase that verifies that many writers are now able to write to
the stack concurrently without failures and with a deterministic end
result.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 refs/reftable-backend.c    |  3 ++-
 t/t0610-reftable-basics.sh | 31 +++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 1 deletion(-)

Comments

Jeff King Sept. 27, 2024, 4:07 a.m. UTC | #1
On Tue, Sep 24, 2024 at 07:33:08AM +0200, Patrick Steinhardt wrote:

> +test_expect_success 'ref transaction: many concurrent writers' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +		# Set a high timeout such that a busy CI machine will not abort
> +		# early. 10 seconds should hopefully be ample of time to make
> +		# this non-flaky.
> +		git config set reftable.lockTimeout 10000 &&

I saw this test racily fail in the Windows CI build. The failure is as
you might imagine, a few of the background update-ref invocations
failed:

  fatal: update_ref failed for ref 'refs/heads/branch-21': reftable: transaction failure: I/O error

but of course we don't notice because they're backgrounded. And then the
expected output is missing the branch-21 entry (and in my case,
branch-64 suffered a similar fate).

At first I thought we probably needed to bump the timeout (and EIO was
just our way of passing that up the stack). But looking at the
timestamps in the Actions log, the whole loop took less than 10ms to
run.

So could this be indicative of a real contention issue specific to
Windows? I'm wondering if something like the old "you can't delete a
file somebody else has open" restriction is biting us somehow.

-Peff
Patrick Steinhardt Sept. 30, 2024, 6:49 a.m. UTC | #2
On Fri, Sep 27, 2024 at 12:07:52AM -0400, Jeff King wrote:
> On Tue, Sep 24, 2024 at 07:33:08AM +0200, Patrick Steinhardt wrote:
> 
> > +test_expect_success 'ref transaction: many concurrent writers' '
> > +	test_when_finished "rm -rf repo" &&
> > +	git init repo &&
> > +	(
> > +		cd repo &&
> > +		# Set a high timeout such that a busy CI machine will not abort
> > +		# early. 10 seconds should hopefully be ample of time to make
> > +		# this non-flaky.
> > +		git config set reftable.lockTimeout 10000 &&
> 
> I saw this test racily fail in the Windows CI build. The failure is as
> you might imagine, a few of the background update-ref invocations
> failed:
> 
>   fatal: update_ref failed for ref 'refs/heads/branch-21': reftable: transaction failure: I/O error
> 
> but of course we don't notice because they're backgrounded. And then the
> expected output is missing the branch-21 entry (and in my case,
> branch-64 suffered a similar fate).
> 
> At first I thought we probably needed to bump the timeout (and EIO was
> just our way of passing that up the stack). But looking at the
> timestamps in the Actions log, the whole loop took less than 10ms to
> run.
> 
> So could this be indicative of a real contention issue specific to
> Windows? I'm wondering if something like the old "you can't delete a
> file somebody else has open" restriction is biting us somehow.

Thanks for letting me know!

I very much expect that this is the scenario that you mention, where we
try to delete a file that is still held open by another process. We're
trying to be mindful about this restriction is the reftable library by
not failing when a call to unlink(3P) fails for any of the tables, and
we do have logic in place to unlink them at a later point in time when
not referenced by the "tables.list" file. So none of the calls to unlink
are error-checked at all.

But there's one file that we _have_ to rewrite, and that is of course
the "tables.list" file itself. We never unlink the file though but only
rename the new instance into place. I think I recently discovered that
we have some problems here, because Windows seemed to allow this in some
scenarios but not in others.

In any case, I've already set up a Windows VM last week, so I'll
investigate the issue later this week.

Patrick
Josh Steadmon Sept. 30, 2024, 10:19 p.m. UTC | #3
On 2024.09.27 00:07, Jeff King wrote:
> On Tue, Sep 24, 2024 at 07:33:08AM +0200, Patrick Steinhardt wrote:
> 
> > +test_expect_success 'ref transaction: many concurrent writers' '
> > +	test_when_finished "rm -rf repo" &&
> > +	git init repo &&
> > +	(
> > +		cd repo &&
> > +		# Set a high timeout such that a busy CI machine will not abort
> > +		# early. 10 seconds should hopefully be ample of time to make
> > +		# this non-flaky.
> > +		git config set reftable.lockTimeout 10000 &&
> 
> I saw this test racily fail in the Windows CI build. The failure is as
> you might imagine, a few of the background update-ref invocations
> failed:
> 
>   fatal: update_ref failed for ref 'refs/heads/branch-21': reftable: transaction failure: I/O error
> 
> but of course we don't notice because they're backgrounded. And then the
> expected output is missing the branch-21 entry (and in my case,
> branch-64 suffered a similar fate).
> 
> At first I thought we probably needed to bump the timeout (and EIO was
> just our way of passing that up the stack). But looking at the
> timestamps in the Actions log, the whole loop took less than 10ms to
> run.
> 
> So could this be indicative of a real contention issue specific to
> Windows? I'm wondering if something like the old "you can't delete a
> file somebody else has open" restriction is biting us somehow.
> 
> -Peff

We're seeing repeated failures from this test case with ASan enabled.
Unfortunately, we've only been able to reproduce this on our
$DAYJOB-specific build system. I haven't been able to get it to fail
using just the upstream Makefile so far. I'll keep trying to find a way
to reproduce this.

FWIW, we're not getting I/O errors, we see the following:
fatal: update_ref failed for ref 'refs/heads/branch-20': cannot lock references

We tried increasing the timeout in the test to 2 minutes (up from 10s),
but it didn't fix the failures.
Patrick Steinhardt Oct. 1, 2024, 4:27 a.m. UTC | #4
On Mon, Sep 30, 2024 at 03:19:04PM -0700, Josh Steadmon wrote:
> On 2024.09.27 00:07, Jeff King wrote:
> > On Tue, Sep 24, 2024 at 07:33:08AM +0200, Patrick Steinhardt wrote:
> > 
> > > +test_expect_success 'ref transaction: many concurrent writers' '
> > > +	test_when_finished "rm -rf repo" &&
> > > +	git init repo &&
> > > +	(
> > > +		cd repo &&
> > > +		# Set a high timeout such that a busy CI machine will not abort
> > > +		# early. 10 seconds should hopefully be ample of time to make
> > > +		# this non-flaky.
> > > +		git config set reftable.lockTimeout 10000 &&
> > 
> > I saw this test racily fail in the Windows CI build. The failure is as
> > you might imagine, a few of the background update-ref invocations
> > failed:
> > 
> >   fatal: update_ref failed for ref 'refs/heads/branch-21': reftable: transaction failure: I/O error
> > 
> > but of course we don't notice because they're backgrounded. And then the
> > expected output is missing the branch-21 entry (and in my case,
> > branch-64 suffered a similar fate).
> > 
> > At first I thought we probably needed to bump the timeout (and EIO was
> > just our way of passing that up the stack). But looking at the
> > timestamps in the Actions log, the whole loop took less than 10ms to
> > run.
> > 
> > So could this be indicative of a real contention issue specific to
> > Windows? I'm wondering if something like the old "you can't delete a
> > file somebody else has open" restriction is biting us somehow.
> > 
> > -Peff
> 
> We're seeing repeated failures from this test case with ASan enabled.
> Unfortunately, we've only been able to reproduce this on our
> $DAYJOB-specific build system. I haven't been able to get it to fail
> using just the upstream Makefile so far. I'll keep trying to find a way
> to reproduce this.
> 
> FWIW, we're not getting I/O errors, we see the following:
> fatal: update_ref failed for ref 'refs/heads/branch-20': cannot lock references
> 
> We tried increasing the timeout in the test to 2 minutes (up from 10s),
> but it didn't fix the failures.

If this is causing problems for folks I'd say we can do the below change
for now. It's of course only a stop-gap solution until I find the time
to debug this, which should be later this week or early next week.

Patrick

diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
index 2d951c8ceb..ad7bb39b79 100755
--- a/t/t0610-reftable-basics.sh
+++ b/t/t0610-reftable-basics.sh
@@ -450,7 +450,7 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' '
 	)
 '
 
-test_expect_success 'ref transaction: many concurrent writers' '
+test_expect_success !WINDOWS 'ref transaction: many concurrent writers' '
 	test_when_finished "rm -rf repo" &&
 	git init repo &&
 	(
Junio C Hamano Oct. 1, 2024, 7:34 a.m. UTC | #5
Josh Steadmon <steadmon@google.com> writes:

> On 2024.09.27 00:07, Jeff King wrote:
>> On Tue, Sep 24, 2024 at 07:33:08AM +0200, Patrick Steinhardt wrote:
>> 
>> > +test_expect_success 'ref transaction: many concurrent writers' '
>> > +	test_when_finished "rm -rf repo" &&
>> > +	git init repo &&
>> > +	(
>> > +		cd repo &&
>> > +		# Set a high timeout such that a busy CI machine will not abort
>> > +		# early. 10 seconds should hopefully be ample of time to make
>> > +		# this non-flaky.
>> > +		git config set reftable.lockTimeout 10000 &&
>> ...
>
> We're seeing repeated failures from this test case with ASan enabled.
> Unfortunately, we've only been able to reproduce this on our
> $DAYJOB-specific build system. I haven't been able to get it to fail
> using just the upstream Makefile so far. I'll keep trying to find a way
> to reproduce this.
>
> FWIW, we're not getting I/O errors, we see the following:
> fatal: update_ref failed for ref 'refs/heads/branch-20': cannot lock references
>
> We tried increasing the timeout in the test to 2 minutes (up from 10s),
> but it didn't fix the failures.

Thanks for a report, and please keep digging ;-).

Is your build, like Peff's, for Windows, or your variant of Linux?

Thanks.
Josh Steadmon Oct. 1, 2024, 6:53 p.m. UTC | #6
On 2024.10.01 00:34, Junio C Hamano wrote:
> Josh Steadmon <steadmon@google.com> writes:
> 
> > On 2024.09.27 00:07, Jeff King wrote:
> >> On Tue, Sep 24, 2024 at 07:33:08AM +0200, Patrick Steinhardt wrote:
> >> 
> >> > +test_expect_success 'ref transaction: many concurrent writers' '
> >> > +	test_when_finished "rm -rf repo" &&
> >> > +	git init repo &&
> >> > +	(
> >> > +		cd repo &&
> >> > +		# Set a high timeout such that a busy CI machine will not abort
> >> > +		# early. 10 seconds should hopefully be ample of time to make
> >> > +		# this non-flaky.
> >> > +		git config set reftable.lockTimeout 10000 &&
> >> ...
> >
> > We're seeing repeated failures from this test case with ASan enabled.
> > Unfortunately, we've only been able to reproduce this on our
> > $DAYJOB-specific build system. I haven't been able to get it to fail
> > using just the upstream Makefile so far. I'll keep trying to find a way
> > to reproduce this.
> >
> > FWIW, we're not getting I/O errors, we see the following:
> > fatal: update_ref failed for ref 'refs/heads/branch-20': cannot lock references
> >
> > We tried increasing the timeout in the test to 2 minutes (up from 10s),
> > but it didn't fix the failures.
> 
> Thanks for a report, and please keep digging ;-).
> 
> Is your build, like Peff's, for Windows, or your variant of Linux?
> 
> Thanks.

It's our internal Debian-based variant of Linux. Sorry for not
specifying earlier.
Jeff King Oct. 1, 2024, 7:08 p.m. UTC | #7
On Tue, Oct 01, 2024 at 11:53:54AM -0700, Josh Steadmon wrote:

> > > We're seeing repeated failures from this test case with ASan enabled.
> > > Unfortunately, we've only been able to reproduce this on our
> > > $DAYJOB-specific build system. I haven't been able to get it to fail
> > > using just the upstream Makefile so far. I'll keep trying to find a way
> > > to reproduce this.
> > >
> > > FWIW, we're not getting I/O errors, we see the following:
> > > fatal: update_ref failed for ref 'refs/heads/branch-20': cannot lock references
> > >
> > > We tried increasing the timeout in the test to 2 minutes (up from 10s),
> > > but it didn't fix the failures.
> > 
> > Thanks for a report, and please keep digging ;-).
> > 
> > Is your build, like Peff's, for Windows, or your variant of Linux?
> > 
> > Thanks.
> 
> It's our internal Debian-based variant of Linux. Sorry for not
> specifying earlier.

I just tried doing (on 'next', since the new test is added there):

  make SANITIZE=address,undefined CFLAGS=-O0
  cd t
  ./t0610-reftable-basics.sh --run=1-47 --stress

and got lots of failures like you describe. But in my case bumping the
timeout from 10s to 100s made that go away. So I think it was just the
sanitizers making things _really_ slow and causing legit timeouts.

Have you tried going beyond 2 minutes? That seems like a crazy amount of
time, but I guess it's possible on a really overloaded system. Of course
it's also quite possible that there are multiple issues and you are
seeing something else entirely. :)

-Peff
Jeff King Oct. 1, 2024, 10:54 p.m. UTC | #8
On Tue, Oct 01, 2024 at 06:27:33AM +0200, Patrick Steinhardt wrote:

> If this is causing problems for folks I'd say we can do the below change
> for now. It's of course only a stop-gap solution until I find the time
> to debug this, which should be later this week or early next week.
> 
> Patrick
> 
> diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
> index 2d951c8ceb..ad7bb39b79 100755
> --- a/t/t0610-reftable-basics.sh
> +++ b/t/t0610-reftable-basics.sh
> @@ -450,7 +450,7 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' '
>  	)
>  '
>  
> -test_expect_success 'ref transaction: many concurrent writers' '
> +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' '
>  	test_when_finished "rm -rf repo" &&
>  	git init repo &&
>  	(

IMHO we can live with a flaky test for a little while. It's not like
it's the only one. ;) And hopefully your digging turns up a real
solution.

It also sounds from subsequent discussion that Josh's issue was on
Linux, so it wouldn't help there.

-Peff
Junio C Hamano Oct. 1, 2024, 11:24 p.m. UTC | #9
Jeff King <peff@peff.net> writes:

> On Tue, Oct 01, 2024 at 06:27:33AM +0200, Patrick Steinhardt wrote:
>
>> If this is causing problems for folks I'd say we can do the below change
>> for now. It's of course only a stop-gap solution until I find the time
>> to debug this, which should be later this week or early next week.
>> 
>> Patrick
>> 
>> diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
>> index 2d951c8ceb..ad7bb39b79 100755
>> --- a/t/t0610-reftable-basics.sh
>> +++ b/t/t0610-reftable-basics.sh
>> @@ -450,7 +450,7 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' '
>>  	)
>>  '
>>  
>> -test_expect_success 'ref transaction: many concurrent writers' '
>> +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' '
>>  	test_when_finished "rm -rf repo" &&
>>  	git init repo &&
>>  	(
>
> IMHO we can live with a flaky test for a little while. It's not like
> it's the only one. ;) And hopefully your digging turns up a real
> solution.
>
> It also sounds from subsequent discussion that Josh's issue was on
> Linux, so it wouldn't help there.

That's true.  WINDOWS prereq would not help there, even though it
would hide the breakage from CI.
Patrick Steinhardt Oct. 2, 2024, 10:58 a.m. UTC | #10
On Tue, Oct 01, 2024 at 04:24:50PM -0700, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
> > On Tue, Oct 01, 2024 at 06:27:33AM +0200, Patrick Steinhardt wrote:
> >
> >> If this is causing problems for folks I'd say we can do the below change
> >> for now. It's of course only a stop-gap solution until I find the time
> >> to debug this, which should be later this week or early next week.
> >> 
> >> Patrick
> >> 
> >> diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
> >> index 2d951c8ceb..ad7bb39b79 100755
> >> --- a/t/t0610-reftable-basics.sh
> >> +++ b/t/t0610-reftable-basics.sh
> >> @@ -450,7 +450,7 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' '
> >>  	)
> >>  '
> >>  
> >> -test_expect_success 'ref transaction: many concurrent writers' '
> >> +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' '
> >>  	test_when_finished "rm -rf repo" &&
> >>  	git init repo &&
> >>  	(
> >
> > IMHO we can live with a flaky test for a little while. It's not like
> > it's the only one. ;) And hopefully your digging turns up a real
> > solution.
> >
> > It also sounds from subsequent discussion that Josh's issue was on
> > Linux, so it wouldn't help there.
> 
> That's true.  WINDOWS prereq would not help there, even though it
> would hide the breakage from CI.

Okay, fair enough, thanks. I should finally get to this issue either on
Friday or early next week.

Patrick
diff mbox series

Patch

diff --git a/refs/reftable-backend.c b/refs/reftable-backend.c
index 6ca00627dd7..efe2b0258ca 100644
--- a/refs/reftable-backend.c
+++ b/refs/reftable-backend.c
@@ -770,7 +770,8 @@  static int prepare_transaction_update(struct write_transaction_table_arg **out,
 		if (ret)
 			return ret;
 
-		ret = reftable_stack_new_addition(&addition, stack, 0);
+		ret = reftable_stack_new_addition(&addition, stack,
+						  REFTABLE_STACK_NEW_ADDITION_RELOAD);
 		if (ret) {
 			if (ret == REFTABLE_LOCK_ERROR)
 				strbuf_addstr(err, "cannot lock references");
diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
index 62da3e37823..2d951c8ceb6 100755
--- a/t/t0610-reftable-basics.sh
+++ b/t/t0610-reftable-basics.sh
@@ -450,6 +450,37 @@  test_expect_success 'ref transaction: retry acquiring tables.list lock' '
 	)
 '
 
+test_expect_success 'ref transaction: many concurrent writers' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+		# Set a high timeout such that a busy CI machine will not abort
+		# early. 10 seconds should hopefully be ample of time to make
+		# this non-flaky.
+		git config set reftable.lockTimeout 10000 &&
+		test_commit --no-tag initial &&
+
+		head=$(git rev-parse HEAD) &&
+		for i in $(test_seq 100)
+		do
+			printf "%s commit\trefs/heads/branch-%s\n" "$head" "$i" ||
+			return 1
+		done >expect &&
+		printf "%s commit\trefs/heads/main\n" "$head" >>expect &&
+
+		for i in $(test_seq 100)
+		do
+			{ git update-ref refs/heads/branch-$i HEAD& } ||
+			return 1
+		done &&
+
+		wait &&
+		git for-each-ref --sort=v:refname >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_expect_success 'pack-refs: compacts tables' '
 	test_when_finished "rm -rf repo" &&
 	git init repo &&