Message ID | 9ce2d18dff2a655365b609dd86ea484a489c717a.1727155858.git.ps@pks.im (mailing list archive) |
---|---|
State | Accepted |
Commit | 6241ce217046cc77d306bda06ad80ac1973b4932 |
Headers | show |
Series | reftable: graceful concurrent writes | expand |
On Tue, Sep 24, 2024 at 07:33:08AM +0200, Patrick Steinhardt wrote: > +test_expect_success 'ref transaction: many concurrent writers' ' > + test_when_finished "rm -rf repo" && > + git init repo && > + ( > + cd repo && > + # Set a high timeout such that a busy CI machine will not abort > + # early. 10 seconds should hopefully be ample of time to make > + # this non-flaky. > + git config set reftable.lockTimeout 10000 && I saw this test racily fail in the Windows CI build. The failure is as you might imagine, a few of the background update-ref invocations failed: fatal: update_ref failed for ref 'refs/heads/branch-21': reftable: transaction failure: I/O error but of course we don't notice because they're backgrounded. And then the expected output is missing the branch-21 entry (and in my case, branch-64 suffered a similar fate). At first I thought we probably needed to bump the timeout (and EIO was just our way of passing that up the stack). But looking at the timestamps in the Actions log, the whole loop took less than 10ms to run. So could this be indicative of a real contention issue specific to Windows? I'm wondering if something like the old "you can't delete a file somebody else has open" restriction is biting us somehow. -Peff
On Fri, Sep 27, 2024 at 12:07:52AM -0400, Jeff King wrote: > On Tue, Sep 24, 2024 at 07:33:08AM +0200, Patrick Steinhardt wrote: > > > +test_expect_success 'ref transaction: many concurrent writers' ' > > + test_when_finished "rm -rf repo" && > > + git init repo && > > + ( > > + cd repo && > > + # Set a high timeout such that a busy CI machine will not abort > > + # early. 10 seconds should hopefully be ample of time to make > > + # this non-flaky. > > + git config set reftable.lockTimeout 10000 && > > I saw this test racily fail in the Windows CI build. The failure is as > you might imagine, a few of the background update-ref invocations > failed: > > fatal: update_ref failed for ref 'refs/heads/branch-21': reftable: transaction failure: I/O error > > but of course we don't notice because they're backgrounded. And then the > expected output is missing the branch-21 entry (and in my case, > branch-64 suffered a similar fate). > > At first I thought we probably needed to bump the timeout (and EIO was > just our way of passing that up the stack). But looking at the > timestamps in the Actions log, the whole loop took less than 10ms to > run. > > So could this be indicative of a real contention issue specific to > Windows? I'm wondering if something like the old "you can't delete a > file somebody else has open" restriction is biting us somehow. Thanks for letting me know! I very much expect that this is the scenario that you mention, where we try to delete a file that is still held open by another process. We're trying to be mindful about this restriction is the reftable library by not failing when a call to unlink(3P) fails for any of the tables, and we do have logic in place to unlink them at a later point in time when not referenced by the "tables.list" file. So none of the calls to unlink are error-checked at all. But there's one file that we _have_ to rewrite, and that is of course the "tables.list" file itself. We never unlink the file though but only rename the new instance into place. I think I recently discovered that we have some problems here, because Windows seemed to allow this in some scenarios but not in others. In any case, I've already set up a Windows VM last week, so I'll investigate the issue later this week. Patrick
On 2024.09.27 00:07, Jeff King wrote: > On Tue, Sep 24, 2024 at 07:33:08AM +0200, Patrick Steinhardt wrote: > > > +test_expect_success 'ref transaction: many concurrent writers' ' > > + test_when_finished "rm -rf repo" && > > + git init repo && > > + ( > > + cd repo && > > + # Set a high timeout such that a busy CI machine will not abort > > + # early. 10 seconds should hopefully be ample of time to make > > + # this non-flaky. > > + git config set reftable.lockTimeout 10000 && > > I saw this test racily fail in the Windows CI build. The failure is as > you might imagine, a few of the background update-ref invocations > failed: > > fatal: update_ref failed for ref 'refs/heads/branch-21': reftable: transaction failure: I/O error > > but of course we don't notice because they're backgrounded. And then the > expected output is missing the branch-21 entry (and in my case, > branch-64 suffered a similar fate). > > At first I thought we probably needed to bump the timeout (and EIO was > just our way of passing that up the stack). But looking at the > timestamps in the Actions log, the whole loop took less than 10ms to > run. > > So could this be indicative of a real contention issue specific to > Windows? I'm wondering if something like the old "you can't delete a > file somebody else has open" restriction is biting us somehow. > > -Peff We're seeing repeated failures from this test case with ASan enabled. Unfortunately, we've only been able to reproduce this on our $DAYJOB-specific build system. I haven't been able to get it to fail using just the upstream Makefile so far. I'll keep trying to find a way to reproduce this. FWIW, we're not getting I/O errors, we see the following: fatal: update_ref failed for ref 'refs/heads/branch-20': cannot lock references We tried increasing the timeout in the test to 2 minutes (up from 10s), but it didn't fix the failures.
On Mon, Sep 30, 2024 at 03:19:04PM -0700, Josh Steadmon wrote: > On 2024.09.27 00:07, Jeff King wrote: > > On Tue, Sep 24, 2024 at 07:33:08AM +0200, Patrick Steinhardt wrote: > > > > > +test_expect_success 'ref transaction: many concurrent writers' ' > > > + test_when_finished "rm -rf repo" && > > > + git init repo && > > > + ( > > > + cd repo && > > > + # Set a high timeout such that a busy CI machine will not abort > > > + # early. 10 seconds should hopefully be ample of time to make > > > + # this non-flaky. > > > + git config set reftable.lockTimeout 10000 && > > > > I saw this test racily fail in the Windows CI build. The failure is as > > you might imagine, a few of the background update-ref invocations > > failed: > > > > fatal: update_ref failed for ref 'refs/heads/branch-21': reftable: transaction failure: I/O error > > > > but of course we don't notice because they're backgrounded. And then the > > expected output is missing the branch-21 entry (and in my case, > > branch-64 suffered a similar fate). > > > > At first I thought we probably needed to bump the timeout (and EIO was > > just our way of passing that up the stack). But looking at the > > timestamps in the Actions log, the whole loop took less than 10ms to > > run. > > > > So could this be indicative of a real contention issue specific to > > Windows? I'm wondering if something like the old "you can't delete a > > file somebody else has open" restriction is biting us somehow. > > > > -Peff > > We're seeing repeated failures from this test case with ASan enabled. > Unfortunately, we've only been able to reproduce this on our > $DAYJOB-specific build system. I haven't been able to get it to fail > using just the upstream Makefile so far. I'll keep trying to find a way > to reproduce this. > > FWIW, we're not getting I/O errors, we see the following: > fatal: update_ref failed for ref 'refs/heads/branch-20': cannot lock references > > We tried increasing the timeout in the test to 2 minutes (up from 10s), > but it didn't fix the failures. If this is causing problems for folks I'd say we can do the below change for now. It's of course only a stop-gap solution until I find the time to debug this, which should be later this week or early next week. Patrick diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh index 2d951c8ceb..ad7bb39b79 100755 --- a/t/t0610-reftable-basics.sh +++ b/t/t0610-reftable-basics.sh @@ -450,7 +450,7 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' ) ' -test_expect_success 'ref transaction: many concurrent writers' ' +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' ' test_when_finished "rm -rf repo" && git init repo && (
Josh Steadmon <steadmon@google.com> writes: > On 2024.09.27 00:07, Jeff King wrote: >> On Tue, Sep 24, 2024 at 07:33:08AM +0200, Patrick Steinhardt wrote: >> >> > +test_expect_success 'ref transaction: many concurrent writers' ' >> > + test_when_finished "rm -rf repo" && >> > + git init repo && >> > + ( >> > + cd repo && >> > + # Set a high timeout such that a busy CI machine will not abort >> > + # early. 10 seconds should hopefully be ample of time to make >> > + # this non-flaky. >> > + git config set reftable.lockTimeout 10000 && >> ... > > We're seeing repeated failures from this test case with ASan enabled. > Unfortunately, we've only been able to reproduce this on our > $DAYJOB-specific build system. I haven't been able to get it to fail > using just the upstream Makefile so far. I'll keep trying to find a way > to reproduce this. > > FWIW, we're not getting I/O errors, we see the following: > fatal: update_ref failed for ref 'refs/heads/branch-20': cannot lock references > > We tried increasing the timeout in the test to 2 minutes (up from 10s), > but it didn't fix the failures. Thanks for a report, and please keep digging ;-). Is your build, like Peff's, for Windows, or your variant of Linux? Thanks.
On 2024.10.01 00:34, Junio C Hamano wrote: > Josh Steadmon <steadmon@google.com> writes: > > > On 2024.09.27 00:07, Jeff King wrote: > >> On Tue, Sep 24, 2024 at 07:33:08AM +0200, Patrick Steinhardt wrote: > >> > >> > +test_expect_success 'ref transaction: many concurrent writers' ' > >> > + test_when_finished "rm -rf repo" && > >> > + git init repo && > >> > + ( > >> > + cd repo && > >> > + # Set a high timeout such that a busy CI machine will not abort > >> > + # early. 10 seconds should hopefully be ample of time to make > >> > + # this non-flaky. > >> > + git config set reftable.lockTimeout 10000 && > >> ... > > > > We're seeing repeated failures from this test case with ASan enabled. > > Unfortunately, we've only been able to reproduce this on our > > $DAYJOB-specific build system. I haven't been able to get it to fail > > using just the upstream Makefile so far. I'll keep trying to find a way > > to reproduce this. > > > > FWIW, we're not getting I/O errors, we see the following: > > fatal: update_ref failed for ref 'refs/heads/branch-20': cannot lock references > > > > We tried increasing the timeout in the test to 2 minutes (up from 10s), > > but it didn't fix the failures. > > Thanks for a report, and please keep digging ;-). > > Is your build, like Peff's, for Windows, or your variant of Linux? > > Thanks. It's our internal Debian-based variant of Linux. Sorry for not specifying earlier.
On Tue, Oct 01, 2024 at 11:53:54AM -0700, Josh Steadmon wrote: > > > We're seeing repeated failures from this test case with ASan enabled. > > > Unfortunately, we've only been able to reproduce this on our > > > $DAYJOB-specific build system. I haven't been able to get it to fail > > > using just the upstream Makefile so far. I'll keep trying to find a way > > > to reproduce this. > > > > > > FWIW, we're not getting I/O errors, we see the following: > > > fatal: update_ref failed for ref 'refs/heads/branch-20': cannot lock references > > > > > > We tried increasing the timeout in the test to 2 minutes (up from 10s), > > > but it didn't fix the failures. > > > > Thanks for a report, and please keep digging ;-). > > > > Is your build, like Peff's, for Windows, or your variant of Linux? > > > > Thanks. > > It's our internal Debian-based variant of Linux. Sorry for not > specifying earlier. I just tried doing (on 'next', since the new test is added there): make SANITIZE=address,undefined CFLAGS=-O0 cd t ./t0610-reftable-basics.sh --run=1-47 --stress and got lots of failures like you describe. But in my case bumping the timeout from 10s to 100s made that go away. So I think it was just the sanitizers making things _really_ slow and causing legit timeouts. Have you tried going beyond 2 minutes? That seems like a crazy amount of time, but I guess it's possible on a really overloaded system. Of course it's also quite possible that there are multiple issues and you are seeing something else entirely. :) -Peff
On Tue, Oct 01, 2024 at 06:27:33AM +0200, Patrick Steinhardt wrote: > If this is causing problems for folks I'd say we can do the below change > for now. It's of course only a stop-gap solution until I find the time > to debug this, which should be later this week or early next week. > > Patrick > > diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh > index 2d951c8ceb..ad7bb39b79 100755 > --- a/t/t0610-reftable-basics.sh > +++ b/t/t0610-reftable-basics.sh > @@ -450,7 +450,7 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' > ) > ' > > -test_expect_success 'ref transaction: many concurrent writers' ' > +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' ' > test_when_finished "rm -rf repo" && > git init repo && > ( IMHO we can live with a flaky test for a little while. It's not like it's the only one. ;) And hopefully your digging turns up a real solution. It also sounds from subsequent discussion that Josh's issue was on Linux, so it wouldn't help there. -Peff
Jeff King <peff@peff.net> writes: > On Tue, Oct 01, 2024 at 06:27:33AM +0200, Patrick Steinhardt wrote: > >> If this is causing problems for folks I'd say we can do the below change >> for now. It's of course only a stop-gap solution until I find the time >> to debug this, which should be later this week or early next week. >> >> Patrick >> >> diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh >> index 2d951c8ceb..ad7bb39b79 100755 >> --- a/t/t0610-reftable-basics.sh >> +++ b/t/t0610-reftable-basics.sh >> @@ -450,7 +450,7 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' >> ) >> ' >> >> -test_expect_success 'ref transaction: many concurrent writers' ' >> +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' ' >> test_when_finished "rm -rf repo" && >> git init repo && >> ( > > IMHO we can live with a flaky test for a little while. It's not like > it's the only one. ;) And hopefully your digging turns up a real > solution. > > It also sounds from subsequent discussion that Josh's issue was on > Linux, so it wouldn't help there. That's true. WINDOWS prereq would not help there, even though it would hide the breakage from CI.
On Tue, Oct 01, 2024 at 04:24:50PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > On Tue, Oct 01, 2024 at 06:27:33AM +0200, Patrick Steinhardt wrote: > > > >> If this is causing problems for folks I'd say we can do the below change > >> for now. It's of course only a stop-gap solution until I find the time > >> to debug this, which should be later this week or early next week. > >> > >> Patrick > >> > >> diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh > >> index 2d951c8ceb..ad7bb39b79 100755 > >> --- a/t/t0610-reftable-basics.sh > >> +++ b/t/t0610-reftable-basics.sh > >> @@ -450,7 +450,7 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' > >> ) > >> ' > >> > >> -test_expect_success 'ref transaction: many concurrent writers' ' > >> +test_expect_success !WINDOWS 'ref transaction: many concurrent writers' ' > >> test_when_finished "rm -rf repo" && > >> git init repo && > >> ( > > > > IMHO we can live with a flaky test for a little while. It's not like > > it's the only one. ;) And hopefully your digging turns up a real > > solution. > > > > It also sounds from subsequent discussion that Josh's issue was on > > Linux, so it wouldn't help there. > > That's true. WINDOWS prereq would not help there, even though it > would hide the breakage from CI. Okay, fair enough, thanks. I should finally get to this issue either on Friday or early next week. Patrick
diff --git a/refs/reftable-backend.c b/refs/reftable-backend.c index 6ca00627dd7..efe2b0258ca 100644 --- a/refs/reftable-backend.c +++ b/refs/reftable-backend.c @@ -770,7 +770,8 @@ static int prepare_transaction_update(struct write_transaction_table_arg **out, if (ret) return ret; - ret = reftable_stack_new_addition(&addition, stack, 0); + ret = reftable_stack_new_addition(&addition, stack, + REFTABLE_STACK_NEW_ADDITION_RELOAD); if (ret) { if (ret == REFTABLE_LOCK_ERROR) strbuf_addstr(err, "cannot lock references"); diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh index 62da3e37823..2d951c8ceb6 100755 --- a/t/t0610-reftable-basics.sh +++ b/t/t0610-reftable-basics.sh @@ -450,6 +450,37 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' ) ' +test_expect_success 'ref transaction: many concurrent writers' ' + test_when_finished "rm -rf repo" && + git init repo && + ( + cd repo && + # Set a high timeout such that a busy CI machine will not abort + # early. 10 seconds should hopefully be ample of time to make + # this non-flaky. + git config set reftable.lockTimeout 10000 && + test_commit --no-tag initial && + + head=$(git rev-parse HEAD) && + for i in $(test_seq 100) + do + printf "%s commit\trefs/heads/branch-%s\n" "$head" "$i" || + return 1 + done >expect && + printf "%s commit\trefs/heads/main\n" "$head" >>expect && + + for i in $(test_seq 100) + do + { git update-ref refs/heads/branch-$i HEAD& } || + return 1 + done && + + wait && + git for-each-ref --sort=v:refname >actual && + test_cmp expect actual + ) +' + test_expect_success 'pack-refs: compacts tables' ' test_when_finished "rm -rf repo" && git init repo &&
When starting a reftable transaction we lock all stacks we are about to modify. While it may happen that the stack is out-of-date at this point in time we don't really care: transactional updates encode the expected state of a certain reference, so all that we really want to verify is that the _current_ value matches that expected state. Pass `REFTABLE_STACK_NEW_ADDITION_RELOAD` when locking the stack such that an out-of-date stack will be reloaded after having been locked. This change is safe because all verifications of the expected state happen after this step anyway. Add a testcase that verifies that many writers are now able to write to the stack concurrently without failures and with a deterministic end result. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- refs/reftable-backend.c | 3 ++- t/t0610-reftable-basics.sh | 31 +++++++++++++++++++++++++++++++ 2 files changed, 33 insertions(+), 1 deletion(-)