Message ID | 20250107-b4-pks-reftable-csprng-v1-2-6109a54a8756@pks.im (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | reftable/stack: stop dying on exhausted entropy pool | expand |
On January 7, 2025 10:27 AM, Patrick Steinhardt wrote: >The reftable library uses randomness in two call paths: > > - When reading a stack in case some of the referenced tables > disappears. The randomness is used to delay the next read by a > couple of milliseconds. > > - When writing a new table, where the randomness gets appended to the > table name (e.g. "0x000000000001-0x000000000002-0b1d8ddf.ref"). > >In neither of these cases do we need strong randomness. > >Unfortunately though, we have observed test failures caused by the former case. In >t0610 we have a test that spawns a 100 processes at once, all of which try to write >a new table to the stack. And given that all of the processes will require >randomness, it can happen that these processes make the entropy pool run dry, >which will then cause us to >die: > > + test_seq 100 > + printf %s commit\trefs/heads/branch-%s\n > 68d032e9edd3481ac96382786ececc37ec28709e 1 > + printf %s commit\trefs/heads/branch-%s\n > 68d032e9edd3481ac96382786ececc37ec28709e 2 > ... > + git update-ref refs/heads/branch-98 HEAD > + git update-ref refs/heads/branch-97 HEAD > + git update-ref refs/heads/branch-99 HEAD > + git update-ref refs/heads/branch-100 HEAD > fatal: unable to get random bytes > fatal: unable to get random bytes > fatal: unable to get random bytes > fatal: unable to get random bytes > fatal: unable to get random bytes > fatal: unable to get random bytes > fatal: unable to get random bytes > >The report was for NonStop, which uses OpenSSL as the backend for randomness. >In the preceding commit we have adapted that backend to also return randomness >in case the entropy pool is empty and the caller passes the >`CSPRNG_BYTES_INSECURE` flag. Do so to fix the issue. > >Reported-by: Randall S. Becker <rsbecker@nexbridge.com> >Signed-off-by: Patrick Steinhardt <ps@pks.im> >--- > reftable/stack.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > >diff --git a/reftable/stack.c b/reftable/stack.c index >6d0aa774e7e29d5366ed55df19725944f8eef792..572a74e00f9ed6040534e060 >652e72c26641749d 100644 >--- a/reftable/stack.c >+++ b/reftable/stack.c >@@ -493,7 +493,7 @@ static int reftable_stack_reload_maybe_reuse(struct >reftable_stack *st, > close(fd); > fd = -1; > >- delay = delay + (delay * rand()) / RAND_MAX + 1; >+ delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) / >+UINT32_MAX + 1; > sleep_millisec(delay); > } > >@@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st, static int >format_name(struct reftable_buf *dest, uint64_t min, uint64_t max) { > char buf[100]; >- uint32_t rnd = (uint32_t)git_rand(0); >+ uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE); > snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x", > min, max, rnd); > reftable_buf_reset(dest); I think this is a good catch. We are seeing other tests fail similarly through this path. I will retest rc3 on both platforms since we cannot exhaust randomness on x86 using the hardware randomizer inside OpenSSL.
Patrick Steinhardt <ps@pks.im> writes: > The report was for NonStop, which uses OpenSSL as the backend for > randomness. In the preceding commit we have adapted that backend to also > return randomness in case the entropy pool is empty and the caller > passes the `CSPRNG_BYTES_INSECURE` flag. Do so to fix the issue. No kidding. This is calling rand(3). The use of the resulting value is to fuzz the delay before retrying, which wants NO CRYPTOGRAPHIC randomness. This callsite does not require it, but rand(3) is perfect for ensuring predictability/repeatability as well (via srand(3)). And it is not allowed to fail. Yet a platform replaces it with a function that returns an error or aborts? What kind of nonsense is that? Do we really need to cater to such an insanity? Use of git_rand() here goes backwards against the more recent trend in reftable/ directory to wean the code off of the rest of Git by getting rid of unnecessary dependency, doesn't it? I think [PATCH 1/2] makes sense regardless, though. But shouldn't we be pushing back this step, with "fix your rand(3)"? Thanks. > Reported-by: Randall S. Becker <rsbecker@nexbridge.com> > Signed-off-by: Patrick Steinhardt <ps@pks.im> > --- > reftable/stack.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/reftable/stack.c b/reftable/stack.c > index 6d0aa774e7e29d5366ed55df19725944f8eef792..572a74e00f9ed6040534e060652e72c26641749d 100644 > --- a/reftable/stack.c > +++ b/reftable/stack.c > @@ -493,7 +493,7 @@ static int reftable_stack_reload_maybe_reuse(struct reftable_stack *st, > close(fd); > fd = -1; > > - delay = delay + (delay * rand()) / RAND_MAX + 1; > + delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) / UINT32_MAX + 1; > sleep_millisec(delay); > } > > @@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st, > static int format_name(struct reftable_buf *dest, uint64_t min, uint64_t max) > { > char buf[100]; > - uint32_t rnd = (uint32_t)git_rand(0); > + uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE); > snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x", > min, max, rnd); > reftable_buf_reset(dest);
On January 7, 2025 3:56 PM, Junio C Hamano wrote: >Patrick Steinhardt <ps@pks.im> writes: > >> The report was for NonStop, which uses OpenSSL as the backend for >> randomness. In the preceding commit we have adapted that backend to >> also return randomness in case the entropy pool is empty and the >> caller passes the `CSPRNG_BYTES_INSECURE` flag. Do so to fix the issue. > >No kidding. > >This is calling rand(3). The use of the resulting value is to fuzz the delay before >retrying, which wants NO CRYPTOGRAPHIC randomness. >This callsite does not require it, but rand(3) is perfect for ensuring >predictability/repeatability as well (via srand(3)). > >And it is not allowed to fail. I don't think rand() is what is failing here. The call into OpenSSL is using randomness from PRNGD. A poorly configured, or overloaded instance of it will cause this situation. The ia64 platform for NonStop is using a standard PRNGD but it is being overloaded with a large volume of uses hitting that server. I think this is more of a canary-in-the-coalmine situation, where we have just brought something to light that otherwise would be hard to recreate. >Yet a platform replaces it with a function that returns an error or aborts? What kind >of nonsense is that? Do we really need to cater to such an insanity? NonStop has not replaced this call. What we did in OpenSSL is to use the x86 hardware randomizer, which is highly reliable and does not show this problem. In ia64, in OpenSSL, the PRNGD is generating randomness and that seems to be not reliable enough for git or OpenSSL when under significant load. >Use of git_rand() here goes backwards against the more recent trend in reftable/ >directory to wean the code off of the rest of Git by getting rid of unnecessary >dependency, doesn't it? > >I think [PATCH 1/2] makes sense regardless, though. But shouldn't we be pushing >back this step, with "fix your rand(3)"? > >Thanks. > >> Reported-by: Randall S. Becker <rsbecker@nexbridge.com> >> Signed-off-by: Patrick Steinhardt <ps@pks.im> >> --- >> reftable/stack.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/reftable/stack.c b/reftable/stack.c index >> >6d0aa774e7e29d5366ed55df19725944f8eef792..572a74e00f9ed6040534e060 >652e >> 72c26641749d 100644 >> --- a/reftable/stack.c >> +++ b/reftable/stack.c >> @@ -493,7 +493,7 @@ static int reftable_stack_reload_maybe_reuse(struct >reftable_stack *st, >> close(fd); >> fd = -1; >> >> - delay = delay + (delay * rand()) / RAND_MAX + 1; >> + delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) / >> +UINT32_MAX + 1; >> sleep_millisec(delay); >> } >> >> @@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st, >> static int format_name(struct reftable_buf *dest, uint64_t min, >> uint64_t max) { >> char buf[100]; >> - uint32_t rnd = (uint32_t)git_rand(0); >> + uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE); >> snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x", >> min, max, rnd); >> reftable_buf_reset(dest); --Randall
Junio C Hamano <gitster@pobox.com> writes: > Yet a platform replaces it with a function that returns an error or > aborts? What kind of nonsense is that? Do we really need to cater > to such an insanity? > > Use of git_rand() here goes backwards against the more recent trend > in reftable/ directory to wean the code off of the rest of Git by > getting rid of unnecessary dependency, doesn't it? > > I think [PATCH 1/2] makes sense regardless, though. But shouldn't > we be pushing back this step, with "fix your rand(3)"? > > Thanks. Ah, I misread the patch. It has two hunks, and what I said applies to the earlier one, but the later one is already contaminated with git_rand(), and that is what is failing, i.e. it is not a nonsense platform replacing rand() with something that can fail. It may still make sense to drop the first hunk, and consider how to proceed when you further want to reduce the unnecessary dependencies for external users of the reftable library, though. Are there correctness implications if git_rand() in format_name() yields non random results (like, always using "rnd = 0" instead of calling git_rand())? I seriously hope not. And if there is no correctness implications, perhaps we can replace it with rand() or even constant "0"? Thanks. >> Reported-by: Randall S. Becker <rsbecker@nexbridge.com> >> Signed-off-by: Patrick Steinhardt <ps@pks.im> >> --- >> reftable/stack.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/reftable/stack.c b/reftable/stack.c >> index 6d0aa774e7e29d5366ed55df19725944f8eef792..572a74e00f9ed6040534e060652e72c26641749d 100644 >> --- a/reftable/stack.c >> +++ b/reftable/stack.c >> @@ -493,7 +493,7 @@ static int reftable_stack_reload_maybe_reuse(struct reftable_stack *st, >> close(fd); >> fd = -1; >> >> - delay = delay + (delay * rand()) / RAND_MAX + 1; >> + delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) / UINT32_MAX + 1; >> sleep_millisec(delay); >> } >> >> @@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st, >> static int format_name(struct reftable_buf *dest, uint64_t min, uint64_t max) >> { >> char buf[100]; >> - uint32_t rnd = (uint32_t)git_rand(0); >> + uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE); >> snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x", >> min, max, rnd); >> reftable_buf_reset(dest);
<rsbecker@nexbridge.com> writes:
> I don't think rand() is what is failing here.
Yeah, sorry, I misead the patch. The problematic one is git_rand()
in the latter hunk, and I was commenting on the other, former, hunk.
>-----Original Message----- >From: Patrick Steinhardt <ps@pks.im> >Sent: January 7, 2025 10:27 AM >To: git@vger.kernel.org >Cc: Randall S. Becker <randall.becker@nexbridge.ca> >Subject: [PATCH 2/2] reftable/stack: accept insecure random bytes > >The reftable library uses randomness in two call paths: > > - When reading a stack in case some of the referenced tables > disappears. The randomness is used to delay the next read by a > couple of milliseconds. > > - When writing a new table, where the randomness gets appended to the > table name (e.g. "0x000000000001-0x000000000002-0b1d8ddf.ref"). > >In neither of these cases do we need strong randomness. > >Unfortunately though, we have observed test failures caused by the former case. In >t0610 we have a test that spawns a 100 processes at once, all of which try to write >a new table to the stack. And given that all of the processes will require >randomness, it can happen that these processes make the entropy pool run dry, >which will then cause us to >die: > > + test_seq 100 > + printf %s commit\trefs/heads/branch-%s\n > 68d032e9edd3481ac96382786ececc37ec28709e 1 > + printf %s commit\trefs/heads/branch-%s\n > 68d032e9edd3481ac96382786ececc37ec28709e 2 > ... > + git update-ref refs/heads/branch-98 HEAD > + git update-ref refs/heads/branch-97 HEAD > + git update-ref refs/heads/branch-99 HEAD > + git update-ref refs/heads/branch-100 HEAD > fatal: unable to get random bytes > fatal: unable to get random bytes > fatal: unable to get random bytes > fatal: unable to get random bytes > fatal: unable to get random bytes > fatal: unable to get random bytes > fatal: unable to get random bytes > >The report was for NonStop, which uses OpenSSL as the backend for randomness. >In the preceding commit we have adapted that backend to also return randomness >in case the entropy pool is empty and the caller passes the >`CSPRNG_BYTES_INSECURE` flag. Do so to fix the issue. > >Reported-by: Randall S. Becker <rsbecker@nexbridge.com> >Signed-off-by: Patrick Steinhardt <ps@pks.im> >--- > reftable/stack.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > >diff --git a/reftable/stack.c b/reftable/stack.c index >6d0aa774e7e29d5366ed55df19725944f8eef792..572a74e00f9ed6040534e060 >652e72c26641749d 100644 >--- a/reftable/stack.c >+++ b/reftable/stack.c >@@ -493,7 +493,7 @@ static int reftable_stack_reload_maybe_reuse(struct >reftable_stack *st, > close(fd); > fd = -1; > >- delay = delay + (delay * rand()) / RAND_MAX + 1; >+ delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) / >+UINT32_MAX + 1; > sleep_millisec(delay); > } > >@@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st, static int >format_name(struct reftable_buf *dest, uint64_t min, uint64_t max) { > char buf[100]; >- uint32_t rnd = (uint32_t)git_rand(0); >+ uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE); > snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x", > min, max, rnd); > reftable_buf_reset(dest); > >-- >2.48.0.rc1.245.gb3e6e7acbc.dirty The tests that are all impacted on ia64 (using PRNGD) are: t0610-reftable-basics t1092-sparse-checkout-compatibility t1300-config t1404-update-ref-errors t1410-reflog t1414-reflog-walk t1415-worktree-refs t1450-fsck t1460-refs-migrate t1500-rev-parse t2006-checkout-index-basic t2009-checkout-statinfo t2013-checkout-submodule t2017-checkout-orphan t2030-unresolve-info t2400-worktree-add Everything passes on x86.
On Tue, Jan 07, 2025 at 01:03:26PM -0800, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > > Yet a platform replaces it with a function that returns an error or > > aborts? What kind of nonsense is that? Do we really need to cater > > to such an insanity? > > > > Use of git_rand() here goes backwards against the more recent trend > > in reftable/ directory to wean the code off of the rest of Git by > > getting rid of unnecessary dependency, doesn't it? It certainly does, yes, but unifying those two callsites to do the same is also something I do in a patch series that gets rid of the last deps on the Git codebase. This then allows me to move `git_rand()` into "reftable/system.h" and make it a shim provided by the respective code base that it's part of. So it's not a step backwards. > > I think [PATCH 1/2] makes sense regardless, though. But shouldn't > > we be pushing back this step, with "fix your rand(3)"? > > > > Thanks. > > Ah, I misread the patch. It has two hunks, and what I said applies > to the earlier one, but the later one is already contaminated with > git_rand(), and that is what is failing, i.e. it is not a nonsense > platform replacing rand() with something that can fail. > > It may still make sense to drop the first hunk, and consider how to > proceed when you further want to reduce the unnecessary dependencies > for external users of the reftable library, though. Are there > correctness implications if git_rand() in format_name() yields non > random results (like, always using "rnd = 0" instead of calling > git_rand())? I seriously hope not. And if there is no correctness > implications, perhaps we can replace it with rand() or even constant > "0"? No, there aren't any implications on correctness in that case. Sure, the randomized delays not being randomized can lead to more contention. But even when the randomized suffix for tables is deterministic we wouldn't have an issue as the files are still distinguished by their update indices. Patrick
Patrick Steinhardt <ps@pks.im> writes: >> It may still make sense to drop the first hunk, and consider how to >> proceed when you further want to reduce the unnecessary dependencies >> for external users of the reftable library, though. Are there >> correctness implications if git_rand() in format_name() yields non >> random results (like, always using "rnd = 0" instead of calling >> git_rand())? I seriously hope not. And if there is no correctness >> implications, perhaps we can replace it with rand() or even constant >> "0"? > > No, there aren't any implications on correctness in that case. Sure, the > randomized delays not being randomized can lead to more contention. But > even when the randomized suffix for tables is deterministic we wouldn't > have an issue as the files are still distinguished by their update > indices. OK, so they both can be turned into a simple rand() that is expected to work more reliably especially on more exotic systems (meaning: the ability the system providers can test their rand() is much better than our ability to test our git_rand() there)? It would help us solve the immediate issue reported, while removing one git specific function from the reftable library? Thanks.
On Wed, Jan 08, 2025 at 07:39:37AM -0800, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > >> It may still make sense to drop the first hunk, and consider how to > >> proceed when you further want to reduce the unnecessary dependencies > >> for external users of the reftable library, though. Are there > >> correctness implications if git_rand() in format_name() yields non > >> random results (like, always using "rnd = 0" instead of calling > >> git_rand())? I seriously hope not. And if there is no correctness > >> implications, perhaps we can replace it with rand() or even constant > >> "0"? > > > > No, there aren't any implications on correctness in that case. Sure, the > > randomized delays not being randomized can lead to more contention. But > > even when the randomized suffix for tables is deterministic we wouldn't > > have an issue as the files are still distinguished by their update > > indices. > > OK, so they both can be turned into a simple rand() that is expected > to work more reliably especially on more exotic systems (meaning: > the ability the system providers can test their rand() is much > better than our ability to test our git_rand() there)? It would > help us solve the immediate issue reported, while removing one git > specific function from the reftable library? Hm. The problem is when Git dies in the middle of a transaction: 1. We write the temporary table. 2. We compute the not-so-random suffix. 3. We write the temporary "tables.list" file. 4. We move the temporary table into place using the not-so-random suffix. 5. Git dies before updating "tables.list". Now we have the temporary table moved into place, but "tables.list" hasn't been updated yet. When the next Git process comes along and wants to update the table it would result in an error if it computed the same suffix. The reftable library knows to clean up such stale tables when not referenced by the "tables.list" file, but it doesn't do so on every write. So this would likely still cause issues in practice. I already though about this scenario when writing my mail, but didn't really think about it as "correctness". But I guess it is. Also, based on the feedback from Randall it's not only the reftable backend that has issues. It's a more general problem on ia64, where many tests are failing. So even if we fixed this one case, it's likely that other cases would still die when running low on entropy. I dunno. It feels like a platform issue, not like a Git issue, when the RNG cannot provide us a couple of integers. The OpenSSL backend seems unfit for use to me as none of the other backends have the same issue. Patrick
Patrick Steinhardt <ps@pks.im> writes: > Hm. The problem is when Git dies in the middle of a transaction: > > 1. We write the temporary table. > 2. We compute the not-so-random suffix. > 3. We write the temporary "tables.list" file. > 4. We move the temporary table into place using the not-so-random > suffix. > 5. Git dies before updating "tables.list". > > Now we have the temporary table moved into place, but "tables.list" > hasn't been updated yet. When the next Git process comes along and wants > to update the table it would result in an error if it computed the same > suffix. Here, I hear that we _do_ depend on the suffix being relatively unique. Once our random number generator decides to give the same number twice to cause collision, the reftable data gets corrupt? > The reftable library knows to clean up such stale tables when not > referenced by the "tables.list" file, but it doesn't do so on every > write. So this would likely still cause issues in practice. > > I already though about this scenario when writing my mail, but didn't > really think about it as "correctness". But I guess it is. Hmph. I am not sure how I should feel about this. Our reliance on hash functions (which can be made to collide) not colliding is one thing, but is it sensibly safe to rely on a cryptographically unpredictable random generator not to yield the same suffix twice during the lifetime of an previous invocation for correctness?
On Wed, Jan 08, 2025 at 09:40:58AM -0800, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > Hm. The problem is when Git dies in the middle of a transaction: > > > > 1. We write the temporary table. > > 2. We compute the not-so-random suffix. > > 3. We write the temporary "tables.list" file. > > 4. We move the temporary table into place using the not-so-random > > suffix. > > 5. Git dies before updating "tables.list". > > > > Now we have the temporary table moved into place, but "tables.list" > > hasn't been updated yet. When the next Git process comes along and wants > > to update the table it would result in an error if it computed the same > > suffix. > > Here, I hear that we _do_ depend on the suffix being relatively > unique. Once our random number generator decides to give the same > number twice to cause collision, the reftable data gets corrupt? No, there is no corruption. We may fail to update the stack when there are colliding files, but that's it. > > The reftable library knows to clean up such stale tables when not > > referenced by the "tables.list" file, but it doesn't do so on every > > write. So this would likely still cause issues in practice. > > > > I already though about this scenario when writing my mail, but didn't > > really think about it as "correctness". But I guess it is. > > Hmph. I am not sure how I should feel about this. Our reliance on > hash functions (which can be made to collide) not colliding is one > thing, but is it sensibly safe to rely on a cryptographically > unpredictable random generator not to yield the same suffix twice > during the lifetime of an previous invocation for correctness? This is why I've been hesistant to call it a "correctness" issue, as there is no corruption involved here. It's more of a denial of service as you may not be able to update the stack anymore until you remove the occluding file. But turns out I misremembered from 9abda98149 (reftable/stack: fix use of unseeded randomness, 2023-12-11): things indeed work alright. To demonstrate, let's update `format_name()` like this: diff --git a/reftable/stack.c b/reftable/stack.c index 531660a49f..b7422679df 100644 --- a/reftable/stack.c +++ b/reftable/stack.c @@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st, static int format_name(struct reftable_buf *dest, uint64_t min, uint64_t max) { char buf[100]; - uint32_t rnd = (uint32_t)git_rand(); + uint32_t rnd = 123; snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x", min, max, rnd); reftable_buf_reset(dest); And then we create an occluding file and try to update: $ ~/Development/git/build/bin-wrappers/git init repo --ref-format=reftable Initialized empty Git repository in /tmp/repo/.git/ $ cd repo/ $ ls .git/reftable/ 0x000000000001-0x000000000001-0000007b.ref tables.list $ touch .git/reftable/0x000000000002-0x000000000002-0000007b.ref $ ~/Development/git/build/git commit --allow-empty -mx [main (root-commit) 08d02ef] x $ ls .git/reftable/ 0x000000000001-0x000000000002-0000007b.ref tables.list $ git show commit 08d02efa88f085f02c31285bf909d8d3a25c70dd (HEAD -> main) Author: Patrick Steinhardt <ps@pks.im> Date: Wed Jan 8 19:03:46 2025 +0100 x So the stack gets updated as expected, no corruption there. But this _can_ be a problem on Windows, where the file cannot be deleted in case it was still open. This is also documented as the reason in "Documentation/techincal/reftable.txt". That should've gotten better though now with the improvements I made to our rename-emulation, as it now uses POSIX semantics. That being said, I still don't think that swapping out `git_rand()` for `rand()` is the right thing to do. It does not solve the issue, but only a symptom thereof. Git can still die whenever the OpenSSL CSPRNG fails as there are other uses of `git_rand()` or `csprng_bytes()` in the codebase. So this change would regress something that works everywhere but on NonStop and make the suffixes predictable. The ia64 machine in question is being EOLd end of 2025. And the fact that this has never been a problem before v2.48.0-rc2, and that the machine was rebooted for maintenance immediately before running the tests, indicates to me that something fishy is going on on that platform. Patrick
diff --git a/reftable/stack.c b/reftable/stack.c index 6d0aa774e7e29d5366ed55df19725944f8eef792..572a74e00f9ed6040534e060652e72c26641749d 100644 --- a/reftable/stack.c +++ b/reftable/stack.c @@ -493,7 +493,7 @@ static int reftable_stack_reload_maybe_reuse(struct reftable_stack *st, close(fd); fd = -1; - delay = delay + (delay * rand()) / RAND_MAX + 1; + delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) / UINT32_MAX + 1; sleep_millisec(delay); } @@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st, static int format_name(struct reftable_buf *dest, uint64_t min, uint64_t max) { char buf[100]; - uint32_t rnd = (uint32_t)git_rand(0); + uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE); snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x", min, max, rnd); reftable_buf_reset(dest);
The reftable library uses randomness in two call paths: - When reading a stack in case some of the referenced tables disappears. The randomness is used to delay the next read by a couple of milliseconds. - When writing a new table, where the randomness gets appended to the table name (e.g. "0x000000000001-0x000000000002-0b1d8ddf.ref"). In neither of these cases do we need strong randomness. Unfortunately though, we have observed test failures caused by the former case. In t0610 we have a test that spawns a 100 processes at once, all of which try to write a new table to the stack. And given that all of the processes will require randomness, it can happen that these processes make the entropy pool run dry, which will then cause us to die: + test_seq 100 + printf %s commit\trefs/heads/branch-%s\n 68d032e9edd3481ac96382786ececc37ec28709e 1 + printf %s commit\trefs/heads/branch-%s\n 68d032e9edd3481ac96382786ececc37ec28709e 2 ... + git update-ref refs/heads/branch-98 HEAD + git update-ref refs/heads/branch-97 HEAD + git update-ref refs/heads/branch-99 HEAD + git update-ref refs/heads/branch-100 HEAD fatal: unable to get random bytes fatal: unable to get random bytes fatal: unable to get random bytes fatal: unable to get random bytes fatal: unable to get random bytes fatal: unable to get random bytes fatal: unable to get random bytes The report was for NonStop, which uses OpenSSL as the backend for randomness. In the preceding commit we have adapted that backend to also return randomness in case the entropy pool is empty and the caller passes the `CSPRNG_BYTES_INSECURE` flag. Do so to fix the issue. Reported-by: Randall S. Becker <rsbecker@nexbridge.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> --- reftable/stack.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)