diff mbox series

[2/2] reftable/stack: accept insecure random bytes

Message ID 20250107-b4-pks-reftable-csprng-v1-2-6109a54a8756@pks.im (mailing list archive)
State New
Headers show
Series reftable/stack: stop dying on exhausted entropy pool | expand

Commit Message

Patrick Steinhardt Jan. 7, 2025, 3:27 p.m. UTC
The reftable library uses randomness in two call paths:

  - When reading a stack in case some of the referenced tables
    disappears. The randomness is used to delay the next read by a
    couple of milliseconds.

  - When writing a new table, where the randomness gets appended to the
    table name (e.g. "0x000000000001-0x000000000002-0b1d8ddf.ref").

In neither of these cases do we need strong randomness.

Unfortunately though, we have observed test failures caused by the
former case. In t0610 we have a test that spawns a 100 processes at
once, all of which try to write a new table to the stack. And given that
all of the processes will require randomness, it can happen that these
processes make the entropy pool run dry, which will then cause us to
die:

    + test_seq 100
    + printf %s commit\trefs/heads/branch-%s\n
    68d032e9edd3481ac96382786ececc37ec28709e 1
    + printf %s commit\trefs/heads/branch-%s\n
    68d032e9edd3481ac96382786ececc37ec28709e 2
    ...
    + git update-ref refs/heads/branch-98 HEAD
    + git update-ref refs/heads/branch-97 HEAD
    + git update-ref refs/heads/branch-99 HEAD
    + git update-ref refs/heads/branch-100 HEAD
    fatal: unable to get random bytes
    fatal: unable to get random bytes
    fatal: unable to get random bytes
    fatal: unable to get random bytes
    fatal: unable to get random bytes
    fatal: unable to get random bytes
    fatal: unable to get random bytes

The report was for NonStop, which uses OpenSSL as the backend for
randomness. In the preceding commit we have adapted that backend to also
return randomness in case the entropy pool is empty and the caller
passes the `CSPRNG_BYTES_INSECURE` flag. Do so to fix the issue.

Reported-by: Randall S. Becker <rsbecker@nexbridge.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/stack.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Randall S. Becker Jan. 7, 2025, 3:37 p.m. UTC | #1
On January 7, 2025 10:27 AM, Patrick Steinhardt wrote:
>The reftable library uses randomness in two call paths:
>
>  - When reading a stack in case some of the referenced tables
>    disappears. The randomness is used to delay the next read by a
>    couple of milliseconds.
>
>  - When writing a new table, where the randomness gets appended to the
>    table name (e.g. "0x000000000001-0x000000000002-0b1d8ddf.ref").
>
>In neither of these cases do we need strong randomness.
>
>Unfortunately though, we have observed test failures caused by the former case. In
>t0610 we have a test that spawns a 100 processes at once, all of which try to write
>a new table to the stack. And given that all of the processes will require
>randomness, it can happen that these processes make the entropy pool run dry,
>which will then cause us to
>die:
>
>    + test_seq 100
>    + printf %s commit\trefs/heads/branch-%s\n
>    68d032e9edd3481ac96382786ececc37ec28709e 1
>    + printf %s commit\trefs/heads/branch-%s\n
>    68d032e9edd3481ac96382786ececc37ec28709e 2
>    ...
>    + git update-ref refs/heads/branch-98 HEAD
>    + git update-ref refs/heads/branch-97 HEAD
>    + git update-ref refs/heads/branch-99 HEAD
>    + git update-ref refs/heads/branch-100 HEAD
>    fatal: unable to get random bytes
>    fatal: unable to get random bytes
>    fatal: unable to get random bytes
>    fatal: unable to get random bytes
>    fatal: unable to get random bytes
>    fatal: unable to get random bytes
>    fatal: unable to get random bytes
>
>The report was for NonStop, which uses OpenSSL as the backend for randomness.
>In the preceding commit we have adapted that backend to also return randomness
>in case the entropy pool is empty and the caller passes the
>`CSPRNG_BYTES_INSECURE` flag. Do so to fix the issue.
>
>Reported-by: Randall S. Becker <rsbecker@nexbridge.com>
>Signed-off-by: Patrick Steinhardt <ps@pks.im>
>---
> reftable/stack.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
>diff --git a/reftable/stack.c b/reftable/stack.c index
>6d0aa774e7e29d5366ed55df19725944f8eef792..572a74e00f9ed6040534e060
>652e72c26641749d 100644
>--- a/reftable/stack.c
>+++ b/reftable/stack.c
>@@ -493,7 +493,7 @@ static int reftable_stack_reload_maybe_reuse(struct
>reftable_stack *st,
> 		close(fd);
> 		fd = -1;
>
>-		delay = delay + (delay * rand()) / RAND_MAX + 1;
>+		delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) /
>+UINT32_MAX + 1;
> 		sleep_millisec(delay);
> 	}
>
>@@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st,  static int
>format_name(struct reftable_buf *dest, uint64_t min, uint64_t max)  {
> 	char buf[100];
>-	uint32_t rnd = (uint32_t)git_rand(0);
>+	uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE);
> 	snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x",
> 		 min, max, rnd);
> 	reftable_buf_reset(dest);

I think this is a good catch. We are seeing other tests fail similarly through this path. I will retest rc3
on both platforms since we cannot exhaust randomness on x86 using the hardware randomizer
inside OpenSSL.
Junio C Hamano Jan. 7, 2025, 8:56 p.m. UTC | #2
Patrick Steinhardt <ps@pks.im> writes:

> The report was for NonStop, which uses OpenSSL as the backend for
> randomness. In the preceding commit we have adapted that backend to also
> return randomness in case the entropy pool is empty and the caller
> passes the `CSPRNG_BYTES_INSECURE` flag. Do so to fix the issue.

No kidding.

This is calling rand(3).  The use of the resulting value is to fuzz
the delay before retrying, which wants NO CRYPTOGRAPHIC randomness.
This callsite does not require it, but rand(3) is perfect for ensuring
predictability/repeatability as well (via srand(3)).

And it is not allowed to fail.

Yet a platform replaces it with a function that returns an error or
aborts?  What kind of nonsense is that?  Do we really need to cater
to such an insanity?

Use of git_rand() here goes backwards against the more recent trend
in reftable/ directory to wean the code off of the rest of Git by
getting rid of unnecessary dependency, doesn't it?

I think [PATCH 1/2] makes sense regardless, though.  But shouldn't
we be pushing back this step, with "fix your rand(3)"?

Thanks.

> Reported-by: Randall S. Becker <rsbecker@nexbridge.com>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  reftable/stack.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/reftable/stack.c b/reftable/stack.c
> index 6d0aa774e7e29d5366ed55df19725944f8eef792..572a74e00f9ed6040534e060652e72c26641749d 100644
> --- a/reftable/stack.c
> +++ b/reftable/stack.c
> @@ -493,7 +493,7 @@ static int reftable_stack_reload_maybe_reuse(struct reftable_stack *st,
>  		close(fd);
>  		fd = -1;
>  
> -		delay = delay + (delay * rand()) / RAND_MAX + 1;
> +		delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) / UINT32_MAX + 1;
>  		sleep_millisec(delay);
>  	}
>  
> @@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st,
>  static int format_name(struct reftable_buf *dest, uint64_t min, uint64_t max)
>  {
>  	char buf[100];
> -	uint32_t rnd = (uint32_t)git_rand(0);
> +	uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE);
>  	snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x",
>  		 min, max, rnd);
>  	reftable_buf_reset(dest);
Randall S. Becker Jan. 7, 2025, 9:03 p.m. UTC | #3
On January 7, 2025 3:56 PM, Junio C Hamano wrote:
>Patrick Steinhardt <ps@pks.im> writes:
>
>> The report was for NonStop, which uses OpenSSL as the backend for
>> randomness. In the preceding commit we have adapted that backend to
>> also return randomness in case the entropy pool is empty and the
>> caller passes the `CSPRNG_BYTES_INSECURE` flag. Do so to fix the issue.
>
>No kidding.
>
>This is calling rand(3).  The use of the resulting value is to fuzz the
delay before
>retrying, which wants NO CRYPTOGRAPHIC randomness.
>This callsite does not require it, but rand(3) is perfect for ensuring
>predictability/repeatability as well (via srand(3)).
>
>And it is not allowed to fail.

I don't think rand() is what is failing here. The call into OpenSSL is using
randomness
from PRNGD. A poorly configured, or overloaded instance of it will cause
this situation.
The ia64 platform for NonStop is using a standard PRNGD but it is being
overloaded
with a large volume of uses hitting that server. I think this is more of a
canary-in-the-coalmine situation, where we have just brought something to
light
that otherwise would be hard to recreate.

>Yet a platform replaces it with a function that returns an error or aborts?
What kind
>of nonsense is that?  Do we really need to cater to such an insanity?

NonStop has not replaced this call. What we did in OpenSSL is to use the x86
hardware randomizer, which is highly reliable and does not show this
problem. In
ia64, in OpenSSL, the PRNGD is generating randomness and that seems to be
not
reliable enough for git or OpenSSL when under significant load.

>Use of git_rand() here goes backwards against the more recent trend in
reftable/
>directory to wean the code off of the rest of Git by getting rid of
unnecessary
>dependency, doesn't it?
>
>I think [PATCH 1/2] makes sense regardless, though.  But shouldn't we be
pushing
>back this step, with "fix your rand(3)"?
>
>Thanks.
>
>> Reported-by: Randall S. Becker <rsbecker@nexbridge.com>
>> Signed-off-by: Patrick Steinhardt <ps@pks.im>
>> ---
>>  reftable/stack.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/reftable/stack.c b/reftable/stack.c index
>>
>6d0aa774e7e29d5366ed55df19725944f8eef792..572a74e00f9ed6040534e060
>652e
>> 72c26641749d 100644
>> --- a/reftable/stack.c
>> +++ b/reftable/stack.c
>> @@ -493,7 +493,7 @@ static int reftable_stack_reload_maybe_reuse(struct
>reftable_stack *st,
>>  		close(fd);
>>  		fd = -1;
>>
>> -		delay = delay + (delay * rand()) / RAND_MAX + 1;
>> +		delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) /
>> +UINT32_MAX + 1;
>>  		sleep_millisec(delay);
>>  	}
>>
>> @@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st,
>> static int format_name(struct reftable_buf *dest, uint64_t min,
>> uint64_t max)  {
>>  	char buf[100];
>> -	uint32_t rnd = (uint32_t)git_rand(0);
>> +	uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE);
>>  	snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x",
>>  		 min, max, rnd);
>>  	reftable_buf_reset(dest);

--Randall
Junio C Hamano Jan. 7, 2025, 9:03 p.m. UTC | #4
Junio C Hamano <gitster@pobox.com> writes:

> Yet a platform replaces it with a function that returns an error or
> aborts?  What kind of nonsense is that?  Do we really need to cater
> to such an insanity?
>
> Use of git_rand() here goes backwards against the more recent trend
> in reftable/ directory to wean the code off of the rest of Git by
> getting rid of unnecessary dependency, doesn't it?
>
> I think [PATCH 1/2] makes sense regardless, though.  But shouldn't
> we be pushing back this step, with "fix your rand(3)"?
>
> Thanks.

Ah, I misread the patch.  It has two hunks, and what I said applies
to the earlier one, but the later one is already contaminated with
git_rand(), and that is what is failing, i.e. it is not a nonsense
platform replacing rand() with something that can fail.

It may still make sense to drop the first hunk, and consider how to
proceed when you further want to reduce the unnecessary dependencies
for external users of the reftable library, though.  Are there
correctness implications if git_rand() in format_name() yields non
random results (like, always using "rnd = 0" instead of calling
git_rand())?  I seriously hope not.  And if there is no correctness
implications, perhaps we can replace it with rand() or even constant
"0"?

Thanks.

>> Reported-by: Randall S. Becker <rsbecker@nexbridge.com>
>> Signed-off-by: Patrick Steinhardt <ps@pks.im>
>> ---
>>  reftable/stack.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/reftable/stack.c b/reftable/stack.c
>> index 6d0aa774e7e29d5366ed55df19725944f8eef792..572a74e00f9ed6040534e060652e72c26641749d 100644
>> --- a/reftable/stack.c
>> +++ b/reftable/stack.c
>> @@ -493,7 +493,7 @@ static int reftable_stack_reload_maybe_reuse(struct reftable_stack *st,
>>  		close(fd);
>>  		fd = -1;
>>  
>> -		delay = delay + (delay * rand()) / RAND_MAX + 1;
>> +		delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) / UINT32_MAX + 1;
>>  		sleep_millisec(delay);
>>  	}
>>  
>> @@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st,
>>  static int format_name(struct reftable_buf *dest, uint64_t min, uint64_t max)
>>  {
>>  	char buf[100];
>> -	uint32_t rnd = (uint32_t)git_rand(0);
>> +	uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE);
>>  	snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x",
>>  		 min, max, rnd);
>>  	reftable_buf_reset(dest);
Junio C Hamano Jan. 7, 2025, 9:09 p.m. UTC | #5
<rsbecker@nexbridge.com> writes:

> I don't think rand() is what is failing here.

Yeah, sorry, I misead the patch.  The problematic one is git_rand()
in the latter hunk, and I was commenting on the other, former, hunk.
Randall S. Becker Jan. 7, 2025, 11:56 p.m. UTC | #6
>-----Original Message-----
>From: Patrick Steinhardt <ps@pks.im>
>Sent: January 7, 2025 10:27 AM
>To: git@vger.kernel.org
>Cc: Randall S. Becker <randall.becker@nexbridge.ca>
>Subject: [PATCH 2/2] reftable/stack: accept insecure random bytes
>
>The reftable library uses randomness in two call paths:
>
>  - When reading a stack in case some of the referenced tables
>    disappears. The randomness is used to delay the next read by a
>    couple of milliseconds.
>
>  - When writing a new table, where the randomness gets appended to the
>    table name (e.g. "0x000000000001-0x000000000002-0b1d8ddf.ref").
>
>In neither of these cases do we need strong randomness.
>
>Unfortunately though, we have observed test failures caused by the former case. In
>t0610 we have a test that spawns a 100 processes at once, all of which try to write
>a new table to the stack. And given that all of the processes will require
>randomness, it can happen that these processes make the entropy pool run dry,
>which will then cause us to
>die:
>
>    + test_seq 100
>    + printf %s commit\trefs/heads/branch-%s\n
>    68d032e9edd3481ac96382786ececc37ec28709e 1
>    + printf %s commit\trefs/heads/branch-%s\n
>    68d032e9edd3481ac96382786ececc37ec28709e 2
>    ...
>    + git update-ref refs/heads/branch-98 HEAD
>    + git update-ref refs/heads/branch-97 HEAD
>    + git update-ref refs/heads/branch-99 HEAD
>    + git update-ref refs/heads/branch-100 HEAD
>    fatal: unable to get random bytes
>    fatal: unable to get random bytes
>    fatal: unable to get random bytes
>    fatal: unable to get random bytes
>    fatal: unable to get random bytes
>    fatal: unable to get random bytes
>    fatal: unable to get random bytes
>
>The report was for NonStop, which uses OpenSSL as the backend for randomness.
>In the preceding commit we have adapted that backend to also return randomness
>in case the entropy pool is empty and the caller passes the
>`CSPRNG_BYTES_INSECURE` flag. Do so to fix the issue.
>
>Reported-by: Randall S. Becker <rsbecker@nexbridge.com>
>Signed-off-by: Patrick Steinhardt <ps@pks.im>
>---
> reftable/stack.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
>diff --git a/reftable/stack.c b/reftable/stack.c index
>6d0aa774e7e29d5366ed55df19725944f8eef792..572a74e00f9ed6040534e060
>652e72c26641749d 100644
>--- a/reftable/stack.c
>+++ b/reftable/stack.c
>@@ -493,7 +493,7 @@ static int reftable_stack_reload_maybe_reuse(struct
>reftable_stack *st,
> 		close(fd);
> 		fd = -1;
>
>-		delay = delay + (delay * rand()) / RAND_MAX + 1;
>+		delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) /
>+UINT32_MAX + 1;
> 		sleep_millisec(delay);
> 	}
>
>@@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st,  static int
>format_name(struct reftable_buf *dest, uint64_t min, uint64_t max)  {
> 	char buf[100];
>-	uint32_t rnd = (uint32_t)git_rand(0);
>+	uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE);
> 	snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x",
> 		 min, max, rnd);
> 	reftable_buf_reset(dest);
>
>--
>2.48.0.rc1.245.gb3e6e7acbc.dirty

The tests that are all impacted on ia64 (using PRNGD) are:

t0610-reftable-basics
t1092-sparse-checkout-compatibility
t1300-config
t1404-update-ref-errors
t1410-reflog
t1414-reflog-walk
t1415-worktree-refs
t1450-fsck
t1460-refs-migrate
t1500-rev-parse
t2006-checkout-index-basic
t2009-checkout-statinfo
t2013-checkout-submodule
t2017-checkout-orphan
t2030-unresolve-info
t2400-worktree-add

Everything passes on x86.
Patrick Steinhardt Jan. 8, 2025, 6:51 a.m. UTC | #7
On Tue, Jan 07, 2025 at 01:03:26PM -0800, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Yet a platform replaces it with a function that returns an error or
> > aborts?  What kind of nonsense is that?  Do we really need to cater
> > to such an insanity?
> >
> > Use of git_rand() here goes backwards against the more recent trend
> > in reftable/ directory to wean the code off of the rest of Git by
> > getting rid of unnecessary dependency, doesn't it?

It certainly does, yes, but unifying those two callsites to do the same
is also something I do in a patch series that gets rid of the last deps
on the Git codebase. This then allows me to move `git_rand()` into
"reftable/system.h" and make it a shim provided by the respective code
base that it's part of. So it's not a step backwards.

> > I think [PATCH 1/2] makes sense regardless, though.  But shouldn't
> > we be pushing back this step, with "fix your rand(3)"?
> >
> > Thanks.
> 
> Ah, I misread the patch.  It has two hunks, and what I said applies
> to the earlier one, but the later one is already contaminated with
> git_rand(), and that is what is failing, i.e. it is not a nonsense
> platform replacing rand() with something that can fail.
> 
> It may still make sense to drop the first hunk, and consider how to
> proceed when you further want to reduce the unnecessary dependencies
> for external users of the reftable library, though.  Are there
> correctness implications if git_rand() in format_name() yields non
> random results (like, always using "rnd = 0" instead of calling
> git_rand())?  I seriously hope not.  And if there is no correctness
> implications, perhaps we can replace it with rand() or even constant
> "0"?

No, there aren't any implications on correctness in that case. Sure, the
randomized delays not being randomized can lead to more contention. But
even when the randomized suffix for tables is deterministic we wouldn't
have an issue as the files are still distinguished by their update
indices.

Patrick
Junio C Hamano Jan. 8, 2025, 3:39 p.m. UTC | #8
Patrick Steinhardt <ps@pks.im> writes:

>> It may still make sense to drop the first hunk, and consider how to
>> proceed when you further want to reduce the unnecessary dependencies
>> for external users of the reftable library, though.  Are there
>> correctness implications if git_rand() in format_name() yields non
>> random results (like, always using "rnd = 0" instead of calling
>> git_rand())?  I seriously hope not.  And if there is no correctness
>> implications, perhaps we can replace it with rand() or even constant
>> "0"?
>
> No, there aren't any implications on correctness in that case. Sure, the
> randomized delays not being randomized can lead to more contention. But
> even when the randomized suffix for tables is deterministic we wouldn't
> have an issue as the files are still distinguished by their update
> indices.

OK, so they both can be turned into a simple rand() that is expected
to work more reliably especially on more exotic systems (meaning:
the ability the system providers can test their rand() is much
better than our ability to test our git_rand() there)?  It would
help us solve the immediate issue reported, while removing one git
specific function from the reftable library?

Thanks.
Patrick Steinhardt Jan. 8, 2025, 4:21 p.m. UTC | #9
On Wed, Jan 08, 2025 at 07:39:37AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> >> It may still make sense to drop the first hunk, and consider how to
> >> proceed when you further want to reduce the unnecessary dependencies
> >> for external users of the reftable library, though.  Are there
> >> correctness implications if git_rand() in format_name() yields non
> >> random results (like, always using "rnd = 0" instead of calling
> >> git_rand())?  I seriously hope not.  And if there is no correctness
> >> implications, perhaps we can replace it with rand() or even constant
> >> "0"?
> >
> > No, there aren't any implications on correctness in that case. Sure, the
> > randomized delays not being randomized can lead to more contention. But
> > even when the randomized suffix for tables is deterministic we wouldn't
> > have an issue as the files are still distinguished by their update
> > indices.
> 
> OK, so they both can be turned into a simple rand() that is expected
> to work more reliably especially on more exotic systems (meaning:
> the ability the system providers can test their rand() is much
> better than our ability to test our git_rand() there)?  It would
> help us solve the immediate issue reported, while removing one git
> specific function from the reftable library?

Hm. The problem is when Git dies in the middle of a transaction:

  1. We write the temporary table.
  2. We compute the not-so-random suffix.
  3. We write the temporary "tables.list" file.
  4. We move the temporary table into place using the not-so-random
     suffix.
  5. Git dies before updating "tables.list".

Now we have the temporary table moved into place, but "tables.list"
hasn't been updated yet. When the next Git process comes along and wants
to update the table it would result in an error if it computed the same
suffix.

The reftable library knows to clean up such stale tables when not
referenced by the "tables.list" file, but it doesn't do so on every
write. So this would likely still cause issues in practice.

I already though about this scenario when writing my mail, but didn't
really think about it as "correctness". But I guess it is.

Also, based on the feedback from Randall it's not only the reftable
backend that has issues. It's a more general problem on ia64, where many
tests are failing. So even if we fixed this one case, it's likely that
other cases would still die when running low on entropy.

I dunno. It feels like a platform issue, not like a Git issue, when the
RNG cannot provide us a couple of integers. The OpenSSL backend seems
unfit for use to me as none of the other backends have the same issue.

Patrick
Junio C Hamano Jan. 8, 2025, 5:40 p.m. UTC | #10
Patrick Steinhardt <ps@pks.im> writes:

> Hm. The problem is when Git dies in the middle of a transaction:
>
>   1. We write the temporary table.
>   2. We compute the not-so-random suffix.
>   3. We write the temporary "tables.list" file.
>   4. We move the temporary table into place using the not-so-random
>      suffix.
>   5. Git dies before updating "tables.list".
>
> Now we have the temporary table moved into place, but "tables.list"
> hasn't been updated yet. When the next Git process comes along and wants
> to update the table it would result in an error if it computed the same
> suffix.

Here, I hear that we _do_ depend on the suffix being relatively
unique.  Once our random number generator decides to give the same
number twice to cause collision, the reftable data gets corrupt?

> The reftable library knows to clean up such stale tables when not
> referenced by the "tables.list" file, but it doesn't do so on every
> write. So this would likely still cause issues in practice.
>
> I already though about this scenario when writing my mail, but didn't
> really think about it as "correctness". But I guess it is.

Hmph.  I am not sure how I should feel about this.  Our reliance on
hash functions (which can be made to collide) not colliding is one
thing, but is it sensibly safe to rely on a cryptographically
unpredictable random generator not to yield the same suffix twice
during the lifetime of an previous invocation for correctness?
Patrick Steinhardt Jan. 8, 2025, 6:16 p.m. UTC | #11
On Wed, Jan 08, 2025 at 09:40:58AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > Hm. The problem is when Git dies in the middle of a transaction:
> >
> >   1. We write the temporary table.
> >   2. We compute the not-so-random suffix.
> >   3. We write the temporary "tables.list" file.
> >   4. We move the temporary table into place using the not-so-random
> >      suffix.
> >   5. Git dies before updating "tables.list".
> >
> > Now we have the temporary table moved into place, but "tables.list"
> > hasn't been updated yet. When the next Git process comes along and wants
> > to update the table it would result in an error if it computed the same
> > suffix.
> 
> Here, I hear that we _do_ depend on the suffix being relatively
> unique.  Once our random number generator decides to give the same
> number twice to cause collision, the reftable data gets corrupt?

No, there is no corruption. We may fail to update the stack when there
are colliding files, but that's it.

> > The reftable library knows to clean up such stale tables when not
> > referenced by the "tables.list" file, but it doesn't do so on every
> > write. So this would likely still cause issues in practice.
> >
> > I already though about this scenario when writing my mail, but didn't
> > really think about it as "correctness". But I guess it is.
> 
> Hmph.  I am not sure how I should feel about this.  Our reliance on
> hash functions (which can be made to collide) not colliding is one
> thing, but is it sensibly safe to rely on a cryptographically
> unpredictable random generator not to yield the same suffix twice
> during the lifetime of an previous invocation for correctness?

This is why I've been hesistant to call it a "correctness" issue, as
there is no corruption involved here. It's more of a denial of service
as you may not be able to update the stack anymore until you remove the
occluding file.

But turns out I misremembered from 9abda98149 (reftable/stack: fix use
of unseeded randomness, 2023-12-11): things indeed work alright. To
demonstrate, let's update `format_name()` like this:

diff --git a/reftable/stack.c b/reftable/stack.c
index 531660a49f..b7422679df 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st,
 static int format_name(struct reftable_buf *dest, uint64_t min, uint64_t max)
 {
 	char buf[100];
-	uint32_t rnd = (uint32_t)git_rand();
+	uint32_t rnd = 123;
 	snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x",
 		 min, max, rnd);
 	reftable_buf_reset(dest);

And then we create an occluding file and try to update:

    $ ~/Development/git/build/bin-wrappers/git init repo --ref-format=reftable
    Initialized empty Git repository in /tmp/repo/.git/
    $ cd repo/
    $ ls .git/reftable/
    0x000000000001-0x000000000001-0000007b.ref  tables.list
    $ touch .git/reftable/0x000000000002-0x000000000002-0000007b.ref
    $ ~/Development/git/build/git commit --allow-empty -mx
    [main (root-commit) 08d02ef] x
    $ ls .git/reftable/
    0x000000000001-0x000000000002-0000007b.ref  tables.list
    $ git show
    commit 08d02efa88f085f02c31285bf909d8d3a25c70dd (HEAD -> main)
    Author: Patrick Steinhardt <ps@pks.im>
    Date:   Wed Jan 8 19:03:46 2025 +0100

    x

So the stack gets updated as expected, no corruption there.

But this _can_ be a problem on Windows, where the file cannot be deleted
in case it was still open. This is also documented as the reason in
"Documentation/techincal/reftable.txt". That should've gotten better
though now with the improvements I made to our rename-emulation, as it
now uses POSIX semantics.

That being said, I still don't think that swapping out `git_rand()` for
`rand()` is the right thing to do. It does not solve the issue, but only
a symptom thereof. Git can still die whenever the OpenSSL CSPRNG fails
as there are other uses of `git_rand()` or `csprng_bytes()` in the
codebase.

So this change would regress something that works everywhere but on
NonStop and make the suffixes predictable. The ia64 machine in question
is being EOLd end of 2025. And the fact that this has never been a
problem before v2.48.0-rc2, and that the machine was rebooted for
maintenance immediately before running the tests, indicates to me that
something fishy is going on on that platform.

Patrick
diff mbox series

Patch

diff --git a/reftable/stack.c b/reftable/stack.c
index 6d0aa774e7e29d5366ed55df19725944f8eef792..572a74e00f9ed6040534e060652e72c26641749d 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -493,7 +493,7 @@  static int reftable_stack_reload_maybe_reuse(struct reftable_stack *st,
 		close(fd);
 		fd = -1;
 
-		delay = delay + (delay * rand()) / RAND_MAX + 1;
+		delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) / UINT32_MAX + 1;
 		sleep_millisec(delay);
 	}
 
@@ -659,7 +659,7 @@  int reftable_stack_add(struct reftable_stack *st,
 static int format_name(struct reftable_buf *dest, uint64_t min, uint64_t max)
 {
 	char buf[100];
-	uint32_t rnd = (uint32_t)git_rand(0);
+	uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE);
 	snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x",
 		 min, max, rnd);
 	reftable_buf_reset(dest);