[v2,1/2] t/helper/test-sha1: prepare for an unsafe mode

Message ID	0e2fcee6894b7b16136ff09a69f199bea9f8c882.1730833507.git.me@ttaylorr.com (mailing list archive)
State	New
Headers	show Received: from mail-yb1-f171.google.com (mail-yb1-f171.google.com [209.85.219.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00A451EB9FD for <git@vger.kernel.org>; Tue, 5 Nov 2024 19:05:16 +0000 (UTC) Date: Tue, 5 Nov 2024 14:05:14 -0500 From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: Elijah Newren <newren@gmail.com>, Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>, "brian m. carlson" <sandals@crustytoothpaste.net> Subject: [PATCH v2 1/2] t/helper/test-sha1: prepare for an unsafe mode Message-ID: <0e2fcee6894b7b16136ff09a69f199bea9f8c882.1730833507.git.me@ttaylorr.com> References: <cover.1730833506.git.me@ttaylorr.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <cover.1730833506.git.me@ttaylorr.com>
Series	t/helper/test-tool: implement 'sha1-unsafe' helper \| expand [v2,0/2] t/helper/test-tool: implement 'sha1-unsafe' helper [v2,1/2] t/helper/test-sha1: prepare for an unsafe mode [v2,2/2] t/helper/test-tool: implement sha1-unsafe helper

Taylor Blau Nov. 5, 2024, 7:05 p.m. UTC

With the new "unsafe" SHA-1 build knob, it would be convenient to have
a test-tool that can exercise Git's unsafe SHA-1 wrappers for testing,
similar to 't/helper/test-tool sha1'.

Prepare for such a helper by altering the implementation of that
test-tool (in cmd_hash_impl(), which is generic and parameterized over
different hash functions) to conditionally run the unsafe variants of
the chosen hash function.

The following commit will add a new test-tool which makes use of this
new parameter.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-hash.c   | 17 +++++++++++++----
 t/helper/test-sha1.c   |  2 +-
 t/helper/test-sha256.c |  2 +-
 t/helper/test-tool.h   |  2 +-
 4 files changed, 16 insertions(+), 7 deletions(-)

Junio C Hamano Nov. 6, 2024, 1:38 a.m. UTC | #1

Taylor Blau <me@ttaylorr.com> writes:

> -int cmd_hash_impl(int ac, const char **av, int algo)
> +int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
>  {
>  	git_hash_ctx ctx;
>  	unsigned char hash[GIT_MAX_HEXSZ];
> @@ -27,7 +27,10 @@ int cmd_hash_impl(int ac, const char **av, int algo)
>  			die("OOPS");
>  	}
>  
> -	algop->init_fn(&ctx);
> +	if (unsafe)
> +		algop->unsafe_init_fn(&ctx);
> +	else
> +		algop->init_fn(&ctx);

It may be just me, and it would not matter all that much within the
context of the project because this is merely a test helper, but
giving a pair of init/unsafe_init methods to algop looks unnerving.
It gives an impression that every design of hash algorithm must come
with normal and unsafe variant, which is not what you want to say.

I would have expected that there are different algorighm instances,
and one of them would be "unsafe SHA-1", among "normal SHA-1" and
"SHA-256" (as the last one would not even have unsafe_init_fn
method), and the callers that want to measure the performance of
each algorithm would simply pick one of these instances and go
through the usual "init", "update", "final" flow, regardless of the
"unsafe" ness of the algorithm.

IOW, ...

>  	while (1) {
>  		ssize_t sz, this_sz;
> @@ -46,9 +49,15 @@ int cmd_hash_impl(int ac, const char **av, int algo)
>  		}
>  		if (this_sz == 0)
>  			break;
> -		algop->update_fn(&ctx, buffer, this_sz);
> +		if (unsafe)
> +			algop->unsafe_update_fn(&ctx, buffer, this_sz);
> +		else
> +			algop->update_fn(&ctx, buffer, this_sz);
>  	}
> -	algop->final_fn(hash, &ctx);
> +	if (unsafe)
> +		algop->unsafe_final_fn(hash, &ctx);
> +	else
> +		algop->final_fn(hash, &ctx);

... I didn't expect any of these "if (unsafe) .. else .." switches.

brian m. carlson Nov. 7, 2024, 12:48 a.m. UTC | #2

On 2024-11-06 at 01:38:48, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> 
> > -int cmd_hash_impl(int ac, const char **av, int algo)
> > +int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
> >  {
> >  	git_hash_ctx ctx;
> >  	unsigned char hash[GIT_MAX_HEXSZ];
> > @@ -27,7 +27,10 @@ int cmd_hash_impl(int ac, const char **av, int algo)
> >  			die("OOPS");
> >  	}
> >  
> > -	algop->init_fn(&ctx);
> > +	if (unsafe)
> > +		algop->unsafe_init_fn(&ctx);
> > +	else
> > +		algop->init_fn(&ctx);
> 
> It may be just me, and it would not matter all that much within the
> context of the project because this is merely a test helper, but
> giving a pair of init/unsafe_init methods to algop looks unnerving.
> It gives an impression that every design of hash algorithm must come
> with normal and unsafe variant, which is not what you want to say.
> 
> I would have expected that there are different algorighm instances,
> and one of them would be "unsafe SHA-1", among "normal SHA-1" and
> "SHA-256" (as the last one would not even have unsafe_init_fn
> method), and the callers that want to measure the performance of
> each algorithm would simply pick one of these instances and go
> through the usual "init", "update", "final" flow, regardless of the
> "unsafe" ness of the algorithm.
> 
> IOW, ...
> 
> >  	while (1) {
> >  		ssize_t sz, this_sz;
> > @@ -46,9 +49,15 @@ int cmd_hash_impl(int ac, const char **av, int algo)
> >  		}
> >  		if (this_sz == 0)
> >  			break;
> > -		algop->update_fn(&ctx, buffer, this_sz);
> > +		if (unsafe)
> > +			algop->unsafe_update_fn(&ctx, buffer, this_sz);
> > +		else
> > +			algop->update_fn(&ctx, buffer, this_sz);
> >  	}
> > -	algop->final_fn(hash, &ctx);
> > +	if (unsafe)
> > +		algop->unsafe_final_fn(hash, &ctx);
> > +	else
> > +		algop->final_fn(hash, &ctx);
> 
> ... I didn't expect any of these "if (unsafe) .. else .." switches.

I don't think we can remove those, and here's why.  Certainly Taylor can
correct me if I'm off base, though.

In the normal case, our hash struct is a union, and different
implementations can have a different layout.  A SHA-1 implementation
will usually keep track of a 64-bit size, 5 32-bit words, and up to 64
bytes of data that might need to be processed.  Maybe SHA-1-DC, our safe
implementation, stores the size first, but our unsafe implementation
is OpenSSL and it stores the 5 hash words first, or whatever.

If we use the same update and final functions, we'll end up with
incorrect data because we'll have initialized the union contents with
data for one implementation but are trying to update or finalize it with
different data, which in the very best case will just produce broken
results, and might just cause a segfault (if one of the implementations
stores a pointer instead of storing the data in the union).

We could certainly adopt a different hash algorithm structure for safe
and unsafe code, but we have a lot of places that assume that there's
just one structure per algorithm.  It would require a substantial amount
of refactoring to remove those assumptions and deal with the fact that
we now have two SHA-1 implementations.  It would also be tricky to deal
with the fact that for SHA-1, we might use the safe or unsafe algorithm,
but for SHA-256, there's only one algorithm structure and we need to use
it for both.

Jeff King Nov. 7, 2024, 1:39 a.m. UTC | #3

On Thu, Nov 07, 2024 at 12:48:26AM +0000, brian m. carlson wrote:

> > I would have expected that there are different algorighm instances,
> > and one of them would be "unsafe SHA-1", among "normal SHA-1" and
> > "SHA-256" (as the last one would not even have unsafe_init_fn
> > method), and the callers that want to measure the performance of
> > each algorithm would simply pick one of these instances and go
> > through the usual "init", "update", "final" flow, regardless of the
> > "unsafe" ness of the algorithm.
> [...]
> > ... I didn't expect any of these "if (unsafe) .. else .." switches.
> 
> I don't think we can remove those, and here's why.  Certainly Taylor can
> correct me if I'm off base, though.
> 
> In the normal case, our hash struct is a union, and different
> implementations can have a different layout.  A SHA-1 implementation
> will usually keep track of a 64-bit size, 5 32-bit words, and up to 64
> bytes of data that might need to be processed.  Maybe SHA-1-DC, our safe
> implementation, stores the size first, but our unsafe implementation
> is OpenSSL and it stores the 5 hash words first, or whatever.
> 
> If we use the same update and final functions, we'll end up with
> incorrect data because we'll have initialized the union contents with
> data for one implementation but are trying to update or finalize it with
> different data, which in the very best case will just produce broken
> results, and might just cause a segfault (if one of the implementations
> stores a pointer instead of storing the data in the union).

I don't think Junio is proposing mixing the functions on a single data
type. Which, as you note, would be a disaster. I think the idea is for a
separate git_hash_algo struct entirely, that can be slotted in as a
pointer (since git_hash_algo is already essentially a virtual type).

That gets weird as you say here:

> We could certainly adopt a different hash algorithm structure for safe
> and unsafe code, but we have a lot of places that assume that there's
> just one structure per algorithm.  It would require a substantial amount
> of refactoring to remove those assumptions and deal with the fact that
> we now have two SHA-1 implementations.  It would also be tricky to deal
> with the fact that for SHA-1, we might use the safe or unsafe algorithm,
> but for SHA-256, there's only one algorithm structure and we need to use
> it for both.

both because we have two different algos for "sha1", but also because
they are not _really_ independent. If the_hash_algo is sha1, then
whichever implementation a given piece of code is using, it must still
be one of the sha1 variants.

So I think you wouldn't want to allocate an enum or a slot in the
hash_algos array, because it is not really an independent algorithm.
But I think it _could_ work as a separate struct that the caller derives
from the main hash algorithm. For example, imagine that the
git_hash_algo struct didn't have unsafe_init_fn, etc, but instead had:

  struct git_hash_algo *unsafe_implementation;

with a matching accessor like:

  struct git_hash_algo *unsafe_hash_algo(struct git_hash_algo *algo)
  {
	/* if we have a faster "unsafe" implementation, use that */
	if (algo->unsafe_implementation)
		return algo->unsafe_implementation;
	/* otherwise just use the default one */
	return algo;
  }

And then csum-file.c, rather than calling:

  the_hash_algo->unsafe_init_fn(...);
  ...
  the_hash_algo->unsafe_final_fn(...);

would do this:

  struct git_hash_algo *algo = unsafe_hash_algo(the_hash_algo);
  algo->init_fn(...);
  ...
  algo->final_fn(...);

And likewise this test helper would have a single conditional at the
start:

  if (unsafe)
	algo = unsafe_hash_algo(algo);

and the rest of the code would just use that.

All that said, I do not think it buys us anything for "real" code. There
the decision between safe/unsafe is in the context of how we are using
it, and not based on some conditional. So while the code in this test
helper is ugly, I don't think we'd ever write anything like that for
real. So it may not be worth the effort to refactor into a more virtual
object-oriented way.

-Peff

brian m. carlson Nov. 7, 2024, 1:49 a.m. UTC | #4

On 2024-11-07 at 01:39:15, Jeff King wrote:
> So I think you wouldn't want to allocate an enum or a slot in the
> hash_algos array, because it is not really an independent algorithm.
> But I think it _could_ work as a separate struct that the caller derives
> from the main hash algorithm. For example, imagine that the
> git_hash_algo struct didn't have unsafe_init_fn, etc, but instead had:
> 
>   struct git_hash_algo *unsafe_implementation;
> 
> with a matching accessor like:
> 
>   struct git_hash_algo *unsafe_hash_algo(struct git_hash_algo *algo)
>   {
> 	/* if we have a faster "unsafe" implementation, use that */
> 	if (algo->unsafe_implementation)
> 		return algo->unsafe_implementation;
> 	/* otherwise just use the default one */
> 	return algo;
>   }
> 
> And then csum-file.c, rather than calling:
> 
>   the_hash_algo->unsafe_init_fn(...);
>   ...
>   the_hash_algo->unsafe_final_fn(...);
> 
> would do this:
> 
>   struct git_hash_algo *algo = unsafe_hash_algo(the_hash_algo);
>   algo->init_fn(...);
>   ...
>   algo->final_fn(...);
> 
> And likewise this test helper would have a single conditional at the
> start:
> 
>   if (unsafe)
> 	algo = unsafe_hash_algo(algo);
> 
> and the rest of the code would just use that.

Ah, yes, I think that approach would be simpler and nicer to work with
than a separate entry in the `hash_algos` array.  We do, however, have
some places that assume that a `struct git_hash_algo *` points into the
`hash_algos` array (notably `hash_algo_by_ptr`), so we'd have to adjust
for that, move the function pointers out into their own struct which
we'd use for `unsafe_hash_algo`, or be careful never to call the
relevant functions on our special `git_hash_algo` struct.

> All that said, I do not think it buys us anything for "real" code. There
> the decision between safe/unsafe is in the context of how we are using
> it, and not based on some conditional. So while the code in this test
> helper is ugly, I don't think we'd ever write anything like that for
> real. So it may not be worth the effort to refactor into a more virtual
> object-oriented way.

Yeah, I don't have a strong opinion one way or the other.  I think, with
the limitation I mentioned above, it would probably require a decent
amount of refactoring if we took a different approach, and I'm fine with
going with Taylor's current approach unless he wants to do that
refactoring (in which case, great).

Jeff King Nov. 7, 2024, 2:08 a.m. UTC | #5

On Thu, Nov 07, 2024 at 01:49:35AM +0000, brian m. carlson wrote:

> Ah, yes, I think that approach would be simpler and nicer to work with
> than a separate entry in the `hash_algos` array.  We do, however, have
> some places that assume that a `struct git_hash_algo *` points into the
> `hash_algos` array (notably `hash_algo_by_ptr`), so we'd have to adjust
> for that, move the function pointers out into their own struct which
> we'd use for `unsafe_hash_algo`, or be careful never to call the
> relevant functions on our special `git_hash_algo` struct.

Yeah, I wondered if some code might be surprised by having a separate
hash algo. Another weird thing is that the sub-implementation algo
struct will have its own rawsz, hexsz, etc, even though those _must_ be
the same its parent.

If all of the virtual implementation functions were in a git_hash_impl
struct, then each git_hash_algo struct could have one embedded for the
main implementation (which in sha1's case would be a collision detecting
one), and an optional pointer to another unsafe _impl struct.

And then you get more type-safety, because you can never confuse the
_impl struct for a parent git_hash_algo.

The downside is that every single caller, even if it doesn't care about
the unsafe variant, needs to refer to the_hash_algo->impl.init_fn(),
etc, due to the extra layer of indirection. Probably not worth it.

> Yeah, I don't have a strong opinion one way or the other.  I think, with
> the limitation I mentioned above, it would probably require a decent
> amount of refactoring if we took a different approach, and I'm fine with
> going with Taylor's current approach unless he wants to do that
> refactoring (in which case, great).

Me too. If we were fixing something ugly or error-prone that we expected
to come up in real code, it might be worth it. But it's probably not for
trying to accommodate a one-off test helper.

-Peff

Junio C Hamano Nov. 7, 2024, 3:08 a.m. UTC | #6

Jeff King <peff@peff.net> writes:

> Me too. If we were fixing something ugly or error-prone that we expected
> to come up in real code, it might be worth it. But it's probably not for
> trying to accommodate a one-off test helper.

I 100% agree that it would not matter all that much within the
context of the project because this is merely a test helper.

It looked odd not to have sha1-unsafe as a first class citizen next
to sha1 and sha256, with an entry for it in the list enums and list
of hash algos.  If there is a good reason why adding the "unsafe"
variant as a first class citizen among algos, the approach posted
patch took is fine.

Thanks.

[v2,1/2] t/helper/test-sha1: prepare for an unsafe mode

Commit Message

Comments

Patch