diff mbox series

[2/4] hash.h: scaffolding for _fast hashing variants

Message ID 6ac6f934c32bdc600cdb8d2a08d4aa390c1f2994.1725206584.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series hash.h: support choosing a separate SHA-1 for non-cryptographic uses | expand

Commit Message

Taylor Blau Sept. 1, 2024, 4:03 p.m. UTC
Git's default SHA-1 implementation is collision-detecting, which hardens
us against known SHA-1 attacks against Git objects. This makes Git
object writes safer at the expense of some speed when hashing through
the collision-detecting implementation, which is slower than
non-collision detecting alternatives.

Prepare for loading a separate "fast" SHA-1 implementation that can be
used for non-cryptographic purposes, like computing the checksum of
files that use the hashwrite() API.

This commit does not actually introduce any new compile-time knobs to
control which implementation is used as the fast SHA-1 variant, but does
add scaffolding so that the "git_hash_algo" structure has five new
function pointers which are "fast" variants of the five existing
hashing-related function pointers:

  - git_hash_init_fn fast_init_fn
  - git_hash_clone_fn fast_clone_fn
  - git_hash_update_fn fast_update_fn
  - git_hash_final_fn fast_final_fn
  - git_hash_final_oid_fn fast_final_oid_fn

The following commit will introduce compile-time knobs to specify which
SHA-1 implementation is used for non-cryptographic uses.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 hash.h        | 42 ++++++++++++++++++++++++++++++++++++++++++
 object-file.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+)

Comments

Patrick Steinhardt Sept. 2, 2024, 1:41 p.m. UTC | #1
On Sun, Sep 01, 2024 at 12:03:24PM -0400, Taylor Blau wrote:
> Git's default SHA-1 implementation is collision-detecting, which hardens
> us against known SHA-1 attacks against Git objects. This makes Git
> object writes safer at the expense of some speed when hashing through
> the collision-detecting implementation, which is slower than
> non-collision detecting alternatives.
> 
> Prepare for loading a separate "fast" SHA-1 implementation that can be
> used for non-cryptographic purposes, like computing the checksum of
> files that use the hashwrite() API.
> 
> This commit does not actually introduce any new compile-time knobs to
> control which implementation is used as the fast SHA-1 variant, but does
> add scaffolding so that the "git_hash_algo" structure has five new
> function pointers which are "fast" variants of the five existing
> hashing-related function pointers:
> 
>   - git_hash_init_fn fast_init_fn
>   - git_hash_clone_fn fast_clone_fn
>   - git_hash_update_fn fast_update_fn
>   - git_hash_final_fn fast_final_fn
>   - git_hash_final_oid_fn fast_final_oid_fn
> 
> The following commit will introduce compile-time knobs to specify which
> SHA-1 implementation is used for non-cryptographic uses.

While the property we care about in the context of this patch series
indeed is that the second hash is faster, I think the more important
property is that it's insecure. If I were seeing two APIs, one labelled
fast and one labelled slow, I would of course pick the fast one. So I
wonder whether we should rename things accordingly so that developers
aren't intrigued to pick the fast one without thinking, and also to have
a more useful signal that stands out to reviewers.

> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  hash.h        | 42 ++++++++++++++++++++++++++++++++++++++++++
>  object-file.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 84 insertions(+)
> 
> diff --git a/hash.h b/hash.h
> index 72ffbc862e5..f255e5c1e8a 100644
> --- a/hash.h
> +++ b/hash.h
> @@ -44,14 +44,32 @@
>  #define platform_SHA1_Final    	SHA1_Final
>  #endif
>  
> +#ifndef platform_SHA_CTX_fast
> +#define platform_SHA_CTX_fast	  platform_SHA_CTX
> +#define platform_SHA1_Init_fast	  platform_SHA1_Init
> +#define platform_SHA1_Update_fast platform_SHA1_Update
> +#define platform_SHA1_Final_fast  platform_SHA1_Final
> +#ifdef platform_SHA1_Clone
> +#define platform_SHA1_Clone_fast  platform_SHA1_Clone
> +#endif
> +#endif

We may want to apply our new coding guidelines around nested
preprocessor directives, which should also use indenting.

> @@ -222,6 +249,21 @@ struct git_hash_algo {
>  	/* The hash finalization function for object IDs. */
>  	git_hash_final_oid_fn final_oid_fn;
>  
> +	/* The fast hash initialization function. */

Providing some context here why there are two sets of functions would
help future readers.

> @@ -219,6 +256,11 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
>  		.update_fn = git_hash_sha256_update,
>  		.final_fn = git_hash_sha256_final,
>  		.final_oid_fn = git_hash_sha256_final_oid,
> +		.fast_init_fn = git_hash_sha256_init,
> +		.fast_clone_fn = git_hash_sha256_clone,
> +		.fast_update_fn = git_hash_sha256_update,
> +		.fast_final_fn = git_hash_sha256_final,
> +		.fast_final_oid_fn = git_hash_sha256_final_oid,
>  		.empty_tree = &empty_tree_oid_sha256,
>  		.empty_blob = &empty_blob_oid_sha256,
>  		.null_oid = &null_oid_sha256,

I was briefly wondering whether we'd rather want to have automatic
fallbacks to the secure alternative when the fast one isn't set. I guess
in the end that's not really worth it though, as it saves us one branch
when not having such a fallback. And we only have so many hashes, so
it's not like this would be a huge pain to maintain.

Patrick
Junio C Hamano Sept. 3, 2024, 5:27 p.m. UTC | #2
Patrick Steinhardt <ps@pks.im> writes:

> While the property we care about in the context of this patch series
> indeed is that the second hash is faster, I think the more important
> property is that it's insecure. If I were seeing two APIs, one labelled
> fast and one labelled slow, I would of course pick the fast one. So I
> wonder whether we should rename things accordingly so that developers
> aren't intrigued to pick the fast one without thinking, and also to have
> a more useful signal that stands out to reviewers.

I do not think this topic is going in the direction it set out to,
but if we are to resurrect it by 

 (1) first to ensure that we won't overwrite existing on-disk files
     and other things as needed to safely swap the tail sum to a
     cryptographically insecure hash function;

 (2) devise a transition plan to use a hash function that computes a
     value that is different from SHA-1 (or SHA-256 for that
     matter); and

 (3) pick a hash function that computes a lot faster but is insecure
     and transition to it.

we will need to clearly label the two hash functions as such.

We may also need to consider similar points if we need to name
pseudo random numbers we use, to clarify the requirement of the
caller (e.g., can a caller that wants security use it?).

Thanks.
Taylor Blau Sept. 3, 2024, 7:40 p.m. UTC | #3
On Mon, Sep 02, 2024 at 03:41:24PM +0200, Patrick Steinhardt wrote:
> > This commit does not actually introduce any new compile-time knobs to
> > control which implementation is used as the fast SHA-1 variant, but does
> > add scaffolding so that the "git_hash_algo" structure has five new
> > function pointers which are "fast" variants of the five existing
> > hashing-related function pointers:
> >
> >   - git_hash_init_fn fast_init_fn
> >   - git_hash_clone_fn fast_clone_fn
> >   - git_hash_update_fn fast_update_fn
> >   - git_hash_final_fn fast_final_fn
> >   - git_hash_final_oid_fn fast_final_oid_fn
> >
> > The following commit will introduce compile-time knobs to specify which
> > SHA-1 implementation is used for non-cryptographic uses.
>
> While the property we care about in the context of this patch series
> indeed is that the second hash is faster, I think the more important
> property is that it's insecure. If I were seeing two APIs, one labelled
> fast and one labelled slow, I would of course pick the fast one. So I
> wonder whether we should rename things accordingly so that developers
> aren't intrigued to pick the fast one without thinking, and also to have
> a more useful signal that stands out to reviewers.

I tried to come up with a different name myself when writing this
series, and wasn't happy with any of the others that I came up with. I
thought of "insecure_init_fn()", or "non_cryptographic_init_fn()". The
first one appears scarier than the second, but both are mouthfuls.

As a middle-ground, I updated the comments to say "fast /
non-cryptographic" in places where they just said "fast" previously. Let
me know if you think that's sufficient, otherwise I can try and come up
with some more names.

> We may want to apply our new coding guidelines around nested
> preprocessor directives, which should also use indenting.

Gotcha.

> > @@ -222,6 +249,21 @@ struct git_hash_algo {
> >  	/* The hash finalization function for object IDs. */
> >  	git_hash_final_oid_fn final_oid_fn;
> >
> > +	/* The fast hash initialization function. */
>
> Providing some context here why there are two sets of functions would
> help future readers.

Very fair, will adjust (as above).

Thanks,
Taylor
Taylor Blau Sept. 3, 2024, 7:52 p.m. UTC | #4
On Tue, Sep 03, 2024 at 10:27:39AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
> > While the property we care about in the context of this patch series
> > indeed is that the second hash is faster, I think the more important
> > property is that it's insecure. If I were seeing two APIs, one labelled
> > fast and one labelled slow, I would of course pick the fast one. So I
> > wonder whether we should rename things accordingly so that developers
> > aren't intrigued to pick the fast one without thinking, and also to have
> > a more useful signal that stands out to reviewers.
>
> I do not think this topic is going in the direction it set out to,
> but if we are to resurrect it by
>
>  (1) first to ensure that we won't overwrite existing on-disk files
>      and other things as needed to safely swap the tail sum to a
>      cryptographically insecure hash function;

I discussed this with brian in the sub-thread where I am talking to
them, but I think this is already the case. The pack is read in
index-pack and the checksum is verified without using the _fast hash
functions, so we would detect:

  - either half of a colliding pair of objects, when reading individual
    objects' contents to determine their SHA-1s, or

  - a colliding pack checksum, when computing the whole pack's checksum
    (which also does not use the _fast variants of these functions), and

  - a mismatched pack checksum, when verifying the pack's checksum
    against the one stored in the pack.

>  (2) devise a transition plan to use a hash function that computes a
>      value that is different from SHA-1 (or SHA-256 for that
>      matter); and
>
>  (3) pick a hash function that computes a lot faster but is insecure
>      and transition to it.

So I do not think that either of these two steps are necessary.

Thanks,
Taylor
Junio C Hamano Sept. 3, 2024, 8:47 p.m. UTC | #5
Taylor Blau <me@ttaylorr.com> writes:

> I discussed this with brian in the sub-thread where I am talking to
> them, but I think this is already the case. The pack is read in
> index-pack and the checksum is verified without using the _fast hash
> functions, so we would detect:
>
>   - either half of a colliding pair of objects, when reading individual
>     objects' contents to determine their SHA-1s, or
>
>   - a colliding pack checksum, when computing the whole pack's checksum
>     (which also does not use the _fast variants of these functions), and
>
>   - a mismatched pack checksum, when verifying the pack's checksum
>     against the one stored in the pack.
>
>>  (2) devise a transition plan to use a hash function that computes a
>>      value that is different from SHA-1 (or SHA-256 for that
>>      matter); and
>>
>>  (3) pick a hash function that computes a lot faster but is insecure
>>      and transition to it.
>
> So I do not think that either of these two steps are necessary.

I suspect that it is a wrong conclusion, as I meant (1) to be
prerequisite for doing (2) and (3), that gives us the real benefit
of being able to go faster than SHA1DC or even SHA-256.  If (1) is
unnecessary (because it is already covered), that is great---we can
directly jump to (2) and (3).
Taylor Blau Sept. 3, 2024, 9:24 p.m. UTC | #6
On Tue, Sep 03, 2024 at 01:47:09PM -0700, Junio C Hamano wrote:
> > So I do not think that either of these two steps are necessary.
>
> I suspect that it is a wrong conclusion, as I meant (1) to be
> prerequisite for doing (2) and (3), that gives us the real benefit
> of being able to go faster than SHA1DC or even SHA-256.  If (1) is
> unnecessary (because it is already covered), that is great---we can
> directly jump to (2) and (3).

Ah, yes, after re-reading your message I am definitely mistaken here. I
think that in the future doing this would be more than worthwhile.

Thanks,
Taylor
Patrick Steinhardt Sept. 4, 2024, 7:05 a.m. UTC | #7
On Tue, Sep 03, 2024 at 01:47:09PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> 
> > I discussed this with brian in the sub-thread where I am talking to
> > them, but I think this is already the case. The pack is read in
> > index-pack and the checksum is verified without using the _fast hash
> > functions, so we would detect:
> >
> >   - either half of a colliding pair of objects, when reading individual
> >     objects' contents to determine their SHA-1s, or
> >
> >   - a colliding pack checksum, when computing the whole pack's checksum
> >     (which also does not use the _fast variants of these functions), and
> >
> >   - a mismatched pack checksum, when verifying the pack's checksum
> >     against the one stored in the pack.
> >
> >>  (2) devise a transition plan to use a hash function that computes a
> >>      value that is different from SHA-1 (or SHA-256 for that
> >>      matter); and
> >>
> >>  (3) pick a hash function that computes a lot faster but is insecure
> >>      and transition to it.
> >
> > So I do not think that either of these two steps are necessary.
> 
> I suspect that it is a wrong conclusion, as I meant (1) to be
> prerequisite for doing (2) and (3), that gives us the real benefit
> of being able to go faster than SHA1DC or even SHA-256.  If (1) is
> unnecessary (because it is already covered), that is great---we can
> directly jump to (2) and (3).

Ah, so the idea would be to not introduce SHA1_fast, but instead use a
hash function that is explicitly designed for fast hashing like xxHash
[1]? When you compare numbers I definitely think that this makes quite
some sense as XXH3 for example hashes at 31.5GB/s whereas SHA1 hashes at
0.8GB/s (if you believe the numbers on their site).

Doing this for data structures structur is almost a no-brainer if you
ask me. For packfiles it's a bit more complicated as we also have to
consider backwards compatibility -- a server of course cannot just start
to send packfiles that use xxHash.

Patrick

[1]: https://github.com/Cyan4973/xxHash
Junio C Hamano Sept. 4, 2024, 2:53 p.m. UTC | #8
Patrick Steinhardt <ps@pks.im> writes:

> On Tue, Sep 03, 2024 at 01:47:09PM -0700, Junio C Hamano wrote:
>> ...
>> >>  (2) devise a transition plan to use a hash function that computes a
>> >>      value that is different from SHA-1 (or SHA-256 for that
>> >>      matter); and
>> >>
>> >>  (3) pick a hash function that computes a lot faster but is insecure
>> >>      and transition to it.
>> >
>> > So I do not think that either of these two steps are necessary.
>> 
>> I suspect that it is a wrong conclusion, as I meant (1) to be
>> prerequisite for doing (2) and (3), that gives us the real benefit
>> of being able to go faster than SHA1DC or even SHA-256.  If (1) is
>> unnecessary (because it is already covered), that is great---we can
>> directly jump to (2) and (3).
>
> Ah, so the idea would be to not introduce SHA1_fast, but instead use a
> hash function that is explicitly designed for fast hashing like xxHash
> [1]? When you compare numbers I definitely think that this makes quite
> some sense as XXH3 for example hashes at 31.5GB/s whereas SHA1 hashes at
> 0.8GB/s (if you believe the numbers on their site).

> Doing this for data structures structur is almost a no-brainer if you
> ask me. For packfiles it's a bit more complicated as we also have to
> consider backwards compatibility -- a server of course cannot just start
> to send packfiles that use xxHash.

Yup, that is where the step (2) above comes in.  In the thread,
xxhash was indeed brought up as a viable candidate for the tail sum
for strictly local files.
diff mbox series

Patch

diff --git a/hash.h b/hash.h
index 72ffbc862e5..f255e5c1e8a 100644
--- a/hash.h
+++ b/hash.h
@@ -44,14 +44,32 @@ 
 #define platform_SHA1_Final    	SHA1_Final
 #endif
 
+#ifndef platform_SHA_CTX_fast
+#define platform_SHA_CTX_fast	  platform_SHA_CTX
+#define platform_SHA1_Init_fast	  platform_SHA1_Init
+#define platform_SHA1_Update_fast platform_SHA1_Update
+#define platform_SHA1_Final_fast  platform_SHA1_Final
+#ifdef platform_SHA1_Clone
+#define platform_SHA1_Clone_fast  platform_SHA1_Clone
+#endif
+#endif
+
 #define git_SHA_CTX		platform_SHA_CTX
 #define git_SHA1_Init		platform_SHA1_Init
 #define git_SHA1_Update		platform_SHA1_Update
 #define git_SHA1_Final		platform_SHA1_Final
 
+#define git_SHA_CTX_fast	platform_SHA_CTX_fast
+#define git_SHA1_Init_fast	platform_SHA1_Init_fast
+#define git_SHA1_Update_fast	platform_SHA1_Update_fast
+#define git_SHA1_Final_fast	platform_SHA1_Final_fast
+
 #ifdef platform_SHA1_Clone
 #define git_SHA1_Clone	platform_SHA1_Clone
 #endif
+#ifdef platform_SHA1_Clone_fast
+#define git_SHA1_Clone_fast	platform_SHA1_Clone_fast
+#endif
 
 #ifndef platform_SHA256_CTX
 #define platform_SHA256_CTX	SHA256_CTX
@@ -81,6 +99,13 @@  static inline void git_SHA1_Clone(git_SHA_CTX *dst, const git_SHA_CTX *src)
 	memcpy(dst, src, sizeof(*dst));
 }
 #endif
+#ifndef SHA1_NEEDS_CLONE_HELPER_FAST
+static inline void git_SHA1_Clone_fast(git_SHA_CTX_fast *dst,
+				       const git_SHA_CTX_fast *src)
+{
+	memcpy(dst, src, sizeof(*dst));
+}
+#endif
 
 #ifndef SHA256_NEEDS_CLONE_HELPER
 static inline void git_SHA256_Clone(git_SHA256_CTX *dst, const git_SHA256_CTX *src)
@@ -178,6 +203,8 @@  enum get_oid_result {
 /* A suitably aligned type for stack allocations of hash contexts. */
 union git_hash_ctx {
 	git_SHA_CTX sha1;
+	git_SHA_CTX_fast sha1_fast;
+
 	git_SHA256_CTX sha256;
 };
 typedef union git_hash_ctx git_hash_ctx;
@@ -222,6 +249,21 @@  struct git_hash_algo {
 	/* The hash finalization function for object IDs. */
 	git_hash_final_oid_fn final_oid_fn;
 
+	/* The fast hash initialization function. */
+	git_hash_init_fn fast_init_fn;
+
+	/* The fast hash context cloning function. */
+	git_hash_clone_fn fast_clone_fn;
+
+	/* The fast hash update function. */
+	git_hash_update_fn fast_update_fn;
+
+	/* The fast hash finalization function. */
+	git_hash_final_fn fast_final_fn;
+
+	/* The fast hash finalization function for object IDs. */
+	git_hash_final_oid_fn fast_final_oid_fn;
+
 	/* The OID of the empty tree. */
 	const struct object_id *empty_tree;
 
diff --git a/object-file.c b/object-file.c
index c5994202ba0..9691292ef5a 100644
--- a/object-file.c
+++ b/object-file.c
@@ -115,6 +115,33 @@  static void git_hash_sha1_final_oid(struct object_id *oid, git_hash_ctx *ctx)
 	oid->algo = GIT_HASH_SHA1;
 }
 
+static void git_hash_sha1_init_fast(git_hash_ctx *ctx)
+{
+	git_SHA1_Init_fast(&ctx->sha1_fast);
+}
+
+static void git_hash_sha1_clone_fast(git_hash_ctx *dst, const git_hash_ctx *src)
+{
+	git_SHA1_Clone_fast(&dst->sha1_fast, &src->sha1_fast);
+}
+
+static void git_hash_sha1_update_fast(git_hash_ctx *ctx, const void *data,
+				      size_t len)
+{
+	git_SHA1_Update_fast(&ctx->sha1_fast, data, len);
+}
+
+static void git_hash_sha1_final_fast(unsigned char *hash, git_hash_ctx *ctx)
+{
+	git_SHA1_Final_fast(hash, &ctx->sha1_fast);
+}
+
+static void git_hash_sha1_final_oid_fast(struct object_id *oid, git_hash_ctx *ctx)
+{
+	git_SHA1_Final_fast(oid->hash, &ctx->sha1_fast);
+	memset(oid->hash + GIT_SHA1_RAWSZ, 0, GIT_MAX_RAWSZ - GIT_SHA1_RAWSZ);
+	oid->algo = GIT_HASH_SHA1;
+}
 
 static void git_hash_sha256_init(git_hash_ctx *ctx)
 {
@@ -189,6 +216,11 @@  const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_unknown_update,
 		.final_fn = git_hash_unknown_final,
 		.final_oid_fn = git_hash_unknown_final_oid,
+		.fast_init_fn = git_hash_unknown_init,
+		.fast_clone_fn = git_hash_unknown_clone,
+		.fast_update_fn = git_hash_unknown_update,
+		.fast_final_fn = git_hash_unknown_final,
+		.fast_final_oid_fn = git_hash_unknown_final_oid,
 		.empty_tree = NULL,
 		.empty_blob = NULL,
 		.null_oid = NULL,
@@ -204,6 +236,11 @@  const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_sha1_update,
 		.final_fn = git_hash_sha1_final,
 		.final_oid_fn = git_hash_sha1_final_oid,
+		.fast_init_fn = git_hash_sha1_init_fast,
+		.fast_clone_fn = git_hash_sha1_clone_fast,
+		.fast_update_fn = git_hash_sha1_update_fast,
+		.fast_final_fn = git_hash_sha1_final_fast,
+		.fast_final_oid_fn = git_hash_sha1_final_oid_fast,
 		.empty_tree = &empty_tree_oid,
 		.empty_blob = &empty_blob_oid,
 		.null_oid = &null_oid_sha1,
@@ -219,6 +256,11 @@  const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_sha256_update,
 		.final_fn = git_hash_sha256_final,
 		.final_oid_fn = git_hash_sha256_final_oid,
+		.fast_init_fn = git_hash_sha256_init,
+		.fast_clone_fn = git_hash_sha256_clone,
+		.fast_update_fn = git_hash_sha256_update,
+		.fast_final_fn = git_hash_sha256_final,
+		.fast_final_oid_fn = git_hash_sha256_final_oid,
 		.empty_tree = &empty_tree_oid_sha256,
 		.empty_blob = &empty_blob_oid_sha256,
 		.null_oid = &null_oid_sha256,