Message ID | 6ac6f934c32bdc600cdb8d2a08d4aa390c1f2994.1725206584.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | hash.h: support choosing a separate SHA-1 for non-cryptographic uses | expand |
On Sun, Sep 01, 2024 at 12:03:24PM -0400, Taylor Blau wrote: > Git's default SHA-1 implementation is collision-detecting, which hardens > us against known SHA-1 attacks against Git objects. This makes Git > object writes safer at the expense of some speed when hashing through > the collision-detecting implementation, which is slower than > non-collision detecting alternatives. > > Prepare for loading a separate "fast" SHA-1 implementation that can be > used for non-cryptographic purposes, like computing the checksum of > files that use the hashwrite() API. > > This commit does not actually introduce any new compile-time knobs to > control which implementation is used as the fast SHA-1 variant, but does > add scaffolding so that the "git_hash_algo" structure has five new > function pointers which are "fast" variants of the five existing > hashing-related function pointers: > > - git_hash_init_fn fast_init_fn > - git_hash_clone_fn fast_clone_fn > - git_hash_update_fn fast_update_fn > - git_hash_final_fn fast_final_fn > - git_hash_final_oid_fn fast_final_oid_fn > > The following commit will introduce compile-time knobs to specify which > SHA-1 implementation is used for non-cryptographic uses. While the property we care about in the context of this patch series indeed is that the second hash is faster, I think the more important property is that it's insecure. If I were seeing two APIs, one labelled fast and one labelled slow, I would of course pick the fast one. So I wonder whether we should rename things accordingly so that developers aren't intrigued to pick the fast one without thinking, and also to have a more useful signal that stands out to reviewers. > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > hash.h | 42 ++++++++++++++++++++++++++++++++++++++++++ > object-file.c | 42 ++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 84 insertions(+) > > diff --git a/hash.h b/hash.h > index 72ffbc862e5..f255e5c1e8a 100644 > --- a/hash.h > +++ b/hash.h > @@ -44,14 +44,32 @@ > #define platform_SHA1_Final SHA1_Final > #endif > > +#ifndef platform_SHA_CTX_fast > +#define platform_SHA_CTX_fast platform_SHA_CTX > +#define platform_SHA1_Init_fast platform_SHA1_Init > +#define platform_SHA1_Update_fast platform_SHA1_Update > +#define platform_SHA1_Final_fast platform_SHA1_Final > +#ifdef platform_SHA1_Clone > +#define platform_SHA1_Clone_fast platform_SHA1_Clone > +#endif > +#endif We may want to apply our new coding guidelines around nested preprocessor directives, which should also use indenting. > @@ -222,6 +249,21 @@ struct git_hash_algo { > /* The hash finalization function for object IDs. */ > git_hash_final_oid_fn final_oid_fn; > > + /* The fast hash initialization function. */ Providing some context here why there are two sets of functions would help future readers. > @@ -219,6 +256,11 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = { > .update_fn = git_hash_sha256_update, > .final_fn = git_hash_sha256_final, > .final_oid_fn = git_hash_sha256_final_oid, > + .fast_init_fn = git_hash_sha256_init, > + .fast_clone_fn = git_hash_sha256_clone, > + .fast_update_fn = git_hash_sha256_update, > + .fast_final_fn = git_hash_sha256_final, > + .fast_final_oid_fn = git_hash_sha256_final_oid, > .empty_tree = &empty_tree_oid_sha256, > .empty_blob = &empty_blob_oid_sha256, > .null_oid = &null_oid_sha256, I was briefly wondering whether we'd rather want to have automatic fallbacks to the secure alternative when the fast one isn't set. I guess in the end that's not really worth it though, as it saves us one branch when not having such a fallback. And we only have so many hashes, so it's not like this would be a huge pain to maintain. Patrick
Patrick Steinhardt <ps@pks.im> writes: > While the property we care about in the context of this patch series > indeed is that the second hash is faster, I think the more important > property is that it's insecure. If I were seeing two APIs, one labelled > fast and one labelled slow, I would of course pick the fast one. So I > wonder whether we should rename things accordingly so that developers > aren't intrigued to pick the fast one without thinking, and also to have > a more useful signal that stands out to reviewers. I do not think this topic is going in the direction it set out to, but if we are to resurrect it by (1) first to ensure that we won't overwrite existing on-disk files and other things as needed to safely swap the tail sum to a cryptographically insecure hash function; (2) devise a transition plan to use a hash function that computes a value that is different from SHA-1 (or SHA-256 for that matter); and (3) pick a hash function that computes a lot faster but is insecure and transition to it. we will need to clearly label the two hash functions as such. We may also need to consider similar points if we need to name pseudo random numbers we use, to clarify the requirement of the caller (e.g., can a caller that wants security use it?). Thanks.
On Mon, Sep 02, 2024 at 03:41:24PM +0200, Patrick Steinhardt wrote: > > This commit does not actually introduce any new compile-time knobs to > > control which implementation is used as the fast SHA-1 variant, but does > > add scaffolding so that the "git_hash_algo" structure has five new > > function pointers which are "fast" variants of the five existing > > hashing-related function pointers: > > > > - git_hash_init_fn fast_init_fn > > - git_hash_clone_fn fast_clone_fn > > - git_hash_update_fn fast_update_fn > > - git_hash_final_fn fast_final_fn > > - git_hash_final_oid_fn fast_final_oid_fn > > > > The following commit will introduce compile-time knobs to specify which > > SHA-1 implementation is used for non-cryptographic uses. > > While the property we care about in the context of this patch series > indeed is that the second hash is faster, I think the more important > property is that it's insecure. If I were seeing two APIs, one labelled > fast and one labelled slow, I would of course pick the fast one. So I > wonder whether we should rename things accordingly so that developers > aren't intrigued to pick the fast one without thinking, and also to have > a more useful signal that stands out to reviewers. I tried to come up with a different name myself when writing this series, and wasn't happy with any of the others that I came up with. I thought of "insecure_init_fn()", or "non_cryptographic_init_fn()". The first one appears scarier than the second, but both are mouthfuls. As a middle-ground, I updated the comments to say "fast / non-cryptographic" in places where they just said "fast" previously. Let me know if you think that's sufficient, otherwise I can try and come up with some more names. > We may want to apply our new coding guidelines around nested > preprocessor directives, which should also use indenting. Gotcha. > > @@ -222,6 +249,21 @@ struct git_hash_algo { > > /* The hash finalization function for object IDs. */ > > git_hash_final_oid_fn final_oid_fn; > > > > + /* The fast hash initialization function. */ > > Providing some context here why there are two sets of functions would > help future readers. Very fair, will adjust (as above). Thanks, Taylor
On Tue, Sep 03, 2024 at 10:27:39AM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > While the property we care about in the context of this patch series > > indeed is that the second hash is faster, I think the more important > > property is that it's insecure. If I were seeing two APIs, one labelled > > fast and one labelled slow, I would of course pick the fast one. So I > > wonder whether we should rename things accordingly so that developers > > aren't intrigued to pick the fast one without thinking, and also to have > > a more useful signal that stands out to reviewers. > > I do not think this topic is going in the direction it set out to, > but if we are to resurrect it by > > (1) first to ensure that we won't overwrite existing on-disk files > and other things as needed to safely swap the tail sum to a > cryptographically insecure hash function; I discussed this with brian in the sub-thread where I am talking to them, but I think this is already the case. The pack is read in index-pack and the checksum is verified without using the _fast hash functions, so we would detect: - either half of a colliding pair of objects, when reading individual objects' contents to determine their SHA-1s, or - a colliding pack checksum, when computing the whole pack's checksum (which also does not use the _fast variants of these functions), and - a mismatched pack checksum, when verifying the pack's checksum against the one stored in the pack. > (2) devise a transition plan to use a hash function that computes a > value that is different from SHA-1 (or SHA-256 for that > matter); and > > (3) pick a hash function that computes a lot faster but is insecure > and transition to it. So I do not think that either of these two steps are necessary. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > I discussed this with brian in the sub-thread where I am talking to > them, but I think this is already the case. The pack is read in > index-pack and the checksum is verified without using the _fast hash > functions, so we would detect: > > - either half of a colliding pair of objects, when reading individual > objects' contents to determine their SHA-1s, or > > - a colliding pack checksum, when computing the whole pack's checksum > (which also does not use the _fast variants of these functions), and > > - a mismatched pack checksum, when verifying the pack's checksum > against the one stored in the pack. > >> (2) devise a transition plan to use a hash function that computes a >> value that is different from SHA-1 (or SHA-256 for that >> matter); and >> >> (3) pick a hash function that computes a lot faster but is insecure >> and transition to it. > > So I do not think that either of these two steps are necessary. I suspect that it is a wrong conclusion, as I meant (1) to be prerequisite for doing (2) and (3), that gives us the real benefit of being able to go faster than SHA1DC or even SHA-256. If (1) is unnecessary (because it is already covered), that is great---we can directly jump to (2) and (3).
On Tue, Sep 03, 2024 at 01:47:09PM -0700, Junio C Hamano wrote: > > So I do not think that either of these two steps are necessary. > > I suspect that it is a wrong conclusion, as I meant (1) to be > prerequisite for doing (2) and (3), that gives us the real benefit > of being able to go faster than SHA1DC or even SHA-256. If (1) is > unnecessary (because it is already covered), that is great---we can > directly jump to (2) and (3). Ah, yes, after re-reading your message I am definitely mistaken here. I think that in the future doing this would be more than worthwhile. Thanks, Taylor
On Tue, Sep 03, 2024 at 01:47:09PM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > I discussed this with brian in the sub-thread where I am talking to > > them, but I think this is already the case. The pack is read in > > index-pack and the checksum is verified without using the _fast hash > > functions, so we would detect: > > > > - either half of a colliding pair of objects, when reading individual > > objects' contents to determine their SHA-1s, or > > > > - a colliding pack checksum, when computing the whole pack's checksum > > (which also does not use the _fast variants of these functions), and > > > > - a mismatched pack checksum, when verifying the pack's checksum > > against the one stored in the pack. > > > >> (2) devise a transition plan to use a hash function that computes a > >> value that is different from SHA-1 (or SHA-256 for that > >> matter); and > >> > >> (3) pick a hash function that computes a lot faster but is insecure > >> and transition to it. > > > > So I do not think that either of these two steps are necessary. > > I suspect that it is a wrong conclusion, as I meant (1) to be > prerequisite for doing (2) and (3), that gives us the real benefit > of being able to go faster than SHA1DC or even SHA-256. If (1) is > unnecessary (because it is already covered), that is great---we can > directly jump to (2) and (3). Ah, so the idea would be to not introduce SHA1_fast, but instead use a hash function that is explicitly designed for fast hashing like xxHash [1]? When you compare numbers I definitely think that this makes quite some sense as XXH3 for example hashes at 31.5GB/s whereas SHA1 hashes at 0.8GB/s (if you believe the numbers on their site). Doing this for data structures structur is almost a no-brainer if you ask me. For packfiles it's a bit more complicated as we also have to consider backwards compatibility -- a server of course cannot just start to send packfiles that use xxHash. Patrick [1]: https://github.com/Cyan4973/xxHash
Patrick Steinhardt <ps@pks.im> writes: > On Tue, Sep 03, 2024 at 01:47:09PM -0700, Junio C Hamano wrote: >> ... >> >> (2) devise a transition plan to use a hash function that computes a >> >> value that is different from SHA-1 (or SHA-256 for that >> >> matter); and >> >> >> >> (3) pick a hash function that computes a lot faster but is insecure >> >> and transition to it. >> > >> > So I do not think that either of these two steps are necessary. >> >> I suspect that it is a wrong conclusion, as I meant (1) to be >> prerequisite for doing (2) and (3), that gives us the real benefit >> of being able to go faster than SHA1DC or even SHA-256. If (1) is >> unnecessary (because it is already covered), that is great---we can >> directly jump to (2) and (3). > > Ah, so the idea would be to not introduce SHA1_fast, but instead use a > hash function that is explicitly designed for fast hashing like xxHash > [1]? When you compare numbers I definitely think that this makes quite > some sense as XXH3 for example hashes at 31.5GB/s whereas SHA1 hashes at > 0.8GB/s (if you believe the numbers on their site). > Doing this for data structures structur is almost a no-brainer if you > ask me. For packfiles it's a bit more complicated as we also have to > consider backwards compatibility -- a server of course cannot just start > to send packfiles that use xxHash. Yup, that is where the step (2) above comes in. In the thread, xxhash was indeed brought up as a viable candidate for the tail sum for strictly local files.
diff --git a/hash.h b/hash.h index 72ffbc862e5..f255e5c1e8a 100644 --- a/hash.h +++ b/hash.h @@ -44,14 +44,32 @@ #define platform_SHA1_Final SHA1_Final #endif +#ifndef platform_SHA_CTX_fast +#define platform_SHA_CTX_fast platform_SHA_CTX +#define platform_SHA1_Init_fast platform_SHA1_Init +#define platform_SHA1_Update_fast platform_SHA1_Update +#define platform_SHA1_Final_fast platform_SHA1_Final +#ifdef platform_SHA1_Clone +#define platform_SHA1_Clone_fast platform_SHA1_Clone +#endif +#endif + #define git_SHA_CTX platform_SHA_CTX #define git_SHA1_Init platform_SHA1_Init #define git_SHA1_Update platform_SHA1_Update #define git_SHA1_Final platform_SHA1_Final +#define git_SHA_CTX_fast platform_SHA_CTX_fast +#define git_SHA1_Init_fast platform_SHA1_Init_fast +#define git_SHA1_Update_fast platform_SHA1_Update_fast +#define git_SHA1_Final_fast platform_SHA1_Final_fast + #ifdef platform_SHA1_Clone #define git_SHA1_Clone platform_SHA1_Clone #endif +#ifdef platform_SHA1_Clone_fast +#define git_SHA1_Clone_fast platform_SHA1_Clone_fast +#endif #ifndef platform_SHA256_CTX #define platform_SHA256_CTX SHA256_CTX @@ -81,6 +99,13 @@ static inline void git_SHA1_Clone(git_SHA_CTX *dst, const git_SHA_CTX *src) memcpy(dst, src, sizeof(*dst)); } #endif +#ifndef SHA1_NEEDS_CLONE_HELPER_FAST +static inline void git_SHA1_Clone_fast(git_SHA_CTX_fast *dst, + const git_SHA_CTX_fast *src) +{ + memcpy(dst, src, sizeof(*dst)); +} +#endif #ifndef SHA256_NEEDS_CLONE_HELPER static inline void git_SHA256_Clone(git_SHA256_CTX *dst, const git_SHA256_CTX *src) @@ -178,6 +203,8 @@ enum get_oid_result { /* A suitably aligned type for stack allocations of hash contexts. */ union git_hash_ctx { git_SHA_CTX sha1; + git_SHA_CTX_fast sha1_fast; + git_SHA256_CTX sha256; }; typedef union git_hash_ctx git_hash_ctx; @@ -222,6 +249,21 @@ struct git_hash_algo { /* The hash finalization function for object IDs. */ git_hash_final_oid_fn final_oid_fn; + /* The fast hash initialization function. */ + git_hash_init_fn fast_init_fn; + + /* The fast hash context cloning function. */ + git_hash_clone_fn fast_clone_fn; + + /* The fast hash update function. */ + git_hash_update_fn fast_update_fn; + + /* The fast hash finalization function. */ + git_hash_final_fn fast_final_fn; + + /* The fast hash finalization function for object IDs. */ + git_hash_final_oid_fn fast_final_oid_fn; + /* The OID of the empty tree. */ const struct object_id *empty_tree; diff --git a/object-file.c b/object-file.c index c5994202ba0..9691292ef5a 100644 --- a/object-file.c +++ b/object-file.c @@ -115,6 +115,33 @@ static void git_hash_sha1_final_oid(struct object_id *oid, git_hash_ctx *ctx) oid->algo = GIT_HASH_SHA1; } +static void git_hash_sha1_init_fast(git_hash_ctx *ctx) +{ + git_SHA1_Init_fast(&ctx->sha1_fast); +} + +static void git_hash_sha1_clone_fast(git_hash_ctx *dst, const git_hash_ctx *src) +{ + git_SHA1_Clone_fast(&dst->sha1_fast, &src->sha1_fast); +} + +static void git_hash_sha1_update_fast(git_hash_ctx *ctx, const void *data, + size_t len) +{ + git_SHA1_Update_fast(&ctx->sha1_fast, data, len); +} + +static void git_hash_sha1_final_fast(unsigned char *hash, git_hash_ctx *ctx) +{ + git_SHA1_Final_fast(hash, &ctx->sha1_fast); +} + +static void git_hash_sha1_final_oid_fast(struct object_id *oid, git_hash_ctx *ctx) +{ + git_SHA1_Final_fast(oid->hash, &ctx->sha1_fast); + memset(oid->hash + GIT_SHA1_RAWSZ, 0, GIT_MAX_RAWSZ - GIT_SHA1_RAWSZ); + oid->algo = GIT_HASH_SHA1; +} static void git_hash_sha256_init(git_hash_ctx *ctx) { @@ -189,6 +216,11 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = { .update_fn = git_hash_unknown_update, .final_fn = git_hash_unknown_final, .final_oid_fn = git_hash_unknown_final_oid, + .fast_init_fn = git_hash_unknown_init, + .fast_clone_fn = git_hash_unknown_clone, + .fast_update_fn = git_hash_unknown_update, + .fast_final_fn = git_hash_unknown_final, + .fast_final_oid_fn = git_hash_unknown_final_oid, .empty_tree = NULL, .empty_blob = NULL, .null_oid = NULL, @@ -204,6 +236,11 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = { .update_fn = git_hash_sha1_update, .final_fn = git_hash_sha1_final, .final_oid_fn = git_hash_sha1_final_oid, + .fast_init_fn = git_hash_sha1_init_fast, + .fast_clone_fn = git_hash_sha1_clone_fast, + .fast_update_fn = git_hash_sha1_update_fast, + .fast_final_fn = git_hash_sha1_final_fast, + .fast_final_oid_fn = git_hash_sha1_final_oid_fast, .empty_tree = &empty_tree_oid, .empty_blob = &empty_blob_oid, .null_oid = &null_oid_sha1, @@ -219,6 +256,11 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = { .update_fn = git_hash_sha256_update, .final_fn = git_hash_sha256_final, .final_oid_fn = git_hash_sha256_final_oid, + .fast_init_fn = git_hash_sha256_init, + .fast_clone_fn = git_hash_sha256_clone, + .fast_update_fn = git_hash_sha256_update, + .fast_final_fn = git_hash_sha256_final, + .fast_final_oid_fn = git_hash_sha256_final_oid, .empty_tree = &empty_tree_oid_sha256, .empty_blob = &empty_blob_oid_sha256, .null_oid = &null_oid_sha256,
Git's default SHA-1 implementation is collision-detecting, which hardens us against known SHA-1 attacks against Git objects. This makes Git object writes safer at the expense of some speed when hashing through the collision-detecting implementation, which is slower than non-collision detecting alternatives. Prepare for loading a separate "fast" SHA-1 implementation that can be used for non-cryptographic purposes, like computing the checksum of files that use the hashwrite() API. This commit does not actually introduce any new compile-time knobs to control which implementation is used as the fast SHA-1 variant, but does add scaffolding so that the "git_hash_algo" structure has five new function pointers which are "fast" variants of the five existing hashing-related function pointers: - git_hash_init_fn fast_init_fn - git_hash_clone_fn fast_clone_fn - git_hash_update_fn fast_update_fn - git_hash_final_fn fast_final_fn - git_hash_final_oid_fn fast_final_oid_fn The following commit will introduce compile-time knobs to specify which SHA-1 implementation is used for non-cryptographic uses. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- hash.h | 42 ++++++++++++++++++++++++++++++++++++++++++ object-file.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 84 insertions(+)