Message ID | cover.1737151386.git.me@ttaylorr.com (mailing list archive) |
---|---|
Headers | show |
Series | hash: introduce unsafe_hash_algo(), drop unsafe_ variants | expand |
On Fri, Jan 17, 2025 at 05:03:07PM -0500, Taylor Blau wrote: > + Note that hash_algo_by_ptr() needs an adjustment to allow passing in the > + unsafe variant of a hash function. All other query functions on the > + hash_algos array will continue to return the safe variants of any > + function. > + > Suggested-by: Jeff King <peff@peff.net> > Signed-off-by: Taylor Blau <me@ttaylorr.com> > > @@ hash.h: struct git_hash_algo { > }; > extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS]; > > -@@ hash.h: static inline int hash_algo_by_ptr(const struct git_hash_algo *p) > - return p - hash_algos; > +@@ hash.h: int hash_algo_by_length(int len); > + /* Identical, except for a pointer to struct git_hash_algo. */ > + static inline int hash_algo_by_ptr(const struct git_hash_algo *p) > + { > +- return p - hash_algos; > ++ size_t i; > ++ for (i = 0; i < GIT_HASH_NALGOS; i++) { > ++ const struct git_hash_algo *algop = &hash_algos[i]; > ++ if (p == algop || (algop->unsafe && p == algop->unsafe)) > ++ return i; > ++ } > ++ return GIT_HASH_UNKNOWN; > } OK, so this version introduces the loop we discussed earlier. I think we can probably dismiss any performance loss as theoretical unless somebody can think of a good way to measure. It seems like worrying about it is probably a premature micro-optimization. It is a little quirky that it loses the transitive nature of hash_algo_by_ptr() and hash_algo_by_id(). So this is unsafe: /* start with unsafe variant */ algo = unsafe_hash_algo(the_hash_algo); algo->init_fn(...); /* returns GIT_HASH_SHA1, even for the unsafe variant */ id = hash_algo_by_ptr(algo); /* returns the safe variant */ algo = hash_algo_by_id(id); /* oops, this is mix-and-matching */ alog->update_fn(...); Now obviously that sort of round-trip is a silly thing to do in a single function. Could it happen across a larger call-chain, where the id is stored in a struct and passed somewhere far away? I guess so, but it also seems kind of unlikely. It does make me wonder if hash_algo_by_ptr() should just return UNKNOWN for the unsafe variant. Then we'd at least get a loud error from the caller (as opposed to the code before your patch, which is undefined behavior). I dunno. -Peff
On Sat, Jan 18, 2025 at 07:28:31AM -0500, Jeff King wrote: > It does make me wonder if hash_algo_by_ptr() should just return UNKNOWN > for the unsafe variant. Then we'd at least get a loud error from the > caller (as opposed to the code before your patch, which is undefined > behavior). I dunno. Doing this: diff --git a/hash.h b/hash.h index 68d4292e6d..ad2c919991 100644 --- a/hash.h +++ b/hash.h @@ -311,7 +311,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p) size_t i; for (i = 0; i < GIT_HASH_NALGOS; i++) { const struct git_hash_algo *algop = &hash_algos[i]; - if (p == algop || (algop->unsafe && p == algop->unsafe)) + if (p == algop) return i; } return GIT_HASH_UNKNOWN; on top of your series doesn't trigger any issues in the test suite. Which I think shows that we would never have triggered the UB from the original implementation, either[1]. Still, I think I prefer your loop to being one errant function call away from undefined behavior. But maybe returning UNKNOWN (as above) is a safer bet than losing the inverse nature of the by_ptr() and by_id() functions, which could otherwise cause hard-to-find interactions? -Peff [1] One thing that still puzzles me. If you further add a BUG() to that function when we'd return UNKNOWN, you can see that we trigger a few cases in the test suite where we pass in NULL (e.g., because we're not in a repo). I think these are already UB with the existing "p - hash_algos" implementation! We'll return what is effectively "p" cast to an int. E.g. if I do this: diff --git a/hash.h b/hash.h index 756166ce5e..ff31156416 100644 --- a/hash.h +++ b/hash.h @@ -320,6 +320,8 @@ int hash_algo_by_length(int len); /* Identical, except for a pointer to struct git_hash_algo. */ static inline int hash_algo_by_ptr(const struct git_hash_algo *p) { + if (!p) + BUG("null hash, return is %d", (int)(p - hash_algos)); return p - hash_algos; } then t1517 shows grep dying with: BUG: hash.h:324: null hash, return is -1646754892 I think it "works" because the backtrace is: [...] #5 0x0000556d436f6b71 in BUG_fl (file=0x556d43790e8b "hash.h", line=324, fmt=0x556d43790e73 "null hash, return is %d") at usage.c:335 #6 0x0000556d4357c2f9 in hash_algo_by_ptr (p=0x0) at /home/peff/compile/git/hash.h:324 #7 0x0000556d4357c437 in oidclr (oid=0x7ffdf5810ea4, algop=0x0) at /home/peff/compile/git/hash.h:392 #8 0x0000556d435801c7 in prep_exclude (dir=0x7ffdf5811190, istate=0x556d608959c0, base=0x556d6089da50 "nums", baselen=0) at dir.c:1699 [...] #16 0x0000556d4344dd4a in cmd_grep (argc=0, argv=0x556d60895ee8, prefix=0x0, repo=0x0) at builtin/grep.c:1257 So we probably write a totally garbage algo field into the object_id, but nobody ever ends up looking at it. Probably something we should clean up, but way out of scope for your series. But I do think it reinforces that returning UNKNOWN is an improvement; it avoids undefined behavior and anybody who tried to use it should get a BUG() from calling git_hash_unknown_*() functions.
Jeff King <peff@peff.net> writes: > .... But maybe > returning UNKNOWN (as above) is a safer bet than losing the inverse > nature of the by_ptr() and by_id() functions, which could otherwise > cause hard-to-find interactions? Yeah, that sounds sensible. > I think it "works" because the backtrace is: > > [...] > #5 0x0000556d436f6b71 in BUG_fl (file=0x556d43790e8b "hash.h", line=324, > fmt=0x556d43790e73 "null hash, return is %d") at usage.c:335 > #6 0x0000556d4357c2f9 in hash_algo_by_ptr (p=0x0) at /home/peff/compile/git/hash.h:324 > #7 0x0000556d4357c437 in oidclr (oid=0x7ffdf5810ea4, algop=0x0) at /home/peff/compile/git/hash.h:392 > #8 0x0000556d435801c7 in prep_exclude (dir=0x7ffdf5811190, istate=0x556d608959c0, base=0x556d6089da50 "nums", > baselen=0) at dir.c:1699 > [...] > #16 0x0000556d4344dd4a in cmd_grep (argc=0, argv=0x556d60895ee8, prefix=0x0, repo=0x0) at builtin/grep.c:1257 > > So we probably write a totally garbage algo field into the > object_id, but nobody ever ends up looking at it. Probably > something we should clean up, but way out of scope for your series. Yeah, that looks bad, but I think it is fine to leave it outside the series to fix that. > But I do think it reinforces that returning UNKNOWN is an > improvement; it avoids undefined behavior and anybody who tried to > use it should get a BUG() from calling git_hash_unknown_*() > functions. Sounds like a sensitive direction to go in. Thanks.
(This series is based on 14650065b7 (RelNotes/2.48.0: fix typos etc., 2025-01-07)). The bulk of this series is unchanged since last time, save for a couple of typofixes on spots noticed by Peff and Patrick Steinhardt. More importantly, it fixes hash_algo_by_ptr() when passing the unsafe SHA-1 variant. There were a couple of other ideas floated around, but I don't think they panned out as we had hoped in practice, so I think that this version should be good to go. The original cover letter is as follows: ------------ This series implements an idea discussed in [2] which suggests that we introduce a way to access a wrapped version of a 'struct git_hash_algo' which represents the unsafe variant of that algorithm, rather than having individual unsafe_ functions (like unsafe_init_fn() versus init_fn(), etc.). This approach is relatively straightforward to implement, and removes a significant deficiency in the original implementation of unsafe/non-cryptographic hash functions by making it impossible to switch between safe- and unsafe variants of hash functions. It also cleans up the sha1-unsafe test helper's implementation by removing a large number of "if (unsafe)"-style conditionals. The series is laid out as follows: * The first two patches prepare the hashfile API for the upcoming change: csum-file: store the hash algorithm as a struct field csum-file.c: extract algop from hashfile_checksum_valid() * The next patch implements the new 'unsafe_hash_algo()' function at the heart of this series' approach: hash.h: introduce `unsafe_hash_algo()` * The next two patches convert existing callers to use the new 'unsafe_hash_algo()' function, instead of switching between safe and unsafe_ variants of individual functions: csum-file.c: use unsafe_hash_algo() t/helper/test-hash.c: use unsafe_hash_algo() * The final patch drops the unsafe_ function variants following all callers being converted to use the new pattern: hash.h: drop unsafe_ function variants Thanks in advance for your review! [1]: https://lore.kernel.org/git/20241230-pks-meson-sha1-unsafe-v1-0-efb276e171f5@pks.im/ [2]: https://lore.kernel.org/git/20241107013915.GA961214@coredump.intra.peff.net/ Taylor Blau (8): t/helper/test-tool: implement sha1-unsafe helper csum-file: store the hash algorithm as a struct field csum-file.c: extract algop from hashfile_checksum_valid() hash.h: introduce `unsafe_hash_algo()` csum-file.c: use unsafe_hash_algo() t/helper/test-hash.c: use unsafe_hash_algo() csum-file: introduce hashfile_checkpoint_init() hash.h: drop unsafe_ function variants builtin/fast-import.c | 2 +- bulk-checkin.c | 9 ++++++--- csum-file.c | 40 +++++++++++++++++++++++++--------------- csum-file.h | 2 ++ hash.h | 28 ++++++++++++---------------- object-file.c | 41 ++++++++++++++++++++++++++--------------- t/helper/test-hash.c | 4 +++- t/helper/test-sha1.c | 7 ++++++- t/helper/test-sha1.sh | 38 ++++++++++++++++++++++---------------- t/helper/test-sha256.c | 2 +- t/helper/test-tool.c | 1 + t/helper/test-tool.h | 3 ++- 12 files changed, 107 insertions(+), 70 deletions(-) Range-diff against v2: 1: 4c1523a04f1 = 1: ae6b8c75294 t/helper/test-tool: implement sha1-unsafe helper 2: 99cc44895b5 ! 2: 2b79c76e471 csum-file: store the hash algorithm as a struct field @@ Commit message csum-file: store the hash algorithm as a struct field Throughout the hashfile API, we rely on a reference to 'the_hash_algo', - and call its _usnafe function variants directly. + and call its _unsafe function variants directly. Prepare for a future change where we may use a different 'git_hash_algo' pointer (instead of just relying on 'the_hash_algo' throughout) by 3: 1ffab2f8289 = 3: d7deb3f338e csum-file.c: extract algop from hashfile_checksum_valid() 4: 99dcbe2e716 ! 4: b6b2bb2714f hash.h: introduce `unsafe_hash_algo()` @@ Commit message if (unsafe) algop = unsafe_hash_algo(algop); - the_hash_algo->init_fn(...); - the_hash_algo->update_fn(...); - the_hash_algo->final_fn(...); + algop->init_fn(...); + algop->update_fn(...); + algop->final_fn(...); This removes the existing shortcoming by no longer forcing the caller to "remember" which variant of the hash functions it wants to call, only to @@ Commit message functions, this too will go away after subsequent commits remove all direct calls to the unsafe_ variants. + Note that hash_algo_by_ptr() needs an adjustment to allow passing in the + unsafe variant of a hash function. All other query functions on the + hash_algos array will continue to return the safe variants of any + function. + Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> @@ hash.h: struct git_hash_algo { }; extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS]; -@@ hash.h: static inline int hash_algo_by_ptr(const struct git_hash_algo *p) - return p - hash_algos; +@@ hash.h: int hash_algo_by_length(int len); + /* Identical, except for a pointer to struct git_hash_algo. */ + static inline int hash_algo_by_ptr(const struct git_hash_algo *p) + { +- return p - hash_algos; ++ size_t i; ++ for (i = 0; i < GIT_HASH_NALGOS; i++) { ++ const struct git_hash_algo *algop = &hash_algos[i]; ++ if (p == algop || (algop->unsafe && p == algop->unsafe)) ++ return i; ++ } ++ return GIT_HASH_UNKNOWN; } +const struct git_hash_algo *unsafe_hash_algo(const struct git_hash_algo *algop); 5: 2dcc2aa6803 = 5: ca67de80971 csum-file.c: use unsafe_hash_algo() 6: a2b9ef03080 = 6: 21b175b07ff t/helper/test-hash.c: use unsafe_hash_algo() 7: 94c07fd8a55 = 7: 850d4f407db csum-file: introduce hashfile_checkpoint_init() 8: f5579883816 ! 8: 0c4d006e6e8 hash.h: drop unsafe_ function variants @@ Commit message to - unsafe_hash_algo(the_hash_algo)->unsafe_init_fn(); + unsafe_hash_algo(the_hash_algo)->init_fn(); and similar, we can remove the scaffolding for the unsafe_ function variants and force callers to use the new unsafe_hash_algo() mechanic base-commit: 14650065b76b28d3cfa9453356ac5669b19e706e