mbox series

[v3,0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants

Message ID cover.1737151386.git.me@ttaylorr.com (mailing list archive)
Headers show
Series hash: introduce unsafe_hash_algo(), drop unsafe_ variants | expand

Message

Taylor Blau Jan. 17, 2025, 10:03 p.m. UTC
(This series is based on 14650065b7 (RelNotes/2.48.0: fix typos etc.,
2025-01-07)).

The bulk of this series is unchanged since last time, save for a couple
of typofixes on spots noticed by Peff and Patrick Steinhardt. More
importantly, it fixes hash_algo_by_ptr() when passing the unsafe SHA-1
variant.

There were a couple of other ideas floated around, but I don't think
they panned out as we had hoped in practice, so I think that this
version should be good to go.

The original cover letter is as follows:

------------

This series implements an idea discussed in [2] which suggests that we
introduce a way to access a wrapped version of a 'struct git_hash_algo'
which represents the unsafe variant of that algorithm, rather than
having individual unsafe_ functions (like unsafe_init_fn() versus
init_fn(), etc.).

This approach is relatively straightforward to implement, and removes a
significant deficiency in the original implementation of
unsafe/non-cryptographic hash functions by making it impossible to
switch between safe- and unsafe variants of hash functions. It also
cleans up the sha1-unsafe test helper's implementation by removing a
large number of "if (unsafe)"-style conditionals.

The series is laid out as follows:

  * The first two patches prepare the hashfile API for the upcoming
    change:

      csum-file: store the hash algorithm as a struct field
      csum-file.c: extract algop from hashfile_checksum_valid()

  * The next patch implements the new 'unsafe_hash_algo()' function at
    the heart of this series' approach:

      hash.h: introduce `unsafe_hash_algo()`

  * The next two patches convert existing callers to use the new
    'unsafe_hash_algo()' function, instead of switching between safe and
    unsafe_ variants of individual functions:

      csum-file.c: use unsafe_hash_algo()
      t/helper/test-hash.c: use unsafe_hash_algo()

  * The final patch drops the unsafe_ function variants following all
    callers being converted to use the new pattern:

      hash.h: drop unsafe_ function variants

Thanks in advance for your review!

[1]: https://lore.kernel.org/git/20241230-pks-meson-sha1-unsafe-v1-0-efb276e171f5@pks.im/
[2]: https://lore.kernel.org/git/20241107013915.GA961214@coredump.intra.peff.net/

Taylor Blau (8):
  t/helper/test-tool: implement sha1-unsafe helper
  csum-file: store the hash algorithm as a struct field
  csum-file.c: extract algop from hashfile_checksum_valid()
  hash.h: introduce `unsafe_hash_algo()`
  csum-file.c: use unsafe_hash_algo()
  t/helper/test-hash.c: use unsafe_hash_algo()
  csum-file: introduce hashfile_checkpoint_init()
  hash.h: drop unsafe_ function variants

 builtin/fast-import.c  |  2 +-
 bulk-checkin.c         |  9 ++++++---
 csum-file.c            | 40 +++++++++++++++++++++++++---------------
 csum-file.h            |  2 ++
 hash.h                 | 28 ++++++++++++----------------
 object-file.c          | 41 ++++++++++++++++++++++++++---------------
 t/helper/test-hash.c   |  4 +++-
 t/helper/test-sha1.c   |  7 ++++++-
 t/helper/test-sha1.sh  | 38 ++++++++++++++++++++++----------------
 t/helper/test-sha256.c |  2 +-
 t/helper/test-tool.c   |  1 +
 t/helper/test-tool.h   |  3 ++-
 12 files changed, 107 insertions(+), 70 deletions(-)

Range-diff against v2:
1:  4c1523a04f1 = 1:  ae6b8c75294 t/helper/test-tool: implement sha1-unsafe helper
2:  99cc44895b5 ! 2:  2b79c76e471 csum-file: store the hash algorithm as a struct field
    @@ Commit message
         csum-file: store the hash algorithm as a struct field
     
         Throughout the hashfile API, we rely on a reference to 'the_hash_algo',
    -    and call its _usnafe function variants directly.
    +    and call its _unsafe function variants directly.
     
         Prepare for a future change where we may use a different 'git_hash_algo'
         pointer (instead of just relying on 'the_hash_algo' throughout) by
3:  1ffab2f8289 = 3:  d7deb3f338e csum-file.c: extract algop from hashfile_checksum_valid()
4:  99dcbe2e716 ! 4:  b6b2bb2714f hash.h: introduce `unsafe_hash_algo()`
    @@ Commit message
             if (unsafe)
               algop = unsafe_hash_algo(algop);
     
    -        the_hash_algo->init_fn(...);
    -        the_hash_algo->update_fn(...);
    -        the_hash_algo->final_fn(...);
    +        algop->init_fn(...);
    +        algop->update_fn(...);
    +        algop->final_fn(...);
     
         This removes the existing shortcoming by no longer forcing the caller to
         "remember" which variant of the hash functions it wants to call, only to
    @@ Commit message
         functions, this too will go away after subsequent commits remove all
         direct calls to the unsafe_ variants.
     
    +    Note that hash_algo_by_ptr() needs an adjustment to allow passing in the
    +    unsafe variant of a hash function. All other query functions on the
    +    hash_algos array will continue to return the safe variants of any
    +    function.
    +
         Suggested-by: Jeff King <peff@peff.net>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    @@ hash.h: struct git_hash_algo {
      };
      extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
      
    -@@ hash.h: static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
    - 	return p - hash_algos;
    +@@ hash.h: int hash_algo_by_length(int len);
    + /* Identical, except for a pointer to struct git_hash_algo. */
    + static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
    + {
    +-	return p - hash_algos;
    ++	size_t i;
    ++	for (i = 0; i < GIT_HASH_NALGOS; i++) {
    ++		const struct git_hash_algo *algop = &hash_algos[i];
    ++		if (p == algop || (algop->unsafe && p == algop->unsafe))
    ++			return i;
    ++	}
    ++	return GIT_HASH_UNKNOWN;
      }
      
     +const struct git_hash_algo *unsafe_hash_algo(const struct git_hash_algo *algop);
5:  2dcc2aa6803 = 5:  ca67de80971 csum-file.c: use unsafe_hash_algo()
6:  a2b9ef03080 = 6:  21b175b07ff t/helper/test-hash.c: use unsafe_hash_algo()
7:  94c07fd8a55 = 7:  850d4f407db csum-file: introduce hashfile_checkpoint_init()
8:  f5579883816 ! 8:  0c4d006e6e8 hash.h: drop unsafe_ function variants
    @@ Commit message
     
         to
     
    -        unsafe_hash_algo(the_hash_algo)->unsafe_init_fn();
    +        unsafe_hash_algo(the_hash_algo)->init_fn();
     
         and similar, we can remove the scaffolding for the unsafe_ function
         variants and force callers to use the new unsafe_hash_algo() mechanic

base-commit: 14650065b76b28d3cfa9453356ac5669b19e706e

Comments

Jeff King Jan. 18, 2025, 12:28 p.m. UTC | #1
On Fri, Jan 17, 2025 at 05:03:07PM -0500, Taylor Blau wrote:

>     +    Note that hash_algo_by_ptr() needs an adjustment to allow passing in the
>     +    unsafe variant of a hash function. All other query functions on the
>     +    hash_algos array will continue to return the safe variants of any
>     +    function.
>     +
>          Suggested-by: Jeff King <peff@peff.net>
>          Signed-off-by: Taylor Blau <me@ttaylorr.com>
>      
>     @@ hash.h: struct git_hash_algo {
>       };
>       extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
>       
>     -@@ hash.h: static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
>     - 	return p - hash_algos;
>     +@@ hash.h: int hash_algo_by_length(int len);
>     + /* Identical, except for a pointer to struct git_hash_algo. */
>     + static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
>     + {
>     +-	return p - hash_algos;
>     ++	size_t i;
>     ++	for (i = 0; i < GIT_HASH_NALGOS; i++) {
>     ++		const struct git_hash_algo *algop = &hash_algos[i];
>     ++		if (p == algop || (algop->unsafe && p == algop->unsafe))
>     ++			return i;
>     ++	}
>     ++	return GIT_HASH_UNKNOWN;
>       }

OK, so this version introduces the loop we discussed earlier. I think we
can probably dismiss any performance loss as theoretical unless somebody
can think of a good way to measure. It seems like worrying about it is
probably a premature micro-optimization.

It is a little quirky that it loses the transitive nature of
hash_algo_by_ptr() and hash_algo_by_id(). So this is unsafe:

  /* start with unsafe variant */
  algo = unsafe_hash_algo(the_hash_algo);
  algo->init_fn(...);

  /* returns GIT_HASH_SHA1, even for the unsafe variant */
  id = hash_algo_by_ptr(algo);

  /* returns the safe variant */
  algo = hash_algo_by_id(id);

  /* oops, this is mix-and-matching */
  alog->update_fn(...);

Now obviously that sort of round-trip is a silly thing to do in a single
function. Could it happen across a larger call-chain, where the id is
stored in a struct and passed somewhere far away? I guess so, but it
also seems kind of unlikely.

It does make me wonder if hash_algo_by_ptr() should just return UNKNOWN
for the unsafe variant. Then we'd at least get a loud error from the
caller (as opposed to the code before your patch, which is undefined
behavior). I dunno.

-Peff
Jeff King Jan. 18, 2025, 12:43 p.m. UTC | #2
On Sat, Jan 18, 2025 at 07:28:31AM -0500, Jeff King wrote:

> It does make me wonder if hash_algo_by_ptr() should just return UNKNOWN
> for the unsafe variant. Then we'd at least get a loud error from the
> caller (as opposed to the code before your patch, which is undefined
> behavior). I dunno.

Doing this:

diff --git a/hash.h b/hash.h
index 68d4292e6d..ad2c919991 100644
--- a/hash.h
+++ b/hash.h
@@ -311,7 +311,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
 	size_t i;
 	for (i = 0; i < GIT_HASH_NALGOS; i++) {
 		const struct git_hash_algo *algop = &hash_algos[i];
-		if (p == algop || (algop->unsafe && p == algop->unsafe))
+		if (p == algop)
 			return i;
 	}
 	return GIT_HASH_UNKNOWN;

on top of your series doesn't trigger any issues in the test suite.
Which I think shows that we would never have triggered the UB from the
original implementation, either[1]. Still, I think I prefer your loop to
being one errant function call away from undefined behavior. But maybe
returning UNKNOWN (as above) is a safer bet than losing the inverse
nature of the by_ptr() and by_id() functions, which could otherwise
cause hard-to-find interactions?

-Peff

[1] One thing that still puzzles me. If you further add a BUG() to that
    function when we'd return UNKNOWN, you can see that we trigger a few
    cases in the test suite where we pass in NULL (e.g., because we're
    not in a repo). I think these are already UB with the existing
    "p - hash_algos" implementation! We'll return what is effectively
    "p" cast to an int.

    E.g. if I do this:

    diff --git a/hash.h b/hash.h
    index 756166ce5e..ff31156416 100644
    --- a/hash.h
    +++ b/hash.h
    @@ -320,6 +320,8 @@ int hash_algo_by_length(int len);
     /* Identical, except for a pointer to struct git_hash_algo. */
     static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
     {
    +	if (!p)
    +		BUG("null hash, return is %d", (int)(p - hash_algos));
     	return p - hash_algos;
     }

     then t1517 shows grep dying with:

       BUG: hash.h:324: null hash, return is -1646754892

     I think it "works" because the backtrace is:

      [...]
      #5  0x0000556d436f6b71 in BUG_fl (file=0x556d43790e8b "hash.h", line=324,
          fmt=0x556d43790e73 "null hash, return is %d") at usage.c:335
      #6  0x0000556d4357c2f9 in hash_algo_by_ptr (p=0x0) at /home/peff/compile/git/hash.h:324
      #7  0x0000556d4357c437 in oidclr (oid=0x7ffdf5810ea4, algop=0x0) at /home/peff/compile/git/hash.h:392
      #8  0x0000556d435801c7 in prep_exclude (dir=0x7ffdf5811190, istate=0x556d608959c0, base=0x556d6089da50 "nums",
          baselen=0) at dir.c:1699
      [...]
      #16 0x0000556d4344dd4a in cmd_grep (argc=0, argv=0x556d60895ee8, prefix=0x0, repo=0x0) at builtin/grep.c:1257

     So we probably write a totally garbage algo field into the
     object_id, but nobody ever ends up looking at it. Probably
     something we should clean up, but way out of scope for your series.
     But I do think it reinforces that returning UNKNOWN is an
     improvement; it avoids undefined behavior and anybody who tried to
     use it should get a BUG() from calling git_hash_unknown_*()
     functions.
Junio C Hamano Jan. 22, 2025, 9:31 p.m. UTC | #3
Jeff King <peff@peff.net> writes:

> .... But maybe
> returning UNKNOWN (as above) is a safer bet than losing the inverse
> nature of the by_ptr() and by_id() functions, which could otherwise
> cause hard-to-find interactions?

Yeah, that sounds sensible.

>      I think it "works" because the backtrace is:
>
>       [...]
>       #5  0x0000556d436f6b71 in BUG_fl (file=0x556d43790e8b "hash.h", line=324,
>           fmt=0x556d43790e73 "null hash, return is %d") at usage.c:335
>       #6  0x0000556d4357c2f9 in hash_algo_by_ptr (p=0x0) at /home/peff/compile/git/hash.h:324
>       #7  0x0000556d4357c437 in oidclr (oid=0x7ffdf5810ea4, algop=0x0) at /home/peff/compile/git/hash.h:392
>       #8  0x0000556d435801c7 in prep_exclude (dir=0x7ffdf5811190, istate=0x556d608959c0, base=0x556d6089da50 "nums",
>           baselen=0) at dir.c:1699
>       [...]
>       #16 0x0000556d4344dd4a in cmd_grep (argc=0, argv=0x556d60895ee8, prefix=0x0, repo=0x0) at builtin/grep.c:1257
>
>      So we probably write a totally garbage algo field into the
>      object_id, but nobody ever ends up looking at it. Probably
>      something we should clean up, but way out of scope for your series.

Yeah, that looks bad, but I think it is fine to leave it outside the
series to fix that.

>      But I do think it reinforces that returning UNKNOWN is an
>      improvement; it avoids undefined behavior and anybody who tried to
>      use it should get a BUG() from calling git_hash_unknown_*()
>      functions.

Sounds like a sensitive direction to go in.

Thanks.