mbox series

[v5,0/8] hash.h: support choosing a separate SHA-1 for non-cryptographic uses

Message ID cover.1727364141.git.me@ttaylorr.com (mailing list archive)
Headers show
Series hash.h: support choosing a separate SHA-1 for non-cryptographic uses | expand

Message

Taylor Blau Sept. 26, 2024, 3:22 p.m. UTC
Here is a minor reroll of mine and Peff's series to add a build-time
knob to allow selecting an alternative SHA-1 implementation for
non-cryptographic hashing within Git, starting with the `hashwrite()`
family of functions.

This version has changes which are limited to the commit messages only,
and amount to:

  - Updated rationale for skipping the collision check from within
    finalize_object_file() when handling loose objects.

  - Updated commit message with some over-eager s/fast/unsafe/
    conversions in the final patch.

I think most of the review dust has settled up to this point, so I'm
imagining that this is the final version of this series for now, or at
least very close to it. But if something new comes up, please let me
know!

Thanks in advance for your review!

Taylor Blau (8):
  finalize_object_file(): check for name collision before renaming
  finalize_object_file(): refactor unlink_or_warn() placement
  finalize_object_file(): implement collision check
  pack-objects: use finalize_object_file() to rename pack/idx/etc
  sha1: do not redefine `platform_SHA_CTX` and friends
  hash.h: scaffolding for _unsafe hashing variants
  Makefile: allow specifying a SHA-1 for non-cryptographic uses
  csum-file.c: use unsafe SHA-1 implementation when available

 Makefile                              |  25 ++++++
 block-sha1/sha1.h                     |   2 +
 csum-file.c                           |  18 ++--
 hash.h                                |  72 +++++++++++++++
 object-file.c                         | 124 ++++++++++++++++++++++++--
 object-file.h                         |   6 ++
 pack-write.c                          |   7 +-
 sha1/openssl.h                        |   2 +
 sha1dc_git.h                          |   3 +
 t/t5303-pack-corruption-resilience.sh |   7 +-
 tmp-objdir.c                          |  26 ++++--
 11 files changed, 266 insertions(+), 26 deletions(-)

Range-diff against v4:
-:  ---------- > 1:  6f1ee91fff finalize_object_file(): check for name collision before renaming
-:  ---------- > 2:  133047ca8c finalize_object_file(): refactor unlink_or_warn() placement
1:  ed9eeef851 ! 3:  41d38352a4 finalize_object_file(): implement collision check
    @@ Commit message
             object name, so checking for collisions at the content level doesn't
             add anything.
     
    -        This is why we do not bother to check the inflated object contents
    -        for collisions either, since either (a) the object contents have the
    -        fingerprint of a SHA-1 collision, in which case the collision
    -        detecting SHA-1 implementation used to hash the contents to give us
    -        a path would have already rejected it, or (b) the contents are part
    -        of a colliding pair which does not bear the same fingerprints of
    -        known collision attacks, in which case we would not have caught it
    -        anyway.
    +        Adding a content-level collision check would have to happen at a
    +        higher level than in finalize_object_file(), since (avoiding race
    +        conditions) writing an object loose which already exists in the
    +        repository will prevent us from even reaching finalize_object_file()
    +        via the object freshening code.
    +
    +        There is a collision check in index-pack via its `check_collision()`
    +        function, but there isn't an analogous function in unpack-objects,
    +        which just feeds the result to write_object_file().
     
             So skipping the collision check here does not change for better or
             worse the hardness of loose object writes.
    @@ Commit message
     
         Co-authored-by: Jeff King <peff@peff.net>
         Signed-off-by: Jeff King <peff@peff.net>
    +    Helped-by: Elijah Newren <newren@gmail.com>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## object-file.c ##
2:  3cc7f7b1f6 = 4:  611475d83e pack-objects: use finalize_object_file() to rename pack/idx/etc
3:  8f8ac0f5b0 = 5:  9913a5d971 sha1: do not redefine `platform_SHA_CTX` and friends
4:  d300e9c688 = 6:  65de6d724d hash.h: scaffolding for _unsafe hashing variants
5:  af8fd9aa4e = 7:  3884cd0e3a Makefile: allow specifying a SHA-1 for non-cryptographic uses
6:  4b83dd05e9 ! 8:  62abddf73d csum-file.c: use unsafe SHA-1 implementation when available
    @@ Commit message
     
         These callers only use the_hash_algo to produce a checksum, which we
         depend on for data integrity, but not for cryptographic purposes, so
    -    these callers are safe to use the unsafe (and potentially non-collision
    -    detecting) SHA-1 implementation.
    +    these callers are safe to use the unsafe (non-collision detecting) SHA-1
    +    implementation.
     
         To time this, I took a freshly packed copy of linux.git, and ran the
         following with and without the OPENSSL_SHA1_UNSAFE=1 build-knob. Both
    @@ Commit message
             $ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
              59,164,001,176 (58.79%)  /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
     
    -    , and generate the resulting "clone" much unsafeer, in only 11.597 seconds
    +    , and generate the resulting "clone" much faster, in only 11.597 seconds
         of wall time, 11.37 seconds of user time, and 0.23 seconds of system
         time, for a ~40% speed-up.
     

base-commit: 6258f68c3c1092c901337895c864073dcdea9213

Comments

Elijah Newren Sept. 26, 2024, 10:47 p.m. UTC | #1
Hi...

On Thu, Sep 26, 2024 at 8:22 AM Taylor Blau <me@ttaylorr.com> wrote:
...
> I think most of the review dust has settled up to this point, so I'm
> imagining that this is the final version of this series for now, or at
> least very close to it. But if something new comes up, please let me
> know!
>
...
> Range-diff against v4:
> -:  ---------- > 1:  6f1ee91fff finalize_object_file(): check for name collision before renaming
> -:  ---------- > 2:  133047ca8c finalize_object_file(): refactor unlink_or_warn() placement
> 1:  ed9eeef851 ! 3:  41d38352a4 finalize_object_file(): implement collision check
>     @@ Commit message
>              object name, so checking for collisions at the content level doesn't
>              add anything.
>
>     -        This is why we do not bother to check the inflated object contents
>     -        for collisions either, since either (a) the object contents have the
>     -        fingerprint of a SHA-1 collision, in which case the collision
>     -        detecting SHA-1 implementation used to hash the contents to give us
>     -        a path would have already rejected it, or (b) the contents are part
>     -        of a colliding pair which does not bear the same fingerprints of
>     -        known collision attacks, in which case we would not have caught it
>     -        anyway.
>     +        Adding a content-level collision check would have to happen at a
>     +        higher level than in finalize_object_file(), since (avoiding race
>     +        conditions) writing an object loose which already exists in the
>     +        repository will prevent us from even reaching finalize_object_file()
>     +        via the object freshening code.
>     +
>     +        There is a collision check in index-pack via its `check_collision()`
>     +        function, but there isn't an analogous function in unpack-objects,
>     +        which just feeds the result to write_object_file().
>
>              So skipping the collision check here does not change for better or
>              worse the hardness of loose object writes.
>     @@ Commit message
>
>          Co-authored-by: Jeff King <peff@peff.net>
>          Signed-off-by: Jeff King <peff@peff.net>
>     +    Helped-by: Elijah Newren <newren@gmail.com>
>          Signed-off-by: Taylor Blau <me@ttaylorr.com>
>
>       ## object-file.c ##
> 2:  3cc7f7b1f6 = 4:  611475d83e pack-objects: use finalize_object_file() to rename pack/idx/etc
> 3:  8f8ac0f5b0 = 5:  9913a5d971 sha1: do not redefine `platform_SHA_CTX` and friends
> 4:  d300e9c688 = 6:  65de6d724d hash.h: scaffolding for _unsafe hashing variants
> 5:  af8fd9aa4e = 7:  3884cd0e3a Makefile: allow specifying a SHA-1 for non-cryptographic uses
> 6:  4b83dd05e9 ! 8:  62abddf73d csum-file.c: use unsafe SHA-1 implementation when available
>     @@ Commit message
>
>          These callers only use the_hash_algo to produce a checksum, which we
>          depend on for data integrity, but not for cryptographic purposes, so
>     -    these callers are safe to use the unsafe (and potentially non-collision
>     -    detecting) SHA-1 implementation.
>     +    these callers are safe to use the unsafe (non-collision detecting) SHA-1
>     +    implementation.
>
>          To time this, I took a freshly packed copy of linux.git, and ran the
>          following with and without the OPENSSL_SHA1_UNSAFE=1 build-knob. Both
>     @@ Commit message
>              $ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
>               59,164,001,176 (58.79%)  /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
>
>     -    , and generate the resulting "clone" much unsafeer, in only 11.597 seconds
>     +    , and generate the resulting "clone" much faster, in only 11.597 seconds
>          of wall time, 11.37 seconds of user time, and 0.23 seconds of system
>          time, for a ~40% speed-up.

This round looks good to me.
Junio C Hamano Sept. 27, 2024, 12:44 a.m. UTC | #2
Elijah Newren <newren@gmail.com> writes:

> Hi...
>
>> ...
>>     -    , and generate the resulting "clone" much unsafeer, in only 11.597 seconds
>>     +    , and generate the resulting "clone" much faster, in only 11.597 seconds
>>          of wall time, 11.37 seconds of user time, and 0.23 seconds of system
>>          time, for a ~40% speed-up.
>
> This round looks good to me.

Thanks, both.
Jeff King Sept. 27, 2024, 3:57 a.m. UTC | #3
On Thu, Sep 26, 2024 at 11:22:27AM -0400, Taylor Blau wrote:

> Range-diff against v4:
> -:  ---------- > 1:  6f1ee91fff finalize_object_file(): check for name collision before renaming
> -:  ---------- > 2:  133047ca8c finalize_object_file(): refactor unlink_or_warn() placement
> 1:  ed9eeef851 ! 3:  41d38352a4 finalize_object_file(): implement collision check
> [...]
> 2:  3cc7f7b1f6 = 4:  611475d83e pack-objects: use finalize_object_file() to rename pack/idx/etc
> 3:  8f8ac0f5b0 = 5:  9913a5d971 sha1: do not redefine `platform_SHA_CTX` and friends
> 4:  d300e9c688 = 6:  65de6d724d hash.h: scaffolding for _unsafe hashing variants
> 5:  af8fd9aa4e = 7:  3884cd0e3a Makefile: allow specifying a SHA-1 for non-cryptographic uses
> 6:  4b83dd05e9 ! 8:  62abddf73d csum-file.c: use unsafe SHA-1 implementation when available

Funny range-diff, but I think those patches are here in the series as
expected.

Anyway, the result looks good to me!

-Peff