diff mbox series

[36/44] builtin/index-pack: add option to specify hash algorithm

Message ID 20200513005424.81369-37-sandals@crustytoothpaste.net (mailing list archive)
State New, archived
Headers show
Series SHA-256 part 2/3: protocol functionality | expand

Commit Message

brian m. carlson May 13, 2020, 12:54 a.m. UTC
git index-pack is usually run in a repository, but need not be. Since
packs don't contains information on the algorithm in use, instead
relying on context, add an option to index-pack to tell it which one
we're using in case someone runs it outside of a repository.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 builtin/index-pack.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Martin Ågren May 16, 2020, 11:18 a.m. UTC | #1
On Wed, 13 May 2020 at 02:56, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> git index-pack is usually run in a repository, but need not be. Since
> packs don't contains information on the algorithm in use, instead
> relying on context, add an option to index-pack to tell it which one
> we're using in case someone runs it outside of a repository.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  builtin/index-pack.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 7bea1fba52..89f4962a00 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -1760,6 +1760,11 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
>                                         die(_("bad %s"), arg);
>                         } else if (skip_prefix(arg, "--max-input-size=", &arg)) {
>                                 max_input_size = strtoumax(arg, NULL, 10);
> +                       } else if (skip_prefix(arg, "--object-format=", &arg)) {
> +                               int hash_algo = hash_algo_by_name(arg);
> +                               if (hash_algo == GIT_HASH_UNKNOWN)
> +                                       die(_("unknown hash algorithm '%s'"), arg);
> +                               repo_set_hash_algo(the_repository, hash_algo);
>                         } else

Patch 27 added `--hash` to `git show-index` and I almost commented on
"hash" vs "object-format". In the end I figured the object format was a
more technical (protocol) term. But now I wonder. Should we try to align
such options from the start? Or is there perhaps a reason for those
different approaches?

Similar to an earlier patch where we modify `the_hash_algo` like this, I
feel a bit nervous. What happens if you pass in a "wrong" algo here,
i.e., SHA-1 in a SHA-256 repo? Or, given the motivation in the commit
message, should this only be allowed if we really *are* outside a repo?


Martin
brian m. carlson May 16, 2020, 8:47 p.m. UTC | #2
On 2020-05-16 at 11:18:12, Martin Ågren wrote:
> On Wed, 13 May 2020 at 02:56, brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> >
> > git index-pack is usually run in a repository, but need not be. Since
> > packs don't contains information on the algorithm in use, instead
> > relying on context, add an option to index-pack to tell it which one
> > we're using in case someone runs it outside of a repository.
> >
> > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> > ---
> >  builtin/index-pack.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> > index 7bea1fba52..89f4962a00 100644
> > --- a/builtin/index-pack.c
> > +++ b/builtin/index-pack.c
> > @@ -1760,6 +1760,11 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
> >                                         die(_("bad %s"), arg);
> >                         } else if (skip_prefix(arg, "--max-input-size=", &arg)) {
> >                                 max_input_size = strtoumax(arg, NULL, 10);
> > +                       } else if (skip_prefix(arg, "--object-format=", &arg)) {
> > +                               int hash_algo = hash_algo_by_name(arg);
> > +                               if (hash_algo == GIT_HASH_UNKNOWN)
> > +                                       die(_("unknown hash algorithm '%s'"), arg);
> > +                               repo_set_hash_algo(the_repository, hash_algo);
> >                         } else
> 
> Patch 27 added `--hash` to `git show-index` and I almost commented on
> "hash" vs "object-format". In the end I figured the object format was a
> more technical (protocol) term. But now I wonder. Should we try to align
> such options from the start? Or is there perhaps a reason for those
> different approaches?

I'll bring them into sync.

> Similar to an earlier patch where we modify `the_hash_algo` like this, I
> feel a bit nervous. What happens if you pass in a "wrong" algo here,
> i.e., SHA-1 in a SHA-256 repo? Or, given the motivation in the commit
> message, should this only be allowed if we really *are* outside a repo?

Unfortunately, we can't prevent the user from being inside repository A,
which is SHA-1, while invoking git index-pack on repository B, which is
SHA-256.  That is valid without --stdin, if uncommon, and it needs to be
supported.  I can prevent it from being used with --stdin, though.

If you pass in a wrong algorithm, we usually blow up with an inflate
error because we consume more bytes than expected with our ref deltas.
I'm not aware of any cases where we segfault or access invalid memory;
we just blow up in a nonobvious way.  That's true, too, if you manually
tamper with the algorithm in extensions.objectformat; usually we blow up
(but not segfault) because the index is "corrupt".
Martin Ågren May 17, 2020, 6:16 p.m. UTC | #3
On Sat, 16 May 2020 at 22:47, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> On 2020-05-16 at 11:18:12, Martin Ågren wrote:
> > On Wed, 13 May 2020 at 02:56, brian m. carlson
> > <sandals@crustytoothpaste.net> wrote:
> > >
> > > git index-pack is usually run in a repository, but need not be. Since
> > > packs don't contains information on the algorithm in use, instead
> > > relying on context, add an option to index-pack to tell it which one
> > > we're using in case someone runs it outside of a repository.
>
> > Similar to an earlier patch where we modify `the_hash_algo` like this, I
> > feel a bit nervous. What happens if you pass in a "wrong" algo here,
> > i.e., SHA-1 in a SHA-256 repo? Or, given the motivation in the commit
> > message, should this only be allowed if we really *are* outside a repo?
>
> Unfortunately, we can't prevent the user from being inside repository A,
> which is SHA-1, while invoking git index-pack on repository B, which is
> SHA-256.

Ah, I see.

>  That is valid without --stdin, if uncommon, and it needs to be
> supported.  I can prevent it from being used with --stdin, though.

Hmm, that might make sense. I suppose it could quickly get out of
control with bug reports coming in along the lines of "if I do this
really crazy git index-pack invocation, I manage to mess things up". The
easiest way to address this might be through documentation, i.e., "don't
use this option", "for internal use" or even "to be used by the test
suite only" for which there is even precedence in git-index-pack(1).

On the other hand, if we need to detect such hash mismatch even once the
SHA-256 work is 100% complete, then I suppose we really should try a
bit to catch bad invocations.

As a tangent, I see that v2.27.0 will come with `git init
--object-format=<format>` and `GIT_DEFAULT_HASH_ALGORITHM`. The docs for
the former mentions "(if enabled)". Should we add something more scary
to those to make it clear that they shouldn't be used and that you
basically shouldn't even try to figure out how to enable them? I can
already see the tweets and blog posts a few weeks from now about how you
can build Git from source setting a single switch, run

  git init --object-format=sha256

and you're in the future! Which will just lead to pain some days or
weeks later.... "I've done lots of work. How do I convert my repo to
SHA-1 so I can share it?"...

We've added "experimental" things before and tried to document the
experimental nature. Maybe here we're not even "experimental" -- more
like "if you use this in production, you *will* suffer"?

> If you pass in a wrong algorithm, we usually blow up with an inflate
> error because we consume more bytes than expected with our ref deltas.
> I'm not aware of any cases where we segfault or access invalid memory;
> we just blow up in a nonobvious way.  That's true, too, if you manually
> tamper with the algorithm in extensions.objectformat; usually we blow up
> (but not segfault) because the index is "corrupt".

Ok, I see. I suppose "some time", we could tweak error messages to hint
about an object-format mismatch, but I don't think that needs to block
your work here now.

Martin
brian m. carlson May 17, 2020, 8:52 p.m. UTC | #4
On 2020-05-17 at 18:16:37, Martin Ågren wrote:
> On Sat, 16 May 2020 at 22:47, brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> >  That is valid without --stdin, if uncommon, and it needs to be
> > supported.  I can prevent it from being used with --stdin, though.
> 
> Hmm, that might make sense. I suppose it could quickly get out of
> control with bug reports coming in along the lines of "if I do this
> really crazy git index-pack invocation, I manage to mess things up". The
> easiest way to address this might be through documentation, i.e., "don't
> use this option", "for internal use" or even "to be used by the test
> suite only" for which there is even precedence in git-index-pack(1).
> 
> On the other hand, if we need to detect such hash mismatch even once the
> SHA-256 work is 100% complete, then I suppose we really should try a
> bit to catch bad invocations.

I can add documentation and a warning there.

If we actually verified the checksum at the end of the pack first, then
we'd be able to distinguish the two cases, because we'd try to compute a
clearly invalid hash over the body, and the likelihood of it matching
would be very small.  We don't at the moment, for reasons I'm unclear
about, but it's probably performance.

> As a tangent, I see that v2.27.0 will come with `git init
> --object-format=<format>` and `GIT_DEFAULT_HASH_ALGORITHM`. The docs for
> the former mentions "(if enabled)". Should we add something more scary
> to those to make it clear that they shouldn't be used and that you
> basically shouldn't even try to figure out how to enable them? I can
> already see the tweets and blog posts a few weeks from now about how you
> can build Git from source setting a single switch, run
> 
>   git init --object-format=sha256
> 
> and you're in the future! Which will just lead to pain some days or
> weeks later.... "I've done lots of work. How do I convert my repo to
> SHA-1 so I can share it?"...
> 
> We've added "experimental" things before and tried to document the
> experimental nature. Maybe here we're not even "experimental" -- more
> like "if you use this in production, you *will* suffer"?

Well, the option is there, but it produces the following:

  % git init --object-format=sha256
  fatal: The hash algorithm sha256 is not supported in this build.

which can be distinguished from this:

  % git init --object-format=blake2b
  fatal: unknown hash algorithm 'blake2b'

Right now it's pretty broken without this series, so you can't use it.
I mean, you have the source and can remove the check, but it doesn't
work as it stands, so I'm not too worried about people trying to do that
at the moment.  I'll sneak in some documentation for the end product,
though.
diff mbox series

Patch

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 7bea1fba52..89f4962a00 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1760,6 +1760,11 @@  int cmd_index_pack(int argc, const char **argv, const char *prefix)
 					die(_("bad %s"), arg);
 			} else if (skip_prefix(arg, "--max-input-size=", &arg)) {
 				max_input_size = strtoumax(arg, NULL, 10);
+			} else if (skip_prefix(arg, "--object-format=", &arg)) {
+				int hash_algo = hash_algo_by_name(arg);
+				if (hash_algo == GIT_HASH_UNKNOWN)
+					die(_("unknown hash algorithm '%s'"), arg);
+				repo_set_hash_algo(the_repository, hash_algo);
 			} else
 				usage(index_pack_usage);
 			continue;