Message ID | 20200513005424.81369-37-sandals@crustytoothpaste.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | SHA-256 part 2/3: protocol functionality | expand |
On Wed, 13 May 2020 at 02:56, brian m. carlson <sandals@crustytoothpaste.net> wrote: > > git index-pack is usually run in a repository, but need not be. Since > packs don't contains information on the algorithm in use, instead > relying on context, add an option to index-pack to tell it which one > we're using in case someone runs it outside of a repository. > > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> > --- > builtin/index-pack.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/builtin/index-pack.c b/builtin/index-pack.c > index 7bea1fba52..89f4962a00 100644 > --- a/builtin/index-pack.c > +++ b/builtin/index-pack.c > @@ -1760,6 +1760,11 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) > die(_("bad %s"), arg); > } else if (skip_prefix(arg, "--max-input-size=", &arg)) { > max_input_size = strtoumax(arg, NULL, 10); > + } else if (skip_prefix(arg, "--object-format=", &arg)) { > + int hash_algo = hash_algo_by_name(arg); > + if (hash_algo == GIT_HASH_UNKNOWN) > + die(_("unknown hash algorithm '%s'"), arg); > + repo_set_hash_algo(the_repository, hash_algo); > } else Patch 27 added `--hash` to `git show-index` and I almost commented on "hash" vs "object-format". In the end I figured the object format was a more technical (protocol) term. But now I wonder. Should we try to align such options from the start? Or is there perhaps a reason for those different approaches? Similar to an earlier patch where we modify `the_hash_algo` like this, I feel a bit nervous. What happens if you pass in a "wrong" algo here, i.e., SHA-1 in a SHA-256 repo? Or, given the motivation in the commit message, should this only be allowed if we really *are* outside a repo? Martin
On 2020-05-16 at 11:18:12, Martin Ågren wrote: > On Wed, 13 May 2020 at 02:56, brian m. carlson > <sandals@crustytoothpaste.net> wrote: > > > > git index-pack is usually run in a repository, but need not be. Since > > packs don't contains information on the algorithm in use, instead > > relying on context, add an option to index-pack to tell it which one > > we're using in case someone runs it outside of a repository. > > > > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> > > --- > > builtin/index-pack.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/builtin/index-pack.c b/builtin/index-pack.c > > index 7bea1fba52..89f4962a00 100644 > > --- a/builtin/index-pack.c > > +++ b/builtin/index-pack.c > > @@ -1760,6 +1760,11 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) > > die(_("bad %s"), arg); > > } else if (skip_prefix(arg, "--max-input-size=", &arg)) { > > max_input_size = strtoumax(arg, NULL, 10); > > + } else if (skip_prefix(arg, "--object-format=", &arg)) { > > + int hash_algo = hash_algo_by_name(arg); > > + if (hash_algo == GIT_HASH_UNKNOWN) > > + die(_("unknown hash algorithm '%s'"), arg); > > + repo_set_hash_algo(the_repository, hash_algo); > > } else > > Patch 27 added `--hash` to `git show-index` and I almost commented on > "hash" vs "object-format". In the end I figured the object format was a > more technical (protocol) term. But now I wonder. Should we try to align > such options from the start? Or is there perhaps a reason for those > different approaches? I'll bring them into sync. > Similar to an earlier patch where we modify `the_hash_algo` like this, I > feel a bit nervous. What happens if you pass in a "wrong" algo here, > i.e., SHA-1 in a SHA-256 repo? Or, given the motivation in the commit > message, should this only be allowed if we really *are* outside a repo? Unfortunately, we can't prevent the user from being inside repository A, which is SHA-1, while invoking git index-pack on repository B, which is SHA-256. That is valid without --stdin, if uncommon, and it needs to be supported. I can prevent it from being used with --stdin, though. If you pass in a wrong algorithm, we usually blow up with an inflate error because we consume more bytes than expected with our ref deltas. I'm not aware of any cases where we segfault or access invalid memory; we just blow up in a nonobvious way. That's true, too, if you manually tamper with the algorithm in extensions.objectformat; usually we blow up (but not segfault) because the index is "corrupt".
On Sat, 16 May 2020 at 22:47, brian m. carlson <sandals@crustytoothpaste.net> wrote: > > On 2020-05-16 at 11:18:12, Martin Ågren wrote: > > On Wed, 13 May 2020 at 02:56, brian m. carlson > > <sandals@crustytoothpaste.net> wrote: > > > > > > git index-pack is usually run in a repository, but need not be. Since > > > packs don't contains information on the algorithm in use, instead > > > relying on context, add an option to index-pack to tell it which one > > > we're using in case someone runs it outside of a repository. > > > Similar to an earlier patch where we modify `the_hash_algo` like this, I > > feel a bit nervous. What happens if you pass in a "wrong" algo here, > > i.e., SHA-1 in a SHA-256 repo? Or, given the motivation in the commit > > message, should this only be allowed if we really *are* outside a repo? > > Unfortunately, we can't prevent the user from being inside repository A, > which is SHA-1, while invoking git index-pack on repository B, which is > SHA-256. Ah, I see. > That is valid without --stdin, if uncommon, and it needs to be > supported. I can prevent it from being used with --stdin, though. Hmm, that might make sense. I suppose it could quickly get out of control with bug reports coming in along the lines of "if I do this really crazy git index-pack invocation, I manage to mess things up". The easiest way to address this might be through documentation, i.e., "don't use this option", "for internal use" or even "to be used by the test suite only" for which there is even precedence in git-index-pack(1). On the other hand, if we need to detect such hash mismatch even once the SHA-256 work is 100% complete, then I suppose we really should try a bit to catch bad invocations. As a tangent, I see that v2.27.0 will come with `git init --object-format=<format>` and `GIT_DEFAULT_HASH_ALGORITHM`. The docs for the former mentions "(if enabled)". Should we add something more scary to those to make it clear that they shouldn't be used and that you basically shouldn't even try to figure out how to enable them? I can already see the tweets and blog posts a few weeks from now about how you can build Git from source setting a single switch, run git init --object-format=sha256 and you're in the future! Which will just lead to pain some days or weeks later.... "I've done lots of work. How do I convert my repo to SHA-1 so I can share it?"... We've added "experimental" things before and tried to document the experimental nature. Maybe here we're not even "experimental" -- more like "if you use this in production, you *will* suffer"? > If you pass in a wrong algorithm, we usually blow up with an inflate > error because we consume more bytes than expected with our ref deltas. > I'm not aware of any cases where we segfault or access invalid memory; > we just blow up in a nonobvious way. That's true, too, if you manually > tamper with the algorithm in extensions.objectformat; usually we blow up > (but not segfault) because the index is "corrupt". Ok, I see. I suppose "some time", we could tweak error messages to hint about an object-format mismatch, but I don't think that needs to block your work here now. Martin
On 2020-05-17 at 18:16:37, Martin Ågren wrote: > On Sat, 16 May 2020 at 22:47, brian m. carlson > <sandals@crustytoothpaste.net> wrote: > > That is valid without --stdin, if uncommon, and it needs to be > > supported. I can prevent it from being used with --stdin, though. > > Hmm, that might make sense. I suppose it could quickly get out of > control with bug reports coming in along the lines of "if I do this > really crazy git index-pack invocation, I manage to mess things up". The > easiest way to address this might be through documentation, i.e., "don't > use this option", "for internal use" or even "to be used by the test > suite only" for which there is even precedence in git-index-pack(1). > > On the other hand, if we need to detect such hash mismatch even once the > SHA-256 work is 100% complete, then I suppose we really should try a > bit to catch bad invocations. I can add documentation and a warning there. If we actually verified the checksum at the end of the pack first, then we'd be able to distinguish the two cases, because we'd try to compute a clearly invalid hash over the body, and the likelihood of it matching would be very small. We don't at the moment, for reasons I'm unclear about, but it's probably performance. > As a tangent, I see that v2.27.0 will come with `git init > --object-format=<format>` and `GIT_DEFAULT_HASH_ALGORITHM`. The docs for > the former mentions "(if enabled)". Should we add something more scary > to those to make it clear that they shouldn't be used and that you > basically shouldn't even try to figure out how to enable them? I can > already see the tweets and blog posts a few weeks from now about how you > can build Git from source setting a single switch, run > > git init --object-format=sha256 > > and you're in the future! Which will just lead to pain some days or > weeks later.... "I've done lots of work. How do I convert my repo to > SHA-1 so I can share it?"... > > We've added "experimental" things before and tried to document the > experimental nature. Maybe here we're not even "experimental" -- more > like "if you use this in production, you *will* suffer"? Well, the option is there, but it produces the following: % git init --object-format=sha256 fatal: The hash algorithm sha256 is not supported in this build. which can be distinguished from this: % git init --object-format=blake2b fatal: unknown hash algorithm 'blake2b' Right now it's pretty broken without this series, so you can't use it. I mean, you have the source and can remove the check, but it doesn't work as it stands, so I'm not too worried about people trying to do that at the moment. I'll sneak in some documentation for the end product, though.
diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 7bea1fba52..89f4962a00 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1760,6 +1760,11 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) die(_("bad %s"), arg); } else if (skip_prefix(arg, "--max-input-size=", &arg)) { max_input_size = strtoumax(arg, NULL, 10); + } else if (skip_prefix(arg, "--object-format=", &arg)) { + int hash_algo = hash_algo_by_name(arg); + if (hash_algo == GIT_HASH_UNKNOWN) + die(_("unknown hash algorithm '%s'"), arg); + repo_set_hash_algo(the_repository, hash_algo); } else usage(index_pack_usage); continue;
git index-pack is usually run in a repository, but need not be. Since packs don't contains information on the algorithm in use, instead relying on context, add an option to index-pack to tell it which one we're using in case someone runs it outside of a repository. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- builtin/index-pack.c | 5 +++++ 1 file changed, 5 insertions(+)