mbox series

[v2,0/2] object-name: fix resolution of object names containing curly braces

Message ID pull.1844.v2.git.1735949870.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series object-name: fix resolution of object names containing curly braces | expand

Message

Elijah Newren via GitGitGadget Jan. 4, 2025, 12:17 a.m. UTC
Maintainer note: these bugs both date back to 2006; neither is a regression
in this cycle.

Changes since v1:

 * Covered the ^{...} cases, and added a testcase for those
 * Added a second patch for another bug discovered by the same reporter,
   where branch:path/to/file/named/major-gaffed is interpreted as a request
   for a commit (namely affed) rather than a blob. (At least, assuming
   commit affed exists)

The second patch has some backward compatibility concerns. People used to be
able to do e.g. git show ${garbage}-g${hash}. I tightened it to
${valid_refname}-${number}-g${hash}, but do we want to allow e.g.
${valid_refname}-g${hash} (allowing the count to be omitted) or maybe even
allow a subset of invalid refnames?

Also for the second patch, while the repository the reporter found the issue
in was something else, I found two open source examples:

 * lore.git: git cat-file -t master:random/path/major-gaffed
 * git.git: git cat-file -t super-invalid~///\\.....@.lock-gfd0bba94e

Elijah Newren (2):
  object-name: fix resolution of object names containing curly braces
  object-name: be more strict in parsing describe-like output

 object-name.c       | 63 ++++++++++++++++++++++++++++++++++++++++++---
 t/t1006-cat-file.sh | 31 +++++++++++++++++++++-
 t/t6120-describe.sh | 22 ++++++++++++++++
 3 files changed, 111 insertions(+), 5 deletions(-)


base-commit: 063bcebf0c917140ca0e705cbe0fdea127e90086
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1844%2Fnewren%2Fobject-name-fix-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1844/newren/object-name-fix-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1844

Range-diff vs v1:

 1:  1671b773fcc ! 1:  13f68bebe90 object-name: fix resolution of object names containing curly braces
     @@ Commit message
          should succeed (assuming that branch had a README.md file, of course).
          However, the change in cce91a2caef9 (Change 'master@noon' syntax to
          'master@{noon}'., 2006-05-19) presumed that curly braces would always
     -    come after an '@' and be paired, causing 'foo{bar:README.md' to
     -    entirely miss the ':' and assume there's no object being referenced.
     +    come after an '@' or '^' and be paired, causing e.g. 'foo{bar:README.md'
     +    to entirely miss the ':' and assume there's no object being referenced.
          In short, git would report:
      
              fatal: Not a valid object name foo{bar:README.md
      
          Change the parsing to only make the assumption of paired curly braces
     -    immediately after a '@' character appears.
     +    immediately after either a '@' or '^' character appears.
      
     -    Add tests for both this and 'foo@@{...}' cases, which an initial version
     -    of this patch broke.
     +    Add tests for this, as well as for a few other test cases that initial
     +    versions of this patch broke:
     +      * 'foo@@{...}'
     +      * 'foo^{/${SEARCH_TEXT_WITH_COLON}}:${PATH}'
      
          Reported-by: Gabriel Amaral <gabriel-amaral@github.com>
          Helped-by: Michael Haggerty <mhagger@github.com>
     @@ object-name.c: static enum get_oid_result get_oid_with_context_1(struct reposito
       	}
       	for (cp = name, bracket_depth = 0; *cp; cp++) {
      -		if (*cp == '{')
     -+		if (*cp == '@' && *(cp+1) == '{') {
     ++		if (*(cp+1) == '{' && (*cp == '@' || *cp == '^')) {
      +			cp++;
       			bracket_depth++;
      -		else if (bracket_depth && *cp == '}')
     @@ object-name.c: static enum get_oid_result get_oid_with_context_1(struct reposito
       		struct object_id tree_oid;
      
       ## t/t1006-cat-file.sh ##
     +@@ t/t1006-cat-file.sh: test_expect_success "setup" '
     + 	git config extensions.objectformat $test_hash_algo &&
     + 	git config extensions.compatobjectformat $test_compat_hash_algo &&
     + 	echo_without_newline "$hello_content" > hello &&
     +-	git update-index --add hello
     ++	git update-index --add hello &&
     ++	git commit -m "add hello file"
     + '
     + 
     + run_blob_tests () {
      @@ t/t1006-cat-file.sh: test_expect_success FUNNYNAMES '--batch-check, -Z with newline in input' '
       	test_cmp expect actual
       '
       
     -+test_expect_success FUNNYNAMES 'setup with curly braches in input' '
     -+	git branch "foo{bar" &&
     -+	git branch "foo@"
     ++test_expect_success 'setup with curly braches in input' '
     ++	git branch "foo{bar" HEAD &&
     ++	git branch "foo@" HEAD
      +'
      +
     -+test_expect_success FUNNYNAMES 'object reference with curly brace' '
     ++test_expect_success 'object reference with curly brace' '
      +	git cat-file -p "foo{bar:hello" >actual &&
      +	git cat-file -p HEAD:hello >expect &&
      +	test_cmp expect actual
      +'
      +
     -+test_expect_success FUNNYNAMES 'object reference with at-sign' '
     ++test_expect_success 'object reference with at-sign' '
      +	git cat-file -p "foo@@{0}:hello" >actual &&
      +	git cat-file -p HEAD:hello >expect &&
      +	test_cmp expect actual
      +'
     ++
     ++test_expect_success 'setup with commit with colon' '
     ++	git commit-tree -m "testing: just a bunch of junk" HEAD^{tree} >out &&
     ++	git branch other $(cat out)
     ++'
     ++
     ++test_expect_success 'object reference via commit text search' '
     ++	git cat-file -p "other^{/testing:}:hello" >actual &&
     ++	git cat-file -p HEAD:hello >expect &&
     ++	test_cmp expect actual
     ++'
      +
       test_expect_success 'setup blobs which are likely to delta' '
       	test-tool genrandom foo 10240 >foo &&
 -:  ----------- > 2:  31f1c37b31a object-name: be more strict in parsing describe-like output

Comments

Junio C Hamano Jan. 4, 2025, 2:35 p.m. UTC | #1
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  * Added a second patch for another bug discovered by the same reporter,
>    where branch:path/to/file/named/major-gaffed is interpreted as a request
>    for a commit (namely affed) rather than a blob. (At least, assuming
>    commit affed exists)
>
> The second patch has some backward compatibility concerns. People used to be
> able to do e.g. git show ${garbage}-g${hash}. I tightened it to
> ${valid_refname}-${number}-g${hash}, but do we want to allow e.g.
> ${valid_refname}-g${hash} (allowing the count to be omitted) or maybe even
> allow a subset of invalid refnames?

My take on it is that it is an absolute no-no if we require that
${valid_refname} exists locally, and it is still iffy if we checked
${valid_refname} with check_format() (because the definition of
validity can change over time, and we would not know the rules that
were valid back when the reference to the commit was written).

Otherwise a tightened rule would make "${garbage}-g${hash}" less
useful to copy-and-paste from a text file to command line.

In general what would we do if a string can be interpreted in
multiple ways in _different_ parts of the object-name codepaths.  We
all know that "affed" would trigger the "ambiguous object name"
error if there are more than one object whose object name begins
with "affed", but if "${garbage}-gaffed" can be interpreted as the
name of an object whose object name begins with "affed" and also can
be interpreted as the name of another object that sits at a path
that ends with "-gaffed" in some tree object, regardless of how the
leading part "${garbage}" looks like, it would be desirable if we
declared such a string as "ambiguous" the same way.
Elijah Newren Jan. 4, 2025, 3:55 p.m. UTC | #2
On Sat, Jan 4, 2025 at 6:35 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> >  * Added a second patch for another bug discovered by the same reporter,
> >    where branch:path/to/file/named/major-gaffed is interpreted as a request
> >    for a commit (namely affed) rather than a blob. (At least, assuming
> >    commit affed exists)
> >
> > The second patch has some backward compatibility concerns. People used to be
> > able to do e.g. git show ${garbage}-g${hash}. I tightened it to
> > ${valid_refname}-${number}-g${hash}, but do we want to allow e.g.
> > ${valid_refname}-g${hash} (allowing the count to be omitted) or maybe even
> > allow a subset of invalid refnames?
>
> My take on it is that it is an absolute no-no if we require that
> ${valid_refname} exists locally, and it is still iffy if we checked
> ${valid_refname} with check_format() (because the definition of
> validity can change over time, and we would not know the rules that
> were valid back when the reference to the commit was written).

Fair enough.  However...

> Otherwise a tightened rule would make "${garbage}-g${hash}" less
> useful to copy-and-paste from a text file to command line.
>
> In general what would we do if a string can be interpreted in
> multiple ways in _different_ parts of the object-name codepaths.  We
> all know that "affed" would trigger the "ambiguous object name"
> error if there are more than one object whose object name begins
> with "affed", but if "${garbage}-gaffed" can be interpreted as the
> name of an object whose object name begins with "affed" and also can
> be interpreted as the name of another object that sits at a path
> that ends with "-gaffed" in some tree object, regardless of how the
> leading part "${garbage}" looks like, it would be desirable if we
> declared such a string as "ambiguous" the same way.

How would that be desirable?  There's no possible way to disambiguate.
While abbreviated revisions can just be modified to be less
abbreviated, paths cannot be spelled any other way.  How would you
spell
      master:path/to/who-gabbed
in a "less ambiguous" way to differentiate it from commit abbed?  As
far as I can tell, this proposal just leaves the user stuck with an
error with no way to get the path they want.

If you don't like check_format() being called on the leading part of
the string, can we at least enforce that there is no ':', so that we
can successfully request explicit paths of given revisions and know
that we'll get them?  (That'd disallow e.g. next^{/doc:}-12-gabbed,
but that clearly was never a valid describe output anyway.)
Junio C Hamano Jan. 4, 2025, 5:51 p.m. UTC | #3
Elijah Newren <newren@gmail.com> writes:

>> In general what would we do if a string can be interpreted in
>> multiple ways in _different_ parts of the object-name codepaths.  We
>> all know that "affed" would trigger the "ambiguous object name"
>> error if there are more than one object whose object name begins
>> with "affed", but if "${garbage}-gaffed" can be interpreted as the
>> name of an object whose object name begins with "affed" and also can
>> be interpreted as the name of another object that sits at a path
>> that ends with "-gaffed" in some tree object, regardless of how the
>> leading part "${garbage}" looks like, it would be desirable if we
>> declared such a string as "ambiguous" the same way.
>
> How would that be desirable?

In "a:b/c-0-gabcde", *if* "a:b/c-0" *were* a valid way to spell a
valid refname, then the whole thing is an ambiguous object name,
i.e. it could be "something reachable from object 'a:b/c' whose
object name begins with abcde", or it could be "object at the path
b/c-0-gabcde in a tree-ish a", and in such a case our code should be
set up to allow us to give a "that's ambiguous" error, instead of
yielding the first possible interpretation (i.e. if we happen to
have checked the describe name first and "$garbage-0-gabcde", we
yield "abcde" before even checking if $garbage part gives a possible
leading part of a tree-ish; but if a future refactoring of the code
flips the order of checking, we may end up yielding 'an object at a
path, which ends with -0-gabcde, sitting in a tree-ish', without
checking if that could be a valid describe name).

Of course we should make sure that the syntax cannot be ambiguous
when we introduce a new syntax to represent a new feature ;-)

Now, I think ":" has always been a byte that is invalid as a part of
any refname, so "${garbage}-gabcde" with a colon in ${garbage}
cannot be a describe name.  So in the above about "a:b/c-0" is an
impossible example, but I was wondering more about the general
principle we should follow.
Elijah Newren Jan. 4, 2025, 6:55 p.m. UTC | #4
On Sat, Jan 4, 2025 at 9:51 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> In general what would we do if a string can be interpreted in
> >> multiple ways in _different_ parts of the object-name codepaths.  We
> >> all know that "affed" would trigger the "ambiguous object name"
> >> error if there are more than one object whose object name begins
> >> with "affed", but if "${garbage}-gaffed" can be interpreted as the
> >> name of an object whose object name begins with "affed" and also can
> >> be interpreted as the name of another object that sits at a path
> >> that ends with "-gaffed" in some tree object, regardless of how the
> >> leading part "${garbage}" looks like, it would be desirable if we
> >> declared such a string as "ambiguous" the same way.
> >
> > How would that be desirable?
>
> In "a:b/c-0-gabcde", *if* "a:b/c-0" *were* a valid way to spell a
> valid refname, then the whole thing is an ambiguous object name,
> i.e. it could be "something reachable from object 'a:b/c' whose
> object name begins with abcde", or it could be "object at the path
> b/c-0-gabcde in a tree-ish a", and in such a case our code should be
> set up to allow us to give a "that's ambiguous" error, instead of
> yielding the first possible interpretation (i.e. if we happen to
> have checked the describe name first and "$garbage-0-gabcde", we
> yield "abcde" before even checking if $garbage part gives a possible
> leading part of a tree-ish; but if a future refactoring of the code
> flips the order of checking, we may end up yielding 'an object at a
> path, which ends with -0-gabcde, sitting in a tree-ish', without
> checking if that could be a valid describe name).
>
> Of course we should make sure that the syntax cannot be ambiguous
> when we introduce a new syntax to represent a new feature ;-)
>
> Now, I think ":" has always been a byte that is invalid as a part of
> any refname, so "${garbage}-gabcde" with a colon in ${garbage}
> cannot be a describe name.  So in the above about "a:b/c-0" is an
> impossible example, but I was wondering more about the general
> principle we should follow.

Are you only interested in the general principle for the "possible
examples"?  What about the general principle for the "impossible
examples"?  Things like "master:path/to/who-gabbed" are unambiguously
a reference to a path within a revision that cannot be spelled any
alternate way, but the code currently gives the user a commit instead.
What's the right way to fix these "impossible examples"?  I've given
three proposals and implemented the first of them:
  - ${POSSIBLY_VALID_REFNAME}-${INTEGER}-g${HASH}
  - ${POSSIBLY_VALID_REFNAME}-g${HASH}
  - ${ANYTHING_WITHOUT_A_COLON}-g${HASH}

You said you don't like the first two because check_refname() rules
might change, and not commented on the third.

Also, as far as I can tell, the set of "possible examples" you are
focusing on is currently the empty set.  A change of syntax might in
the future expand that to a non-empty-set, and then bring us backward
compatibility headaches because we have been allowing
"${garbage}-g${hash}" to mean a reference to ${hash} and we'd then
have to deal with it becoming ambiguous (and potentially also having
no way to disambiguate those cases, similar to how if colon is allowed
in garbage then we have no way to disambiguate paths).  If we want to
allow future object naming extensions, it seems like we should lock
down and rule out as many existing forms of known ${garbage} as we
can, but that'd push us towards the
${POSSIBLY_VALID_REFNAME}-${INTEGER}-g${HASH} solution I implemented
that you don't seem to like.  Is there a middle ground that you do
like?
Junio C Hamano Jan. 6, 2025, 5:29 p.m. UTC | #5
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Maintainer note: these bugs both date back to 2006; neither is a regression
> in this cycle.

While I was preparing today's -rc2 release, I noticed that this
change broke some of my release scripts with

    $ git rev-parse --verify v2.48.0-rc2-161-g6c2274cdbc^0
    fatal: Needed a single revision

which is the construct that has been there almost forever.  Its
expected output is

    $ git rev-parse --verify v2.48.0-rc2-161-g6c2274cdbc^0
    6c2274cdbca14b7eb70fb182ffac80bf6950e137

The series seems to need a bit more work.

Thanks.
Elijah Newren Jan. 6, 2025, 7:26 p.m. UTC | #6
On Mon, Jan 6, 2025 at 9:29 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > Maintainer note: these bugs both date back to 2006; neither is a regression
> > in this cycle.
>
> While I was preparing today's -rc2 release, I noticed that this
> change broke some of my release scripts with
>
>     $ git rev-parse --verify v2.48.0-rc2-161-g6c2274cdbc^0
>     fatal: Needed a single revision
>
> which is the construct that has been there almost forever.  Its
> expected output is
>
>     $ git rev-parse --verify v2.48.0-rc2-161-g6c2274cdbc^0
>     6c2274cdbca14b7eb70fb182ffac80bf6950e137
>
> The series seems to need a bit more work.

Gah, I made sure to copy the object name into a string buf, so I could
operate on just the relevant part, but then ignored that and operated
on the full thing.

This fixes it:

diff --git a/object-name.c b/object-name.c
index 614520954c7..cb96a0e6161 100644
--- a/object-name.c
+++ b/object-name.c
@@ -1318,7 +1318,7 @@ static int ref_and_count_parts_valid(const char
*name, int len)
        len = cp - name;
        strbuf_init(&sb, len);
        strbuf_add(&sb, name, len);
-       ret = !check_refname_format(name, flags);
+       ret = !check_refname_format(sb.buf, flags);
        strbuf_release(&sb);
        return ret;
 }

I'll include it with all my other fixes in a reroll, which I'll
probably send out after 2.48 to avoid distracting from the release.
Junio C Hamano Jan. 6, 2025, 8:38 p.m. UTC | #7
Elijah Newren <newren@gmail.com> writes:

> On Mon, Jan 6, 2025 at 9:29 AM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>> > Maintainer note: these bugs both date back to 2006; neither is a regression
>> > in this cycle.
>>
>> While I was preparing today's -rc2 release, I noticed that this
>> change broke some of my release scripts with
>>
>>     $ git rev-parse --verify v2.48.0-rc2-161-g6c2274cdbc^0
>>     fatal: Needed a single revision
>>
>> which is the construct that has been there almost forever.  Its
>> expected output is
>>
>>     $ git rev-parse --verify v2.48.0-rc2-161-g6c2274cdbc^0
>>     6c2274cdbca14b7eb70fb182ffac80bf6950e137
>>
>> The series seems to need a bit more work.
>
> Gah, I made sure to copy the object name into a string buf, so I could
> operate on just the relevant part, but then ignored that and operated
> on the full thing.
>
> This fixes it:
>
> diff --git a/object-name.c b/object-name.c
> index 614520954c7..cb96a0e6161 100644
> --- a/object-name.c
> +++ b/object-name.c
> @@ -1318,7 +1318,7 @@ static int ref_and_count_parts_valid(const char
> *name, int len)
>         len = cp - name;
>         strbuf_init(&sb, len);
>         strbuf_add(&sb, name, len);
> -       ret = !check_refname_format(name, flags);
> +       ret = !check_refname_format(sb.buf, flags);
>         strbuf_release(&sb);
>         return ret;
>  }
>
> I'll include it with all my other fixes in a reroll, which I'll
> probably send out after 2.48 to avoid distracting from the release.

In existing tests, we seem to be lacking coverage to notice this
breakage, so let's make sure we add one or two.

Thanks.