ref-filter: don't look for objects when outside of a repository
diff mbox series

Message ID 20181114122725.18659-1-szeder.dev@gmail.com
State New
Headers show
Series
  • ref-filter: don't look for objects when outside of a repository
Related show

Commit Message

SZEDER Gábor Nov. 14, 2018, 12:27 p.m. UTC
The command 'git ls-remote --sort=authordate <remote>' segfaults when
run outside of a repository, ever since the introduction of its
'--sort' option in 1fb20dfd8e (ls-remote: create '--sort' option,
2018-04-09).

While in general the 'git ls-remote' command can be run outside of a
repository just fine, its '--sort=<key>' option with certain keys does
require access to the referenced objects.  This sorting is implemented
using the generic ref-filter sorting facility, which already handles
missing objects gracefully with the appropriate 'missing object
deadbeef for HEAD' message.  However, being generic means that it
checks replace refs while trying to retrieve an object, and while
doing so it accesses the 'git_replace_ref_base' variable, which has
not been initialized and is still a NULL pointer when outside of a
repository, thus causing the segfault.

Make ref-filter more careful upfront while parsing the format string,
and make it error out when encountering a format atom requiring object
access when we are not in a repository.  Also add a test to ensure
that 'git ls-remote --sort' fails gracefully when executed outside of
a repository.

Reported-by: H.Merijn Brand <h.m.brand@xs4all.nl>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---

On Tue, Sep 25, 2018 at 01:57:38PM -0700, Junio C Hamano wrote:
> SZEDER Gábor <szeder.dev@gmail.com> writes:
> 
> > However, if we go for a more informative error message, then wouldn't
> > it be better to add this condition in populate_value() before it even
> > calls get_object()?  Then we could also add the problematic format
> > specifier to the error message (I think, but didn't actually check),
> > just in case someone specified multiple sort keys.
> 
> Even though I suspect that verify_ref_format() is the logically the
> right place to do this (after all, it is about seeing if the format
> makes sense, and a format that requires an object access used
> outside a repository should trigger an verification error), doing
> that in populate_value() probably strikes the best balance, I would
> think.

We are dealing with format specifiers used for sorting here, and those
don't go through verify_ref_format().

So how about this patch instead?

I think it will catch all cases where a user would try to use a format
specifier, for any purpose, requiring object access outside of a
repository (though I don't know whether there are any other cases
besides 'git ls-remote --sort=...'; but perhaps in the future
'ls-remote' will get a '--format' option as well), and it does so
before performing a potentially expensive query to the remote.  OTOH,
it won't change the documented "missing object" error message when run
inside a repo but the necessary object is indeed missing.


 ref-filter.c         | 4 ++++
 t/t5512-ls-remote.sh | 6 ++++++
 2 files changed, 10 insertions(+)

Comments

Jeff King Nov. 15, 2018, 9:38 a.m. UTC | #1
On Wed, Nov 14, 2018 at 01:27:25PM +0100, SZEDER Gábor wrote:

> The command 'git ls-remote --sort=authordate <remote>' segfaults when
> run outside of a repository, ever since the introduction of its
> '--sort' option in 1fb20dfd8e (ls-remote: create '--sort' option,
> 2018-04-09).
> 
> While in general the 'git ls-remote' command can be run outside of a
> repository just fine, its '--sort=<key>' option with certain keys does
> require access to the referenced objects.  This sorting is implemented
> using the generic ref-filter sorting facility, which already handles
> missing objects gracefully with the appropriate 'missing object
> deadbeef for HEAD' message.  However, being generic means that it
> checks replace refs while trying to retrieve an object, and while
> doing so it accesses the 'git_replace_ref_base' variable, which has
> not been initialized and is still a NULL pointer when outside of a
> repository, thus causing the segfault.
> 
> Make ref-filter more careful upfront while parsing the format string,
> and make it error out when encountering a format atom requiring object
> access when we are not in a repository.  Also add a test to ensure
> that 'git ls-remote --sort' fails gracefully when executed outside of
> a repository.

Thanks for picking up this loose end. I like the general approach here,
but...

> diff --git a/ref-filter.c b/ref-filter.c
> index 0c45ed9d94..a1290659af 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -534,6 +534,10 @@ static int parse_ref_filter_atom(const struct ref_format *format,
>  	if (ARRAY_SIZE(valid_atom) <= i)
>  		return strbuf_addf_ret(err, -1, _("unknown field name: %.*s"),
>  				       (int)(ep-atom), atom);
> +	if (valid_atom[i].source != SOURCE_NONE && !have_git_dir())
> +		return strbuf_addf_ret(err, -1,
> +				       _("not a git repository, but the field '%.*s' requires access to object data"),
> +				       (int)(ep-atom), atom);

Is SOURCE_NONE a complete match for what we want?

I see problems in both directions:

 - sorting by "objectname" works now, but it's marked with SOURCE_OBJ,
   and would be forbidden with your patch.  I'm actually not sure if
   SOURCE_OBJ is accurate; we shouldn't need to access the object to
   show it (and we are probably wasting effort loading the full contents
   for tools like for-each-ref).

   However, that's not the full story. For objectname:short, it _does_ call
   find_unique_abbrev(). So we expect to have an object directory.

 - sorting by "HEAD" hits a BUG(), and would still be allowed with your
   patch.

So I like the idea here that the particular atoms would tell us whether
they're going to need to be in a repository or not, but I think the
annotations have to be cleaned up first.

> diff --git a/t/t5512-ls-remote.sh b/t/t5512-ls-remote.sh
> index 91ee6841c1..32e722db2e 100755
> --- a/t/t5512-ls-remote.sh
> +++ b/t/t5512-ls-remote.sh
> @@ -302,6 +302,12 @@ test_expect_success 'ls-remote works outside repository' '
>  	nongit git ls-remote dst.git
>  '
>  
> +test_expect_success 'ls-remote --sort fails gracefully outside repository' '
> +	# Use a sort key that requires access to the referenced objects.
> +	nongit test_must_fail git ls-remote --sort=authordate "$TRASH_DIRECTORY" 2>err &&
> +	test_i18ngrep "^fatal: not a git repository, but the field '\''authordate'\'' requires access to object data" err
> +'

Regardless of our solution, we probably want to add an extra test making
sure that something vanilla like:

  nongit git ls-remote --sort=v:refname "$TRASH_DIRECTORY"

continues to work (we do test ls-remote outside a repo already, but not
with a sort specifier).

-Peff
Jeff King Nov. 15, 2018, 9:43 a.m. UTC | #2
On Thu, Nov 15, 2018 at 04:38:44AM -0500, Jeff King wrote:

> Is SOURCE_NONE a complete match for what we want?
> 
> I see problems in both directions:
> 
>  - sorting by "objectname" works now, but it's marked with SOURCE_OBJ,
>    and would be forbidden with your patch.  I'm actually not sure if
>    SOURCE_OBJ is accurate; we shouldn't need to access the object to
>    show it (and we are probably wasting effort loading the full contents
>    for tools like for-each-ref).
> 
>    However, that's not the full story. For objectname:short, it _does_ call
>    find_unique_abbrev(). So we expect to have an object directory.

Oops, I'm apparently bad at reading. It is in fact SOURCE_OTHER, which
makes sense (outside of this whole "--sort outside a repo thing").

But we'd ideally distinguish between "objectname" (which should be OK
outside a repo) and "objectname:short" (which currently segfaults).

-Peff
Junio C Hamano Nov. 16, 2018, 5:09 a.m. UTC | #3
Jeff King <peff@peff.net> writes:

> On Thu, Nov 15, 2018 at 04:38:44AM -0500, Jeff King wrote:
>
>> Is SOURCE_NONE a complete match for what we want?
>> 
>> I see problems in both directions:
>> 
>>  - sorting by "objectname" works now, but it's marked with SOURCE_OBJ,
>>    and would be forbidden with your patch.  I'm actually not sure if
>>    SOURCE_OBJ is accurate; we shouldn't need to access the object to
>>    show it (and we are probably wasting effort loading the full contents
>>    for tools like for-each-ref).
>> 
>>    However, that's not the full story. For objectname:short, it _does_ call
>>    find_unique_abbrev(). So we expect to have an object directory.
>
> Oops, I'm apparently bad at reading. It is in fact SOURCE_OTHER, which
> makes sense (outside of this whole "--sort outside a repo thing").
>
> But we'd ideally distinguish between "objectname" (which should be OK
> outside a repo) and "objectname:short" (which currently segfaults).

Arguably, use of ref-filter machinery in ls-remote, whether it is
given from inside or outside a repo, was a mistake in 1fb20dfd
("ls-remote: create '--sort' option", 2018-04-09), as the whole
point of "ls-remote" is to peek the list of refs and it is perfectly
normal that the objects listed are not available.

"ls-remote --sort=authorname" that is run in a repository may not
segfault on a ref that points at a yet-to-be-fetched commit, but it
cannot be doing anything sensible.  Is it still better to silently
produce a nonsense result than refusing to --sort no matter what the
sort keys are, whether we are inside or outside a repository?
Jeff King Nov. 16, 2018, 8:56 a.m. UTC | #4
On Fri, Nov 16, 2018 at 02:09:07PM +0900, Junio C Hamano wrote:

> >> I see problems in both directions:
> >> 
> >>  - sorting by "objectname" works now, but it's marked with SOURCE_OBJ,
> >>    and would be forbidden with your patch.  I'm actually not sure if
> >>    SOURCE_OBJ is accurate; we shouldn't need to access the object to
> >>    show it (and we are probably wasting effort loading the full contents
> >>    for tools like for-each-ref).
> >> 
> >>    However, that's not the full story. For objectname:short, it _does_ call
> >>    find_unique_abbrev(). So we expect to have an object directory.
> >
> > Oops, I'm apparently bad at reading. It is in fact SOURCE_OTHER, which
> > makes sense (outside of this whole "--sort outside a repo thing").
> >
> > But we'd ideally distinguish between "objectname" (which should be OK
> > outside a repo) and "objectname:short" (which currently segfaults).
> 
> Arguably, use of ref-filter machinery in ls-remote, whether it is
> given from inside or outside a repo, was a mistake in 1fb20dfd
> ("ls-remote: create '--sort' option", 2018-04-09), as the whole
> point of "ls-remote" is to peek the list of refs and it is perfectly
> normal that the objects listed are not available.

I think it's conceptually reasonable to use the ref-filter machinery.
It's just that it was underprepared to handle this out-of-repo case. I
think we're not too far off, though.

> "ls-remote --sort=authorname" that is run in a repository may not
> segfault on a ref that points at a yet-to-be-fetched commit, but it
> cannot be doing anything sensible.  Is it still better to silently
> produce a nonsense result than refusing to --sort no matter what the
> sort keys are, whether we are inside or outside a repository?

I don't think we produce silent nonsense in the current code (or after
any of the discussed solutions), either in a repo or out. We say "fatal:
missing object ..." inside a repo if the request cannot be fulfilled.
That's not incredibly illuminating, perhaps, but it means we fulfill
whatever we _can_ on behalf of the user's request, and bail otherwise.

If you are arguing that even in a repo we should reject "authorname"
early (just as we would outside of a repo), I could buy that.
Technically we can make it work sometimes (if we happen to have fetched
everything the other side has), but behaving consistently (and with a
decent error message) may trump that.

-Peff
Junio C Hamano Nov. 16, 2018, 10:07 a.m. UTC | #5
Jeff King <peff@peff.net> writes:

> If you are arguing that even in a repo we should reject "authorname"
> early (just as we would outside of a repo), I could buy that.

Yup, that (and replace 'authorname' with anything that won't work
with missing objects) for consistency was what I meant.
SZEDER Gábor Nov. 16, 2018, 1:16 p.m. UTC | #6
On Fri, Nov 16, 2018 at 02:09:07PM +0900, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
> > On Thu, Nov 15, 2018 at 04:38:44AM -0500, Jeff King wrote:
> >
> >> Is SOURCE_NONE a complete match for what we want?
> >> 
> >> I see problems in both directions:
> >> 
> >>  - sorting by "objectname" works now, but it's marked with SOURCE_OBJ,
> >>    and would be forbidden with your patch.  I'm actually not sure if
> >>    SOURCE_OBJ is accurate; we shouldn't need to access the object to
> >>    show it (and we are probably wasting effort loading the full contents
> >>    for tools like for-each-ref).
> >> 
> >>    However, that's not the full story. For objectname:short, it _does_ call
> >>    find_unique_abbrev(). So we expect to have an object directory.
> >
> > Oops, I'm apparently bad at reading. It is in fact SOURCE_OTHER, which
> > makes sense (outside of this whole "--sort outside a repo thing").
> >
> > But we'd ideally distinguish between "objectname" (which should be OK
> > outside a repo) and "objectname:short" (which currently segfaults).
> 
> Arguably, use of ref-filter machinery in ls-remote, whether it is
> given from inside or outside a repo, was a mistake in 1fb20dfd
> ("ls-remote: create '--sort' option", 2018-04-09), as the whole
> point of "ls-remote" is to peek the list of refs and it is perfectly
> normal that the objects listed are not available.

I hope that one day 'git ls-remote' will learn to '--format=...' its
output, and I think that (re)using the ref-filter machinery would be
the right way to go to achive that.  Sure, ref-filter supports a lot
of format specifiers that don't at all make sense in the context of
'ls-remote' (perhaps we should have a dedicated set of valid_atoms for
that), but I think it's perfectly reasonable to do something like:

  git ls-remote --format=%(refname:strip=2) remote

A concrete use case for that could be to eliminate the last remaining
shell loops from refs completion.

Patch
diff mbox series

diff --git a/ref-filter.c b/ref-filter.c
index 0c45ed9d94..a1290659af 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -534,6 +534,10 @@  static int parse_ref_filter_atom(const struct ref_format *format,
 	if (ARRAY_SIZE(valid_atom) <= i)
 		return strbuf_addf_ret(err, -1, _("unknown field name: %.*s"),
 				       (int)(ep-atom), atom);
+	if (valid_atom[i].source != SOURCE_NONE && !have_git_dir())
+		return strbuf_addf_ret(err, -1,
+				       _("not a git repository, but the field '%.*s' requires access to object data"),
+				       (int)(ep-atom), atom);
 
 	/* Add it in, including the deref prefix */
 	at = used_atom_cnt;
diff --git a/t/t5512-ls-remote.sh b/t/t5512-ls-remote.sh
index 91ee6841c1..32e722db2e 100755
--- a/t/t5512-ls-remote.sh
+++ b/t/t5512-ls-remote.sh
@@ -302,6 +302,12 @@  test_expect_success 'ls-remote works outside repository' '
 	nongit git ls-remote dst.git
 '
 
+test_expect_success 'ls-remote --sort fails gracefully outside repository' '
+	# Use a sort key that requires access to the referenced objects.
+	nongit test_must_fail git ls-remote --sort=authordate "$TRASH_DIRECTORY" 2>err &&
+	test_i18ngrep "^fatal: not a git repository, but the field '\''authordate'\'' requires access to object data" err
+'
+
 test_expect_success 'ls-remote patterns work with all protocol versions' '
 	git for-each-ref --format="%(objectname)	%(refname)" \
 		refs/heads/master refs/remotes/origin/master >expect &&