diff mbox series

[v3,1/2] list-objects-filter: only parse sparse OID when 'have_git_dir'

Message ID 20190829231925.15223-2-jon@jonsimons.org (mailing list archive)
State New, archived
Headers show
Series partial-clone: fix two issues with sparse filter handling | expand

Commit Message

Jon Simons Aug. 29, 2019, 11:19 p.m. UTC
Fix a bug in partial cloning with sparse filters by ensuring to check
for 'have_git_dir' before attempting to resolve the sparse filter OID.

Otherwise the client will trigger:

    BUG: refs.c:1851: attempting to get main_ref_store outside of repository

when attempting to git clone with a sparse filter.

Note that this fix is the minimal one which avoids the BUG and allows
for the clone to complete successfully:

There is an open question as to whether there should be any attempt
to resolve the OID provided by the client in this context, as a filter
for the clone to be used on the remote side.  For cases where local
and remote OID resolutions differ, resolving on the client side could
be considered a bug.  For now, the minimal approach here is used to
unblock further testing for partial clones with sparse filters, while
a more invasive fix could make sense to pursue as a future direction.

t5616 is updated to demonstrate the change.

Signed-off-by: Jon Simons <jon@jonsimons.org>
---
 list-objects-filter-options.c |  3 ++-
 t/t5616-partial-clone.sh      | 21 +++++++++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

Comments

Junio C Hamano Aug. 30, 2019, 6:08 p.m. UTC | #1
Jon Simons <jon@jonsimons.org> writes:

> diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
> index 1cb20c659c..aaba312edb 100644
> --- a/list-objects-filter-options.c
> +++ b/list-objects-filter-options.c
> @@ -71,7 +71,8 @@ static int gently_parse_list_objects_filter(
>  		 * command, but DO NOT complain if we don't have the blob or
>  		 * ref locally.
>  		 */
> -		if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB,
> +		if (have_git_dir() &&
> +		    !get_oid_with_context(the_repository, v0, GET_OID_BLOB,
>  					  &sparse_oid, &oc))
>  			filter_options->sparse_oid_value = oiddup(&sparse_oid);
>  		filter_options->choice = LOFC_SPARSE_OID;

Sorry, I do not quite understand what this wants to do.  We say "we
parsed this correctly, this filter is sparse:oid=<blob>" without
filling sparse_oid_value field at all.  What do the consumers of
such a filter_options instance do to such a half-read option?

If they say "ah, the parser wanted to do sparse:oid but we couldn't
really figure the <blob> part out, so let's ignore it", that might
be the best they could do anyway, but it somewhat feels iffy.  I
wonder if we are better off inventing a new "we tried to parse but
we lack sufficient info to make it useful" value to use in .choice
field and return it instead.

In the longer term, what do we want to happen in such a case where
"read this blob to figure out what I want you to do" cannot be
satisfied due to chicken-and-egg situation like this?  Can we
postpone fetching or cloning that *wants* to use the named <blob>
when we discover that the <blob> is not available (of which, your
"!have_git_dir()" is a subset), grab that single <blob> first from
the other side before doing the main transfer, and then resume the
transfer that wants to use the <blob> after we successfully do so,
or something?
Jeff King Sept. 4, 2019, 4:54 a.m. UTC | #2
On Fri, Aug 30, 2019 at 11:08:34AM -0700, Junio C Hamano wrote:

> Jon Simons <jon@jonsimons.org> writes:
> 
> > diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
> > index 1cb20c659c..aaba312edb 100644
> > --- a/list-objects-filter-options.c
> > +++ b/list-objects-filter-options.c
> > @@ -71,7 +71,8 @@ static int gently_parse_list_objects_filter(
> >  		 * command, but DO NOT complain if we don't have the blob or
> >  		 * ref locally.
> >  		 */
> > -		if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB,
> > +		if (have_git_dir() &&
> > +		    !get_oid_with_context(the_repository, v0, GET_OID_BLOB,
> >  					  &sparse_oid, &oc))
> >  			filter_options->sparse_oid_value = oiddup(&sparse_oid);
> >  		filter_options->choice = LOFC_SPARSE_OID;
> 
> Sorry, I do not quite understand what this wants to do.  We say "we
> parsed this correctly, this filter is sparse:oid=<blob>" without
> filling sparse_oid_value field at all.  What do the consumers of
> such a filter_options instance do to such a half-read option?

Empirically, they either die() or segfault. ;)

I agree the code even before Jon's patch is pretty funny. Forgetting the
"not a repo" case for a moment, we know that get_oid_with_context()
might not find the name. In the case of rev-list, we manually check the
validity of sparse_oid_value later and die().  Which seems like a
layering violation, but at least gives the desired outcome.

In the case of clone, it gets parsed twice:

  1. On the client side, we don't have a repo yet, and so we BUG(). But
     once fixed, we don't care about the value at all! We never use it,
     and instead send the original filter spec over to the server, so at
     most this whole option-parsing is giving us an early syntax check.

  2. On the server side, upload-pack receives the filter spec, parses
     it, but doesn't remember to check whether it's a valid oid. It
     segfaults if it isn't (that's patch 2 here).

So these patches are punting on the greater question of why we want to
parse so early, and are not making anything worse. AFAICT, "clone
--filter=sparse:oid" has never worked (even though our tests did cover
the underlying rev-list and pack-objects code paths).

> If they say "ah, the parser wanted to do sparse:oid but we couldn't
> really figure the <blob> part out, so let's ignore it", that might
> be the best they could do anyway, but it somewhat feels iffy.  I
> wonder if we are better off inventing a new "we tried to parse but
> we lack sufficient info to make it useful" value to use in .choice
> field and return it instead.

TBH, I'm not sure why the original is so eager to parse early. I guess
it allows:

  - a dual use of the options parser; we can use it both to sanity-check
    the options before sending them to a server, and to actually use the
    filter ourselves.

  - earlier detection maybe gives us a cleaner error path (e.g.,
    rev-list can do its own error handling). But I'd think doing it when
    we actually initialize the filter would be enough.

I.e., if we want to go all the way, I think this two-patch series could
basically be replaced with something like the (totally untested)
approach below, which just pushes the parsing closer to the
point-of-use.

Adding Jeff Hostetler to the cc, in case he recalls any reason not to
use that approach.

> In the longer term, what do we want to happen in such a case where
> "read this blob to figure out what I want you to do" cannot be
> satisfied due to chicken-and-egg situation like this?  Can we
> postpone fetching or cloning that *wants* to use the named <blob>
> when we discover that the <blob> is not available (of which, your
> "!have_git_dir()" is a subset), grab that single <blob> first from
> the other side before doing the main transfer, and then resume the
> transfer that wants to use the <blob> after we successfully do so,
> or something?

I don't think there's any chicken-and-egg here. The client side of a
clone does not ever care about what's in sparse_oid_value at all, before
or after (and the name needs to be resolved from the server's view of
the refs, anyway). I think there could be a feature where we do a narrow
clone based on the spec in an oid, and then do a matching sparse
checkout. And that feature would want to make sure it knows how the
server resolved the spec in the first step. But AFAIK we don't yet have
such a feature (and the simplest thing would probably be a capability
where the server tells us the blob oid, and makes sure it is
transmitted).

-Peff

---
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 301ccb970b..74dbfad73d 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -471,10 +471,6 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 			parse_list_objects_filter(&filter_options, arg);
 			if (filter_options.choice && !revs.blob_objects)
 				die(_("object filtering requires --objects"));
-			if (filter_options.choice == LOFC_SPARSE_OID &&
-			    !filter_options.sparse_oid_value)
-				die(_("invalid sparse value '%s'"),
-				    filter_options.filter_spec);
 			continue;
 		}
 		if (!strcmp(arg, ("--no-" CL_ARG__FILTER))) {
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index 1cb20c659c..76a9a49a75 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -63,17 +63,8 @@ static int gently_parse_list_objects_filter(
 		return 0;
 
 	} else if (skip_prefix(arg, "sparse:oid=", &v0)) {
-		struct object_context oc;
-		struct object_id sparse_oid;
-
-		/*
-		 * Try to parse <oid-expression> into an OID for the current
-		 * command, but DO NOT complain if we don't have the blob or
-		 * ref locally.
-		 */
-		if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB,
-					  &sparse_oid, &oc))
-			filter_options->sparse_oid_value = oiddup(&sparse_oid);
+		/* v0 is durable because it points into our saved filter_spec */
+		filter_options->sparse_oid_name = v0;
 		filter_options->choice = LOFC_SPARSE_OID;
 		return 0;
 
@@ -138,7 +129,6 @@ void list_objects_filter_release(
 	struct list_objects_filter_options *filter_options)
 {
 	free(filter_options->filter_spec);
-	free(filter_options->sparse_oid_value);
 	memset(filter_options, 0, sizeof(*filter_options));
 }
 
diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
index c54f0000fb..a819e42ffb 100644
--- a/list-objects-filter-options.h
+++ b/list-objects-filter-options.h
@@ -42,7 +42,7 @@ struct list_objects_filter_options {
 	 * choice-specific; not all values will be defined for any given
 	 * choice.
 	 */
-	struct object_id *sparse_oid_value;
+	const char *sparse_oid_name;
 	unsigned long blob_limit_value;
 	unsigned long tree_exclude_depth;
 };
diff --git a/list-objects-filter.c b/list-objects-filter.c
index 36e1f774bc..130c17b95c 100644
--- a/list-objects-filter.c
+++ b/list-objects-filter.c
@@ -463,9 +463,16 @@ static void *filter_sparse_oid__init(
 	filter_free_fn *filter_free_fn)
 {
 	struct filter_sparse_data *d = xcalloc(1, sizeof(*d));
+	struct object_context oc;
+	struct object_id sparse_oid;
+
+	if (!get_oid_with_context(the_repository,
+				  filter_options->sparse_oid_name,
+				  GET_OID_BLOB, &sparse_oid, &oc))
+		die("unable to access sparse blob in '%s'",
+		    filter_options->sparse_oid_name);
 	d->omits = omitted;
-	if (add_excludes_from_blob_to_list(filter_options->sparse_oid_value,
-					   NULL, 0, &d->el) < 0)
+	if (add_excludes_from_blob_to_list(&sparse_oid, NULL, 0, &d->el) < 0)
 		die("could not load filter specification");
 
 	ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc);
Junio C Hamano Sept. 5, 2019, 6:57 p.m. UTC | #3
Jeff King <peff@peff.net> writes:

> So these patches are punting on the greater question of why we want to
> parse so early, and are not making anything worse. AFAICT, "clone
> --filter=sparse:oid" has never worked (even though our tests did cover
> the underlying rev-list and pack-objects code paths).
> ...
> TBH, I'm not sure why the original is so eager to parse early. I guess
> it allows:
>
>   - a dual use of the options parser; we can use it both to sanity-check
>     the options before sending them to a server, and to actually use the
>     filter ourselves.
>
>   - earlier detection maybe gives us a cleaner error path (e.g.,
>     rev-list can do its own error handling). But I'd think doing it when
>     we actually initialize the filter would be enough.
>
> I.e., if we want to go all the way, I think this two-patch series could
> basically be replaced with something like the (totally untested)
> approach below, which just pushes the parsing closer to the
> point-of-use.
>
> Adding Jeff Hostetler to the cc, in case he recalls any reason not to
> use that approach.

Thanks.
Jeff Hostetler Sept. 9, 2019, 1:54 p.m. UTC | #4
On 9/5/2019 2:57 PM, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
>> So these patches are punting on the greater question of why we want to
>> parse so early, and are not making anything worse. AFAICT, "clone
>> --filter=sparse:oid" has never worked (even though our tests did cover
>> the underlying rev-list and pack-objects code paths).
>> ...
>> TBH, I'm not sure why the original is so eager to parse early. I guess
>> it allows:
>>
>>    - a dual use of the options parser; we can use it both to sanity-check
>>      the options before sending them to a server, and to actually use the
>>      filter ourselves.
>>
>>    - earlier detection maybe gives us a cleaner error path (e.g.,
>>      rev-list can do its own error handling). But I'd think doing it when
>>      we actually initialize the filter would be enough.
>>
>> I.e., if we want to go all the way, I think this two-patch series could
>> basically be replaced with something like the (totally untested)
>> approach below, which just pushes the parsing closer to the
>> point-of-use.
>>
>> Adding Jeff Hostetler to the cc, in case he recalls any reason not to
>> use that approach.
> 
> Thanks.
> 

I think both of Peff's guesses are correct.

IIRC I wrote the original parse_list_objects_filter() and friends to
syntax check the command line arguments of rev-list.  In hindsight,
this looks a bit aggressive at that layer, or rather now that it is
being used by various places in other commands (such as parsing
messages from the wire), it shouldn't call die() as Peff suggests.

I like the code Peff suggests.  Making parse_list_objects_filter()
a bit simpler and not call die().  Callers should then check the
function return value as necessary.

It would be nice if we could continue to let parse_list_objects_filter()
do the syntax checking -- that is, we can still check that we received a
ulong in "blob:limit:<nr>" and that "sparse:oid:<oid>" looks like a hex
value, for example.  Just save the actual <oid> lookup to the higher
layer, if and when that makes sense.

Jeff
Jeff King Sept. 9, 2019, 5:08 p.m. UTC | #5
On Mon, Sep 09, 2019 at 09:54:36AM -0400, Jeff Hostetler wrote:

> It would be nice if we could continue to let parse_list_objects_filter()
> do the syntax checking -- that is, we can still check that we received a
> ulong in "blob:limit:<nr>" and that "sparse:oid:<oid>" looks like a hex
> value, for example.  Just save the actual <oid> lookup to the higher
> layer, if and when that makes sense.

Yeah, I agree that is the right place to do syntactic checking. But I
think we can't do much checking for the sparse-oid. We currently accept
not just a hex oid, but any name. And I think that is useful; I should
be able to say "master:sparse-file" and have it resolved by the remote
side.

So it really is "any name which is syntactically resolvable as a sha1
expression". At which point I think you might as well punt and just wait
until we _actually_ try to resolve the name to see if it's valid.

I'll work up what I sent earlier into a real patch, and include some of
this discussion.

-Peff
Junio C Hamano Sept. 9, 2019, 5:12 p.m. UTC | #6
Jeff Hostetler <git@jeffhostetler.com> writes:

> It would be nice if we could continue to let parse_list_objects_filter()
> do the syntax checking -- that is, we can still check that we received a
> ulong in "blob:limit:<nr>" and that "sparse:oid:<oid>" looks like a hex
> value, for example.  Just save the actual <oid> lookup to the higher
> layer, if and when that makes sense.

Hmmm, am I misremembering things or did I hear somebody mention in
this thread that people give "sparse:oid:master" (not blob object
name in hex, but a refname) and expect the other side to resolve it?
Jeff Hostetler Sept. 9, 2019, 7:49 p.m. UTC | #7
On 9/9/2019 1:12 PM, Junio C Hamano wrote:
> Jeff Hostetler <git@jeffhostetler.com> writes:
> 
>> It would be nice if we could continue to let parse_list_objects_filter()
>> do the syntax checking -- that is, we can still check that we received a
>> ulong in "blob:limit:<nr>" and that "sparse:oid:<oid>" looks like a hex
>> value, for example.  Just save the actual <oid> lookup to the higher
>> layer, if and when that makes sense.
> 
> Hmmm, am I misremembering things or did I hear somebody mention in
> this thread that people give "sparse:oid:master" (not blob object
> name in hex, but a refname) and expect the other side to resolve it?
> 

You're right.  I misspoke.  I was thinking about the hex OID case
and forgetting about the <rev>:<path> form.

Jeff
Jeff Hostetler Sept. 9, 2019, 8:03 p.m. UTC | #8
On 9/9/2019 1:08 PM, Jeff King wrote:
> On Mon, Sep 09, 2019 at 09:54:36AM -0400, Jeff Hostetler wrote:
> 
>> It would be nice if we could continue to let parse_list_objects_filter()
>> do the syntax checking -- that is, we can still check that we received a
>> ulong in "blob:limit:<nr>" and that "sparse:oid:<oid>" looks like a hex
>> value, for example.  Just save the actual <oid> lookup to the higher
>> layer, if and when that makes sense.
> 
> Yeah, I agree that is the right place to do syntactic checking. But I
> think we can't do much checking for the sparse-oid. We currently accept
> not just a hex oid, but any name. And I think that is useful; I should
> be able to say "master:sparse-file" and have it resolved by the remote
> side.

Right. I forgot about the "master:sparse-file" case and was only
thinking about the hex oid case.  The sparse-file case is very
useful.

> 
> So it really is "any name which is syntactically resolvable as a sha1
> expression". At which point I think you might as well punt and just wait
> until we _actually_ try to resolve the name to see if it's valid.
> 
> I'll work up what I sent earlier into a real patch, and include some of
> this discussion.
> 
> -Peff
> 

thanks
Jeff
diff mbox series

Patch

diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index 1cb20c659c..aaba312edb 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -71,7 +71,8 @@  static int gently_parse_list_objects_filter(
 		 * command, but DO NOT complain if we don't have the blob or
 		 * ref locally.
 		 */
-		if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB,
+		if (have_git_dir() &&
+		    !get_oid_with_context(the_repository, v0, GET_OID_BLOB,
 					  &sparse_oid, &oc))
 			filter_options->sparse_oid_value = oiddup(&sparse_oid);
 		filter_options->choice = LOFC_SPARSE_OID;
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 565254558f..5c68431d10 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -241,6 +241,27 @@  test_expect_success 'fetch what is specified on CLI even if already promised' '
 	! grep "?$(cat blob)" missing_after
 '
 
+test_expect_success 'setup src repo for sparse filter' '
+	git init sparse-src &&
+	git -C sparse-src config --local uploadpack.allowfilter 1 &&
+	git -C sparse-src config --local uploadpack.allowanysha1inwant 1 &&
+	for n in 1 2 3 4
+	do
+		test_commit -C sparse-src "this-is-file-$n" file.$n.txt || return 1
+	done &&
+	test_write_lines /file1.txt /file3.txt >sparse-src/odd-files &&
+	test_write_lines /file2.txt /file4.txt >sparse-src/even-files &&
+	echo "/*" >sparse-src/all-files &&
+	git -C sparse-src add odd-files even-files all-files &&
+	git -C sparse-src commit -m "some sparse checkout files"
+'
+
+test_expect_success 'partial clone with sparse filter succeeds' '
+	git clone --no-local --no-checkout --filter=sparse:oid=master:all-files sparse-src pc-all &&
+	git clone --no-local --no-checkout --filter=sparse:oid=master:even-files sparse-src pc-even &&
+	git clone --no-local --no-checkout --filter=sparse:oid=master:odd-files sparse-src pc-odd
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd