Message ID | 20190829231925.15223-2-jon@jonsimons.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | partial-clone: fix two issues with sparse filter handling | expand |
Jon Simons <jon@jonsimons.org> writes: > diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c > index 1cb20c659c..aaba312edb 100644 > --- a/list-objects-filter-options.c > +++ b/list-objects-filter-options.c > @@ -71,7 +71,8 @@ static int gently_parse_list_objects_filter( > * command, but DO NOT complain if we don't have the blob or > * ref locally. > */ > - if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB, > + if (have_git_dir() && > + !get_oid_with_context(the_repository, v0, GET_OID_BLOB, > &sparse_oid, &oc)) > filter_options->sparse_oid_value = oiddup(&sparse_oid); > filter_options->choice = LOFC_SPARSE_OID; Sorry, I do not quite understand what this wants to do. We say "we parsed this correctly, this filter is sparse:oid=<blob>" without filling sparse_oid_value field at all. What do the consumers of such a filter_options instance do to such a half-read option? If they say "ah, the parser wanted to do sparse:oid but we couldn't really figure the <blob> part out, so let's ignore it", that might be the best they could do anyway, but it somewhat feels iffy. I wonder if we are better off inventing a new "we tried to parse but we lack sufficient info to make it useful" value to use in .choice field and return it instead. In the longer term, what do we want to happen in such a case where "read this blob to figure out what I want you to do" cannot be satisfied due to chicken-and-egg situation like this? Can we postpone fetching or cloning that *wants* to use the named <blob> when we discover that the <blob> is not available (of which, your "!have_git_dir()" is a subset), grab that single <blob> first from the other side before doing the main transfer, and then resume the transfer that wants to use the <blob> after we successfully do so, or something?
On Fri, Aug 30, 2019 at 11:08:34AM -0700, Junio C Hamano wrote: > Jon Simons <jon@jonsimons.org> writes: > > > diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c > > index 1cb20c659c..aaba312edb 100644 > > --- a/list-objects-filter-options.c > > +++ b/list-objects-filter-options.c > > @@ -71,7 +71,8 @@ static int gently_parse_list_objects_filter( > > * command, but DO NOT complain if we don't have the blob or > > * ref locally. > > */ > > - if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB, > > + if (have_git_dir() && > > + !get_oid_with_context(the_repository, v0, GET_OID_BLOB, > > &sparse_oid, &oc)) > > filter_options->sparse_oid_value = oiddup(&sparse_oid); > > filter_options->choice = LOFC_SPARSE_OID; > > Sorry, I do not quite understand what this wants to do. We say "we > parsed this correctly, this filter is sparse:oid=<blob>" without > filling sparse_oid_value field at all. What do the consumers of > such a filter_options instance do to such a half-read option? Empirically, they either die() or segfault. ;) I agree the code even before Jon's patch is pretty funny. Forgetting the "not a repo" case for a moment, we know that get_oid_with_context() might not find the name. In the case of rev-list, we manually check the validity of sparse_oid_value later and die(). Which seems like a layering violation, but at least gives the desired outcome. In the case of clone, it gets parsed twice: 1. On the client side, we don't have a repo yet, and so we BUG(). But once fixed, we don't care about the value at all! We never use it, and instead send the original filter spec over to the server, so at most this whole option-parsing is giving us an early syntax check. 2. On the server side, upload-pack receives the filter spec, parses it, but doesn't remember to check whether it's a valid oid. It segfaults if it isn't (that's patch 2 here). So these patches are punting on the greater question of why we want to parse so early, and are not making anything worse. AFAICT, "clone --filter=sparse:oid" has never worked (even though our tests did cover the underlying rev-list and pack-objects code paths). > If they say "ah, the parser wanted to do sparse:oid but we couldn't > really figure the <blob> part out, so let's ignore it", that might > be the best they could do anyway, but it somewhat feels iffy. I > wonder if we are better off inventing a new "we tried to parse but > we lack sufficient info to make it useful" value to use in .choice > field and return it instead. TBH, I'm not sure why the original is so eager to parse early. I guess it allows: - a dual use of the options parser; we can use it both to sanity-check the options before sending them to a server, and to actually use the filter ourselves. - earlier detection maybe gives us a cleaner error path (e.g., rev-list can do its own error handling). But I'd think doing it when we actually initialize the filter would be enough. I.e., if we want to go all the way, I think this two-patch series could basically be replaced with something like the (totally untested) approach below, which just pushes the parsing closer to the point-of-use. Adding Jeff Hostetler to the cc, in case he recalls any reason not to use that approach. > In the longer term, what do we want to happen in such a case where > "read this blob to figure out what I want you to do" cannot be > satisfied due to chicken-and-egg situation like this? Can we > postpone fetching or cloning that *wants* to use the named <blob> > when we discover that the <blob> is not available (of which, your > "!have_git_dir()" is a subset), grab that single <blob> first from > the other side before doing the main transfer, and then resume the > transfer that wants to use the <blob> after we successfully do so, > or something? I don't think there's any chicken-and-egg here. The client side of a clone does not ever care about what's in sparse_oid_value at all, before or after (and the name needs to be resolved from the server's view of the refs, anyway). I think there could be a feature where we do a narrow clone based on the spec in an oid, and then do a matching sparse checkout. And that feature would want to make sure it knows how the server resolved the spec in the first step. But AFAIK we don't yet have such a feature (and the simplest thing would probably be a capability where the server tells us the blob oid, and makes sure it is transmitted). -Peff --- diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 301ccb970b..74dbfad73d 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -471,10 +471,6 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) parse_list_objects_filter(&filter_options, arg); if (filter_options.choice && !revs.blob_objects) die(_("object filtering requires --objects")); - if (filter_options.choice == LOFC_SPARSE_OID && - !filter_options.sparse_oid_value) - die(_("invalid sparse value '%s'"), - filter_options.filter_spec); continue; } if (!strcmp(arg, ("--no-" CL_ARG__FILTER))) { diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c index 1cb20c659c..76a9a49a75 100644 --- a/list-objects-filter-options.c +++ b/list-objects-filter-options.c @@ -63,17 +63,8 @@ static int gently_parse_list_objects_filter( return 0; } else if (skip_prefix(arg, "sparse:oid=", &v0)) { - struct object_context oc; - struct object_id sparse_oid; - - /* - * Try to parse <oid-expression> into an OID for the current - * command, but DO NOT complain if we don't have the blob or - * ref locally. - */ - if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB, - &sparse_oid, &oc)) - filter_options->sparse_oid_value = oiddup(&sparse_oid); + /* v0 is durable because it points into our saved filter_spec */ + filter_options->sparse_oid_name = v0; filter_options->choice = LOFC_SPARSE_OID; return 0; @@ -138,7 +129,6 @@ void list_objects_filter_release( struct list_objects_filter_options *filter_options) { free(filter_options->filter_spec); - free(filter_options->sparse_oid_value); memset(filter_options, 0, sizeof(*filter_options)); } diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h index c54f0000fb..a819e42ffb 100644 --- a/list-objects-filter-options.h +++ b/list-objects-filter-options.h @@ -42,7 +42,7 @@ struct list_objects_filter_options { * choice-specific; not all values will be defined for any given * choice. */ - struct object_id *sparse_oid_value; + const char *sparse_oid_name; unsigned long blob_limit_value; unsigned long tree_exclude_depth; }; diff --git a/list-objects-filter.c b/list-objects-filter.c index 36e1f774bc..130c17b95c 100644 --- a/list-objects-filter.c +++ b/list-objects-filter.c @@ -463,9 +463,16 @@ static void *filter_sparse_oid__init( filter_free_fn *filter_free_fn) { struct filter_sparse_data *d = xcalloc(1, sizeof(*d)); + struct object_context oc; + struct object_id sparse_oid; + + if (!get_oid_with_context(the_repository, + filter_options->sparse_oid_name, + GET_OID_BLOB, &sparse_oid, &oc)) + die("unable to access sparse blob in '%s'", + filter_options->sparse_oid_name); d->omits = omitted; - if (add_excludes_from_blob_to_list(filter_options->sparse_oid_value, - NULL, 0, &d->el) < 0) + if (add_excludes_from_blob_to_list(&sparse_oid, NULL, 0, &d->el) < 0) die("could not load filter specification"); ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc);
Jeff King <peff@peff.net> writes: > So these patches are punting on the greater question of why we want to > parse so early, and are not making anything worse. AFAICT, "clone > --filter=sparse:oid" has never worked (even though our tests did cover > the underlying rev-list and pack-objects code paths). > ... > TBH, I'm not sure why the original is so eager to parse early. I guess > it allows: > > - a dual use of the options parser; we can use it both to sanity-check > the options before sending them to a server, and to actually use the > filter ourselves. > > - earlier detection maybe gives us a cleaner error path (e.g., > rev-list can do its own error handling). But I'd think doing it when > we actually initialize the filter would be enough. > > I.e., if we want to go all the way, I think this two-patch series could > basically be replaced with something like the (totally untested) > approach below, which just pushes the parsing closer to the > point-of-use. > > Adding Jeff Hostetler to the cc, in case he recalls any reason not to > use that approach. Thanks.
On 9/5/2019 2:57 PM, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > >> So these patches are punting on the greater question of why we want to >> parse so early, and are not making anything worse. AFAICT, "clone >> --filter=sparse:oid" has never worked (even though our tests did cover >> the underlying rev-list and pack-objects code paths). >> ... >> TBH, I'm not sure why the original is so eager to parse early. I guess >> it allows: >> >> - a dual use of the options parser; we can use it both to sanity-check >> the options before sending them to a server, and to actually use the >> filter ourselves. >> >> - earlier detection maybe gives us a cleaner error path (e.g., >> rev-list can do its own error handling). But I'd think doing it when >> we actually initialize the filter would be enough. >> >> I.e., if we want to go all the way, I think this two-patch series could >> basically be replaced with something like the (totally untested) >> approach below, which just pushes the parsing closer to the >> point-of-use. >> >> Adding Jeff Hostetler to the cc, in case he recalls any reason not to >> use that approach. > > Thanks. > I think both of Peff's guesses are correct. IIRC I wrote the original parse_list_objects_filter() and friends to syntax check the command line arguments of rev-list. In hindsight, this looks a bit aggressive at that layer, or rather now that it is being used by various places in other commands (such as parsing messages from the wire), it shouldn't call die() as Peff suggests. I like the code Peff suggests. Making parse_list_objects_filter() a bit simpler and not call die(). Callers should then check the function return value as necessary. It would be nice if we could continue to let parse_list_objects_filter() do the syntax checking -- that is, we can still check that we received a ulong in "blob:limit:<nr>" and that "sparse:oid:<oid>" looks like a hex value, for example. Just save the actual <oid> lookup to the higher layer, if and when that makes sense. Jeff
On Mon, Sep 09, 2019 at 09:54:36AM -0400, Jeff Hostetler wrote: > It would be nice if we could continue to let parse_list_objects_filter() > do the syntax checking -- that is, we can still check that we received a > ulong in "blob:limit:<nr>" and that "sparse:oid:<oid>" looks like a hex > value, for example. Just save the actual <oid> lookup to the higher > layer, if and when that makes sense. Yeah, I agree that is the right place to do syntactic checking. But I think we can't do much checking for the sparse-oid. We currently accept not just a hex oid, but any name. And I think that is useful; I should be able to say "master:sparse-file" and have it resolved by the remote side. So it really is "any name which is syntactically resolvable as a sha1 expression". At which point I think you might as well punt and just wait until we _actually_ try to resolve the name to see if it's valid. I'll work up what I sent earlier into a real patch, and include some of this discussion. -Peff
Jeff Hostetler <git@jeffhostetler.com> writes: > It would be nice if we could continue to let parse_list_objects_filter() > do the syntax checking -- that is, we can still check that we received a > ulong in "blob:limit:<nr>" and that "sparse:oid:<oid>" looks like a hex > value, for example. Just save the actual <oid> lookup to the higher > layer, if and when that makes sense. Hmmm, am I misremembering things or did I hear somebody mention in this thread that people give "sparse:oid:master" (not blob object name in hex, but a refname) and expect the other side to resolve it?
On 9/9/2019 1:12 PM, Junio C Hamano wrote: > Jeff Hostetler <git@jeffhostetler.com> writes: > >> It would be nice if we could continue to let parse_list_objects_filter() >> do the syntax checking -- that is, we can still check that we received a >> ulong in "blob:limit:<nr>" and that "sparse:oid:<oid>" looks like a hex >> value, for example. Just save the actual <oid> lookup to the higher >> layer, if and when that makes sense. > > Hmmm, am I misremembering things or did I hear somebody mention in > this thread that people give "sparse:oid:master" (not blob object > name in hex, but a refname) and expect the other side to resolve it? > You're right. I misspoke. I was thinking about the hex OID case and forgetting about the <rev>:<path> form. Jeff
On 9/9/2019 1:08 PM, Jeff King wrote: > On Mon, Sep 09, 2019 at 09:54:36AM -0400, Jeff Hostetler wrote: > >> It would be nice if we could continue to let parse_list_objects_filter() >> do the syntax checking -- that is, we can still check that we received a >> ulong in "blob:limit:<nr>" and that "sparse:oid:<oid>" looks like a hex >> value, for example. Just save the actual <oid> lookup to the higher >> layer, if and when that makes sense. > > Yeah, I agree that is the right place to do syntactic checking. But I > think we can't do much checking for the sparse-oid. We currently accept > not just a hex oid, but any name. And I think that is useful; I should > be able to say "master:sparse-file" and have it resolved by the remote > side. Right. I forgot about the "master:sparse-file" case and was only thinking about the hex oid case. The sparse-file case is very useful. > > So it really is "any name which is syntactically resolvable as a sha1 > expression". At which point I think you might as well punt and just wait > until we _actually_ try to resolve the name to see if it's valid. > > I'll work up what I sent earlier into a real patch, and include some of > this discussion. > > -Peff > thanks Jeff
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c index 1cb20c659c..aaba312edb 100644 --- a/list-objects-filter-options.c +++ b/list-objects-filter-options.c @@ -71,7 +71,8 @@ static int gently_parse_list_objects_filter( * command, but DO NOT complain if we don't have the blob or * ref locally. */ - if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB, + if (have_git_dir() && + !get_oid_with_context(the_repository, v0, GET_OID_BLOB, &sparse_oid, &oc)) filter_options->sparse_oid_value = oiddup(&sparse_oid); filter_options->choice = LOFC_SPARSE_OID; diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh index 565254558f..5c68431d10 100755 --- a/t/t5616-partial-clone.sh +++ b/t/t5616-partial-clone.sh @@ -241,6 +241,27 @@ test_expect_success 'fetch what is specified on CLI even if already promised' ' ! grep "?$(cat blob)" missing_after ' +test_expect_success 'setup src repo for sparse filter' ' + git init sparse-src && + git -C sparse-src config --local uploadpack.allowfilter 1 && + git -C sparse-src config --local uploadpack.allowanysha1inwant 1 && + for n in 1 2 3 4 + do + test_commit -C sparse-src "this-is-file-$n" file.$n.txt || return 1 + done && + test_write_lines /file1.txt /file3.txt >sparse-src/odd-files && + test_write_lines /file2.txt /file4.txt >sparse-src/even-files && + echo "/*" >sparse-src/all-files && + git -C sparse-src add odd-files even-files all-files && + git -C sparse-src commit -m "some sparse checkout files" +' + +test_expect_success 'partial clone with sparse filter succeeds' ' + git clone --no-local --no-checkout --filter=sparse:oid=master:all-files sparse-src pc-all && + git clone --no-local --no-checkout --filter=sparse:oid=master:even-files sparse-src pc-even && + git clone --no-local --no-checkout --filter=sparse:oid=master:odd-files sparse-src pc-odd +' + . "$TEST_DIRECTORY"/lib-httpd.sh start_httpd
Fix a bug in partial cloning with sparse filters by ensuring to check for 'have_git_dir' before attempting to resolve the sparse filter OID. Otherwise the client will trigger: BUG: refs.c:1851: attempting to get main_ref_store outside of repository when attempting to git clone with a sparse filter. Note that this fix is the minimal one which avoids the BUG and allows for the clone to complete successfully: There is an open question as to whether there should be any attempt to resolve the OID provided by the client in this context, as a filter for the clone to be used on the remote side. For cases where local and remote OID resolutions differ, resolving on the client side could be considered a bug. For now, the minimal approach here is used to unblock further testing for partial clones with sparse filters, while a more invasive fix could make sense to pursue as a future direction. t5616 is updated to demonstrate the change. Signed-off-by: Jon Simons <jon@jonsimons.org> --- list-objects-filter-options.c | 3 ++- t/t5616-partial-clone.sh | 21 +++++++++++++++++++++ 2 files changed, 23 insertions(+), 1 deletion(-)