Message ID | 1f95597eedc4c651868601c0ff7c4a4d97ca4457.1558484115.git.matvore@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Filter combination | expand |
On 5/21/2019 8:21 PM, Matthew DeVore wrote: > Allow combining filters such that only objects accepted by all filters > are shown. The motivation for this is to allow getting directory > listings without also fetching blobs. This can be done by combining > blob:none with tree:<depth>. There are massive repositories that have > larger-than-expected trees - even if you include only a single commit. > > The current usage requires passing the filter to rev-list, or sending > it over the wire, as: > > combine:<FILTER1>+<FILTER2> I must admit I'm not a fan of this syntax and the URL-encoding that it requires. I see that this was already discussed in the RFC version [1] last week, but I'll repeat it here. I like the repeated used of the "--filter=<f_k>" command line option. In the RFC version, there was discussion [2] of the wire format and the need to be backwards compatible with existing servers and so use the "combine:" syntax so that we only have a single filter line on the wire. Would it be better to have compliant servers advertise a "filters" (plural) capability in addition to the existing "filter" (singular) capability? Then the client would know that it could send a series of filter lines using the existing syntax. Likewise, if the "filters" capability was omitted, the client could error out without the extra round-trip. [1] https://public-inbox.org/git/xmqqwoip3gp0.fsf@gitster-ct.c.googlers.com/ [2] https://public-inbox.org/git/1E174CAA-BD57-400B-A83B-4AABFAFBC04B@comcast.net/ [...] > standard input when --stdin is used). <depth>=1 will include only the > tree and blobs which are referenced directly by a commit reachable from > <commit> or an explicitly-given object. <depth>=2 is like <depth>=1 > while also including trees and blobs one more level removed from an > explicitly-given commit or tree. > ++ > +The form '--filter=combine:<filter1>+<filter2>+...<filterN>' combines > +several filters. We are allowing an unlimited number of filters in the composition. In the code, the compose filter data has space for a LHS and RHS, so I'm assuming we're mapping --filter=f1 --filter=f2 --filter=f3 --filter=f4 or --filter=combine:f1+f2+f3+f4 into basically (compose f1 (compose f2 (compose (f3 f4))) I wonder if it would be easier to understand if we just built an array or linked list, but I'll read on. > Only objects which are accepted by every filter are > +included. Filters are joined by '{plus}' and individual filters are %-encoded > +(i.e. URL-encoded). Besides the '{plus}' and '%' characters, the following > +characters are reserved and also must be encoded: > +`~!@#$^&*()[]{}\;",<>?`+'`+ as well as all characters with ASCII code > +<= `0x20`, which includes space and newline. [...] > diff --git a/list-objects-filter.c b/list-objects-filter.c > index 8e8616b9b8..b97277a46f 100644 > --- a/list-objects-filter.c > +++ b/list-objects-filter.c > @@ -453,34 +453,148 @@ static void filter_sparse_path__init( > > ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc); > d->array_frame[d->nr].defval = 0; /* default to include */ > d->array_frame[d->nr].child_prov_omit = 0; > > ctx->filter_fn = filter_sparse; > ctx->free_fn = filter_sparse_free; > ctx->data = d; > } > > +struct filter_combine_data { > + /* sub[0] corresponds to lhs, sub[1] to rhs. */ > + struct { > + struct filter_context ctx; > + struct oidset seen; > + struct object_id skip_tree; > + unsigned is_skipping_tree : 1; > + } sub[2]; > + > + struct oidset rhs_omits; > +}; > + > +static void add_all(struct oidset *dest, struct oidset *src) { > + struct oidset_iter iter; > + struct object_id *src_oid; > + > + oidset_iter_init(src, &iter); > + while ((src_oid = oidset_iter_next(&iter)) != NULL) > + oidset_insert(dest, src_oid); > +} > + > +static void filter_combine_free(void *filter_data) > +{ > + struct filter_combine_data *d = filter_data; > + int i; > + > + /* Anything omitted by rhs should be added to the overall omits set. */ > + if (d->sub[0].ctx.omits) > + add_all(d->sub[0].ctx.omits, d->sub[1].ctx.omits); > + > + for (i = 0; i < 2; i++) { > + list_objects_filter__release(&d->sub[i].ctx); > + oidset_clear(&d->sub[i].seen); > + } > + oidset_clear(&d->rhs_omits); > + free(d); > +} > + > +static int should_delegate(enum list_objects_filter_situation filter_situation, > + struct object *obj, > + struct filter_combine_data *d, > + int side) > +{ > + if (!d->sub[side].is_skipping_tree) > + return 1; > + if (filter_situation == LOFS_END_TREE && > + oideq(&obj->oid, &d->sub[side].skip_tree)) { > + d->sub[side].is_skipping_tree = 0; > + return 1; > + } > + return 0; > +} > + > +static enum list_objects_filter_result filter_combine( > + struct repository *r, > + enum list_objects_filter_situation filter_situation, > + struct object *obj, > + const char *pathname, > + const char *filename, > + struct filter_context *ctx) > +{ > + struct filter_combine_data *d = ctx->data; > + enum list_objects_filter_result result[2]; > + enum list_objects_filter_result combined_result = LOFR_ZERO; > + int i; > + > + for (i = 0; i < 2; i++) { > + if (oidset_contains(&d->sub[i].seen, &obj->oid) || > + !should_delegate(filter_situation, obj, d, i)) { Should we swap the order of the terms in the || so that we always clear the d->sub[i].is_skipping_tree on LOFS_END_TREE ? > + result[i] = LOFR_ZERO; > + continue; > + } > + > + result[i] = d->sub[i].ctx.filter_fn( > + r, filter_situation, obj, pathname, filename, > + &d->sub[i].ctx); > + > + if (result[i] & LOFR_MARK_SEEN) > + oidset_insert(&d->sub[i].seen, &obj->oid); So filter[i] has said it never wants to show this object (hard omit). And the guard at the top of the loop will prevent future invocations from checking it again if the object is revisited. > + > + if (result[i] & LOFR_SKIP_TREE) { > + d->sub[i].is_skipping_tree = 1; > + d->sub[i].skip_tree = obj->oid; So this marks the tree object at the top of the skip as far as filter[i] is concerned. > + } > + } > + > + if ((result[0] & LOFR_DO_SHOW) && (result[1] & LOFR_DO_SHOW)) > + combined_result |= LOFR_DO_SHOW; > + if (d->sub[0].is_skipping_tree && d->sub[1].is_skipping_tree) > + combined_result |= LOFR_SKIP_TREE; Something about the above bothers me, but I can't quite say what it is. Do we need to do: if ((result[0] & LOFR_MARK_SEEN) && (result[1] & LOFR_MARK_SEEN)) combined_result |= LOFR_MARK_SEEN; > + > + return combined_result; > +} [...] I'm out of time now, will pick this up again next week. Thanks Jeff
Jeff Hostetler <git@jeffhostetler.com> writes: > In the RFC version, there was discussion [2] of the wire format > and the need to be backwards compatible with existing servers and > so use the "combine:" syntax so that we only have a single filter > line on the wire. Would it be better to have compliant servers > advertise a "filters" (plural) capability in addition to the > existing "filter" (singular) capability? Then the client would > know that it could send a series of filter lines using the existing > syntax. Likewise, if the "filters" capability was omitted, the > client could error out without the extra round-trip. All good ideas.
On Tue, May 21, 2019 at 05:21:52PM -0700, Matthew DeVore wrote: > Allow combining filters such that only objects accepted by all filters > are shown. The motivation for this is to allow getting directory > listings without also fetching blobs. This can be done by combining > blob:none with tree:<depth>. There are massive repositories that have > larger-than-expected trees - even if you include only a single commit. > > The current usage requires passing the filter to rev-list, or sending > it over the wire, as: > > combine:<FILTER1>+<FILTER2> > > (i.e.: git rev-list --filter=combine:tree:2+blob:limit=32k). This is > potentially awkward because individual filters must be URL-encoded if > they contain + or %. This can potentially be improved by supporting a > repeated flag syntax, e.g.: > > $ git rev-list --filter=tree:2 --filter=blob:limit=32k > > Such usage is currently an error, so giving it a meaning is backwards- > compatible. > > Signed-off-by: Matthew DeVore <matvore@google.com> > --- > Documentation/rev-list-options.txt | 12 ++ > contrib/completion/git-completion.bash | 2 +- > list-objects-filter-options.c | 161 ++++++++++++++++++++++++- > list-objects-filter-options.h | 14 ++- > list-objects-filter.c | 114 +++++++++++++++++ > t/t6112-rev-list-filters-objects.sh | 159 +++++++++++++++++++++++- > 6 files changed, 455 insertions(+), 7 deletions(-) > > diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt > index ddbc1de43f..4fb0c4fbb0 100644 > --- a/Documentation/rev-list-options.txt > +++ b/Documentation/rev-list-options.txt > @@ -730,20 +730,32 @@ specification contained in <path>. > + > The form '--filter=tree:<depth>' omits all blobs and trees whose depth > from the root tree is >= <depth> (minimum depth if an object is located > at multiple depths in the commits traversed). <depth>=0 will not include > any trees or blobs unless included explicitly in the command-line (or > standard input when --stdin is used). <depth>=1 will include only the > tree and blobs which are referenced directly by a commit reachable from > <commit> or an explicitly-given object. <depth>=2 is like <depth>=1 > while also including trees and blobs one more level removed from an > explicitly-given commit or tree. > ++ > +The form '--filter=combine:<filter1>+<filter2>+...<filterN>' combines > +several filters. Only objects which are accepted by every filter are > +included. Filters are joined by '{plus}' and individual filters are %-encoded > +(i.e. URL-encoded). Besides the '{plus}' and '%' characters, the following > +characters are reserved and also must be encoded: > +`~!@#$^&*()[]{}\;",<>?`+'`+ as well as all characters with ASCII code > +<= `0x20`, which includes space and newline. > ++ > +Other arbitrary characters can also be encoded. For instance, > +'combine:tree:3+blob:none' and 'combine:tree%3A2+blob%3Anone' are > +equivalent. > > --no-filter:: > Turn off any previous `--filter=` argument. > > --filter-print-omitted:: > Only useful with `--filter=`; prints a list of the objects omitted > by the filter. Object IDs are prefixed with a ``~'' character. > > --missing=<missing-action>:: > A debug option to help with future "partial clone" development. > diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash > index 3eefbabdb1..0fd0a10d0c 100644 > --- a/contrib/completion/git-completion.bash > +++ b/contrib/completion/git-completion.bash > @@ -1529,21 +1529,21 @@ _git_difftool () > __git_fetch_recurse_submodules="yes on-demand no" > > _git_fetch () > { > case "$cur" in > --recurse-submodules=*) > __gitcomp "$__git_fetch_recurse_submodules" "" "${cur##--recurse-submodules=}" > return > ;; > --filter=*) > - __gitcomp "blob:none blob:limit= sparse:oid= sparse:path=" "" "${cur##--filter=}" > + __gitcomp "blob:none blob:limit= sparse:oid= sparse:path= combine: tree:" "" "${cur##--filter=}" > return > ;; > --*) > __gitcomp_builtin fetch > return > ;; > esac > __git_complete_remote_or_refspec > } > > diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c > index e46ea467bc..d7a1516188 100644 > --- a/list-objects-filter-options.c > +++ b/list-objects-filter-options.c > @@ -1,19 +1,24 @@ > #include "cache.h" > #include "commit.h" > #include "config.h" > #include "revision.h" > #include "argv-array.h" > #include "list-objects.h" > #include "list-objects-filter.h" > #include "list-objects-filter-options.h" > > +static int parse_combine_filter( > + struct list_objects_filter_options *filter_options, > + const char *arg, > + struct strbuf *errbuf); > + > /* > * Parse value of the argument to the "filter" keyword. > * On the command line this looks like: > * --filter=<arg> > * and in the pack protocol as: > * "filter" SP <arg> > * > * The filter keyword will be used by many commands. > * See Documentation/rev-list-options.txt for allowed values for <arg>. > * > @@ -31,22 +36,20 @@ static int gently_parse_list_objects_filter( > > if (filter_options->choice) { > if (errbuf) { > strbuf_addstr( > errbuf, > _("multiple filter-specs cannot be combined")); > } > return 1; > } > > - filter_options->filter_spec = strdup(arg); > - > if (!strcmp(arg, "blob:none")) { > filter_options->choice = LOFC_BLOB_NONE; > return 0; > > } else if (skip_prefix(arg, "blob:limit=", &v0)) { > if (git_parse_ulong(v0, &filter_options->blob_limit_value)) { > filter_options->choice = LOFC_BLOB_LIMIT; > return 0; > } > > @@ -74,37 +77,183 @@ static int gently_parse_list_objects_filter( > if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB, > &sparse_oid, &oc)) > filter_options->sparse_oid_value = oiddup(&sparse_oid); > filter_options->choice = LOFC_SPARSE_OID; > return 0; > > } else if (skip_prefix(arg, "sparse:path=", &v0)) { > filter_options->choice = LOFC_SPARSE_PATH; > filter_options->sparse_path_value = strdup(v0); > return 0; > + > + } else if (skip_prefix(arg, "combine:", &v0)) { > + int sub_parse_res = parse_combine_filter( > + filter_options, v0, errbuf); > + if (sub_parse_res) > + return sub_parse_res; > + return 0; Couldn't the three lines above be said more succinctly as "return sub_parse_res;"? > + > } > /* > * Please update _git_fetch() in git-completion.bash when you > * add new filters > */ > > if (errbuf) > strbuf_addf(errbuf, _("invalid filter-spec '%s'"), arg); > > memset(filter_options, 0, sizeof(*filter_options)); > return 1; > } > > +static int digit_value(int c, struct strbuf *errbuf) { > + if (c >= '0' && c <= '9') > + return c - '0'; > + if (c >= 'a' && c <= 'f') > + return c - 'a' + 10; > + if (c >= 'A' && c <= 'F') > + return c - 'A' + 10; I'm sure there's something I'm missing here. But why are you manually decoding hex instead of using strtol or sscanf or something? > + > + if (!errbuf) > + return -1; > + > + strbuf_addf(errbuf, _("error in filter-spec - ")); > + if (c) > + strbuf_addf( > + errbuf, > + _("expect two hex digits after %%, but got: '%c'"), > + c); > + else > + strbuf_addf( > + errbuf, > + _("not enough hex digits after %%; expected two")); > + > + return -1; > +} > + > +static int url_decode(struct strbuf *s, struct strbuf *errbuf) { > + char *dest = s->buf; > + char *src = s->buf; > + size_t new_len; > + > + while (*src) { > + int digit_value_0, digit_value_1; > + > + if (src[0] != '%') { > + *dest++ = *src++; > + continue; > + } > + src++; > + > + digit_value_0 = digit_value(*src++, errbuf); > + if (digit_value_0 < 0) > + return 1; > + digit_value_1 = digit_value(*src++, errbuf); > + if (digit_value_1 < 0) > + return 1; > + *dest++ = digit_value_0 * 16 + digit_value_1; > + } > + new_len = dest - s->buf; > + strbuf_remove(s, new_len, s->len - new_len); > + > + return 0; > +} > + > +static const char *RESERVED_NON_WS = "~`!@#$^&*()[]{}\\;'\",<>?"; > + > +static int has_reserved_character( > + struct strbuf *sub_spec, struct strbuf *errbuf) > +{ > + const char *c = sub_spec->buf; > + while (*c) { > + if (*c <= ' ' || strchr(RESERVED_NON_WS, *c)) > + goto found_reserved; > + c++; > + } > + > + return 0; > + > +found_reserved: What's the value of doing this in a goto instead of embedded in the while loop? > + if (errbuf) > + strbuf_addf(errbuf, > + "must escape char in sub-filter-spec: '%c'", > + *c); > + return 1; > +} > + > +static int parse_combine_filter( > + struct list_objects_filter_options *filter_options, > + const char *arg, > + struct strbuf *errbuf) > +{ > + struct strbuf **sub_specs = strbuf_split_str(arg, '+', 2); > + int result; > + > + if (!sub_specs[0]) { > + if (errbuf) > + strbuf_addf(errbuf, > + _("expected something after combine:")); > + result = 1; > + goto cleanup; > + } > + > + result = has_reserved_character(sub_specs[0], errbuf); > + if (result) > + goto cleanup; > + > + /* > + * Only decode the first sub-filter, since the rest will be decoded on > + * the recursive call. > + */ > + result = url_decode(sub_specs[0], errbuf); > + if (result) > + goto cleanup; > + > + if (!sub_specs[1]) { > + /* > + * There is only one sub-filter, so we don't need the > + * combine: - just parse it as a non-composite filter. > + */ > + result = gently_parse_list_objects_filter( > + filter_options, sub_specs[0]->buf, errbuf); > + goto cleanup; > + } > + > + /* Remove trailing "+" so we can parse it. */ > + assert(sub_specs[0]->buf[sub_specs[0]->len - 1] == '+'); > + strbuf_remove(sub_specs[0], sub_specs[0]->len - 1, 1); > + > + filter_options->choice = LOFC_COMBINE; > + filter_options->lhs = xcalloc(1, sizeof(*filter_options->lhs)); > + filter_options->rhs = xcalloc(1, sizeof(*filter_options->rhs)); > + > + result = gently_parse_list_objects_filter(filter_options->lhs, > + sub_specs[0]->buf, > + errbuf) || > + parse_combine_filter(filter_options->rhs, > + sub_specs[1]->buf, > + errbuf); I guess you're recursing to combine filter 2 onto filter 1 which has been combined onto filter 0 here. But why not just use a list or array? > + > +cleanup: > + strbuf_list_free(sub_specs); > + if (result) { > + list_objects_filter_release(filter_options); > + memset(filter_options, 0, sizeof(*filter_options)); > + } > + return result; > +} > + > int parse_list_objects_filter(struct list_objects_filter_options *filter_options, > const char *arg) > { > struct strbuf buf = STRBUF_INIT; > + filter_options->filter_spec = strdup(arg); > if (gently_parse_list_objects_filter(filter_options, arg, &buf)) > die("%s", buf.buf); > return 0; > } > > int opt_parse_list_objects_filter(const struct option *opt, > const char *arg, int unset) > { > struct list_objects_filter_options *filter_options = opt->value; > > @@ -127,23 +276,29 @@ void expand_list_objects_filter_spec( > else if (filter->choice == LOFC_TREE_DEPTH) > strbuf_addf(expanded_spec, "tree:%lu", > filter->tree_exclude_depth); > else > strbuf_addstr(expanded_spec, filter->filter_spec); > } > > void list_objects_filter_release( > struct list_objects_filter_options *filter_options) > { > + if (!filter_options) > + return; > free(filter_options->filter_spec); > free(filter_options->sparse_oid_value); > free(filter_options->sparse_path_value); > + list_objects_filter_release(filter_options->lhs); > + free(filter_options->lhs); > + list_objects_filter_release(filter_options->rhs); > + free(filter_options->rhs); Is there a reason that the free shouldn't be included in list_objects_filter_release()? Maybe this is a common style guideline I've missed, but it seems to me like I'd expect a magic memory cleanup function to do it all, and not leave it to me to free. > memset(filter_options, 0, sizeof(*filter_options)); > } > > void partial_clone_register( > const char *remote, > const struct list_objects_filter_options *filter_options) > { > /* > * Record the name of the partial clone remote in the > * config and in the global variable -- the latter is > @@ -171,14 +326,16 @@ void partial_clone_register( > } > > void partial_clone_get_default_filter_spec( > struct list_objects_filter_options *filter_options) > { > /* > * Parse default value, but silently ignore it if it is invalid. > */ > if (!core_partial_clone_filter_default) > return; > + > + filter_options->filter_spec = strdup(core_partial_clone_filter_default); > gently_parse_list_objects_filter(filter_options, > core_partial_clone_filter_default, > NULL); > } > diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h > index e3adc78ebf..6c0f0ecd08 100644 > --- a/list-objects-filter-options.h > +++ b/list-objects-filter-options.h > @@ -7,20 +7,21 @@ > /* > * The list of defined filters for list-objects. > */ > enum list_objects_filter_choice { > LOFC_DISABLED = 0, > LOFC_BLOB_NONE, > LOFC_BLOB_LIMIT, > LOFC_TREE_DEPTH, > LOFC_SPARSE_OID, > LOFC_SPARSE_PATH, > + LOFC_COMBINE, > LOFC__COUNT /* must be last */ > }; > > struct list_objects_filter_options { > /* > * 'filter_spec' is the raw argument value given on the command line > * or protocol request. (The part after the "--keyword=".) For > * commands that launch filtering sub-processes, or for communication > * over the network, don't use this value; use the result of > * expand_list_objects_filter_spec() instead. > @@ -32,28 +33,35 @@ struct list_objects_filter_options { > * the filtering algorithm to use. > */ > enum list_objects_filter_choice choice; > > /* > * Choice is LOFC_DISABLED because "--no-filter" was requested. > */ > unsigned int no_filter : 1; > > /* > - * Parsed values (fields) from within the filter-spec. These are > - * choice-specific; not all values will be defined for any given > - * choice. > + * BEGIN choice-specific parsed values from within the filter-spec. Only > + * some values will be defined for any given choice. > */ > + > struct object_id *sparse_oid_value; > char *sparse_path_value; > unsigned long blob_limit_value; > unsigned long tree_exclude_depth; > + > + /* LOFC_COMBINE values */ > + struct list_objects_filter_options *lhs, *rhs; > + > + /* > + * END choice-specific parsed values. > + */ > }; > > /* Normalized command line arguments */ > #define CL_ARG__FILTER "filter" > > int parse_list_objects_filter( > struct list_objects_filter_options *filter_options, > const char *arg); > > int opt_parse_list_objects_filter(const struct option *opt, > diff --git a/list-objects-filter.c b/list-objects-filter.c > index 8e8616b9b8..b97277a46f 100644 > --- a/list-objects-filter.c > +++ b/list-objects-filter.c > @@ -453,34 +453,148 @@ static void filter_sparse_path__init( > > ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc); > d->array_frame[d->nr].defval = 0; /* default to include */ > d->array_frame[d->nr].child_prov_omit = 0; > > ctx->filter_fn = filter_sparse; > ctx->free_fn = filter_sparse_free; > ctx->data = d; > } > > +struct filter_combine_data { > + /* sub[0] corresponds to lhs, sub[1] to rhs. */ Jeff H had a comment about this too, but this seems unwieldy for >2 filters. (I also personally don't like using set index to incidate lhs/rhs.) Why not an array of multiple `struct sub`? There's a macro utility to generate types and helpers for an array of arbitrary struct that may suit... > + struct { > + struct filter_context ctx; > + struct oidset seen; > + struct object_id skip_tree; > + unsigned is_skipping_tree : 1; > + } sub[2]; > + > + struct oidset rhs_omits; > +}; > + > +static void add_all(struct oidset *dest, struct oidset *src) { > + struct oidset_iter iter; > + struct object_id *src_oid; > + > + oidset_iter_init(src, &iter); > + while ((src_oid = oidset_iter_next(&iter)) != NULL) > + oidset_insert(dest, src_oid); > +} > + > +static void filter_combine_free(void *filter_data) > +{ > + struct filter_combine_data *d = filter_data; > + int i; > + > + /* Anything omitted by rhs should be added to the overall omits set. */ > + if (d->sub[0].ctx.omits) > + add_all(d->sub[0].ctx.omits, d->sub[1].ctx.omits); > + > + for (i = 0; i < 2; i++) { > + list_objects_filter__release(&d->sub[i].ctx); > + oidset_clear(&d->sub[i].seen); > + } > + oidset_clear(&d->rhs_omits); > + free(d); > +} > + > +static int should_delegate(enum list_objects_filter_situation filter_situation, > + struct object *obj, > + struct filter_combine_data *d, > + int side) > +{ > + if (!d->sub[side].is_skipping_tree) > + return 1; > + if (filter_situation == LOFS_END_TREE && > + oideq(&obj->oid, &d->sub[side].skip_tree)) { > + d->sub[side].is_skipping_tree = 0; > + return 1; > + } > + return 0; > +} > + > +static enum list_objects_filter_result filter_combine( > + struct repository *r, > + enum list_objects_filter_situation filter_situation, > + struct object *obj, > + const char *pathname, > + const char *filename, > + struct filter_context *ctx) > +{ > + struct filter_combine_data *d = ctx->data; > + enum list_objects_filter_result result[2]; > + enum list_objects_filter_result combined_result = LOFR_ZERO; > + int i; > + > + for (i = 0; i < 2; i++) { I suppose your lhs and rhs are in sub[0] and sub[1] in part for the sake of this loop. But I think it would be easier to understand what is going on if you were to perform the loop contents in a helper function (as the name of the function would provide some more documentation). > + if (oidset_contains(&d->sub[i].seen, &obj->oid) || > + !should_delegate(filter_situation, obj, d, i)) { > + result[i] = LOFR_ZERO; > + continue; > + } > + > + result[i] = d->sub[i].ctx.filter_fn( > + r, filter_situation, obj, pathname, filename, > + &d->sub[i].ctx); > + > + if (result[i] & LOFR_MARK_SEEN) > + oidset_insert(&d->sub[i].seen, &obj->oid); > + > + if (result[i] & LOFR_SKIP_TREE) { > + d->sub[i].is_skipping_tree = 1; > + d->sub[i].skip_tree = obj->oid; > + } > + } > + > + if ((result[0] & LOFR_DO_SHOW) && (result[1] & LOFR_DO_SHOW)) > + combined_result |= LOFR_DO_SHOW; > + if (d->sub[0].is_skipping_tree && d->sub[1].is_skipping_tree) > + combined_result |= LOFR_SKIP_TREE; > + > + return combined_result; > +} I see that you tested that >2 filters works okay. But by doing it the way you have it seems like you're setting up to need recursion all over the place to check against all the filters. I suppose I don't see the benefit of doing all this recursively, as compared to doing it iteratively. - Emily
On Tue, May 28, 2019 at 10:59:31AM -0700, Junio C Hamano wrote: > Jeff Hostetler <git@jeffhostetler.com> writes: > > > In the RFC version, there was discussion [2] of the wire format > > and the need to be backwards compatible with existing servers and > > so use the "combine:" syntax so that we only have a single filter > > line on the wire. Would it be better to have compliant servers > > advertise a "filters" (plural) capability in addition to the This is a good idea and I hadn't considered it. It does seem to make the repeated filter lines a safer bet than I though. > > existing "filter" (singular) capability? Then the client would > > know that it could send a series of filter lines using the existing > > syntax. Likewise, if the "filters" capability was omitted, the > > client could error out without the extra round-trip. > > All good ideas. After hacking the code halfway together to make the above idea work, and learning quite a lot in the process, I saw set_git_option in transport.c and realized that all existing transport options are assumed to be ? (0 or 1) rather than * (0 or more). So "filter" would be the first transport option that is repeated. Even though multiple reviewers have weighed in supporting repeated filter lines, I'm still conflicted about it. It seems the drawback to the + syntax is the requirement for encoding the individual filters, but this encoding is no longer required since the sparse:path=... filter no longer has to be supported. And the URL encoding, if it is ever reintroduced, is just boilerplate and is unlikely to change later or cause a significant maintainance burden. The essence of the repeated filter line is that we need additional high-level machinery just for the sake of making the lower-level machinery... marginally simpler, hopefully? And if we ever need to add new filter combinations (like OR or XOR rather than AND) this repeated filter line thing will be a legacy annoyance (users will wonder why does repeated "filter" mean AND rather than one of the other supported combination methods?). Repeating filter lines seems like a leaky abstraction to me. I would be helped if someone re-iterated why the repeated filter lines are a good idea in light of the fact that URL escaping is no longer required to make it work.
On 5/29/2019 11:02 AM, Matthew DeVore wrote: > On Tue, May 28, 2019 at 10:59:31AM -0700, Junio C Hamano wrote: >> Jeff Hostetler <git@jeffhostetler.com> writes: >> >>> In the RFC version, there was discussion [2] of the wire format >>> and the need to be backwards compatible with existing servers and >>> so use the "combine:" syntax so that we only have a single filter >>> line on the wire. Would it be better to have compliant servers >>> advertise a "filters" (plural) capability in addition to the > > This is a good idea and I hadn't considered it. It does seem to make the > repeated filter lines a safer bet than I though. > >>> existing "filter" (singular) capability? Then the client would >>> know that it could send a series of filter lines using the existing >>> syntax. Likewise, if the "filters" capability was omitted, the >>> client could error out without the extra round-trip. >> >> All good ideas. > > After hacking the code halfway together to make the above idea work, and > learning quite a lot in the process, I saw set_git_option in transport.c and > realized that all existing transport options are assumed to be ? (0 or 1) rather > than * (0 or more). So "filter" would be the first transport option that is > repeated. > > Even though multiple reviewers have weighed in supporting repeated filter lines, > I'm still conflicted about it. It seems the drawback to the + syntax is the > requirement for encoding the individual filters, but this encoding is no longer > required since the sparse:path=... filter no longer has to be supported. And the > URL encoding, if it is ever reintroduced, is just boilerplate and is unlikely to > change later or cause a significant maintainance burden. Was sparse:path filter the only reason for needing all the URL encoding? The sparse:oid form allows values <ref>:<path> and these (or at least the <path> portion) may contain special characters. So don't we need to URL encode this form too? > > The essence of the repeated filter line is that we need additional high-level > machinery just for the sake of making the lower-level machinery... marginally > simpler, hopefully? And if we ever need to add new filter combinations (like OR > or XOR rather than AND) this repeated filter line thing will be a legacy > annoyance (users will wonder why does repeated "filter" mean AND rather than > one of the other supported combination methods?). Repeating filter lines seems > like a leaky abstraction to me. > > I would be helped if someone re-iterated why the repeated filter lines are a > good idea in light of the fact that URL escaping is no longer required to make > it work. >
On Wed, May 29, 2019 at 05:29:14PM -0400, Jeff Hostetler wrote: > Was sparse:path filter the only reason for needing all the URL encoding? > The sparse:oid form allows values <ref>:<path> and these (or at least > the <path> portion) may contain special characters. So don't we need to > URL encode this form too? Oh, I missed this. I was only thinking an oid was allowed after "sparse:". So as I suspected I was overlooking something obvious. Now I just want to understand the objection to URL encoding a little better. I haven't worked with in a project that requires a lot of boilerplate before, so I may be asking obvious things again. If so, sorry in advance. So the objections, as I interpret them so far, are that: a the URL encoding/decoding complicates the code base b explaining the URL encoding, while it allows for future expansion, requires some verbose documentation in git-rev-list that is potentially distracting or confusing c there may be a better way to allow for future expansion that does not require URL encoding d the URL encoding is unpleasant to use (note that my patchset makes it optional for the user to use and it is only mandatory in sending it over the wire) I think these are reasonable and I'm willing to stop digging my heels in :) Does the above sum everything up?
On 5/29/2019 7:27 PM, Matthew DeVore wrote: > On Wed, May 29, 2019 at 05:29:14PM -0400, Jeff Hostetler wrote: >> Was sparse:path filter the only reason for needing all the URL encoding? >> The sparse:oid form allows values <ref>:<path> and these (or at least >> the <path> portion) may contain special characters. So don't we need to >> URL encode this form too? > > Oh, I missed this. I was only thinking an oid was allowed after "sparse:". So as > I suspected I was overlooking something obvious. > > Now I just want to understand the objection to URL encoding a little better. I > haven't worked with in a project that requires a lot of boilerplate before, so I > may be asking obvious things again. If so, sorry in advance. > > So the objections, as I interpret them so far, are that: > > a the URL encoding/decoding complicates the code base > b explaining the URL encoding, while it allows for future expansion, requires > some verbose documentation in git-rev-list that is potentially distracting or > confusing > c there may be a better way to allow for future expansion that does not require > URL encoding > d the URL encoding is unpleasant to use (note that my patchset makes it > optional for the user to use and it is only mandatory in sending it over the > wire) > > I think these are reasonable and I'm willing to stop digging my heels in :) Does > the above sum everything up? > My primary concern was how awkward it would be to use the URL encoding syntax on the command line, but as you say, that can be avoided by using the multiple --filter args. And to be honest, the wire format is hidden from user view, so it doesn't really matter there. So either approach is fine. I was hoping that the "filters (plural)" approach would let us avoid URL encoding, but that comes with its own baggage as you suggested. And besides, URL encoding is well-understood. And I don't want to prematurely complicate this with ANDs ORs and XORs as you mention in another thread. So don't let me stop this effort. BTW, I don't think I've seen this mentioned anywhere and I don't remember if this got into the code or not. But we discussed having a repo-local config setting to remember the filter-spec used by the partial clone that would be inherited by a subsequent (partial) fetch. Or would be set by the first partial fetch following a normal clone. Having a single composite filter spec would help with this. Jeff
On Tue, May 28, 2019 at 02:53:59PM -0700, Emily Shaffer wrote: > > + } else if (skip_prefix(arg, "combine:", &v0)) { > > + int sub_parse_res = parse_combine_filter( > > + filter_options, v0, errbuf); > > + if (sub_parse_res) > > + return sub_parse_res; > > + return 0; > > Couldn't the three lines above be said more succinctly as "return > sub_parse_res;"? Oh yes, that's much better. Don't even need the sub_parse_res variable. > > +static int digit_value(int c, struct strbuf *errbuf) { > > + if (c >= '0' && c <= '9') > > + return c - '0'; > > + if (c >= 'a' && c <= 'f') > > + return c - 'a' + 10; > > + if (c >= 'A' && c <= 'F') > > + return c - 'A' + 10; > > I'm sure there's something I'm missing here. But why are you manually > decoding hex instead of using strtol or sscanf or something? > I'll have to give this a try. Thank you for the suggestion. > > +static int has_reserved_character( > > + struct strbuf *sub_spec, struct strbuf *errbuf) > > +{ > > + const char *c = sub_spec->buf; > > + while (*c) { > > + if (*c <= ' ' || strchr(RESERVED_NON_WS, *c)) > > + goto found_reserved; > > + c++; > > + } > > + > > + return 0; > > + > > +found_reserved: > > What's the value of doing this in a goto instead of embedded in the > while loop? > That's to reduce indentation. Note that if I "inlined" the goto logic in the while loop, I'd get at least 5 tabs of indentation, and the error message would be split across a couple lines. > > + > > + result = gently_parse_list_objects_filter(filter_options->lhs, > > + sub_specs[0]->buf, > > + errbuf) || > > + parse_combine_filter(filter_options->rhs, > > + sub_specs[1]->buf, > > + errbuf); > > I guess you're recursing to combine filter 2 onto filter 1 which has > been combined onto filter 0 here. But why not just use a list or array? > I switched this to use an array at your and Jeff's proddings, and it's much better now. Thanks! It will be in the next roll-up. > > > > void list_objects_filter_release( > > struct list_objects_filter_options *filter_options) > > { > > + if (!filter_options) > > + return; > > free(filter_options->filter_spec); > > free(filter_options->sparse_oid_value); > > free(filter_options->sparse_path_value); > > + list_objects_filter_release(filter_options->lhs); > > + free(filter_options->lhs); > > + list_objects_filter_release(filter_options->rhs); > > + free(filter_options->rhs); > > Is there a reason that the free shouldn't be included in > list_objects_filter_release()? Maybe this is a common style guideline > I've missed, but it seems to me like I'd expect a magic memory cleanup > function to do it all, and not leave it to me to free. > Because there are a couple times the list_objects_filter_options struct is allocated on the stack or inline in some other struct. This is similar to how strbuf and other such utility structs are used. > Jeff H had a comment about this too, but this seems unwieldy for >2 > filters. (I also personally don't like using set index to incidate > lhs/rhs.) Why not an array of multiple `struct sub`? There's a macro > utility to generate types and helpers for an array of arbitrary struct > that may suit... > This code is now cleaner that it's using an array. > > +static enum list_objects_filter_result filter_combine( > > + struct repository *r, > > + enum list_objects_filter_situation filter_situation, > > + struct object *obj, > > + const char *pathname, > > + const char *filename, > > + struct filter_context *ctx) > > +{ > > + struct filter_combine_data *d = ctx->data; > > + enum list_objects_filter_result result[2]; > > + enum list_objects_filter_result combined_result = LOFR_ZERO; > > + int i; > > + > > + for (i = 0; i < 2; i++) { > > I suppose your lhs and rhs are in sub[0] and sub[1] in part for the sake > of this loop. But I think it would be easier to understand what is going > on if you were to perform the loop contents in a helper function (as the > name of the function would provide some more documentation). > Agreed, this is how it will be done in the next roll-up. > I see that you tested that >2 filters works okay. But by doing it the > way you have it seems like you're setting up to need recursion all over > the place to check against all the filters. I suppose I don't see the > benefit of doing all this recursively, as compared to doing it > iteratively. Somehow, the recursive appraoch made more sense to me when I was first writing the code. But using an array is nicer.
On Thu, May 30, 2019 at 10:01:47AM -0400, Jeff Hostetler wrote: > BTW, I don't think I've seen this mentioned anywhere and I don't > remember if this got into the code or not. But we discussed having > a repo-local config setting to remember the filter-spec used by the > partial clone that would be inherited by a subsequent (partial) fetch. > Or would be set by the first partial fetch following a normal clone. > Having a single composite filter spec would help with this. Isn't that what the partial_clone_get_default_filter_spec function is for? I forgot about that. Perhaps with Emily's suggestion to use parsing functions in the C library and the other cleanups I've applied since the first roll-up, using the URL encoding will seem nicer. Let me try that...
On Fri, May 31, 2019 at 01:48:21PM -0700, Matthew DeVore wrote: > > > +static int digit_value(int c, struct strbuf *errbuf) { > > > + if (c >= '0' && c <= '9') > > > + return c - '0'; > > > + if (c >= 'a' && c <= 'f') > > > + return c - 'a' + 10; > > > + if (c >= 'A' && c <= 'F') > > > + return c - 'A' + 10; > > > > I'm sure there's something I'm missing here. But why are you manually > > decoding hex instead of using strtol or sscanf or something? > > > > I'll have to give this a try. Thank you for the suggestion. Try our hex_to_bytes() helper (or if you really want to go low-level, your conditionals can be replaced by lookups in the hexval table). -Peff
On Fri, May 24, 2019 at 05:01:15PM -0400, Jeff Hostetler wrote: > We are allowing an unlimited number of filters in the composition. > In the code, the compose filter data has space for a LHS and RHS, so > I'm assuming we're mapping > > --filter=f1 --filter=f2 --filter=f3 --filter=f4 > or --filter=combine:f1+f2+f3+f4 > into basically > (compose f1 (compose f2 (compose (f3 f4))) > > I wonder if it would be easier to understand if we just built an array > or linked list, but I'll read on. As I mentioned in earlier messages, I have changed this to use an array. It's nicer now. (nit: the filters were left-associative rather than right-associative) > Should we swap the order of the terms in the || so that we always > clear the d->sub[i].is_skipping_tree on LOFS_END_TREE ? > Done, and added a comment: /* * Check should_delegate before oidset_contains so that * is_skipping_tree gets unset even when the object is marked as seen. * As of this writing, no filter uses LOFR_MARK_SEEN on trees that also * uses LOFR_SKIP_TREE, so the ordering is only theoretically * important. Be cautious if you change the order of the below checks * and more filters have been added! */ > > > + result[i] = LOFR_ZERO; > > + continue; > > + } > > + > > + result[i] = d->sub[i].ctx.filter_fn( > > + r, filter_situation, obj, pathname, filename, > > + &d->sub[i].ctx); > > + > > + if (result[i] & LOFR_MARK_SEEN) > > + oidset_insert(&d->sub[i].seen, &obj->oid); > > So filter[i] has said it never wants to show this object (hard omit). > And the guard at the top of the loop will prevent future invocations > from checking it again if the object is revisited. > Yes. > > + > > + if (result[i] & LOFR_SKIP_TREE) { > > + d->sub[i].is_skipping_tree = 1; > > + d->sub[i].skip_tree = obj->oid; > > So this marks the tree object at the top of the skip as far as > filter[i] is concerned. > Yes. > > + } > > + } > > + > > + if ((result[0] & LOFR_DO_SHOW) && (result[1] & LOFR_DO_SHOW)) > > + combined_result |= LOFR_DO_SHOW; > > + if (d->sub[0].is_skipping_tree && d->sub[1].is_skipping_tree) > > + combined_result |= LOFR_SKIP_TREE; > > Something about the above bothers me, but I can't quite say what > it is. > It looks nicer now that it's array-based. Let me know what you think after I send the next roll-up. > Do we need to do: > if ((result[0] & LOFR_MARK_SEEN) && (result[1] & LOFR_MARK_SEEN)) > combined_result |= LOFR_MARK_SEEN; This should be a O(1) sort of optimization, since if we don't set it, the top filter will still be called, but won't delegate to any sub-filters. It doesn't complicate the code much, so it seems worth it to add. Done. > I'm out of time now, will pick this up again next week. Thank you for taking a look and for your patience so far.
On Fri, May 31, 2019 at 05:10:42PM -0400, Jeff King wrote: > On Fri, May 31, 2019 at 01:48:21PM -0700, Matthew DeVore wrote: > > > > > +static int digit_value(int c, struct strbuf *errbuf) { > > > > + if (c >= '0' && c <= '9') > > > > + return c - '0'; > > > > + if (c >= 'a' && c <= 'f') > > > > + return c - 'a' + 10; > > > > + if (c >= 'A' && c <= 'F') > > > > + return c - 'A' + 10; > > > > > > I'm sure there's something I'm missing here. But why are you manually > > > decoding hex instead of using strtol or sscanf or something? > > > > > > > I'll have to give this a try. Thank you for the suggestion. > > Try our hex_to_bytes() helper (or if you really want to go low-level, > your conditionals can be replaced by lookups in the hexval table). > > -Peff Using hex_to_bytes worked out quite nicely, thanks!
On Fri, May 31, 2019 at 05:12:31PM -0700, Matthew DeVore wrote: > On Fri, May 31, 2019 at 05:10:42PM -0400, Jeff King wrote: > > On Fri, May 31, 2019 at 01:48:21PM -0700, Matthew DeVore wrote: > > > > > > > +static int digit_value(int c, struct strbuf *errbuf) { > > > > > + if (c >= '0' && c <= '9') > > > > > + return c - '0'; > > > > > + if (c >= 'a' && c <= 'f') > > > > > + return c - 'a' + 10; > > > > > + if (c >= 'A' && c <= 'F') > > > > > + return c - 'A' + 10; > > > > > > > > I'm sure there's something I'm missing here. But why are you manually > > > > decoding hex instead of using strtol or sscanf or something? > > > > > > > > > > I'll have to give this a try. Thank you for the suggestion. > > > > Try our hex_to_bytes() helper (or if you really want to go low-level, > > your conditionals can be replaced by lookups in the hexval table). > > Using hex_to_bytes worked out quite nicely, thanks! Great. We might want to stop there, but it's possible could reuse even more code. I didn't look closely before, but it seems this code is decoding a URL. We already have a url_decode() routine in url.c. Could it be reused? -Peff
On 5/31/2019 4:53 PM, Matthew DeVore wrote: > On Thu, May 30, 2019 at 10:01:47AM -0400, Jeff Hostetler wrote: >> BTW, I don't think I've seen this mentioned anywhere and I don't >> remember if this got into the code or not. But we discussed having >> a repo-local config setting to remember the filter-spec used by the >> partial clone that would be inherited by a subsequent (partial) fetch. >> Or would be set by the first partial fetch following a normal clone. >> Having a single composite filter spec would help with this. > > Isn't that what the partial_clone_get_default_filter_spec function is for? I > forgot about that. Perhaps with Emily's suggestion to use parsing functions in > the C library and the other cleanups I've applied since the first roll-up, using > the URL encoding will seem nicer. Let me try that... > Yes, thanks. That's what I was thinking about. Jeff
On Mon, Jun 03, 2019 at 08:34:35AM -0400, Jeff King wrote: > Great. We might want to stop there, but it's possible could reuse even > more code. I didn't look closely before, but it seems this code is > decoding a URL. We already have a url_decode() routine in url.c. Could > it be reused? Very nice. Here is an interdiff and the changes will be included in v3 of my patchset: diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c index ed02c88eb6..0f135602a7 100644 --- a/list-objects-filter-options.c +++ b/list-objects-filter-options.c @@ -1,19 +1,20 @@ #include "cache.h" #include "commit.h" #include "config.h" #include "revision.h" #include "argv-array.h" #include "list-objects.h" #include "list-objects-filter.h" #include "list-objects-filter-options.h" #include "trace.h" +#include "url.h" static int parse_combine_filter( struct list_objects_filter_options *filter_options, const char *arg, struct strbuf *errbuf); /* * Parse value of the argument to the "filter" keyword. * On the command line this looks like: * --filter=<arg> @@ -84,54 +85,20 @@ static int gently_parse_list_objects_filter( * Please update _git_fetch() in git-completion.bash when you * add new filters */ strbuf_addf(errbuf, "invalid filter-spec '%s'", arg); memset(filter_options, 0, sizeof(*filter_options)); return 1; } -static int url_decode(struct strbuf *s, struct strbuf *errbuf) -{ - char *dest = s->buf; - char *src = s->buf; - size_t new_len; - - while (*src) { - if (src[0] != '%') { - *dest++ = *src++; - continue; - } - - if (hex_to_bytes((unsigned char *)dest, src + 1, 1)) { - strbuf_addstr(errbuf, - "error in filter-spec - " - "invalid hex sequence after %"); - return 1; - } - - if (!*dest) { - strbuf_addstr(errbuf, - "error in filter-spec - unexpected %00"); - return 1; - } - - src += 3; - dest++; - } - new_len = dest - s->buf; - strbuf_remove(s, new_len, s->len - new_len); - - return 0; -} - static const char *RESERVED_NON_WS = "~`!@#$^&*()[]{}\\;'\",<>?"; static int has_reserved_character( struct strbuf *sub_spec, struct strbuf *errbuf) { const char *c = sub_spec->buf; while (*c) { if (*c <= ' ' || strchr(RESERVED_NON_WS, *c)) { strbuf_addf(errbuf, "must escape char in sub-filter-spec: '%c'", @@ -147,56 +114,57 @@ static int has_reserved_character( static int parse_combine_subfilter( struct list_objects_filter_options *filter_options, struct strbuf *subspec, struct strbuf *errbuf) { size_t new_index = filter_options->sub_nr; ALLOC_GROW_BY(filter_options->sub, filter_options->sub_nr, 1, filter_options->sub_alloc); - return has_reserved_character(subspec, errbuf) || - url_decode(subspec, errbuf) || - gently_parse_list_objects_filter( - &filter_options->sub[new_index], subspec->buf, errbuf); + decoded = url_percent_decode(subspec->buf); + + result = gently_parse_list_objects_filter( + &filter_options->sub[new_index], decoded, errbuf); + + free(decoded); + return result; } static int parse_combine_filter( struct list_objects_filter_options *filter_options, const char *arg, struct strbuf *errbuf) { struct strbuf **subspecs = strbuf_split_str(arg, '+', 0); size_t sub; - int result; + int result = 0; if (!subspecs[0]) { strbuf_addf(errbuf, _("expected something after combine:")); result = 1; goto cleanup; } - for (sub = 0; subspecs[sub]; sub++) { + for (sub = 0; subspecs[sub] && !result; sub++) { if (subspecs[sub + 1]) { /* * This is not the last subspec. Remove trailing "+" so * we can parse it. */ size_t last = subspecs[sub]->len - 1; assert(subspecs[sub]->buf[last] == '+'); strbuf_remove(subspecs[sub], last, 1); } result = parse_combine_subfilter( filter_options, subspecs[sub], errbuf); - if (result) - goto cleanup; } filter_options->choice = LOFC_COMBINE; cleanup: strbuf_list_free(subspecs); if (result) { list_objects_filter_release(filter_options); memset(filter_options, 0, sizeof(*filter_options)); } diff --git a/t/t6112-rev-list-filters-objects.sh b/t/t6112-rev-list-filters-objects.sh index 7fb5e50cde..e1bf3ed038 100755 --- a/t/t6112-rev-list-filters-objects.sh +++ b/t/t6112-rev-list-filters-objects.sh @@ -405,32 +405,20 @@ test_expect_success 'combine:... while URL-encoding things that should not be' ' test_expect_success 'combine: with nothing after the :' ' expect_invalid_filter_spec combine: "expected something after combine:" ' test_expect_success 'parse error in first sub-filter in combine:' ' expect_invalid_filter_spec combine:tree:asdf+blob:none \ "expected .tree:<depth>." ' -test_expect_success 'combine:... with invalid URL-encoded sequences' ' - # Not enough hex chars - expect_invalid_filter_spec combine:tree:2+blob:non%a \ - "error in filter-spec - invalid hex sequence after %" && - # Non-hex digit after % - expect_invalid_filter_spec combine:tree:2+blob%G5none \ - "error in filter-spec - invalid hex sequence after %" && - # Null byte encoded by % - expect_invalid_filter_spec combine:tree:2+blob%00none \ - "error in filter-spec - unexpected %00" -' - test_expect_success 'combine:... with non-encoded reserved chars' ' expect_invalid_filter_spec combine:tree:2+sparse:@xyz \ "must escape char in sub-filter-spec: .@." && expect_invalid_filter_spec combine:tree:2+sparse:\` \ "must escape char in sub-filter-spec: .\`." && expect_invalid_filter_spec combine:tree:2+sparse:~abc \ "must escape char in sub-filter-spec: .\~." ' test_expect_success 'validate err msg for "combine:<valid-filter>+"' ' diff --git a/url.c b/url.c index 25576c390b..bdede647bc 100644 --- a/url.c +++ b/url.c @@ -79,20 +79,26 @@ char *url_decode_mem(const char *url, int len) /* Skip protocol part if present */ if (colon && url < colon) { strbuf_add(&out, url, colon - url); len -= colon - url; url = colon; } return url_decode_internal(&url, len, NULL, &out, 0); } +char *url_percent_decode(const char *encoded) +{ + struct strbuf out = STRBUF_INIT; + return url_decode_internal(&encoded, strlen(encoded), NULL, &out, 0); +} + char *url_decode_parameter_name(const char **query) { struct strbuf out = STRBUF_INIT; return url_decode_internal(query, -1, "&=", &out, 1); } char *url_decode_parameter_value(const char **query) { struct strbuf out = STRBUF_INIT; return url_decode_internal(query, -1, "&", &out, 1); diff --git a/url.h b/url.h index 00b7d58c33..2a27c34277 100644 --- a/url.h +++ b/url.h @@ -1,16 +1,24 @@ #ifndef URL_H #define URL_H struct strbuf; int is_url(const char *url); int is_urlschemechar(int first_flag, int ch); char *url_decode(const char *url); char *url_decode_mem(const char *url, int len); + +/* + * Similar to the url_decode_{,mem} methods above, but doesn't assume there + * is a scheme followed by a : at the start of the string. Instead, %-sequences + * before any : are also parsed. + */ +char *url_percent_decode(const char *encoded); + char *url_decode_parameter_name(const char **query); char *url_decode_parameter_value(const char **query); void end_url_with_slash(struct strbuf *buf, const char *url); void str_end_url_with_slash(const char *url, char **dest); #endif /* URL_H */
On Mon, Jun 03, 2019 at 03:22:47PM -0700, Matthew DeVore wrote: > On Mon, Jun 03, 2019 at 08:34:35AM -0400, Jeff King wrote: > > Great. We might want to stop there, but it's possible could reuse even > > more code. I didn't look closely before, but it seems this code is > > decoding a URL. We already have a url_decode() routine in url.c. Could > > it be reused? > > Very nice. Here is an interdiff and the changes will be included in v3 of my > patchset: Nice to see a reduction in duplication (and I see you found some problems in the existing code elsewhere; thanks for cleaning that up). > - return has_reserved_character(subspec, errbuf) || > - url_decode(subspec, errbuf) || > - gently_parse_list_objects_filter( > - &filter_options->sub[new_index], subspec->buf, errbuf); > + decoded = url_percent_decode(subspec->buf); I think you can get rid of has_reserved_character() now, too. The reserved character list is still used on the encoding side. But I think you could switch to strbuf_add_urlencode() there? -Peff
On Tue, Jun 04, 2019 at 12:13:32PM -0400, Jeff King wrote: > > - return has_reserved_character(subspec, errbuf) || > > - url_decode(subspec, errbuf) || > > - gently_parse_list_objects_filter( > > - &filter_options->sub[new_index], subspec->buf, errbuf); > > + decoded = url_percent_decode(subspec->buf); > > I think you can get rid of has_reserved_character() now, too. The purpose of has_reserved_character is to allow for future extensibility if someone decides to implement a more sophisticated DSL and give meaning to these characters. That may be a long-shot, but it seems worth it. > The reserved character list is still used on the encoding side. But I > think you could switch to strbuf_add_urlencode() there? strbuf_addstr_urlencode will either escape or not escape all rfc3986 reserved characters, and that set includes both : and +. The former should not require escaping since it's a common character in filter specs, and I would like the hand-encoded combine specs to be relatively easy to type and read. The + must be escaped since it is used as part of the combine:... syntax to delimit sub filters. So strbuf_addstr_url_encode would have to be more customizable to make it work for this context. I'd like to add a parameterizable should_escape predicate (iow function pointer) which strbuf_addstr_urlencode accepts. I actually think this will be more readable than the current strbuf API.
On Tue, Jun 04, 2019 at 10:19:52AM -0700, Matthew DeVore wrote: > On Tue, Jun 04, 2019 at 12:13:32PM -0400, Jeff King wrote: > > > - return has_reserved_character(subspec, errbuf) || > > > - url_decode(subspec, errbuf) || > > > - gently_parse_list_objects_filter( > > > - &filter_options->sub[new_index], subspec->buf, errbuf); > > > + decoded = url_percent_decode(subspec->buf); > > > > I think you can get rid of has_reserved_character() now, too. > > The purpose of has_reserved_character is to allow for future > extensibility if someone decides to implement a more sophisticated DSL > and give meaning to these characters. That may be a long-shot, but it > seems worth it. I think you'll find that -Wunused-function complains, though, if nobody is calling it. I wasn't sure if what you showed in the interdiff was meant to be final (I had to add a few other variable declarations to make it compile, too). > > The reserved character list is still used on the encoding side. But I > > think you could switch to strbuf_add_urlencode() there? > > strbuf_addstr_urlencode will either escape or not escape all rfc3986 > reserved characters, and that set includes both : and +. The former > should not require escaping since it's a common character in filter > specs, and I would like the hand-encoded combine specs to be relatively > easy to type and read. The + must be escaped since it is used as part of > the combine:... syntax to delimit sub filters. So > strbuf_addstr_url_encode would have to be more customizable to make it > work for this context. I'd like to add a parameterizable should_escape > predicate (iow function pointer) which strbuf_addstr_urlencode accepts. > I actually think this will be more readable than the current strbuf API. That makes some sense, and I agree that readability is a good goal. Do we not need to be escaping colons in other URLs? Or are the strings you're generating not true by-the-book URLs? I'm just wondering if we could take this opportunity to improve the URLs we output elsewhere, too. -Peff
On Tue, Jun 04, 2019 at 02:51:08PM -0400, Jeff King wrote: > > The purpose of has_reserved_character is to allow for future > > extensibility if someone decides to implement a more sophisticated DSL > > and give meaning to these characters. That may be a long-shot, but it > > seems worth it. > > I think you'll find that -Wunused-function complains, though, if nobody > is calling it. I wasn't sure if what you showed in the interdiff was > meant to be final (I had to add a few other variable declarations to > make it compile, too). Sorry, my last interdiff was a mess because I made a mistake during git rebase -i. It was missing a call to has_reserved_char. Below is another diff that fixes the problems: diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c index 0f135602a7..6b206dc58b 100644 --- a/list-objects-filter-options.c +++ b/list-objects-filter-options.c @@ -110,28 +110,31 @@ static int has_reserved_character( return 0; } static int parse_combine_subfilter( struct list_objects_filter_options *filter_options, struct strbuf *subspec, struct strbuf *errbuf) { size_t new_index = filter_options->sub_nr; + char *decoded; + int result; ALLOC_GROW_BY(filter_options->sub, filter_options->sub_nr, 1, filter_options->sub_alloc); decoded = url_percent_decode(subspec->buf); - result = gently_parse_list_objects_filter( - &filter_options->sub[new_index], decoded, errbuf); + result = has_reserved_character(subspec, errbuf) || + gently_parse_list_objects_filter( + &filter_options->sub[new_index], decoded, errbuf); free(decoded); return result; } static int parse_combine_filter( struct list_objects_filter_options *filter_options, const char *arg, struct strbuf *errbuf) { > > strbuf_addstr_urlencode will either escape or not escape all rfc3986 > > reserved characters, and that set includes both : and +. The former > > should not require escaping since it's a common character in filter > > specs, and I would like the hand-encoded combine specs to be relatively > > easy to type and read. The + must be escaped since it is used as part of > > the combine:... syntax to delimit sub filters. So > > strbuf_addstr_url_encode would have to be more customizable to make it > > work for this context. I'd like to add a parameterizable should_escape > > predicate (iow function pointer) which strbuf_addstr_urlencode accepts. > > I actually think this will be more readable than the current strbuf API. > > That makes some sense, and I agree that readability is a good goal. Do > we not need to be escaping colons in other URLs? Or are the strings > you're generating not true by-the-book URLs? I'm just wondering if we > could take this opportunity to improve the URLs we output elsewhere, > too. The strings I'm generating are not URLs. Also, in http.c, we have to use : to delimit a username and password: strbuf_addstr_urlencode(&s, proxy_auth.username, 1); strbuf_addch(&s, ':'); strbuf_addstr_urlencode(&s, proxy_auth.password, 1); I think this is dictated by libcurl and is not flexible.
On Tue, Jun 04, 2019 at 03:59:21PM -0700, Matthew DeVore wrote: > > I think you'll find that -Wunused-function complains, though, if nobody > > is calling it. I wasn't sure if what you showed in the interdiff was > > meant to be final (I had to add a few other variable declarations to > > make it compile, too). > > Sorry, my last interdiff was a mess because I made a mistake during git rebase > -i. It was missing a call to has_reserved_char. Below is another diff that > fixes the problems: Ah, OK, that makes sense then (and keeping the function is obviously the right thing to do). > > That makes some sense, and I agree that readability is a good goal. Do > > we not need to be escaping colons in other URLs? Or are the strings > > you're generating not true by-the-book URLs? I'm just wondering if we > > could take this opportunity to improve the URLs we output elsewhere, > > too. > > The strings I'm generating are not URLs. Also, in http.c, we have to use : to > delimit a username and password: > > strbuf_addstr_urlencode(&s, proxy_auth.username, 1); > strbuf_addch(&s, ':'); > strbuf_addstr_urlencode(&s, proxy_auth.password, 1); > > I think this is dictated by libcurl and is not flexible. Right, that has to be a real colon because it's syntactically significant (but a colon in the username _must_ be encoded). That strbuf function doesn't really understand whole URLs, and it's up to the caller to assemble the parts. Anyway, we've veered off of your patch series enough. Yeah, it sounds like using strbuf's url-encoding is not quite what you want. -Peff
On Tue, Jun 04, 2019 at 07:14:18PM -0400, Jeff King wrote: > Right, that has to be a real colon because it's syntactically > significant (but a colon in the username _must_ be encoded). That strbuf > function doesn't really understand whole URLs, and it's up to the caller > to assemble the parts. > > Anyway, we've veered off of your patch series enough. Yeah, it sounds > like using strbuf's url-encoding is not quite what you want. I tried to do it anyway :) I think this makes the strbuf API a bit easier to reason about, and strbuf.h is a bit more self-documenting. WDYT? diff --git a/credential-store.c b/credential-store.c index ac295420dd..c010497cb2 100644 --- a/credential-store.c +++ b/credential-store.c @@ -65,29 +65,30 @@ static void rewrite_credential_file(const char *fn, struct credential *c, parse_credential_file(fn, c, NULL, print_line); if (commit_lock_file(&credential_lock) < 0) die_errno("unable to write credential store"); } static void store_credential_file(const char *fn, struct credential *c) { struct strbuf buf = STRBUF_INIT; strbuf_addf(&buf, "%s://", c->protocol); - strbuf_addstr_urlencode(&buf, c->username, 1); + strbuf_addstr_urlencode(&buf, c->username, is_rfc3986_unreserved); strbuf_addch(&buf, ':'); - strbuf_addstr_urlencode(&buf, c->password, 1); + strbuf_addstr_urlencode(&buf, c->password, is_rfc3986_unreserved); strbuf_addch(&buf, '@'); if (c->host) - strbuf_addstr_urlencode(&buf, c->host, 1); + strbuf_addstr_urlencode(&buf, c->host, is_rfc3986_unreserved); if (c->path) { strbuf_addch(&buf, '/'); - strbuf_addstr_urlencode(&buf, c->path, 0); + strbuf_addstr_urlencode(&buf, c->path, + is_rfc3986_reserved_or_unreserved); } rewrite_credential_file(fn, c, &buf); strbuf_release(&buf); } static void store_credential(const struct string_list *fns, struct credential *c) { struct string_list_item *fn; diff --git a/http.c b/http.c index 27aa0a3192..938b9e55af 100644 --- a/http.c +++ b/http.c @@ -506,23 +506,25 @@ static void var_override(const char **var, char *value) static void set_proxyauth_name_password(CURL *result) { #if LIBCURL_VERSION_NUM >= 0x071301 curl_easy_setopt(result, CURLOPT_PROXYUSERNAME, proxy_auth.username); curl_easy_setopt(result, CURLOPT_PROXYPASSWORD, proxy_auth.password); #else struct strbuf s = STRBUF_INIT; - strbuf_addstr_urlencode(&s, proxy_auth.username, 1); + strbuf_addstr_urlencode(&s, proxy_auth.username, + is_rfc3986_unreserved); strbuf_addch(&s, ':'); - strbuf_addstr_urlencode(&s, proxy_auth.password, 1); + strbuf_addstr_urlencode(&s, proxy_auth.password, + is_rfc3986_unreserved); curl_proxyuserpwd = strbuf_detach(&s, NULL); curl_easy_setopt(result, CURLOPT_PROXYUSERPWD, curl_proxyuserpwd); #endif } static void init_curl_proxy_auth(CURL *result) { if (proxy_auth.username) { if (!proxy_auth.password) credential_fill(&proxy_auth); diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c index 6b206dc58b..9a5677c2c8 100644 --- a/list-objects-filter-options.c +++ b/list-objects-filter-options.c @@ -167,30 +167,25 @@ static int parse_combine_filter( cleanup: strbuf_list_free(subspecs); if (result) { list_objects_filter_release(filter_options); memset(filter_options, 0, sizeof(*filter_options)); } return result; } -static void add_url_encoded(struct strbuf *dest, const char *s) +static int allow_unencoded(char ch) { - while (*s) { - if (*s <= ' ' || strchr(RESERVED_NON_WS, *s) || - *s == '%' || *s == '+') - strbuf_addf(dest, "%%%02X", (int)*s); - else - strbuf_addf(dest, "%c", *s); - s++; - } + if (ch <= ' ' || ch == '%' || ch == '+') + return 0; + return !strchr(RESERVED_NON_WS, ch); } /* * Changes filter_options into an equivalent LOFC_COMBINE filter options * instance. Does not do anything if filter_options is already LOFC_COMBINE. */ static void transform_to_combine_type( struct list_objects_filter_options *filter_options) { assert(filter_options->choice); @@ -202,22 +197,23 @@ static void transform_to_combine_type( xcalloc(initial_sub_alloc, sizeof(*sub_array)); sub_array[0] = *filter_options; memset(filter_options, 0, sizeof(*filter_options)); filter_options->sub = sub_array; filter_options->sub_alloc = initial_sub_alloc; } filter_options->sub_nr = 1; filter_options->choice = LOFC_COMBINE; strbuf_init(&filter_options->filter_spec, 0); strbuf_addstr(&filter_options->filter_spec, "combine:"); - add_url_encoded(&filter_options->filter_spec, - filter_options->sub[0].filter_spec.buf); + strbuf_addstr_urlencode(&filter_options->filter_spec, + filter_options->sub[0].filter_spec.buf, + allow_unencoded); /* * We don't need the filter_spec strings for subfilter specs, only the * top level. */ strbuf_release(&filter_options->sub[0].filter_spec); } void list_objects_filter_die_if_populated( struct list_objects_filter_options *filter_options) { @@ -239,21 +235,22 @@ void parse_list_objects_filter( parse_error = gently_parse_list_objects_filter( filter_options, arg, &errbuf); } else { /* * Make filter_options an LOFC_COMBINE spec so we can trivially * add subspecs to it. */ transform_to_combine_type(filter_options); strbuf_addstr(&filter_options->filter_spec, "+"); - add_url_encoded(&filter_options->filter_spec, arg); + strbuf_addstr_urlencode(&filter_options->filter_spec, arg, + allow_unencoded); trace_printf("Generated composite filter-spec: %s\n", filter_options->filter_spec.buf); ALLOC_GROW_BY(filter_options->sub, filter_options->sub_nr, 1, filter_options->sub_alloc); parse_error = gently_parse_list_objects_filter( &filter_options->sub[filter_options->sub_nr - 1], arg, &errbuf); } if (parse_error) diff --git a/strbuf.c b/strbuf.c index 0e18b259ce..60ab5144f2 100644 --- a/strbuf.c +++ b/strbuf.c @@ -767,55 +767,56 @@ void strbuf_addstr_xml_quoted(struct strbuf *buf, const char *s) case '&': strbuf_addstr(buf, "&"); break; case 0: return; } s++; } } -static int is_rfc3986_reserved(char ch) +int is_rfc3986_reserved_or_unreserved(char ch) { + if (is_rfc3986_unreserved(ch)) + return 1; switch (ch) { case '!': case '*': case '\'': case '(': case ')': case ';': case ':': case '@': case '&': case '=': case '+': case '$': case ',': case '/': case '?': case '#': case '[': case ']': return 1; } return 0; } -static int is_rfc3986_unreserved(char ch) +int is_rfc3986_unreserved(char ch) { return isalnum(ch) || ch == '-' || ch == '_' || ch == '.' || ch == '~'; } static void strbuf_add_urlencode(struct strbuf *sb, const char *s, size_t len, - int reserved) + char_predicate allow_unencoded_fn) { strbuf_grow(sb, len); while (len--) { char ch = *s++; - if (is_rfc3986_unreserved(ch) || - (!reserved && is_rfc3986_reserved(ch))) + if (allow_unencoded_fn(ch)) strbuf_addch(sb, ch); else strbuf_addf(sb, "%%%02x", (unsigned char)ch); } } void strbuf_addstr_urlencode(struct strbuf *sb, const char *s, - int reserved) + char_predicate allow_unencoded_fn) { - strbuf_add_urlencode(sb, s, strlen(s), reserved); + strbuf_add_urlencode(sb, s, strlen(s), allow_unencoded_fn); } void strbuf_humanise_bytes(struct strbuf *buf, off_t bytes) { if (bytes > 1 << 30) { strbuf_addf(buf, "%u.%2.2u GiB", (unsigned)(bytes >> 30), (unsigned)(bytes & ((1 << 30) - 1)) / 10737419); } else if (bytes > 1 << 20) { unsigned x = bytes + 5243; /* for rounding */ diff --git a/strbuf.h b/strbuf.h index c8d98dfb95..346d722492 100644 --- a/strbuf.h +++ b/strbuf.h @@ -659,22 +659,27 @@ void strbuf_branchname(struct strbuf *sb, const char *name, unsigned allowed); /* * Like strbuf_branchname() above, but confirm that the result is * syntactically valid to be used as a local branch name in refs/heads/. * * The return value is "0" if the result is valid, and "-1" otherwise. */ int strbuf_check_branch_ref(struct strbuf *sb, const char *name); +typedef int (*char_predicate)(char ch); + +int is_rfc3986_unreserved(char ch); +int is_rfc3986_reserved_or_unreserved(char ch); + void strbuf_addstr_urlencode(struct strbuf *sb, const char *name, - int reserved); + char_predicate allow_unencoded_fn); __attribute__((format (printf,1,2))) int printf_ln(const char *fmt, ...); __attribute__((format (printf,2,3))) int fprintf_ln(FILE *fp, const char *fmt, ...); char *xstrdup_tolower(const char *); char *xstrdup_toupper(const char *); /**
On Tue, Jun 04, 2019 at 04:49:51PM -0700, Matthew DeVore wrote: > I tried to do it anyway :) I think this makes the strbuf API a bit easier to > reason about, and strbuf.h is a bit more self-documenting. WDYT? > > [...] > > +typedef int (*char_predicate)(char ch); > + > +int is_rfc3986_unreserved(char ch); > +int is_rfc3986_reserved_or_unreserved(char ch); > + > void strbuf_addstr_urlencode(struct strbuf *sb, const char *name, > - int reserved); > + char_predicate allow_unencoded_fn); Yeah, that seems reasonable. I worry slightly about adding function-call overhead to something that's processing a string character-by-character, but these strings tend to be short and infrequent. -Peff
diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt index ddbc1de43f..4fb0c4fbb0 100644 --- a/Documentation/rev-list-options.txt +++ b/Documentation/rev-list-options.txt @@ -730,20 +730,32 @@ specification contained in <path>. + The form '--filter=tree:<depth>' omits all blobs and trees whose depth from the root tree is >= <depth> (minimum depth if an object is located at multiple depths in the commits traversed). <depth>=0 will not include any trees or blobs unless included explicitly in the command-line (or standard input when --stdin is used). <depth>=1 will include only the tree and blobs which are referenced directly by a commit reachable from <commit> or an explicitly-given object. <depth>=2 is like <depth>=1 while also including trees and blobs one more level removed from an explicitly-given commit or tree. ++ +The form '--filter=combine:<filter1>+<filter2>+...<filterN>' combines +several filters. Only objects which are accepted by every filter are +included. Filters are joined by '{plus}' and individual filters are %-encoded +(i.e. URL-encoded). Besides the '{plus}' and '%' characters, the following +characters are reserved and also must be encoded: +`~!@#$^&*()[]{}\;",<>?`+'`+ as well as all characters with ASCII code +<= `0x20`, which includes space and newline. ++ +Other arbitrary characters can also be encoded. For instance, +'combine:tree:3+blob:none' and 'combine:tree%3A2+blob%3Anone' are +equivalent. --no-filter:: Turn off any previous `--filter=` argument. --filter-print-omitted:: Only useful with `--filter=`; prints a list of the objects omitted by the filter. Object IDs are prefixed with a ``~'' character. --missing=<missing-action>:: A debug option to help with future "partial clone" development. diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash index 3eefbabdb1..0fd0a10d0c 100644 --- a/contrib/completion/git-completion.bash +++ b/contrib/completion/git-completion.bash @@ -1529,21 +1529,21 @@ _git_difftool () __git_fetch_recurse_submodules="yes on-demand no" _git_fetch () { case "$cur" in --recurse-submodules=*) __gitcomp "$__git_fetch_recurse_submodules" "" "${cur##--recurse-submodules=}" return ;; --filter=*) - __gitcomp "blob:none blob:limit= sparse:oid= sparse:path=" "" "${cur##--filter=}" + __gitcomp "blob:none blob:limit= sparse:oid= sparse:path= combine: tree:" "" "${cur##--filter=}" return ;; --*) __gitcomp_builtin fetch return ;; esac __git_complete_remote_or_refspec } diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c index e46ea467bc..d7a1516188 100644 --- a/list-objects-filter-options.c +++ b/list-objects-filter-options.c @@ -1,19 +1,24 @@ #include "cache.h" #include "commit.h" #include "config.h" #include "revision.h" #include "argv-array.h" #include "list-objects.h" #include "list-objects-filter.h" #include "list-objects-filter-options.h" +static int parse_combine_filter( + struct list_objects_filter_options *filter_options, + const char *arg, + struct strbuf *errbuf); + /* * Parse value of the argument to the "filter" keyword. * On the command line this looks like: * --filter=<arg> * and in the pack protocol as: * "filter" SP <arg> * * The filter keyword will be used by many commands. * See Documentation/rev-list-options.txt for allowed values for <arg>. * @@ -31,22 +36,20 @@ static int gently_parse_list_objects_filter( if (filter_options->choice) { if (errbuf) { strbuf_addstr( errbuf, _("multiple filter-specs cannot be combined")); } return 1; } - filter_options->filter_spec = strdup(arg); - if (!strcmp(arg, "blob:none")) { filter_options->choice = LOFC_BLOB_NONE; return 0; } else if (skip_prefix(arg, "blob:limit=", &v0)) { if (git_parse_ulong(v0, &filter_options->blob_limit_value)) { filter_options->choice = LOFC_BLOB_LIMIT; return 0; } @@ -74,37 +77,183 @@ static int gently_parse_list_objects_filter( if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB, &sparse_oid, &oc)) filter_options->sparse_oid_value = oiddup(&sparse_oid); filter_options->choice = LOFC_SPARSE_OID; return 0; } else if (skip_prefix(arg, "sparse:path=", &v0)) { filter_options->choice = LOFC_SPARSE_PATH; filter_options->sparse_path_value = strdup(v0); return 0; + + } else if (skip_prefix(arg, "combine:", &v0)) { + int sub_parse_res = parse_combine_filter( + filter_options, v0, errbuf); + if (sub_parse_res) + return sub_parse_res; + return 0; + } /* * Please update _git_fetch() in git-completion.bash when you * add new filters */ if (errbuf) strbuf_addf(errbuf, _("invalid filter-spec '%s'"), arg); memset(filter_options, 0, sizeof(*filter_options)); return 1; } +static int digit_value(int c, struct strbuf *errbuf) { + if (c >= '0' && c <= '9') + return c - '0'; + if (c >= 'a' && c <= 'f') + return c - 'a' + 10; + if (c >= 'A' && c <= 'F') + return c - 'A' + 10; + + if (!errbuf) + return -1; + + strbuf_addf(errbuf, _("error in filter-spec - ")); + if (c) + strbuf_addf( + errbuf, + _("expect two hex digits after %%, but got: '%c'"), + c); + else + strbuf_addf( + errbuf, + _("not enough hex digits after %%; expected two")); + + return -1; +} + +static int url_decode(struct strbuf *s, struct strbuf *errbuf) { + char *dest = s->buf; + char *src = s->buf; + size_t new_len; + + while (*src) { + int digit_value_0, digit_value_1; + + if (src[0] != '%') { + *dest++ = *src++; + continue; + } + src++; + + digit_value_0 = digit_value(*src++, errbuf); + if (digit_value_0 < 0) + return 1; + digit_value_1 = digit_value(*src++, errbuf); + if (digit_value_1 < 0) + return 1; + *dest++ = digit_value_0 * 16 + digit_value_1; + } + new_len = dest - s->buf; + strbuf_remove(s, new_len, s->len - new_len); + + return 0; +} + +static const char *RESERVED_NON_WS = "~`!@#$^&*()[]{}\\;'\",<>?"; + +static int has_reserved_character( + struct strbuf *sub_spec, struct strbuf *errbuf) +{ + const char *c = sub_spec->buf; + while (*c) { + if (*c <= ' ' || strchr(RESERVED_NON_WS, *c)) + goto found_reserved; + c++; + } + + return 0; + +found_reserved: + if (errbuf) + strbuf_addf(errbuf, + "must escape char in sub-filter-spec: '%c'", + *c); + return 1; +} + +static int parse_combine_filter( + struct list_objects_filter_options *filter_options, + const char *arg, + struct strbuf *errbuf) +{ + struct strbuf **sub_specs = strbuf_split_str(arg, '+', 2); + int result; + + if (!sub_specs[0]) { + if (errbuf) + strbuf_addf(errbuf, + _("expected something after combine:")); + result = 1; + goto cleanup; + } + + result = has_reserved_character(sub_specs[0], errbuf); + if (result) + goto cleanup; + + /* + * Only decode the first sub-filter, since the rest will be decoded on + * the recursive call. + */ + result = url_decode(sub_specs[0], errbuf); + if (result) + goto cleanup; + + if (!sub_specs[1]) { + /* + * There is only one sub-filter, so we don't need the + * combine: - just parse it as a non-composite filter. + */ + result = gently_parse_list_objects_filter( + filter_options, sub_specs[0]->buf, errbuf); + goto cleanup; + } + + /* Remove trailing "+" so we can parse it. */ + assert(sub_specs[0]->buf[sub_specs[0]->len - 1] == '+'); + strbuf_remove(sub_specs[0], sub_specs[0]->len - 1, 1); + + filter_options->choice = LOFC_COMBINE; + filter_options->lhs = xcalloc(1, sizeof(*filter_options->lhs)); + filter_options->rhs = xcalloc(1, sizeof(*filter_options->rhs)); + + result = gently_parse_list_objects_filter(filter_options->lhs, + sub_specs[0]->buf, + errbuf) || + parse_combine_filter(filter_options->rhs, + sub_specs[1]->buf, + errbuf); + +cleanup: + strbuf_list_free(sub_specs); + if (result) { + list_objects_filter_release(filter_options); + memset(filter_options, 0, sizeof(*filter_options)); + } + return result; +} + int parse_list_objects_filter(struct list_objects_filter_options *filter_options, const char *arg) { struct strbuf buf = STRBUF_INIT; + filter_options->filter_spec = strdup(arg); if (gently_parse_list_objects_filter(filter_options, arg, &buf)) die("%s", buf.buf); return 0; } int opt_parse_list_objects_filter(const struct option *opt, const char *arg, int unset) { struct list_objects_filter_options *filter_options = opt->value; @@ -127,23 +276,29 @@ void expand_list_objects_filter_spec( else if (filter->choice == LOFC_TREE_DEPTH) strbuf_addf(expanded_spec, "tree:%lu", filter->tree_exclude_depth); else strbuf_addstr(expanded_spec, filter->filter_spec); } void list_objects_filter_release( struct list_objects_filter_options *filter_options) { + if (!filter_options) + return; free(filter_options->filter_spec); free(filter_options->sparse_oid_value); free(filter_options->sparse_path_value); + list_objects_filter_release(filter_options->lhs); + free(filter_options->lhs); + list_objects_filter_release(filter_options->rhs); + free(filter_options->rhs); memset(filter_options, 0, sizeof(*filter_options)); } void partial_clone_register( const char *remote, const struct list_objects_filter_options *filter_options) { /* * Record the name of the partial clone remote in the * config and in the global variable -- the latter is @@ -171,14 +326,16 @@ void partial_clone_register( } void partial_clone_get_default_filter_spec( struct list_objects_filter_options *filter_options) { /* * Parse default value, but silently ignore it if it is invalid. */ if (!core_partial_clone_filter_default) return; + + filter_options->filter_spec = strdup(core_partial_clone_filter_default); gently_parse_list_objects_filter(filter_options, core_partial_clone_filter_default, NULL); } diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h index e3adc78ebf..6c0f0ecd08 100644 --- a/list-objects-filter-options.h +++ b/list-objects-filter-options.h @@ -7,20 +7,21 @@ /* * The list of defined filters for list-objects. */ enum list_objects_filter_choice { LOFC_DISABLED = 0, LOFC_BLOB_NONE, LOFC_BLOB_LIMIT, LOFC_TREE_DEPTH, LOFC_SPARSE_OID, LOFC_SPARSE_PATH, + LOFC_COMBINE, LOFC__COUNT /* must be last */ }; struct list_objects_filter_options { /* * 'filter_spec' is the raw argument value given on the command line * or protocol request. (The part after the "--keyword=".) For * commands that launch filtering sub-processes, or for communication * over the network, don't use this value; use the result of * expand_list_objects_filter_spec() instead. @@ -32,28 +33,35 @@ struct list_objects_filter_options { * the filtering algorithm to use. */ enum list_objects_filter_choice choice; /* * Choice is LOFC_DISABLED because "--no-filter" was requested. */ unsigned int no_filter : 1; /* - * Parsed values (fields) from within the filter-spec. These are - * choice-specific; not all values will be defined for any given - * choice. + * BEGIN choice-specific parsed values from within the filter-spec. Only + * some values will be defined for any given choice. */ + struct object_id *sparse_oid_value; char *sparse_path_value; unsigned long blob_limit_value; unsigned long tree_exclude_depth; + + /* LOFC_COMBINE values */ + struct list_objects_filter_options *lhs, *rhs; + + /* + * END choice-specific parsed values. + */ }; /* Normalized command line arguments */ #define CL_ARG__FILTER "filter" int parse_list_objects_filter( struct list_objects_filter_options *filter_options, const char *arg); int opt_parse_list_objects_filter(const struct option *opt, diff --git a/list-objects-filter.c b/list-objects-filter.c index 8e8616b9b8..b97277a46f 100644 --- a/list-objects-filter.c +++ b/list-objects-filter.c @@ -453,34 +453,148 @@ static void filter_sparse_path__init( ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc); d->array_frame[d->nr].defval = 0; /* default to include */ d->array_frame[d->nr].child_prov_omit = 0; ctx->filter_fn = filter_sparse; ctx->free_fn = filter_sparse_free; ctx->data = d; } +struct filter_combine_data { + /* sub[0] corresponds to lhs, sub[1] to rhs. */ + struct { + struct filter_context ctx; + struct oidset seen; + struct object_id skip_tree; + unsigned is_skipping_tree : 1; + } sub[2]; + + struct oidset rhs_omits; +}; + +static void add_all(struct oidset *dest, struct oidset *src) { + struct oidset_iter iter; + struct object_id *src_oid; + + oidset_iter_init(src, &iter); + while ((src_oid = oidset_iter_next(&iter)) != NULL) + oidset_insert(dest, src_oid); +} + +static void filter_combine_free(void *filter_data) +{ + struct filter_combine_data *d = filter_data; + int i; + + /* Anything omitted by rhs should be added to the overall omits set. */ + if (d->sub[0].ctx.omits) + add_all(d->sub[0].ctx.omits, d->sub[1].ctx.omits); + + for (i = 0; i < 2; i++) { + list_objects_filter__release(&d->sub[i].ctx); + oidset_clear(&d->sub[i].seen); + } + oidset_clear(&d->rhs_omits); + free(d); +} + +static int should_delegate(enum list_objects_filter_situation filter_situation, + struct object *obj, + struct filter_combine_data *d, + int side) +{ + if (!d->sub[side].is_skipping_tree) + return 1; + if (filter_situation == LOFS_END_TREE && + oideq(&obj->oid, &d->sub[side].skip_tree)) { + d->sub[side].is_skipping_tree = 0; + return 1; + } + return 0; +} + +static enum list_objects_filter_result filter_combine( + struct repository *r, + enum list_objects_filter_situation filter_situation, + struct object *obj, + const char *pathname, + const char *filename, + struct filter_context *ctx) +{ + struct filter_combine_data *d = ctx->data; + enum list_objects_filter_result result[2]; + enum list_objects_filter_result combined_result = LOFR_ZERO; + int i; + + for (i = 0; i < 2; i++) { + if (oidset_contains(&d->sub[i].seen, &obj->oid) || + !should_delegate(filter_situation, obj, d, i)) { + result[i] = LOFR_ZERO; + continue; + } + + result[i] = d->sub[i].ctx.filter_fn( + r, filter_situation, obj, pathname, filename, + &d->sub[i].ctx); + + if (result[i] & LOFR_MARK_SEEN) + oidset_insert(&d->sub[i].seen, &obj->oid); + + if (result[i] & LOFR_SKIP_TREE) { + d->sub[i].is_skipping_tree = 1; + d->sub[i].skip_tree = obj->oid; + } + } + + if ((result[0] & LOFR_DO_SHOW) && (result[1] & LOFR_DO_SHOW)) + combined_result |= LOFR_DO_SHOW; + if (d->sub[0].is_skipping_tree && d->sub[1].is_skipping_tree) + combined_result |= LOFR_SKIP_TREE; + + return combined_result; +} + +static void filter_combine__init( + struct list_objects_filter_options *filter_options, + struct filter_context *ctx) +{ + struct filter_combine_data *d = xcalloc(1, sizeof(*d)); + + if (ctx->omits) + oidset_init(&d->rhs_omits, 16); + + list_objects_filter__init(ctx->omits, filter_options->lhs, + &d->sub[0].ctx); + list_objects_filter__init(&d->rhs_omits, filter_options->rhs, + &d->sub[1].ctx); + + ctx->filter_fn = filter_combine; + ctx->free_fn = filter_combine_free; + ctx->data = d; +} + typedef void (*filter_init_fn)( struct list_objects_filter_options *filter_options, struct filter_context *ctx); /* * Must match "enum list_objects_filter_choice". */ static filter_init_fn s_filters[] = { NULL, filter_blobs_none__init, filter_blobs_limit__init, filter_trees_depth__init, filter_sparse_oid__init, filter_sparse_path__init, + filter_combine__init, }; void list_objects_filter__init( struct oidset *omitted, struct list_objects_filter_options *filter_options, struct filter_context *ctx) { filter_init_fn init_fn; assert((sizeof(s_filters) / sizeof(s_filters[0])) == LOFC__COUNT); diff --git a/t/t6112-rev-list-filters-objects.sh b/t/t6112-rev-list-filters-objects.sh index 9c11427719..ddfacb1a1a 100755 --- a/t/t6112-rev-list-filters-objects.sh +++ b/t/t6112-rev-list-filters-objects.sh @@ -284,21 +284,33 @@ test_expect_success 'verify tree:0 includes trees in "filtered" output' ' # Make sure tree:0 does not iterate through any trees. test_expect_success 'verify skipping tree iteration when not collecting omits' ' GIT_TRACE=1 git -C r3 rev-list \ --objects --filter=tree:0 HEAD 2>filter_trace && grep "Skipping contents of tree [.][.][.]" filter_trace >actual && # One line for each commit traversed. test_line_count = 2 actual && # Make sure no other trees were considered besides the root. - ! grep "Skipping contents of tree [^.]" filter_trace + ! grep "Skipping contents of tree [^.]" filter_trace && + + # Try this again with "combine:". If both sub-filters are skipping + # trees, the composite filter should also skip trees. This is not + # important unless the user does combine:tree:X+tree:Y or another filter + # besides "tree:" is implemented in the future which can skip trees. + GIT_TRACE=1 git -C r3 rev-list \ + --objects --filter=combine:tree:1+tree:3 HEAD 2>filter_trace && + + # Only skip the dir1/ tree, which is shared between the two commits. + grep "Skipping contents of tree " filter_trace >actual && + test_write_lines "Skipping contents of tree dir1/..." >expected && + test_cmp expected actual ' # Test tree:# filters. expect_has () { commit=$1 && name=$2 && hash=$(git -C r3 rev-parse $commit:$name) && grep "^$hash $name$" actual @@ -336,20 +348,134 @@ test_expect_success 'verify tree:3 includes everything expected' ' expect_has HEAD dir1/sparse1 && expect_has HEAD dir1/sparse2 && expect_has HEAD pattern && expect_has HEAD sparse1 && expect_has HEAD sparse2 && # There are also 2 commit objects test_line_count = 10 actual ' +test_expect_success 'combine:... for a simple combination' ' + git -C r3 rev-list --objects --filter=combine:tree:2+blob:none HEAD \ + >actual && + + expect_has HEAD "" && + expect_has HEAD~1 "" && + expect_has HEAD dir1 && + + # There are also 2 commit objects + test_line_count = 5 actual +' + +test_expect_success 'combine:... with URL encoding' ' + git -C r3 rev-list --objects \ + --filter=combine:tree%3a2+blob:%6Eon%65 HEAD >actual && + + expect_has HEAD "" && + expect_has HEAD~1 "" && + expect_has HEAD dir1 && + + # There are also 2 commit objects + test_line_count = 5 actual +' + +expect_invalid_filter_spec () { + spec="$1" && + err="$2" && + + test_must_fail git -C r3 rev-list --objects --filter="$spec" HEAD \ + >actual 2>actual_stderr && + test_must_be_empty actual && + test_i18ngrep "$err" actual_stderr +} + +test_expect_success 'combine:... while URL-encoding things that should not be' ' + expect_invalid_filter_spec combine%3Atree:2+blob:none \ + "invalid filter-spec" +' + +test_expect_success 'combine: with nothing after the :' ' + expect_invalid_filter_spec combine: "expected something after combine:" +' + +test_expect_success 'parse error in first sub-filter in combine:' ' + expect_invalid_filter_spec combine:tree:asdf+blob:none \ + "expected .tree:<depth>." +' + +test_expect_success 'combine:... with invalid URL-encoded sequences' ' + expect_invalid_filter_spec combine:tree:2+blob:non%a \ + "error in filter-spec - not enough hex digits after %" && + # Edge cases for non-hex chars: "Gg/:" + expect_invalid_filter_spec combine:tree:2+blob%G5none \ + "error in filter-spec - expect two hex digits .*: .G." && + expect_invalid_filter_spec combine:tree:2+blob%g5none \ + "error in filter-spec - expect two hex digits .*: .g." && + expect_invalid_filter_spec combine:tree:2+blob%5/none \ + "error in filter-spec - expect two hex digits .*: ./." && + expect_invalid_filter_spec combine:%:5tree:2+blob:none \ + "error in filter-spec - expect two hex digits .*: .:." +' + +test_expect_success 'combine:... with non-encoded reserved chars' ' + expect_invalid_filter_spec combine:tree:2+sparse:@xyz \ + "must escape char in sub-filter-spec: .@." && + expect_invalid_filter_spec combine:tree:2+sparse:\` \ + "must escape char in sub-filter-spec: .\`." && + expect_invalid_filter_spec combine:tree:2+sparse:~abc \ + "must escape char in sub-filter-spec: .\~." +' + +test_expect_success 'validate err msg for "combine:<valid-filter>+"' ' + expect_invalid_filter_spec combine:tree:2+ "expected .tree:<depth>." +' + +test_expect_success 'combine:... with edge-case hex digits: Ff Aa 0 9' ' + git -C r3 rev-list --objects --filter="combine:tree:2+bl%6Fb:n%6fne" \ + HEAD >actual && + test_line_count = 5 actual && + git -C r3 rev-list --objects --filter="combine:tree%3A2+blob%3anone" \ + HEAD >actual && + test_line_count = 5 actual && + git -C r3 rev-list --objects --filter="combine:tree:%30" HEAD >actual && + test_line_count = 2 actual && + git -C r3 rev-list --objects --filter="combine:tree:%39+blob:none" \ + HEAD >actual && + test_line_count = 5 actual +' + +test_expect_success 'combine:... with more than two sub-filters' ' + git -C r3 rev-list --objects \ + --filter=combine:tree:3+blob:limit=40+sparse:path=../pattern1 \ + HEAD >actual && + + expect_has HEAD "" && + expect_has HEAD~1 "" && + expect_has HEAD dir1 && + expect_has HEAD dir1/sparse1 && + expect_has HEAD dir1/sparse2 && + + # Should also have 2 commits + test_line_count = 7 actual && + + # Try again, this time making sure the last sub-filter is only + # URL-decoded once. + cp pattern1 pattern1+renamed% && + cp actual expect && + + git -C r3 rev-list --objects \ + --filter=combine:tree:3+blob:limit=40+sparse:path=../pattern1%2brenamed%25 \ + HEAD >actual && + test_cmp expect actual +' + # Test provisional omit collection logic with a repo that has objects appearing # at multiple depths - first deeper than the filter's threshold, then shallow. test_expect_success 'setup r4' ' git init r4 && echo foo > r4/foo && mkdir r4/subdir && echo bar > r4/subdir/bar && @@ -379,20 +505,51 @@ test_expect_success 'test tree:# filter provisional omit for blob and tree' ' test_expect_success 'verify skipping tree iteration when collecting omits' ' GIT_TRACE=1 git -C r4 rev-list --filter-print-omitted \ --objects --filter=tree:0 HEAD 2>filter_trace && grep "^Skipping contents of tree " filter_trace >actual && echo "Skipping contents of tree subdir/..." >expect && test_cmp expect actual ' +test_expect_success 'setup r5' ' + git init r5 && + mkdir -p r5/subdir && + + echo 1 >r5/short-root && + echo 12345 >r5/long-root && + echo a >r5/subdir/short-subdir && + echo abcde >r5/subdir/long-subdir && + + git -C r5 add short-root long-root subdir && + git -C r5 commit -m "commit msg" +' + +test_expect_success 'verify collecting omits in combined: filter' ' + # Note that this test guards against the naive implementation of simply + # giving both filters the same "omits" set and expecting it to + # automatically merge them. + git -C r5 rev-list --objects --quiet --filter-print-omitted \ + --filter=combine:tree:2+blob:limit=3 HEAD >actual && + + # Expect 0 trees/commits, 3 blobs omitted (all blobs except short-root) + omitted_1=$(echo 12345 | git hash-object --stdin) && + omitted_2=$(echo a | git hash-object --stdin) && + omitted_3=$(echo abcde | git hash-object --stdin) && + + grep ~$omitted_1 actual && + grep ~$omitted_2 actual && + grep ~$omitted_3 actual && + test_line_count = 3 actual +' + # Test tree:<depth> where a tree is iterated to twice - once where a subentry is # too deep to be included, and again where the blob inside it is shallow enough # to be included. This makes sure we don't use LOFR_MARK_SEEN incorrectly (we # can't use it because a tree can be iterated over again at a lower depth). test_expect_success 'tree:<depth> where we iterate over tree at two levels' ' git init r5 && mkdir -p r5/a/subdir/b && echo foo > r5/a/subdir/b/foo &&
Allow combining filters such that only objects accepted by all filters are shown. The motivation for this is to allow getting directory listings without also fetching blobs. This can be done by combining blob:none with tree:<depth>. There are massive repositories that have larger-than-expected trees - even if you include only a single commit. The current usage requires passing the filter to rev-list, or sending it over the wire, as: combine:<FILTER1>+<FILTER2> (i.e.: git rev-list --filter=combine:tree:2+blob:limit=32k). This is potentially awkward because individual filters must be URL-encoded if they contain + or %. This can potentially be improved by supporting a repeated flag syntax, e.g.: $ git rev-list --filter=tree:2 --filter=blob:limit=32k Such usage is currently an error, so giving it a meaning is backwards- compatible. Signed-off-by: Matthew DeVore <matvore@google.com> --- Documentation/rev-list-options.txt | 12 ++ contrib/completion/git-completion.bash | 2 +- list-objects-filter-options.c | 161 ++++++++++++++++++++++++- list-objects-filter-options.h | 14 ++- list-objects-filter.c | 114 +++++++++++++++++ t/t6112-rev-list-filters-objects.sh | 159 +++++++++++++++++++++++- 6 files changed, 455 insertions(+), 7 deletions(-)