[v1,3/5] list-objects-filter: implement composite filters

Message ID	1f95597eedc4c651868601c0ff7c4a4d97ca4457.1558484115.git.matvore@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Tue, 21 May 2019 17:21:52 -0700 In-Reply-To: <cover.1558484115.git.matvore@google.com> Message-Id: <1f95597eedc4c651868601c0ff7c4a4d97ca4457.1558484115.git.matvore@google.com> Mime-Version: 1.0 References: <cover.1558484115.git.matvore@google.com> Subject: [PATCH v1 3/5] From: Matthew DeVore <matvore@google.com> To: jonathantanmy@google.com, jrn@google.com, git@vger.kernel.org, dstolee@microsoft.com, jeffhost@microsoft.com, jrnieder@gmail.com, pclouds@gmail.com Cc: Matthew DeVore <matvore@google.com>, matvore@comcast.net Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Filter combination \| expand [v1,0/5] Filter combination [v1,1/5] list-objects-filter: refactor into a context struct [v1,2/5] list-objects-filter-options: error is localizeable [v1,3/5] list-objects-filter: implement composite filters [v1,4/5] list-objects-filter-options: move error check up [v1,5/5] list-objects-filter-options: allow mult. --filter

Matthew DeVore May 22, 2019, 12:21 a.m. UTC

Allow combining filters such that only objects accepted by all filters
are shown. The motivation for this is to allow getting directory
listings without also fetching blobs. This can be done by combining
blob:none with tree:<depth>. There are massive repositories that have
larger-than-expected trees - even if you include only a single commit.

The current usage requires passing the filter to rev-list, or sending
it over the wire, as:

	combine:<FILTER1>+<FILTER2>

(i.e.: git rev-list --filter=combine:tree:2+blob:limit=32k). This is
potentially awkward because individual filters must be URL-encoded if
they contain + or %. This can potentially be improved by supporting a
repeated flag syntax, e.g.:

	$ git rev-list --filter=tree:2 --filter=blob:limit=32k

Such usage is currently an error, so giving it a meaning is backwards-
compatible.

Signed-off-by: Matthew DeVore <matvore@google.com>
---
 Documentation/rev-list-options.txt     |  12 ++
 contrib/completion/git-completion.bash |   2 +-
 list-objects-filter-options.c          | 161 ++++++++++++++++++++++++-
 list-objects-filter-options.h          |  14 ++-
 list-objects-filter.c                  | 114 +++++++++++++++++
 t/t6112-rev-list-filters-objects.sh    | 159 +++++++++++++++++++++++-
 6 files changed, 455 insertions(+), 7 deletions(-)

Jeff Hostetler May 24, 2019, 9:01 p.m. UTC | #1

On 5/21/2019 8:21 PM, Matthew DeVore wrote:
> Allow combining filters such that only objects accepted by all filters
> are shown. The motivation for this is to allow getting directory
> listings without also fetching blobs. This can be done by combining
> blob:none with tree:<depth>. There are massive repositories that have
> larger-than-expected trees - even if you include only a single commit.
> 
> The current usage requires passing the filter to rev-list, or sending
> it over the wire, as:
> 
> 	combine:<FILTER1>+<FILTER2>

I must admit I'm not a fan of this syntax and the URL-encoding
that it requires.  I see that this was already discussed in the
RFC version [1] last week, but I'll repeat it here.

I like the repeated used of the "--filter=<f_k>" command line option.


In the RFC version, there was discussion [2] of the wire format
and the need to be backwards compatible with existing servers and
so use the "combine:" syntax so that we only have a single filter
line on the wire.  Would it be better to have compliant servers
advertise a "filters" (plural) capability in addition to the
existing "filter" (singular) capability?  Then the client would
know that it could send a series of filter lines using the existing
syntax.  Likewise, if the "filters" capability was omitted, the
client could error out without the extra round-trip.


[1] https://public-inbox.org/git/xmqqwoip3gp0.fsf@gitster-ct.c.googlers.com/
[2] 
https://public-inbox.org/git/1E174CAA-BD57-400B-A83B-4AABFAFBC04B@comcast.net/


[...]
>   standard input when --stdin is used). <depth>=1 will include only the
>   tree and blobs which are referenced directly by a commit reachable from
>   <commit> or an explicitly-given object. <depth>=2 is like <depth>=1
>   while also including trees and blobs one more level removed from an
>   explicitly-given commit or tree.
> ++
> +The form '--filter=combine:<filter1>+<filter2>+...<filterN>' combines
> +several filters.

We are allowing an unlimited number of filters in the composition.
In the code, the compose filter data has space for a LHS and RHS, so
I'm assuming we're mapping

     --filter=f1 --filter=f2 --filter=f3 --filter=f4
or  --filter=combine:f1+f2+f3+f4
into basically
     (compose f1 (compose f2 (compose (f3 f4)))

I wonder if it would be easier to understand if we just built an array
or linked list, but I'll read on.

>                    Only objects which are accepted by every filter are
> +included. Filters are joined by '{plus}' and individual filters are %-encoded
> +(i.e. URL-encoded). Besides the '{plus}' and '%' characters, the following
> +characters are reserved and also must be encoded:
> +`~!@#$^&*()[]{}\;",<>?`+&#39;&#96;+ as well as all characters with ASCII code
> +&lt;= `0x20`, which includes space and newline.
[...]


> diff --git a/list-objects-filter.c b/list-objects-filter.c
> index 8e8616b9b8..b97277a46f 100644
> --- a/list-objects-filter.c
> +++ b/list-objects-filter.c
> @@ -453,34 +453,148 @@ static void filter_sparse_path__init(
>   
>   	ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc);
>   	d->array_frame[d->nr].defval = 0; /* default to include */
>   	d->array_frame[d->nr].child_prov_omit = 0;
>   
>   	ctx->filter_fn = filter_sparse;
>   	ctx->free_fn = filter_sparse_free;
>   	ctx->data = d;
>   }
>   
> +struct filter_combine_data {
> +	/* sub[0] corresponds to lhs, sub[1] to rhs. */
> +	struct {
> +		struct filter_context ctx;
> +		struct oidset seen;
> +		struct object_id skip_tree;
> +		unsigned is_skipping_tree : 1;
> +	} sub[2];
> +
> +	struct oidset rhs_omits;
> +};
> +
> +static void add_all(struct oidset *dest, struct oidset *src) {
> +	struct oidset_iter iter;
> +	struct object_id *src_oid;
> +
> +	oidset_iter_init(src, &iter);
> +	while ((src_oid = oidset_iter_next(&iter)) != NULL)
> +		oidset_insert(dest, src_oid);
> +}
> +
> +static void filter_combine_free(void *filter_data)
> +{
> +	struct filter_combine_data *d = filter_data;
> +	int i;
> +
> +	/* Anything omitted by rhs should be added to the overall omits set. */
> +	if (d->sub[0].ctx.omits)
> +		add_all(d->sub[0].ctx.omits, d->sub[1].ctx.omits);
> +
> +	for (i = 0; i < 2; i++) {
> +		list_objects_filter__release(&d->sub[i].ctx);
> +		oidset_clear(&d->sub[i].seen);
> +	}
> +	oidset_clear(&d->rhs_omits);
> +	free(d);
> +}
> +
> +static int should_delegate(enum list_objects_filter_situation filter_situation,
> +			   struct object *obj,
> +			   struct filter_combine_data *d,
> +			   int side)
> +{
> +	if (!d->sub[side].is_skipping_tree)
> +		return 1;
> +	if (filter_situation == LOFS_END_TREE &&
> +		oideq(&obj->oid, &d->sub[side].skip_tree)) {
> +		d->sub[side].is_skipping_tree = 0;
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static enum list_objects_filter_result filter_combine(
> +	struct repository *r,
> +	enum list_objects_filter_situation filter_situation,
> +	struct object *obj,
> +	const char *pathname,
> +	const char *filename,
> +	struct filter_context *ctx)
> +{
> +	struct filter_combine_data *d = ctx->data;
> +	enum list_objects_filter_result result[2];
> +	enum list_objects_filter_result combined_result = LOFR_ZERO;
> +	int i;
> +
> +	for (i = 0; i < 2; i++) {
> +		if (oidset_contains(&d->sub[i].seen, &obj->oid) ||
> +			!should_delegate(filter_situation, obj, d, i)) {

Should we swap the order of the terms in the || so that we always
clear the d->sub[i].is_skipping_tree on LOFS_END_TREE ?


> +			result[i] = LOFR_ZERO;
> +			continue;
> +		}
> +
> +		result[i] = d->sub[i].ctx.filter_fn(
> +			r, filter_situation, obj, pathname, filename,
> +			&d->sub[i].ctx);
> +
> +		if (result[i] & LOFR_MARK_SEEN)
> +			oidset_insert(&d->sub[i].seen, &obj->oid);

So filter[i] has said it never wants to show this object (hard omit).
And the guard at the top of the loop will prevent future invocations
from checking it again if the object is revisited.

> +
> +		if (result[i] & LOFR_SKIP_TREE) {
> +			d->sub[i].is_skipping_tree = 1;
> +			d->sub[i].skip_tree = obj->oid;

So this marks the tree object at the top of the skip as far as
filter[i] is concerned.

> +		}
> +	}
> +
> +	if ((result[0] & LOFR_DO_SHOW) && (result[1] & LOFR_DO_SHOW))
> +		combined_result |= LOFR_DO_SHOW;
> +	if (d->sub[0].is_skipping_tree && d->sub[1].is_skipping_tree)
> +		combined_result |= LOFR_SKIP_TREE;

Something about the above bothers me, but I can't quite say what
it is.

Do we need to do:
     if ((result[0] & LOFR_MARK_SEEN) && (result[1] & LOFR_MARK_SEEN))
         combined_result |= LOFR_MARK_SEEN;


> +
> +	return combined_result;
> +}
[...]


I'm out of time now, will pick this up again next week.

Thanks
Jeff

Junio C Hamano May 28, 2019, 5:59 p.m. UTC | #2

Jeff Hostetler <git@jeffhostetler.com> writes:

> In the RFC version, there was discussion [2] of the wire format
> and the need to be backwards compatible with existing servers and
> so use the "combine:" syntax so that we only have a single filter
> line on the wire.  Would it be better to have compliant servers
> advertise a "filters" (plural) capability in addition to the
> existing "filter" (singular) capability?  Then the client would
> know that it could send a series of filter lines using the existing
> syntax.  Likewise, if the "filters" capability was omitted, the
> client could error out without the extra round-trip.

All good ideas.

Emily Shaffer May 28, 2019, 9:53 p.m. UTC | #3

On Tue, May 21, 2019 at 05:21:52PM -0700, Matthew DeVore wrote:
> Allow combining filters such that only objects accepted by all filters
> are shown. The motivation for this is to allow getting directory
> listings without also fetching blobs. This can be done by combining
> blob:none with tree:<depth>. There are massive repositories that have
> larger-than-expected trees - even if you include only a single commit.
> 
> The current usage requires passing the filter to rev-list, or sending
> it over the wire, as:
> 
> 	combine:<FILTER1>+<FILTER2>
> 
> (i.e.: git rev-list --filter=combine:tree:2+blob:limit=32k). This is
> potentially awkward because individual filters must be URL-encoded if
> they contain + or %. This can potentially be improved by supporting a
> repeated flag syntax, e.g.:
> 
> 	$ git rev-list --filter=tree:2 --filter=blob:limit=32k
> 
> Such usage is currently an error, so giving it a meaning is backwards-
> compatible.
> 
> Signed-off-by: Matthew DeVore <matvore@google.com>
> ---
>  Documentation/rev-list-options.txt     |  12 ++
>  contrib/completion/git-completion.bash |   2 +-
>  list-objects-filter-options.c          | 161 ++++++++++++++++++++++++-
>  list-objects-filter-options.h          |  14 ++-
>  list-objects-filter.c                  | 114 +++++++++++++++++
>  t/t6112-rev-list-filters-objects.sh    | 159 +++++++++++++++++++++++-
>  6 files changed, 455 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
> index ddbc1de43f..4fb0c4fbb0 100644
> --- a/Documentation/rev-list-options.txt
> +++ b/Documentation/rev-list-options.txt
> @@ -730,20 +730,32 @@ specification contained in <path>.
>  +
>  The form '--filter=tree:<depth>' omits all blobs and trees whose depth
>  from the root tree is >= <depth> (minimum depth if an object is located
>  at multiple depths in the commits traversed). <depth>=0 will not include
>  any trees or blobs unless included explicitly in the command-line (or
>  standard input when --stdin is used). <depth>=1 will include only the
>  tree and blobs which are referenced directly by a commit reachable from
>  <commit> or an explicitly-given object. <depth>=2 is like <depth>=1
>  while also including trees and blobs one more level removed from an
>  explicitly-given commit or tree.
> ++
> +The form '--filter=combine:<filter1>+<filter2>+...<filterN>' combines
> +several filters. Only objects which are accepted by every filter are
> +included. Filters are joined by '{plus}' and individual filters are %-encoded
> +(i.e. URL-encoded). Besides the '{plus}' and '%' characters, the following
> +characters are reserved and also must be encoded:
> +`~!@#$^&*()[]{}\;",<>?`+&#39;&#96;+ as well as all characters with ASCII code
> +&lt;= `0x20`, which includes space and newline.
> ++
> +Other arbitrary characters can also be encoded. For instance,
> +'combine:tree:3+blob:none' and 'combine:tree%3A2+blob%3Anone' are
> +equivalent.
>  
>  --no-filter::
>  	Turn off any previous `--filter=` argument.
>  
>  --filter-print-omitted::
>  	Only useful with `--filter=`; prints a list of the objects omitted
>  	by the filter.  Object IDs are prefixed with a ``~'' character.
>  
>  --missing=<missing-action>::
>  	A debug option to help with future "partial clone" development.
> diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
> index 3eefbabdb1..0fd0a10d0c 100644
> --- a/contrib/completion/git-completion.bash
> +++ b/contrib/completion/git-completion.bash
> @@ -1529,21 +1529,21 @@ _git_difftool ()
>  __git_fetch_recurse_submodules="yes on-demand no"
>  
>  _git_fetch ()
>  {
>  	case "$cur" in
>  	--recurse-submodules=*)
>  		__gitcomp "$__git_fetch_recurse_submodules" "" "${cur##--recurse-submodules=}"
>  		return
>  		;;
>  	--filter=*)
> -		__gitcomp "blob:none blob:limit= sparse:oid= sparse:path=" "" "${cur##--filter=}"
> +		__gitcomp "blob:none blob:limit= sparse:oid= sparse:path= combine: tree:" "" "${cur##--filter=}"
>  		return
>  		;;
>  	--*)
>  		__gitcomp_builtin fetch
>  		return
>  		;;
>  	esac
>  	__git_complete_remote_or_refspec
>  }
>  
> diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
> index e46ea467bc..d7a1516188 100644
> --- a/list-objects-filter-options.c
> +++ b/list-objects-filter-options.c
> @@ -1,19 +1,24 @@
>  #include "cache.h"
>  #include "commit.h"
>  #include "config.h"
>  #include "revision.h"
>  #include "argv-array.h"
>  #include "list-objects.h"
>  #include "list-objects-filter.h"
>  #include "list-objects-filter-options.h"
>  
> +static int parse_combine_filter(
> +	struct list_objects_filter_options *filter_options,
> +	const char *arg,
> +	struct strbuf *errbuf);
> +
>  /*
>   * Parse value of the argument to the "filter" keyword.
>   * On the command line this looks like:
>   *       --filter=<arg>
>   * and in the pack protocol as:
>   *       "filter" SP <arg>
>   *
>   * The filter keyword will be used by many commands.
>   * See Documentation/rev-list-options.txt for allowed values for <arg>.
>   *
> @@ -31,22 +36,20 @@ static int gently_parse_list_objects_filter(
>  
>  	if (filter_options->choice) {
>  		if (errbuf) {
>  			strbuf_addstr(
>  				errbuf,
>  				_("multiple filter-specs cannot be combined"));
>  		}
>  		return 1;
>  	}
>  
> -	filter_options->filter_spec = strdup(arg);
> -
>  	if (!strcmp(arg, "blob:none")) {
>  		filter_options->choice = LOFC_BLOB_NONE;
>  		return 0;
>  
>  	} else if (skip_prefix(arg, "blob:limit=", &v0)) {
>  		if (git_parse_ulong(v0, &filter_options->blob_limit_value)) {
>  			filter_options->choice = LOFC_BLOB_LIMIT;
>  			return 0;
>  		}
>  
> @@ -74,37 +77,183 @@ static int gently_parse_list_objects_filter(
>  		if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB,
>  					  &sparse_oid, &oc))
>  			filter_options->sparse_oid_value = oiddup(&sparse_oid);
>  		filter_options->choice = LOFC_SPARSE_OID;
>  		return 0;
>  
>  	} else if (skip_prefix(arg, "sparse:path=", &v0)) {
>  		filter_options->choice = LOFC_SPARSE_PATH;
>  		filter_options->sparse_path_value = strdup(v0);
>  		return 0;
> +
> +	} else if (skip_prefix(arg, "combine:", &v0)) {
> +		int sub_parse_res = parse_combine_filter(
> +			filter_options, v0, errbuf);
> +		if (sub_parse_res)
> +			return sub_parse_res;
> +		return 0;

Couldn't the three lines above be said more succinctly as "return
sub_parse_res;"?

> +
>  	}
>  	/*
>  	 * Please update _git_fetch() in git-completion.bash when you
>  	 * add new filters
>  	 */
>  
>  	if (errbuf)
>  		strbuf_addf(errbuf, _("invalid filter-spec '%s'"), arg);
>  
>  	memset(filter_options, 0, sizeof(*filter_options));
>  	return 1;
>  }
>  
> +static int digit_value(int c, struct strbuf *errbuf) {
> +	if (c >= '0' && c <= '9')
> +		return c - '0';
> +	if (c >= 'a' && c <= 'f')
> +		return c - 'a' + 10;
> +	if (c >= 'A' && c <= 'F')
> +		return c - 'A' + 10;

I'm sure there's something I'm missing here. But why are you manually
decoding hex instead of using strtol or sscanf or something?

> +
> +	if (!errbuf)
> +		return -1;
> +
> +	strbuf_addf(errbuf, _("error in filter-spec - "));
> +	if (c)
> +		strbuf_addf(
> +			errbuf,
> +			_("expect two hex digits after %%, but got: '%c'"),
> +			c);
> +	else
> +		strbuf_addf(
> +			errbuf,
> +			_("not enough hex digits after %%; expected two"));
> +
> +	return -1;
> +}
> +
> +static int url_decode(struct strbuf *s, struct strbuf *errbuf) {
> +	char *dest = s->buf;
> +	char *src = s->buf;
> +	size_t new_len;
> +
> +	while (*src) {
> +		int digit_value_0, digit_value_1;
> +
> +		if (src[0] != '%') {
> +			*dest++ = *src++;
> +			continue;
> +		}
> +		src++;
> +
> +		digit_value_0 = digit_value(*src++, errbuf);
> +		if (digit_value_0 < 0)
> +			return 1;
> +		digit_value_1 = digit_value(*src++, errbuf);
> +		if (digit_value_1 < 0)
> +			return 1;
> +		*dest++ = digit_value_0 * 16 + digit_value_1;
> +	}
> +	new_len = dest - s->buf;
> +	strbuf_remove(s, new_len, s->len - new_len);
> +
> +	return 0;
> +}
> +
> +static const char *RESERVED_NON_WS = "~`!@#$^&*()[]{}\\;'\",<>?";
> +
> +static int has_reserved_character(
> +	struct strbuf *sub_spec, struct strbuf *errbuf)
> +{
> +	const char *c = sub_spec->buf;
> +	while (*c) {
> +		if (*c <= ' ' || strchr(RESERVED_NON_WS, *c))
> +			goto found_reserved;
> +		c++;
> +	}
> +
> +	return 0;
> +
> +found_reserved:

What's the value of doing this in a goto instead of embedded in the
while loop?

> +	if (errbuf)
> +		strbuf_addf(errbuf,
> +			    "must escape char in sub-filter-spec: '%c'",
> +			    *c);
> +	return 1;
> +}
> +
> +static int parse_combine_filter(
> +	struct list_objects_filter_options *filter_options,
> +	const char *arg,
> +	struct strbuf *errbuf)
> +{
> +	struct strbuf **sub_specs = strbuf_split_str(arg, '+', 2);
> +	int result;
> +
> +	if (!sub_specs[0]) {
> +		if (errbuf)
> +			strbuf_addf(errbuf,
> +				    _("expected something after combine:"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	result = has_reserved_character(sub_specs[0], errbuf);
> +	if (result)
> +		goto cleanup;
> +
> +	/*
> +	 * Only decode the first sub-filter, since the rest will be decoded on
> +	 * the recursive call.
> +	 */
> +	result = url_decode(sub_specs[0], errbuf);
> +	if (result)
> +		goto cleanup;
> +
> +	if (!sub_specs[1]) {
> +		/*
> +		 * There is only one sub-filter, so we don't need the
> +		 * combine: - just parse it as a non-composite filter.
> +		 */
> +		result = gently_parse_list_objects_filter(
> +			filter_options, sub_specs[0]->buf, errbuf);
> +		goto cleanup;
> +	}
> +
> +	/* Remove trailing "+" so we can parse it. */
> +	assert(sub_specs[0]->buf[sub_specs[0]->len - 1] == '+');
> +	strbuf_remove(sub_specs[0], sub_specs[0]->len - 1, 1);
> +
> +	filter_options->choice = LOFC_COMBINE;
> +	filter_options->lhs = xcalloc(1, sizeof(*filter_options->lhs));
> +	filter_options->rhs = xcalloc(1, sizeof(*filter_options->rhs));
> +
> +	result = gently_parse_list_objects_filter(filter_options->lhs,
> +						  sub_specs[0]->buf,
> +						  errbuf) ||
> +		parse_combine_filter(filter_options->rhs,
> +				      sub_specs[1]->buf,
> +				      errbuf);

I guess you're recursing to combine filter 2 onto filter 1 which has
been combined onto filter 0 here. But why not just use a list or array?

> +
> +cleanup:
> +	strbuf_list_free(sub_specs);
> +	if (result) {
> +		list_objects_filter_release(filter_options);
> +		memset(filter_options, 0, sizeof(*filter_options));
> +	}
> +	return result;
> +}
> +
>  int parse_list_objects_filter(struct list_objects_filter_options *filter_options,
>  			      const char *arg)
>  {
>  	struct strbuf buf = STRBUF_INIT;
> +	filter_options->filter_spec = strdup(arg);
>  	if (gently_parse_list_objects_filter(filter_options, arg, &buf))
>  		die("%s", buf.buf);
>  	return 0;
>  }
>  
>  int opt_parse_list_objects_filter(const struct option *opt,
>  				  const char *arg, int unset)
>  {
>  	struct list_objects_filter_options *filter_options = opt->value;
>  
> @@ -127,23 +276,29 @@ void expand_list_objects_filter_spec(
>  	else if (filter->choice == LOFC_TREE_DEPTH)
>  		strbuf_addf(expanded_spec, "tree:%lu",
>  			    filter->tree_exclude_depth);
>  	else
>  		strbuf_addstr(expanded_spec, filter->filter_spec);
>  }
>  
>  void list_objects_filter_release(
>  	struct list_objects_filter_options *filter_options)
>  {
> +	if (!filter_options)
> +		return;
>  	free(filter_options->filter_spec);
>  	free(filter_options->sparse_oid_value);
>  	free(filter_options->sparse_path_value);
> +	list_objects_filter_release(filter_options->lhs);
> +	free(filter_options->lhs);
> +	list_objects_filter_release(filter_options->rhs);
> +	free(filter_options->rhs);

Is there a reason that the free shouldn't be included in
list_objects_filter_release()? Maybe this is a common style guideline
I've missed, but it seems to me like I'd expect a magic memory cleanup
function to do it all, and not leave it to me to free.

>  	memset(filter_options, 0, sizeof(*filter_options));
>  }
>  
>  void partial_clone_register(
>  	const char *remote,
>  	const struct list_objects_filter_options *filter_options)
>  {
>  	/*
>  	 * Record the name of the partial clone remote in the
>  	 * config and in the global variable -- the latter is
> @@ -171,14 +326,16 @@ void partial_clone_register(
>  }
>  
>  void partial_clone_get_default_filter_spec(
>  	struct list_objects_filter_options *filter_options)
>  {
>  	/*
>  	 * Parse default value, but silently ignore it if it is invalid.
>  	 */
>  	if (!core_partial_clone_filter_default)
>  		return;
> +
> +	filter_options->filter_spec = strdup(core_partial_clone_filter_default);
>  	gently_parse_list_objects_filter(filter_options,
>  					 core_partial_clone_filter_default,
>  					 NULL);
>  }
> diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
> index e3adc78ebf..6c0f0ecd08 100644
> --- a/list-objects-filter-options.h
> +++ b/list-objects-filter-options.h
> @@ -7,20 +7,21 @@
>  /*
>   * The list of defined filters for list-objects.
>   */
>  enum list_objects_filter_choice {
>  	LOFC_DISABLED = 0,
>  	LOFC_BLOB_NONE,
>  	LOFC_BLOB_LIMIT,
>  	LOFC_TREE_DEPTH,
>  	LOFC_SPARSE_OID,
>  	LOFC_SPARSE_PATH,
> +	LOFC_COMBINE,
>  	LOFC__COUNT /* must be last */
>  };
>  
>  struct list_objects_filter_options {
>  	/*
>  	 * 'filter_spec' is the raw argument value given on the command line
>  	 * or protocol request.  (The part after the "--keyword=".)  For
>  	 * commands that launch filtering sub-processes, or for communication
>  	 * over the network, don't use this value; use the result of
>  	 * expand_list_objects_filter_spec() instead.
> @@ -32,28 +33,35 @@ struct list_objects_filter_options {
>  	 * the filtering algorithm to use.
>  	 */
>  	enum list_objects_filter_choice choice;
>  
>  	/*
>  	 * Choice is LOFC_DISABLED because "--no-filter" was requested.
>  	 */
>  	unsigned int no_filter : 1;
>  
>  	/*
> -	 * Parsed values (fields) from within the filter-spec.  These are
> -	 * choice-specific; not all values will be defined for any given
> -	 * choice.
> +	 * BEGIN choice-specific parsed values from within the filter-spec. Only
> +	 * some values will be defined for any given choice.
>  	 */
> +
>  	struct object_id *sparse_oid_value;
>  	char *sparse_path_value;
>  	unsigned long blob_limit_value;
>  	unsigned long tree_exclude_depth;
> +
> +	/* LOFC_COMBINE values */
> +	struct list_objects_filter_options *lhs, *rhs;
> +
> +	/*
> +	 * END choice-specific parsed values.
> +	 */
>  };
>  
>  /* Normalized command line arguments */
>  #define CL_ARG__FILTER "filter"
>  
>  int parse_list_objects_filter(
>  	struct list_objects_filter_options *filter_options,
>  	const char *arg);
>  
>  int opt_parse_list_objects_filter(const struct option *opt,
> diff --git a/list-objects-filter.c b/list-objects-filter.c
> index 8e8616b9b8..b97277a46f 100644
> --- a/list-objects-filter.c
> +++ b/list-objects-filter.c
> @@ -453,34 +453,148 @@ static void filter_sparse_path__init(
>  
>  	ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc);
>  	d->array_frame[d->nr].defval = 0; /* default to include */
>  	d->array_frame[d->nr].child_prov_omit = 0;
>  
>  	ctx->filter_fn = filter_sparse;
>  	ctx->free_fn = filter_sparse_free;
>  	ctx->data = d;
>  }
>  
> +struct filter_combine_data {
> +	/* sub[0] corresponds to lhs, sub[1] to rhs. */

Jeff H had a comment about this too, but this seems unwieldy for >2
filters. (I also personally don't like using set index to incidate
lhs/rhs.) Why not an array of multiple `struct sub`? There's a macro
utility to generate types and helpers for an array of arbitrary struct
that may suit...

> +	struct {
> +		struct filter_context ctx;
> +		struct oidset seen;
> +		struct object_id skip_tree;
> +		unsigned is_skipping_tree : 1;
> +	} sub[2];
> +
> +	struct oidset rhs_omits;
> +};
> +
> +static void add_all(struct oidset *dest, struct oidset *src) {
> +	struct oidset_iter iter;
> +	struct object_id *src_oid;
> +
> +	oidset_iter_init(src, &iter);
> +	while ((src_oid = oidset_iter_next(&iter)) != NULL)
> +		oidset_insert(dest, src_oid);
> +}
> +
> +static void filter_combine_free(void *filter_data)
> +{
> +	struct filter_combine_data *d = filter_data;
> +	int i;
> +
> +	/* Anything omitted by rhs should be added to the overall omits set. */
> +	if (d->sub[0].ctx.omits)
> +		add_all(d->sub[0].ctx.omits, d->sub[1].ctx.omits);
> +
> +	for (i = 0; i < 2; i++) {
> +		list_objects_filter__release(&d->sub[i].ctx);
> +		oidset_clear(&d->sub[i].seen);
> +	}
> +	oidset_clear(&d->rhs_omits);
> +	free(d);
> +}
> +
> +static int should_delegate(enum list_objects_filter_situation filter_situation,
> +			   struct object *obj,
> +			   struct filter_combine_data *d,
> +			   int side)
> +{
> +	if (!d->sub[side].is_skipping_tree)
> +		return 1;
> +	if (filter_situation == LOFS_END_TREE &&
> +		oideq(&obj->oid, &d->sub[side].skip_tree)) {
> +		d->sub[side].is_skipping_tree = 0;
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static enum list_objects_filter_result filter_combine(
> +	struct repository *r,
> +	enum list_objects_filter_situation filter_situation,
> +	struct object *obj,
> +	const char *pathname,
> +	const char *filename,
> +	struct filter_context *ctx)
> +{
> +	struct filter_combine_data *d = ctx->data;
> +	enum list_objects_filter_result result[2];
> +	enum list_objects_filter_result combined_result = LOFR_ZERO;
> +	int i;
> +
> +	for (i = 0; i < 2; i++) {

I suppose your lhs and rhs are in sub[0] and sub[1] in part for the sake
of this loop. But I think it would be easier to understand what is going
on if you were to perform the loop contents in a helper function (as the
name of the function would provide some more documentation).

> +		if (oidset_contains(&d->sub[i].seen, &obj->oid) ||
> +			!should_delegate(filter_situation, obj, d, i)) {
> +			result[i] = LOFR_ZERO;
> +			continue;
> +		}
> +
> +		result[i] = d->sub[i].ctx.filter_fn(
> +			r, filter_situation, obj, pathname, filename,
> +			&d->sub[i].ctx);
> +
> +		if (result[i] & LOFR_MARK_SEEN)
> +			oidset_insert(&d->sub[i].seen, &obj->oid);
> +
> +		if (result[i] & LOFR_SKIP_TREE) {
> +			d->sub[i].is_skipping_tree = 1;
> +			d->sub[i].skip_tree = obj->oid;
> +		}
> +	}
> +
> +	if ((result[0] & LOFR_DO_SHOW) && (result[1] & LOFR_DO_SHOW))
> +		combined_result |= LOFR_DO_SHOW;
> +	if (d->sub[0].is_skipping_tree && d->sub[1].is_skipping_tree)
> +		combined_result |= LOFR_SKIP_TREE;
> +
> +	return combined_result;
> +}

I see that you tested that >2 filters works okay. But by doing it the
way you have it seems like you're setting up to need recursion all over
the place to check against all the filters. I suppose I don't see the
benefit of doing all this recursively, as compared to doing it
iteratively.

 - Emily

Matthew DeVore May 29, 2019, 3:02 p.m. UTC | #4

On Tue, May 28, 2019 at 10:59:31AM -0700, Junio C Hamano wrote:
> Jeff Hostetler <git@jeffhostetler.com> writes:
> 
> > In the RFC version, there was discussion [2] of the wire format
> > and the need to be backwards compatible with existing servers and
> > so use the "combine:" syntax so that we only have a single filter
> > line on the wire.  Would it be better to have compliant servers
> > advertise a "filters" (plural) capability in addition to the

This is a good idea and I hadn't considered it. It does seem to make the
repeated filter lines a safer bet than I though.

> > existing "filter" (singular) capability?  Then the client would
> > know that it could send a series of filter lines using the existing
> > syntax.  Likewise, if the "filters" capability was omitted, the
> > client could error out without the extra round-trip.
> 
> All good ideas.

After hacking the code halfway together to make the above idea work, and
learning quite a lot in the process, I saw set_git_option in transport.c and
realized that all existing transport options are assumed to be ? (0 or 1) rather
than * (0 or more). So "filter" would be the first transport option that is
repeated.

Even though multiple reviewers have weighed in supporting repeated filter lines,
I'm still conflicted about it. It seems the drawback to the + syntax is the
requirement for encoding the individual filters, but this encoding is no longer
required since the sparse:path=... filter no longer has to be supported. And the
URL encoding, if it is ever reintroduced, is just boilerplate and is unlikely to
change later or cause a significant maintainance burden.

The essence of the repeated filter line is that we need additional high-level
machinery just for the sake of making the lower-level machinery... marginally
simpler, hopefully? And if we ever need to add new filter combinations (like OR
or XOR rather than AND) this repeated filter line thing will be a legacy
annoyance (users will wonder why does repeated "filter" mean AND rather than
one of the other supported combination methods?). Repeating filter lines seems
like a leaky abstraction to me.

I would be helped if someone re-iterated why the repeated filter lines are a
good idea in light of the fact that URL escaping is no longer required to make
it work.

Jeff Hostetler May 29, 2019, 9:29 p.m. UTC | #5

On 5/29/2019 11:02 AM, Matthew DeVore wrote:
> On Tue, May 28, 2019 at 10:59:31AM -0700, Junio C Hamano wrote:
>> Jeff Hostetler <git@jeffhostetler.com> writes:
>>
>>> In the RFC version, there was discussion [2] of the wire format
>>> and the need to be backwards compatible with existing servers and
>>> so use the "combine:" syntax so that we only have a single filter
>>> line on the wire.  Would it be better to have compliant servers
>>> advertise a "filters" (plural) capability in addition to the
> 
> This is a good idea and I hadn't considered it. It does seem to make the
> repeated filter lines a safer bet than I though.
> 
>>> existing "filter" (singular) capability?  Then the client would
>>> know that it could send a series of filter lines using the existing
>>> syntax.  Likewise, if the "filters" capability was omitted, the
>>> client could error out without the extra round-trip.
>>
>> All good ideas.
> 
> After hacking the code halfway together to make the above idea work, and
> learning quite a lot in the process, I saw set_git_option in transport.c and
> realized that all existing transport options are assumed to be ? (0 or 1) rather
> than * (0 or more). So "filter" would be the first transport option that is
> repeated.
> 
> Even though multiple reviewers have weighed in supporting repeated filter lines,
> I'm still conflicted about it. It seems the drawback to the + syntax is the
> requirement for encoding the individual filters, but this encoding is no longer
> required since the sparse:path=... filter no longer has to be supported. And the
> URL encoding, if it is ever reintroduced, is just boilerplate and is unlikely to
> change later or cause a significant maintainance burden.

Was sparse:path filter the only reason for needing all the URL encoding?
The sparse:oid form allows values <ref>:<path> and these (or at least
the <path> portion) may contain special characters.  So don't we need to
URL encode this form too?


> 
> The essence of the repeated filter line is that we need additional high-level
> machinery just for the sake of making the lower-level machinery... marginally
> simpler, hopefully? And if we ever need to add new filter combinations (like OR
> or XOR rather than AND) this repeated filter line thing will be a legacy
> annoyance (users will wonder why does repeated "filter" mean AND rather than
> one of the other supported combination methods?). Repeating filter lines seems
> like a leaky abstraction to me.
> 
> I would be helped if someone re-iterated why the repeated filter lines are a
> good idea in light of the fact that URL escaping is no longer required to make
> it work.
>

Matthew DeVore May 29, 2019, 11:27 p.m. UTC | #6

On Wed, May 29, 2019 at 05:29:14PM -0400, Jeff Hostetler wrote:
> Was sparse:path filter the only reason for needing all the URL encoding?
> The sparse:oid form allows values <ref>:<path> and these (or at least
> the <path> portion) may contain special characters.  So don't we need to
> URL encode this form too?

Oh, I missed this. I was only thinking an oid was allowed after "sparse:". So as
I suspected I was overlooking something obvious.

Now I just want to understand the objection to URL encoding a little better. I
haven't worked with in a project that requires a lot of boilerplate before, so I
may be asking obvious things again. If so, sorry in advance.

So the objections, as I interpret them so far, are that:

 a the URL encoding/decoding complicates the code base
 b explaining the URL encoding, while it allows for future expansion, requires
   some verbose documentation in git-rev-list that is potentially distracting or
   confusing
 c there may be a better way to allow for future expansion that does not require
   URL encoding
 d the URL encoding is unpleasant to use (note that my patchset makes it
   optional for the user to use and it is only mandatory in sending it over the
   wire)

I think these are reasonable and I'm willing to stop digging my heels in :) Does
the above sum everything up?

Jeff Hostetler May 30, 2019, 2:01 p.m. UTC | #7

On 5/29/2019 7:27 PM, Matthew DeVore wrote:
> On Wed, May 29, 2019 at 05:29:14PM -0400, Jeff Hostetler wrote:
>> Was sparse:path filter the only reason for needing all the URL encoding?
>> The sparse:oid form allows values <ref>:<path> and these (or at least
>> the <path> portion) may contain special characters.  So don't we need to
>> URL encode this form too?
> 
> Oh, I missed this. I was only thinking an oid was allowed after "sparse:". So as
> I suspected I was overlooking something obvious.
> 
> Now I just want to understand the objection to URL encoding a little better. I
> haven't worked with in a project that requires a lot of boilerplate before, so I
> may be asking obvious things again. If so, sorry in advance.
> 
> So the objections, as I interpret them so far, are that:
> 
>   a the URL encoding/decoding complicates the code base
>   b explaining the URL encoding, while it allows for future expansion, requires
>     some verbose documentation in git-rev-list that is potentially distracting or
>     confusing
>   c there may be a better way to allow for future expansion that does not require
>     URL encoding
>   d the URL encoding is unpleasant to use (note that my patchset makes it
>     optional for the user to use and it is only mandatory in sending it over the
>     wire)
> 
> I think these are reasonable and I'm willing to stop digging my heels in :) Does
> the above sum everything up?
> 

My primary concern was how awkward it would be to use the URL
encoding syntax on the command line, but as you say, that can be
avoided by using the multiple --filter args.

And to be honest, the wire format is hidden from user view, so it
doesn't really matter there.  So either approach is fine.  I was
hoping that the "filters (plural)" approach would let us avoid URL
encoding, but that comes with its own baggage as you suggested.
And besides, URL encoding is well-understood.

And I don't want to prematurely complicate this with ANDs ORs and
XORs as you mention in another thread.

So don't let me stop this effort.

BTW, I don't think I've seen this mentioned anywhere and I don't
remember if this got into the code or not.  But we discussed having
a repo-local config setting to remember the filter-spec used by the
partial clone that would be inherited by a subsequent (partial) fetch.
Or would be set by the first partial fetch following a normal clone.
Having a single composite filter spec would help with this.

Jeff

Matthew DeVore May 31, 2019, 8:48 p.m. UTC | #8

On Tue, May 28, 2019 at 02:53:59PM -0700, Emily Shaffer wrote:
> > +	} else if (skip_prefix(arg, "combine:", &v0)) {
> > +		int sub_parse_res = parse_combine_filter(
> > +			filter_options, v0, errbuf);
> > +		if (sub_parse_res)
> > +			return sub_parse_res;
> > +		return 0;
> 
> Couldn't the three lines above be said more succinctly as "return
> sub_parse_res;"?

Oh yes, that's much better. Don't even need the sub_parse_res variable.

> > +static int digit_value(int c, struct strbuf *errbuf) {
> > +	if (c >= '0' && c <= '9')
> > +		return c - '0';
> > +	if (c >= 'a' && c <= 'f')
> > +		return c - 'a' + 10;
> > +	if (c >= 'A' && c <= 'F')
> > +		return c - 'A' + 10;
> 
> I'm sure there's something I'm missing here. But why are you manually
> decoding hex instead of using strtol or sscanf or something?
> 

I'll have to give this a try. Thank you for the suggestion.

> > +static int has_reserved_character(
> > +	struct strbuf *sub_spec, struct strbuf *errbuf)
> > +{
> > +	const char *c = sub_spec->buf;
> > +	while (*c) {
> > +		if (*c <= ' ' || strchr(RESERVED_NON_WS, *c))
> > +			goto found_reserved;
> > +		c++;
> > +	}
> > +
> > +	return 0;
> > +
> > +found_reserved:
> 
> What's the value of doing this in a goto instead of embedded in the
> while loop?
> 

That's to reduce indentation. Note that if I "inlined" the goto logic in the
while loop, I'd get at least 5 tabs of indentation, and the error message would
be split across a couple lines.

> > +
> > +	result = gently_parse_list_objects_filter(filter_options->lhs,
> > +						  sub_specs[0]->buf,
> > +						  errbuf) ||
> > +		parse_combine_filter(filter_options->rhs,
> > +				      sub_specs[1]->buf,
> > +				      errbuf);
> 
> I guess you're recursing to combine filter 2 onto filter 1 which has
> been combined onto filter 0 here. But why not just use a list or array?
> 

I switched this to use an array at your and Jeff's proddings, and it's much
better now. Thanks! It will be in the next roll-up.

> >  
> >  void list_objects_filter_release(
> >  	struct list_objects_filter_options *filter_options)
> >  {
> > +	if (!filter_options)
> > +		return;
> >  	free(filter_options->filter_spec);
> >  	free(filter_options->sparse_oid_value);
> >  	free(filter_options->sparse_path_value);
> > +	list_objects_filter_release(filter_options->lhs);
> > +	free(filter_options->lhs);
> > +	list_objects_filter_release(filter_options->rhs);
> > +	free(filter_options->rhs);
> 
> Is there a reason that the free shouldn't be included in
> list_objects_filter_release()? Maybe this is a common style guideline
> I've missed, but it seems to me like I'd expect a magic memory cleanup
> function to do it all, and not leave it to me to free.
> 

Because there are a couple times the list_objects_filter_options struct is
allocated on the stack or inline in some other struct. This is similar to how
strbuf and other such utility structs are used.

> Jeff H had a comment about this too, but this seems unwieldy for >2
> filters. (I also personally don't like using set index to incidate
> lhs/rhs.) Why not an array of multiple `struct sub`? There's a macro
> utility to generate types and helpers for an array of arbitrary struct
> that may suit...
> 

This code is now cleaner that it's using an array.

> > +static enum list_objects_filter_result filter_combine(
> > +	struct repository *r,
> > +	enum list_objects_filter_situation filter_situation,
> > +	struct object *obj,
> > +	const char *pathname,
> > +	const char *filename,
> > +	struct filter_context *ctx)
> > +{
> > +	struct filter_combine_data *d = ctx->data;
> > +	enum list_objects_filter_result result[2];
> > +	enum list_objects_filter_result combined_result = LOFR_ZERO;
> > +	int i;
> > +
> > +	for (i = 0; i < 2; i++) {
> 
> I suppose your lhs and rhs are in sub[0] and sub[1] in part for the sake
> of this loop. But I think it would be easier to understand what is going
> on if you were to perform the loop contents in a helper function (as the
> name of the function would provide some more documentation).
> 

Agreed, this is how it will be done in the next roll-up.

> I see that you tested that >2 filters works okay. But by doing it the
> way you have it seems like you're setting up to need recursion all over
> the place to check against all the filters. I suppose I don't see the
> benefit of doing all this recursively, as compared to doing it
> iteratively.

Somehow, the recursive appraoch made more sense to me when I was first writing
the code. But using an array is nicer.

Matthew DeVore May 31, 2019, 8:53 p.m. UTC | #9

On Thu, May 30, 2019 at 10:01:47AM -0400, Jeff Hostetler wrote:
> BTW, I don't think I've seen this mentioned anywhere and I don't
> remember if this got into the code or not.  But we discussed having
> a repo-local config setting to remember the filter-spec used by the
> partial clone that would be inherited by a subsequent (partial) fetch.
> Or would be set by the first partial fetch following a normal clone.
> Having a single composite filter spec would help with this.

Isn't that what the partial_clone_get_default_filter_spec function is for? I
forgot about that. Perhaps with Emily's suggestion to use parsing functions in
the C library and the other cleanups I've applied since the first roll-up, using
the URL encoding will seem nicer. Let me try that...

Jeff King May 31, 2019, 9:10 p.m. UTC | #10

On Fri, May 31, 2019 at 01:48:21PM -0700, Matthew DeVore wrote:

> > > +static int digit_value(int c, struct strbuf *errbuf) {
> > > +	if (c >= '0' && c <= '9')
> > > +		return c - '0';
> > > +	if (c >= 'a' && c <= 'f')
> > > +		return c - 'a' + 10;
> > > +	if (c >= 'A' && c <= 'F')
> > > +		return c - 'A' + 10;
> > 
> > I'm sure there's something I'm missing here. But why are you manually
> > decoding hex instead of using strtol or sscanf or something?
> > 
> 
> I'll have to give this a try. Thank you for the suggestion.

Try our hex_to_bytes() helper (or if you really want to go low-level,
your conditionals can be replaced by lookups in the hexval table).

-Peff

Matthew DeVore June 1, 2019, 12:11 a.m. UTC | #11

On Fri, May 24, 2019 at 05:01:15PM -0400, Jeff Hostetler wrote:
> We are allowing an unlimited number of filters in the composition.
> In the code, the compose filter data has space for a LHS and RHS, so
> I'm assuming we're mapping
> 
>     --filter=f1 --filter=f2 --filter=f3 --filter=f4
> or  --filter=combine:f1+f2+f3+f4
> into basically
>     (compose f1 (compose f2 (compose (f3 f4)))
> 
> I wonder if it would be easier to understand if we just built an array
> or linked list, but I'll read on.

As I mentioned in earlier messages, I have changed this to use an array. It's
nicer now.

(nit: the filters were left-associative rather than right-associative)

> Should we swap the order of the terms in the || so that we always
> clear the d->sub[i].is_skipping_tree on LOFS_END_TREE ?
> 

Done, and added a comment:

	/*
	 * Check should_delegate before oidset_contains so that
	 * is_skipping_tree gets unset even when the object is marked as seen.
	 * As of this writing, no filter uses LOFR_MARK_SEEN on trees that also
	 * uses LOFR_SKIP_TREE, so the ordering is only theoretically
	 * important. Be cautious if you change the order of the below checks
	 * and more filters have been added!
	 */

> 
> > +			result[i] = LOFR_ZERO;
> > +			continue;
> > +		}
> > +
> > +		result[i] = d->sub[i].ctx.filter_fn(
> > +			r, filter_situation, obj, pathname, filename,
> > +			&d->sub[i].ctx);
> > +
> > +		if (result[i] & LOFR_MARK_SEEN)
> > +			oidset_insert(&d->sub[i].seen, &obj->oid);
> 
> So filter[i] has said it never wants to show this object (hard omit).
> And the guard at the top of the loop will prevent future invocations
> from checking it again if the object is revisited.
> 

Yes.

> > +
> > +		if (result[i] & LOFR_SKIP_TREE) {
> > +			d->sub[i].is_skipping_tree = 1;
> > +			d->sub[i].skip_tree = obj->oid;
> 
> So this marks the tree object at the top of the skip as far as
> filter[i] is concerned.
> 

Yes.

> > +		}
> > +	}
> > +
> > +	if ((result[0] & LOFR_DO_SHOW) && (result[1] & LOFR_DO_SHOW))
> > +		combined_result |= LOFR_DO_SHOW;
> > +	if (d->sub[0].is_skipping_tree && d->sub[1].is_skipping_tree)
> > +		combined_result |= LOFR_SKIP_TREE;
> 
> Something about the above bothers me, but I can't quite say what
> it is.
> 

It looks nicer now that it's array-based. Let me know what you think after I
send the next roll-up.

> Do we need to do:
>     if ((result[0] & LOFR_MARK_SEEN) && (result[1] & LOFR_MARK_SEEN))
>         combined_result |= LOFR_MARK_SEEN;

This should be a O(1) sort of optimization, since if we don't set it, the top
filter will still be called, but won't delegate to any sub-filters. It doesn't
complicate the code much, so it seems worth it to add. Done.

> I'm out of time now, will pick this up again next week.

Thank you for taking a look and for your patience so far.

Matthew DeVore June 1, 2019, 12:12 a.m. UTC | #12

On Fri, May 31, 2019 at 05:10:42PM -0400, Jeff King wrote:
> On Fri, May 31, 2019 at 01:48:21PM -0700, Matthew DeVore wrote:
> 
> > > > +static int digit_value(int c, struct strbuf *errbuf) {
> > > > +	if (c >= '0' && c <= '9')
> > > > +		return c - '0';
> > > > +	if (c >= 'a' && c <= 'f')
> > > > +		return c - 'a' + 10;
> > > > +	if (c >= 'A' && c <= 'F')
> > > > +		return c - 'A' + 10;
> > > 
> > > I'm sure there's something I'm missing here. But why are you manually
> > > decoding hex instead of using strtol or sscanf or something?
> > > 
> > 
> > I'll have to give this a try. Thank you for the suggestion.
> 
> Try our hex_to_bytes() helper (or if you really want to go low-level,
> your conditionals can be replaced by lookups in the hexval table).
> 
> -Peff

Using hex_to_bytes worked out quite nicely, thanks!

Jeff King June 3, 2019, 12:34 p.m. UTC | #13

On Fri, May 31, 2019 at 05:12:31PM -0700, Matthew DeVore wrote:

> On Fri, May 31, 2019 at 05:10:42PM -0400, Jeff King wrote:
> > On Fri, May 31, 2019 at 01:48:21PM -0700, Matthew DeVore wrote:
> > 
> > > > > +static int digit_value(int c, struct strbuf *errbuf) {
> > > > > +	if (c >= '0' && c <= '9')
> > > > > +		return c - '0';
> > > > > +	if (c >= 'a' && c <= 'f')
> > > > > +		return c - 'a' + 10;
> > > > > +	if (c >= 'A' && c <= 'F')
> > > > > +		return c - 'A' + 10;
> > > > 
> > > > I'm sure there's something I'm missing here. But why are you manually
> > > > decoding hex instead of using strtol or sscanf or something?
> > > > 
> > > 
> > > I'll have to give this a try. Thank you for the suggestion.
> > 
> > Try our hex_to_bytes() helper (or if you really want to go low-level,
> > your conditionals can be replaced by lookups in the hexval table).
> 
> Using hex_to_bytes worked out quite nicely, thanks!

Great. We might want to stop there, but it's possible could reuse even
more code. I didn't look closely before, but it seems this code is
decoding a URL. We already have a url_decode() routine in url.c. Could
it be reused?

-Peff

Jeff Hostetler June 3, 2019, 9:04 p.m. UTC | #14

On 5/31/2019 4:53 PM, Matthew DeVore wrote:
> On Thu, May 30, 2019 at 10:01:47AM -0400, Jeff Hostetler wrote:
>> BTW, I don't think I've seen this mentioned anywhere and I don't
>> remember if this got into the code or not.  But we discussed having
>> a repo-local config setting to remember the filter-spec used by the
>> partial clone that would be inherited by a subsequent (partial) fetch.
>> Or would be set by the first partial fetch following a normal clone.
>> Having a single composite filter spec would help with this.
> 
> Isn't that what the partial_clone_get_default_filter_spec function is for? I
> forgot about that. Perhaps with Emily's suggestion to use parsing functions in
> the C library and the other cleanups I've applied since the first roll-up, using
> the URL encoding will seem nicer. Let me try that...
> 

Yes, thanks.  That's what I was thinking about.

Jeff

Matthew DeVore June 3, 2019, 10:22 p.m. UTC | #15

On Mon, Jun 03, 2019 at 08:34:35AM -0400, Jeff King wrote:
> Great. We might want to stop there, but it's possible could reuse even
> more code. I didn't look closely before, but it seems this code is
> decoding a URL. We already have a url_decode() routine in url.c. Could
> it be reused?

Very nice. Here is an interdiff and the changes will be included in v3 of my
patchset:

diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index ed02c88eb6..0f135602a7 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -1,19 +1,20 @@
 #include "cache.h"
 #include "commit.h"
 #include "config.h"
 #include "revision.h"
 #include "argv-array.h"
 #include "list-objects.h"
 #include "list-objects-filter.h"
 #include "list-objects-filter-options.h"
 #include "trace.h"
+#include "url.h"
 
 static int parse_combine_filter(
 	struct list_objects_filter_options *filter_options,
 	const char *arg,
 	struct strbuf *errbuf);
 
 /*
  * Parse value of the argument to the "filter" keyword.
  * On the command line this looks like:
  *       --filter=<arg>
@@ -84,54 +85,20 @@ static int gently_parse_list_objects_filter(
 	 * Please update _git_fetch() in git-completion.bash when you
 	 * add new filters
 	 */
 
 	strbuf_addf(errbuf, "invalid filter-spec '%s'", arg);
 
 	memset(filter_options, 0, sizeof(*filter_options));
 	return 1;
 }
 
-static int url_decode(struct strbuf *s, struct strbuf *errbuf)
-{
-	char *dest = s->buf;
-	char *src = s->buf;
-	size_t new_len;
-
-	while (*src) {
-		if (src[0] != '%') {
-			*dest++ = *src++;
-			continue;
-		}
-
-		if (hex_to_bytes((unsigned char *)dest, src + 1, 1)) {
-			strbuf_addstr(errbuf,
-				      "error in filter-spec - "
-				      "invalid hex sequence after %");
-			return 1;
-		}
-
-		if (!*dest) {
-			strbuf_addstr(errbuf,
-				      "error in filter-spec - unexpected %00");
-			return 1;
-		}
-
-		src += 3;
-		dest++;
-	}
-	new_len = dest - s->buf;
-	strbuf_remove(s, new_len, s->len - new_len);
-
-	return 0;
-}
-
 static const char *RESERVED_NON_WS = "~`!@#$^&*()[]{}\\;'\",<>?";
 
 static int has_reserved_character(
 	struct strbuf *sub_spec, struct strbuf *errbuf)
 {
 	const char *c = sub_spec->buf;
 	while (*c) {
 		if (*c <= ' ' || strchr(RESERVED_NON_WS, *c)) {
 			strbuf_addf(errbuf,
 				    "must escape char in sub-filter-spec: '%c'",
@@ -147,56 +114,57 @@ static int has_reserved_character(
 static int parse_combine_subfilter(
 	struct list_objects_filter_options *filter_options,
 	struct strbuf *subspec,
 	struct strbuf *errbuf)
 {
 	size_t new_index = filter_options->sub_nr;
 
 	ALLOC_GROW_BY(filter_options->sub, filter_options->sub_nr, 1,
 		      filter_options->sub_alloc);
 
-	return has_reserved_character(subspec, errbuf) ||
-		url_decode(subspec, errbuf) ||
-		gently_parse_list_objects_filter(
-			&filter_options->sub[new_index], subspec->buf, errbuf);
+	decoded = url_percent_decode(subspec->buf);
+
+	result = gently_parse_list_objects_filter(
+		&filter_options->sub[new_index], decoded, errbuf);
+
+	free(decoded);
+	return result;
 }
 
 static int parse_combine_filter(
 	struct list_objects_filter_options *filter_options,
 	const char *arg,
 	struct strbuf *errbuf)
 {
 	struct strbuf **subspecs = strbuf_split_str(arg, '+', 0);
 	size_t sub;
-	int result;
+	int result = 0;
 
 	if (!subspecs[0]) {
 		strbuf_addf(errbuf,
 			    _("expected something after combine:"));
 		result = 1;
 		goto cleanup;
 	}
 
-	for (sub = 0; subspecs[sub]; sub++) {
+	for (sub = 0; subspecs[sub] && !result; sub++) {
 		if (subspecs[sub + 1]) {
 			/*
 			 * This is not the last subspec. Remove trailing "+" so
 			 * we can parse it.
 			 */
 			size_t last = subspecs[sub]->len - 1;
 			assert(subspecs[sub]->buf[last] == '+');
 			strbuf_remove(subspecs[sub], last, 1);
 		}
 		result = parse_combine_subfilter(
 			filter_options, subspecs[sub], errbuf);
-		if (result)
-			goto cleanup;
 	}
 
 	filter_options->choice = LOFC_COMBINE;
 
 cleanup:
 	strbuf_list_free(subspecs);
 	if (result) {
 		list_objects_filter_release(filter_options);
 		memset(filter_options, 0, sizeof(*filter_options));
 	}
diff --git a/t/t6112-rev-list-filters-objects.sh b/t/t6112-rev-list-filters-objects.sh
index 7fb5e50cde..e1bf3ed038 100755
--- a/t/t6112-rev-list-filters-objects.sh
+++ b/t/t6112-rev-list-filters-objects.sh
@@ -405,32 +405,20 @@ test_expect_success 'combine:... while URL-encoding things that should not be' '
 
 test_expect_success 'combine: with nothing after the :' '
 	expect_invalid_filter_spec combine: "expected something after combine:"
 '
 
 test_expect_success 'parse error in first sub-filter in combine:' '
 	expect_invalid_filter_spec combine:tree:asdf+blob:none \
 		"expected .tree:<depth>."
 '
 
-test_expect_success 'combine:... with invalid URL-encoded sequences' '
-	# Not enough hex chars
-	expect_invalid_filter_spec combine:tree:2+blob:non%a \
-		"error in filter-spec - invalid hex sequence after %" &&
-	# Non-hex digit after %
-	expect_invalid_filter_spec combine:tree:2+blob%G5none \
-		"error in filter-spec - invalid hex sequence after %" &&
-	# Null byte encoded by %
-	expect_invalid_filter_spec combine:tree:2+blob%00none \
-		"error in filter-spec - unexpected %00"
-'
-
 test_expect_success 'combine:... with non-encoded reserved chars' '
 	expect_invalid_filter_spec combine:tree:2+sparse:@xyz \
 		"must escape char in sub-filter-spec: .@." &&
 	expect_invalid_filter_spec combine:tree:2+sparse:\` \
 		"must escape char in sub-filter-spec: .\`." &&
 	expect_invalid_filter_spec combine:tree:2+sparse:~abc \
 		"must escape char in sub-filter-spec: .\~."
 '
 
 test_expect_success 'validate err msg for "combine:<valid-filter>+"' '
diff --git a/url.c b/url.c
index 25576c390b..bdede647bc 100644
--- a/url.c
+++ b/url.c
@@ -79,20 +79,26 @@ char *url_decode_mem(const char *url, int len)
 
 	/* Skip protocol part if present */
 	if (colon && url < colon) {
 		strbuf_add(&out, url, colon - url);
 		len -= colon - url;
 		url = colon;
 	}
 	return url_decode_internal(&url, len, NULL, &out, 0);
 }
 
+char *url_percent_decode(const char *encoded)
+{
+	struct strbuf out = STRBUF_INIT;
+	return url_decode_internal(&encoded, strlen(encoded), NULL, &out, 0);
+}
+
 char *url_decode_parameter_name(const char **query)
 {
 	struct strbuf out = STRBUF_INIT;
 	return url_decode_internal(query, -1, "&=", &out, 1);
 }
 
 char *url_decode_parameter_value(const char **query)
 {
 	struct strbuf out = STRBUF_INIT;
 	return url_decode_internal(query, -1, "&", &out, 1);
diff --git a/url.h b/url.h
index 00b7d58c33..2a27c34277 100644
--- a/url.h
+++ b/url.h
@@ -1,16 +1,24 @@
 #ifndef URL_H
 #define URL_H
 
 struct strbuf;
 
 int is_url(const char *url);
 int is_urlschemechar(int first_flag, int ch);
 char *url_decode(const char *url);
 char *url_decode_mem(const char *url, int len);
+
+/*
+ * Similar to the url_decode_{,mem} methods above, but doesn't assume there
+ * is a scheme followed by a : at the start of the string. Instead, %-sequences
+ * before any : are also parsed.
+ */
+char *url_percent_decode(const char *encoded);
+
 char *url_decode_parameter_name(const char **query);
 char *url_decode_parameter_value(const char **query);
 
 void end_url_with_slash(struct strbuf *buf, const char *url);
 void str_end_url_with_slash(const char *url, char **dest);
 
 #endif /* URL_H */

Jeff King June 4, 2019, 4:13 p.m. UTC | #16

On Mon, Jun 03, 2019 at 03:22:47PM -0700, Matthew DeVore wrote:

> On Mon, Jun 03, 2019 at 08:34:35AM -0400, Jeff King wrote:
> > Great. We might want to stop there, but it's possible could reuse even
> > more code. I didn't look closely before, but it seems this code is
> > decoding a URL. We already have a url_decode() routine in url.c. Could
> > it be reused?
> 
> Very nice. Here is an interdiff and the changes will be included in v3 of my
> patchset:

Nice to see a reduction in duplication (and I see you found some
problems in the existing code elsewhere; thanks for cleaning that up).

> -	return has_reserved_character(subspec, errbuf) ||
> -		url_decode(subspec, errbuf) ||
> -		gently_parse_list_objects_filter(
> -			&filter_options->sub[new_index], subspec->buf, errbuf);
> +	decoded = url_percent_decode(subspec->buf);

I think you can get rid of has_reserved_character() now, too.

The reserved character list is still used on the encoding side. But I
think you could switch to strbuf_add_urlencode() there?

-Peff

Matthew DeVore June 4, 2019, 5:19 p.m. UTC | #17

On Tue, Jun 04, 2019 at 12:13:32PM -0400, Jeff King wrote:
> > -	return has_reserved_character(subspec, errbuf) ||
> > -		url_decode(subspec, errbuf) ||
> > -		gently_parse_list_objects_filter(
> > -			&filter_options->sub[new_index], subspec->buf, errbuf);
> > +	decoded = url_percent_decode(subspec->buf);
> 
> I think you can get rid of has_reserved_character() now, too.

The purpose of has_reserved_character is to allow for future
extensibility if someone decides to implement a more sophisticated DSL
and give meaning to these characters. That may be a long-shot, but it
seems worth it.

> The reserved character list is still used on the encoding side. But I
> think you could switch to strbuf_add_urlencode() there?

strbuf_addstr_urlencode will either escape or not escape all rfc3986
reserved characters, and that set includes both : and +. The former
should not require escaping since it's a common character in filter
specs, and I would like the hand-encoded combine specs to be relatively
easy to type and read. The + must be escaped since it is used as part of
the combine:... syntax to delimit sub filters. So
strbuf_addstr_url_encode would have to be more customizable to make it
work for this context. I'd like to add a parameterizable should_escape
predicate (iow function pointer) which strbuf_addstr_urlencode accepts.
I actually think this will be more readable than the current strbuf API.

Jeff King June 4, 2019, 6:51 p.m. UTC | #18

On Tue, Jun 04, 2019 at 10:19:52AM -0700, Matthew DeVore wrote:

> On Tue, Jun 04, 2019 at 12:13:32PM -0400, Jeff King wrote:
> > > -	return has_reserved_character(subspec, errbuf) ||
> > > -		url_decode(subspec, errbuf) ||
> > > -		gently_parse_list_objects_filter(
> > > -			&filter_options->sub[new_index], subspec->buf, errbuf);
> > > +	decoded = url_percent_decode(subspec->buf);
> > 
> > I think you can get rid of has_reserved_character() now, too.
> 
> The purpose of has_reserved_character is to allow for future
> extensibility if someone decides to implement a more sophisticated DSL
> and give meaning to these characters. That may be a long-shot, but it
> seems worth it.

I think you'll find that -Wunused-function complains, though, if nobody
is calling it. I wasn't sure if what you showed in the interdiff was
meant to be final (I had to add a few other variable declarations to
make it compile, too).

> > The reserved character list is still used on the encoding side. But I
> > think you could switch to strbuf_add_urlencode() there?
> 
> strbuf_addstr_urlencode will either escape or not escape all rfc3986
> reserved characters, and that set includes both : and +. The former
> should not require escaping since it's a common character in filter
> specs, and I would like the hand-encoded combine specs to be relatively
> easy to type and read. The + must be escaped since it is used as part of
> the combine:... syntax to delimit sub filters. So
> strbuf_addstr_url_encode would have to be more customizable to make it
> work for this context. I'd like to add a parameterizable should_escape
> predicate (iow function pointer) which strbuf_addstr_urlencode accepts.
> I actually think this will be more readable than the current strbuf API.

That makes some sense, and I agree that readability is a good goal. Do
we not need to be escaping colons in other URLs? Or are the strings
you're generating not true by-the-book URLs? I'm just wondering if we
could take this opportunity to improve the URLs we output elsewhere,
too.

-Peff

Matthew DeVore June 4, 2019, 10:59 p.m. UTC | #19

On Tue, Jun 04, 2019 at 02:51:08PM -0400, Jeff King wrote:
> > The purpose of has_reserved_character is to allow for future
> > extensibility if someone decides to implement a more sophisticated DSL
> > and give meaning to these characters. That may be a long-shot, but it
> > seems worth it.
> 
> I think you'll find that -Wunused-function complains, though, if nobody
> is calling it. I wasn't sure if what you showed in the interdiff was
> meant to be final (I had to add a few other variable declarations to
> make it compile, too).

Sorry, my last interdiff was a mess because I made a mistake during git rebase
-i. It was missing a call to has_reserved_char. Below is another diff that
fixes the problems:

diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index 0f135602a7..6b206dc58b 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -110,28 +110,31 @@ static int has_reserved_character(
 
 	return 0;
 }
 
 static int parse_combine_subfilter(
 	struct list_objects_filter_options *filter_options,
 	struct strbuf *subspec,
 	struct strbuf *errbuf)
 {
 	size_t new_index = filter_options->sub_nr;
+	char *decoded;
+	int result;
 
 	ALLOC_GROW_BY(filter_options->sub, filter_options->sub_nr, 1,
 		      filter_options->sub_alloc);
 
 	decoded = url_percent_decode(subspec->buf);
 
-	result = gently_parse_list_objects_filter(
-		&filter_options->sub[new_index], decoded, errbuf);
+	result = has_reserved_character(subspec, errbuf) ||
+		gently_parse_list_objects_filter(
+			&filter_options->sub[new_index], decoded, errbuf);
 
 	free(decoded);
 	return result;
 }
 
 static int parse_combine_filter(
 	struct list_objects_filter_options *filter_options,
 	const char *arg,
 	struct strbuf *errbuf)
 {

> > strbuf_addstr_urlencode will either escape or not escape all rfc3986
> > reserved characters, and that set includes both : and +. The former
> > should not require escaping since it's a common character in filter
> > specs, and I would like the hand-encoded combine specs to be relatively
> > easy to type and read. The + must be escaped since it is used as part of
> > the combine:... syntax to delimit sub filters. So
> > strbuf_addstr_url_encode would have to be more customizable to make it
> > work for this context. I'd like to add a parameterizable should_escape
> > predicate (iow function pointer) which strbuf_addstr_urlencode accepts.
> > I actually think this will be more readable than the current strbuf API.
> 
> That makes some sense, and I agree that readability is a good goal. Do
> we not need to be escaping colons in other URLs? Or are the strings
> you're generating not true by-the-book URLs? I'm just wondering if we
> could take this opportunity to improve the URLs we output elsewhere,
> too.

The strings I'm generating are not URLs. Also, in http.c, we have to use : to
delimit a username and password:

	strbuf_addstr_urlencode(&s, proxy_auth.username, 1);
	strbuf_addch(&s, ':');
	strbuf_addstr_urlencode(&s, proxy_auth.password, 1);

I think this is dictated by libcurl and is not flexible.

Jeff King June 4, 2019, 11:14 p.m. UTC | #20

On Tue, Jun 04, 2019 at 03:59:21PM -0700, Matthew DeVore wrote:

> > I think you'll find that -Wunused-function complains, though, if nobody
> > is calling it. I wasn't sure if what you showed in the interdiff was
> > meant to be final (I had to add a few other variable declarations to
> > make it compile, too).
> 
> Sorry, my last interdiff was a mess because I made a mistake during git rebase
> -i. It was missing a call to has_reserved_char. Below is another diff that
> fixes the problems:

Ah, OK, that makes sense then (and keeping the function is obviously the
right thing to do).

> > That makes some sense, and I agree that readability is a good goal. Do
> > we not need to be escaping colons in other URLs? Or are the strings
> > you're generating not true by-the-book URLs? I'm just wondering if we
> > could take this opportunity to improve the URLs we output elsewhere,
> > too.
> 
> The strings I'm generating are not URLs. Also, in http.c, we have to use : to
> delimit a username and password:
> 
> 	strbuf_addstr_urlencode(&s, proxy_auth.username, 1);
> 	strbuf_addch(&s, ':');
> 	strbuf_addstr_urlencode(&s, proxy_auth.password, 1);
> 
> I think this is dictated by libcurl and is not flexible.

Right, that has to be a real colon because it's syntactically
significant (but a colon in the username _must_ be encoded). That strbuf
function doesn't really understand whole URLs, and it's up to the caller
to assemble the parts.

Anyway, we've veered off of your patch series enough. Yeah, it sounds
like using strbuf's url-encoding is not quite what you want.

-Peff

Matthew DeVore June 4, 2019, 11:49 p.m. UTC | #21

On Tue, Jun 04, 2019 at 07:14:18PM -0400, Jeff King wrote:
> Right, that has to be a real colon because it's syntactically
> significant (but a colon in the username _must_ be encoded). That strbuf
> function doesn't really understand whole URLs, and it's up to the caller
> to assemble the parts.
> 
> Anyway, we've veered off of your patch series enough. Yeah, it sounds
> like using strbuf's url-encoding is not quite what you want.

I tried to do it anyway :) I think this makes the strbuf API a bit easier to
reason about, and strbuf.h is a bit more self-documenting. WDYT?

diff --git a/credential-store.c b/credential-store.c
index ac295420dd..c010497cb2 100644
--- a/credential-store.c
+++ b/credential-store.c
@@ -65,29 +65,30 @@ static void rewrite_credential_file(const char *fn, struct credential *c,
 	parse_credential_file(fn, c, NULL, print_line);
 	if (commit_lock_file(&credential_lock) < 0)
 		die_errno("unable to write credential store");
 }
 
 static void store_credential_file(const char *fn, struct credential *c)
 {
 	struct strbuf buf = STRBUF_INIT;
 
 	strbuf_addf(&buf, "%s://", c->protocol);
-	strbuf_addstr_urlencode(&buf, c->username, 1);
+	strbuf_addstr_urlencode(&buf, c->username, is_rfc3986_unreserved);
 	strbuf_addch(&buf, ':');
-	strbuf_addstr_urlencode(&buf, c->password, 1);
+	strbuf_addstr_urlencode(&buf, c->password, is_rfc3986_unreserved);
 	strbuf_addch(&buf, '@');
 	if (c->host)
-		strbuf_addstr_urlencode(&buf, c->host, 1);
+		strbuf_addstr_urlencode(&buf, c->host, is_rfc3986_unreserved);
 	if (c->path) {
 		strbuf_addch(&buf, '/');
-		strbuf_addstr_urlencode(&buf, c->path, 0);
+		strbuf_addstr_urlencode(&buf, c->path,
+					is_rfc3986_reserved_or_unreserved);
 	}
 
 	rewrite_credential_file(fn, c, &buf);
 	strbuf_release(&buf);
 }
 
 static void store_credential(const struct string_list *fns, struct credential *c)
 {
 	struct string_list_item *fn;
 
diff --git a/http.c b/http.c
index 27aa0a3192..938b9e55af 100644
--- a/http.c
+++ b/http.c
@@ -506,23 +506,25 @@ static void var_override(const char **var, char *value)
 static void set_proxyauth_name_password(CURL *result)
 {
 #if LIBCURL_VERSION_NUM >= 0x071301
 		curl_easy_setopt(result, CURLOPT_PROXYUSERNAME,
 			proxy_auth.username);
 		curl_easy_setopt(result, CURLOPT_PROXYPASSWORD,
 			proxy_auth.password);
 #else
 		struct strbuf s = STRBUF_INIT;
 
-		strbuf_addstr_urlencode(&s, proxy_auth.username, 1);
+		strbuf_addstr_urlencode(&s, proxy_auth.username,
+					is_rfc3986_unreserved);
 		strbuf_addch(&s, ':');
-		strbuf_addstr_urlencode(&s, proxy_auth.password, 1);
+		strbuf_addstr_urlencode(&s, proxy_auth.password,
+					is_rfc3986_unreserved);
 		curl_proxyuserpwd = strbuf_detach(&s, NULL);
 		curl_easy_setopt(result, CURLOPT_PROXYUSERPWD, curl_proxyuserpwd);
 #endif
 }
 
 static void init_curl_proxy_auth(CURL *result)
 {
 	if (proxy_auth.username) {
 		if (!proxy_auth.password)
 			credential_fill(&proxy_auth);
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index 6b206dc58b..9a5677c2c8 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -167,30 +167,25 @@ static int parse_combine_filter(
 
 cleanup:
 	strbuf_list_free(subspecs);
 	if (result) {
 		list_objects_filter_release(filter_options);
 		memset(filter_options, 0, sizeof(*filter_options));
 	}
 	return result;
 }
 
-static void add_url_encoded(struct strbuf *dest, const char *s)
+static int allow_unencoded(char ch)
 {
-	while (*s) {
-		if (*s <= ' ' || strchr(RESERVED_NON_WS, *s) ||
-			*s == '%' || *s == '+')
-			strbuf_addf(dest, "%%%02X", (int)*s);
-		else
-			strbuf_addf(dest, "%c", *s);
-		s++;
-	}
+	if (ch <= ' ' || ch == '%' || ch == '+')
+		return 0;
+	return !strchr(RESERVED_NON_WS, ch);
 }
 
 /*
  * Changes filter_options into an equivalent LOFC_COMBINE filter options
  * instance. Does not do anything if filter_options is already LOFC_COMBINE.
  */
 static void transform_to_combine_type(
 	struct list_objects_filter_options *filter_options)
 {
 	assert(filter_options->choice);
@@ -202,22 +197,23 @@ static void transform_to_combine_type(
 			xcalloc(initial_sub_alloc, sizeof(*sub_array));
 		sub_array[0] = *filter_options;
 		memset(filter_options, 0, sizeof(*filter_options));
 		filter_options->sub = sub_array;
 		filter_options->sub_alloc = initial_sub_alloc;
 	}
 	filter_options->sub_nr = 1;
 	filter_options->choice = LOFC_COMBINE;
 	strbuf_init(&filter_options->filter_spec, 0);
 	strbuf_addstr(&filter_options->filter_spec, "combine:");
-	add_url_encoded(&filter_options->filter_spec,
-			filter_options->sub[0].filter_spec.buf);
+	strbuf_addstr_urlencode(&filter_options->filter_spec,
+				filter_options->sub[0].filter_spec.buf,
+				allow_unencoded);
 	/*
 	 * We don't need the filter_spec strings for subfilter specs, only the
 	 * top level.
 	 */
 	strbuf_release(&filter_options->sub[0].filter_spec);
 }
 
 void list_objects_filter_die_if_populated(
 	struct list_objects_filter_options *filter_options)
 {
@@ -239,21 +235,22 @@ void parse_list_objects_filter(
 		parse_error = gently_parse_list_objects_filter(
 			filter_options, arg, &errbuf);
 	} else {
 		/*
 		 * Make filter_options an LOFC_COMBINE spec so we can trivially
 		 * add subspecs to it.
 		 */
 		transform_to_combine_type(filter_options);
 
 		strbuf_addstr(&filter_options->filter_spec, "+");
-		add_url_encoded(&filter_options->filter_spec, arg);
+		strbuf_addstr_urlencode(&filter_options->filter_spec, arg,
+					allow_unencoded);
 		trace_printf("Generated composite filter-spec: %s\n",
 			     filter_options->filter_spec.buf);
 		ALLOC_GROW_BY(filter_options->sub, filter_options->sub_nr, 1,
 			      filter_options->sub_alloc);
 
 		parse_error = gently_parse_list_objects_filter(
 			&filter_options->sub[filter_options->sub_nr - 1], arg,
 			&errbuf);
 	}
 	if (parse_error)
diff --git a/strbuf.c b/strbuf.c
index 0e18b259ce..60ab5144f2 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -767,55 +767,56 @@ void strbuf_addstr_xml_quoted(struct strbuf *buf, const char *s)
 		case '&':
 			strbuf_addstr(buf, "&amp;");
 			break;
 		case 0:
 			return;
 		}
 		s++;
 	}
 }
 
-static int is_rfc3986_reserved(char ch)
+int is_rfc3986_reserved_or_unreserved(char ch)
 {
+	if (is_rfc3986_unreserved(ch))
+		return 1;
 	switch (ch) {
 		case '!': case '*': case '\'': case '(': case ')': case ';':
 		case ':': case '@': case '&': case '=': case '+': case '$':
 		case ',': case '/': case '?': case '#': case '[': case ']':
 			return 1;
 	}
 	return 0;
 }
 
-static int is_rfc3986_unreserved(char ch)
+int is_rfc3986_unreserved(char ch)
 {
 	return isalnum(ch) ||
 		ch == '-' || ch == '_' || ch == '.' || ch == '~';
 }
 
 static void strbuf_add_urlencode(struct strbuf *sb, const char *s, size_t len,
-				 int reserved)
+				 char_predicate allow_unencoded_fn)
 {
 	strbuf_grow(sb, len);
 	while (len--) {
 		char ch = *s++;
-		if (is_rfc3986_unreserved(ch) ||
-		    (!reserved && is_rfc3986_reserved(ch)))
+		if (allow_unencoded_fn(ch))
 			strbuf_addch(sb, ch);
 		else
 			strbuf_addf(sb, "%%%02x", (unsigned char)ch);
 	}
 }
 
 void strbuf_addstr_urlencode(struct strbuf *sb, const char *s,
-			     int reserved)
+			     char_predicate allow_unencoded_fn)
 {
-	strbuf_add_urlencode(sb, s, strlen(s), reserved);
+	strbuf_add_urlencode(sb, s, strlen(s), allow_unencoded_fn);
 }
 
 void strbuf_humanise_bytes(struct strbuf *buf, off_t bytes)
 {
 	if (bytes > 1 << 30) {
 		strbuf_addf(buf, "%u.%2.2u GiB",
 			    (unsigned)(bytes >> 30),
 			    (unsigned)(bytes & ((1 << 30) - 1)) / 10737419);
 	} else if (bytes > 1 << 20) {
 		unsigned x = bytes + 5243;  /* for rounding */
diff --git a/strbuf.h b/strbuf.h
index c8d98dfb95..346d722492 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -659,22 +659,27 @@ void strbuf_branchname(struct strbuf *sb, const char *name,
 		       unsigned allowed);
 
 /*
  * Like strbuf_branchname() above, but confirm that the result is
  * syntactically valid to be used as a local branch name in refs/heads/.
  *
  * The return value is "0" if the result is valid, and "-1" otherwise.
  */
 int strbuf_check_branch_ref(struct strbuf *sb, const char *name);
 
+typedef int (*char_predicate)(char ch);
+
+int is_rfc3986_unreserved(char ch);
+int is_rfc3986_reserved_or_unreserved(char ch);
+
 void strbuf_addstr_urlencode(struct strbuf *sb, const char *name,
-			     int reserved);
+			     char_predicate allow_unencoded_fn);
 
 __attribute__((format (printf,1,2)))
 int printf_ln(const char *fmt, ...);
 __attribute__((format (printf,2,3)))
 int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
 char *xstrdup_toupper(const char *);
 
 /**

Jeff King June 9, 2019, 12:36 p.m. UTC | #22

On Tue, Jun 04, 2019 at 04:49:51PM -0700, Matthew DeVore wrote:

> I tried to do it anyway :) I think this makes the strbuf API a bit easier to
> reason about, and strbuf.h is a bit more self-documenting. WDYT?
>
> [...]
>
> +typedef int (*char_predicate)(char ch);
> +
> +int is_rfc3986_unreserved(char ch);
> +int is_rfc3986_reserved_or_unreserved(char ch);
> +
>  void strbuf_addstr_urlencode(struct strbuf *sb, const char *name,
> -			     int reserved);
> +			     char_predicate allow_unencoded_fn);

Yeah, that seems reasonable. I worry slightly about adding function-call
overhead to something that's processing a string character-by-character,
but these strings tend to be short and infrequent.

-Peff

[v1,3/5] list-objects-filter: implement composite filters

Commit Message

Comments

Patch