diff mbox series

[v4,05/10] userdiff: add and use for_each_userdiff_driver()

Message ID patch-05.11-64ea5e8443f-20210324T014604Z-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series userdiff: refactor + test improvements | expand

Commit Message

Ævar Arnfjörð Bjarmason March 24, 2021, 1:48 a.m. UTC
Refactor the userdiff_find_by_namelen() function so that a new
for_each_userdiff_driver() API function does most of the work.

This will be useful for the same reason we've got other for_each_*()
API functions as part of various APIs, and will be used in a follow-up
commit.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 userdiff.c | 61 +++++++++++++++++++++++++++++++++++++++++++-----------
 userdiff.h | 15 ++++++++++++++
 2 files changed, 64 insertions(+), 12 deletions(-)

Comments

Junio C Hamano March 24, 2021, 4:50 a.m. UTC | #1
Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> +struct for_each_userdiff_driver_cb {
> +	const char *k;
> +	size_t len;
> +	struct userdiff_driver *driver;
> +};

Makes me wonder if we want to rename s/k/name/;

> +static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
> +				       enum userdiff_driver_type type, void *priv)
>  {
> +	struct for_each_userdiff_driver_cb *cb_data = priv;
> +
> +	if (!strncmp(driver->name, cb_data->k, cb_data->len) &&
> +	    !driver->name[cb_data->len]) {
> +		cb_data->driver = driver;
> +		return -1; /* found it! */
>  	}
> -	return NULL;
> +	return 0;
> +}

Makes sense.

> +static struct userdiff_driver *userdiff_find_by_namelen(const char *k, size_t len)
> +{
> +	struct for_each_userdiff_driver_cb udcbdata = { .k = k, .len = len, .driver = NULL };

No need to explicitly spell the zero initialization.  Wrapping it
like this:

	struct for_each_userdiff_driver_cb udcbdata = {
		.k = k, .len = len
	};

would avoid the overlong line.

> +
> +	for_each_userdiff_driver(userdiff_find_by_namelen_cb,
> +				 USERDIFF_DRIVER_TYPE_UNSPECIFIED, &udcbdata);
> +	return udcbdata.driver;


> @@ -373,3 +385,28 @@ struct userdiff_driver *userdiff_get_textconv(struct repository *r,
>  
>  	return driver;
>  }
> +
> +int for_each_userdiff_driver(each_userdiff_driver_fn fn,
> +			     enum userdiff_driver_type type, void *cb_data)
> +{
> +	int i, ret;
> +	if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_CUSTOM)) {

I presume that the concrete ones are bitmask (i.e. BUILTIN occupies
bit #0 while CUSTOM occupies bit #1, or something like that).  Then

    #define USERDIFF_DRIVER_TYPE_UNSPECIFIED (-1)

would make this (and the other) condition far easier to read, i.e.

	if (type & USERDIFF_DRIVER_TYPE_CUSTOM) {
		... if the caller wants to iterate over "custom" drivers
		... do these things
Jeff King March 24, 2021, 7:12 p.m. UTC | #2
On Wed, Mar 24, 2021 at 02:48:47AM +0100, Ævar Arnfjörð Bjarmason wrote:

> Refactor the userdiff_find_by_namelen() function so that a new
> for_each_userdiff_driver() API function does most of the work.
> 
> This will be useful for the same reason we've got other for_each_*()
> API functions as part of various APIs, and will be used in a follow-up
> commit.

The refactorings up to here all made sense, but TBH this one makes the
code more confusing to follow to me.

Perhaps part of it is just that the diff is messy, but I had to read it
several times to understand what's going on. Here's what I think were
the tricky parts:

> -static struct userdiff_driver *userdiff_find_by_namelen(const char *k, size_t len)
> +struct for_each_userdiff_driver_cb {
> +	const char *k;
> +	size_t len;
> +	struct userdiff_driver *driver;
> +};

Our callback function does _one_ type of selection (based on a "type"
parameter), but not another (based on the name). That feels
inconsistent, but is also the reason we have this awkward struct.  Part
of my confusion is the name: this is not something to be generically
used with for_each_userdiff_driver(), but rather a type unique to
find_by_namelen() to be passed through the opaque void pointer.

So "struct find_by_namelen_data" would have been a lot more
enlightening.

The fact that callbacks are awkward in general in C might not be
solvable, at least not without duplicating some iteration code.

> +static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
> +				       enum userdiff_driver_type type, void *priv)
>  {
> [...]
> +	if (!strncmp(driver->name, cb_data->k, cb_data->len) &&
> +	    !driver->name[cb_data->len]) {
> +		cb_data->driver = driver;
> +		return -1; /* found it! */
>  	}

This "return -1" took me a while to grok, and the comment didn't help
all that much. The point is to stop traversing the list, but "-1" to me
signals error. I think returning "1" might be a bit more idiomatic, but
also a comment that says "tell the caller to stop iterating" would have
been more clear.

> +int for_each_userdiff_driver(each_userdiff_driver_fn fn,
> +			     enum userdiff_driver_type type, void *cb_data)
> +{
> +	int i, ret;
> +	if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_CUSTOM)) {
> +
> +		for (i = 0; i < ndrivers; i++) {
> +			struct userdiff_driver *drv = drivers + i;
> +			ret = fn(drv, USERDIFF_DRIVER_TYPE_CUSTOM, cb_data);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +	if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_BUILTIN)) {
> +
> +		for (i = 0; i < ARRAY_SIZE(builtin_drivers); i++) {
> +			struct userdiff_driver *drv = builtin_drivers + i;
> +			ret = fn(drv, USERDIFF_DRIVER_TYPE_BUILTIN, cb_data);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +	return 0;
> +}

I spent a while scratching my head at these types, and what they would
be used for, since this caller doesn't introduce any. Looking at patch 7
helped, though it's unclear to me why we need to distinguish between
custom and builtin drivers there. As you note there, nobody calls
list-custom-drivers nor list-drivers. And if we haven't configured
anything, then wouldn't list-drivers be the same as list-builtin-drivers?
Or for the purposes of that test, if we _did_ configure something,

  As an aside, it feels like this is something we ought to be able to
  ask git-config about, rather than having a test-helper. This is
  basically "baked-in" config, and if we represented it as such, and
  parsed it into a struct just like regular config, then probably "git
  config --list --source" could be used to find it (and differentiate it
  from user-provided config). Possible downsides:

    1. Would people find it confusing that "git config --list" suddenly
       gets way bigger? Maybe we'd want an "--include-baked-in" option
       or something.

    2. Is the cost of parsing the config measurably bad? Obviously a
       user could provide the same content and we'd have to parse it,
       but there's a lot more rules here than most users would probably
       provide.

> +enum userdiff_driver_type {
> +	USERDIFF_DRIVER_TYPE_UNSPECIFIED = 1<<0,
> +	USERDIFF_DRIVER_TYPE_BUILTIN = 1<<1,
> +	USERDIFF_DRIVER_TYPE_CUSTOM = 1<<2,
> +};

I was confused by these being bits, because some of them seem mutually
exclusive (e.g., UNSPECIFIED and anything else).

Perhaps it would make more sense as:

  USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
  USERDIFF_DRIVER_TYPE_CUSTOM = 1<<0,
  USERDIFF_DRIVER_TYPE_ALL = USERDIFF_DRIVER_TYPE_BUILTIN | USERDIFF_DRIVER_TYPE_CUSTOM

Or the one caller who wants "ALL" could even do the OR themselves.

I do kind of wonder if there's much value in having a single function
with a type field at all, though, given that there's no overlap in the
implementation. Would separate "for_each_custom" and "for_each_builtin"
functions make sense? And then the existing caller would just call them
sequentially.

I dunno. I know a lot of this is nit-picking, and I don't think there's
anything incorrect in this patch. I just found it surprisingly hard to
read for something that purports to be refactoring / cleaning the code.

-Peff
Junio C Hamano March 24, 2021, 7:57 p.m. UTC | #3
Jeff King <peff@peff.net> writes:

> Our callback function does _one_ type of selection (based on a "type"
> parameter), but not another (based on the name). That feels
> inconsistent, but is also the reason we have this awkward struct.

I wrote a very similar review but did not send it out, as I couldn't
pinpoint where the awkwardness come from exactly.

> Part
> of my confusion is the name: this is not something to be generically
> used with for_each_userdiff_driver(), but rather a type unique to
> find_by_namelen() to be passed through the opaque void pointer.
>
> So "struct find_by_namelen_data" would have been a lot more
> enlightening.

Yes.  The callback parameter being "void *" is the API, and it is
just this user that uses this particular structure.

And while "type" could also be made a part of this structure and
have the API always iterate over all entries regardless of "type",
i.e. the callback function could be the one responsible for finding
the entries in the table for one particular type, it is understandable
that potential callers can be helped by having the pre-filtering
based on the "type" thing on the API side.

>> +	if (!strncmp(driver->name, cb_data->k, cb_data->len) &&
>> +	    !driver->name[cb_data->len]) {
>> +		cb_data->driver = driver;
>> +		return -1; /* found it! */
>>  	}
>
> This "return -1" took me a while to grok, and the comment didn't help
> all that much. The point is to stop traversing the list, but "-1" to me
> signals error. I think returning "1" might be a bit more idiomatic, but
> also a comment that says "tell the caller to stop iterating" would have
> been more clear.

And the fact that it becomes the return value of "for_each_()" iterator
does not quite help to use a negative value, implying there was some
error condition X-<.

> Perhaps it would make more sense as:
>
>   USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
>   USERDIFF_DRIVER_TYPE_CUSTOM = 1<<0,
>   USERDIFF_DRIVER_TYPE_ALL = USERDIFF_DRIVER_TYPE_BUILTIN | USERDIFF_DRIVER_TYPE_CUSTOM
>
> Or the one caller who wants "ALL" could even do the OR themselves.

Great minds think alike ;-)

> I do kind of wonder if there's much value in having a single function
> with a type field at all, though, given that there's no overlap in the
> implementation. Would separate "for_each_custom" and "for_each_builtin"
> functions make sense? And then the existing caller would just call them
> sequentially.

Or a single for_each_driver() that gives <name, length, type> tuple
to the callback function, iterating over all drivers regardless of
the type.
Jeff King March 24, 2021, 8:43 p.m. UTC | #4
On Wed, Mar 24, 2021 at 12:57:15PM -0700, Junio C Hamano wrote:

> And while "type" could also be made a part of this structure and
> have the API always iterate over all entries regardless of "type",
> i.e. the callback function could be the one responsible for finding
> the entries in the table for one particular type, it is understandable
> that potential callers can be helped by having the pre-filtering
> based on the "type" thing on the API side.

Yeah. We _could_ also have the for-each function filter by name (if
present). And then quit when it finds a matching name (because it knows
the names are supposed to be unique).

That is the opposite of:

> Or a single for_each_driver() that gives <name, length, type> tuple
> to the callback function, iterating over all drivers regardless of
> the type.

I'd be fine with that, too. I don't have a huge preference either way,
but it does feel like it would fix the weird asymmetry (though as I
said, I don't think the asymmetry is _wrong_, and it may not be worth
over-engineering this tiny corner of the interface).

Thanks for spelling out both of these directions clearly. My response to
Ævar was a little muddled as I was having trouble figuring out just what
it was that left me so puzzled by the patch. :)


By the way, one thing I wondered about all of this: could we have an
entry in both the custom and builtin lists? I think the answer is "no",
because any custom config for a builtin type would get placed inside the
existing struct in the builtin_drivers array (which is sadly a reason we
must leak any old string values when we parse diff.*.funcname, etc; we
don't know if they are string literals or heap values from previously
parsed config).

I also notice that the "builtin" for-each does not count the boolean
"diff" or "!diff" attribute structs. That is perhaps reasonable, as they
do not have interesting names nor values in the first place. I do still
wonder whether "builtin vs custom" is even a useful distinction (i.e.,
those builtins are less builtin_drivers[] than drivers[]).

-Peff
Ævar Arnfjörð Bjarmason March 24, 2021, 11:05 p.m. UTC | #5
On Wed, Mar 24 2021, Jeff King wrote:

> On Wed, Mar 24, 2021 at 02:48:47AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> Refactor the userdiff_find_by_namelen() function so that a new
>> for_each_userdiff_driver() API function does most of the work.
>> 
>> This will be useful for the same reason we've got other for_each_*()
>> API functions as part of various APIs, and will be used in a follow-up
>> commit.
>
> The refactorings up to here all made sense, but TBH this one makes the
> code more confusing to follow to me.
>
> Perhaps part of it is just that the diff is messy, but I had to read it
> several times to understand what's going on. Here's what I think were
> the tricky parts:
>
>> -static struct userdiff_driver *userdiff_find_by_namelen(const char *k, size_t len)
>> +struct for_each_userdiff_driver_cb {
>> +	const char *k;
>> +	size_t len;
>> +	struct userdiff_driver *driver;
>> +};
>
> Our callback function does _one_ type of selection (based on a "type"
> parameter), but not another (based on the name). That feels
> inconsistent, but is also the reason we have this awkward struct.  Part
> of my confusion is the name: this is not something to be generically
> used with for_each_userdiff_driver(), but rather a type unique to
> find_by_namelen() to be passed through the opaque void pointer.
>
> So "struct find_by_namelen_data" would have been a lot more
> enlightening.
>
> The fact that callbacks are awkward in general in C might not be
> solvable, at least not without duplicating some iteration code.
>
>> +static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
>> +				       enum userdiff_driver_type type, void *priv)
>>  {
>> [...]
>> +	if (!strncmp(driver->name, cb_data->k, cb_data->len) &&
>> +	    !driver->name[cb_data->len]) {
>> +		cb_data->driver = driver;
>> +		return -1; /* found it! */
>>  	}
>
> This "return -1" took me a while to grok, and the comment didn't help
> all that much. The point is to stop traversing the list, but "-1" to me
> signals error. I think returning "1" might be a bit more idiomatic, but
> also a comment that says "tell the caller to stop iterating" would have
> been more clear.

*nod*

Also thanks for all the reviewing so far both, I'm not replying to all
of it point-by-point here, will respond with a re-roll at some point.

>> +int for_each_userdiff_driver(each_userdiff_driver_fn fn,
>> +			     enum userdiff_driver_type type, void *cb_data)
>> +{
>> +	int i, ret;
>> +	if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_CUSTOM)) {
>> +
>> +		for (i = 0; i < ndrivers; i++) {
>> +			struct userdiff_driver *drv = drivers + i;
>> +			ret = fn(drv, USERDIFF_DRIVER_TYPE_CUSTOM, cb_data);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +	}
>> +	if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_BUILTIN)) {
>> +
>> +		for (i = 0; i < ARRAY_SIZE(builtin_drivers); i++) {
>> +			struct userdiff_driver *drv = builtin_drivers + i;
>> +			ret = fn(drv, USERDIFF_DRIVER_TYPE_BUILTIN, cb_data);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +	}
>> +	return 0;
>> +}
>
> I spent a while scratching my head at these types, and what they would
> be used for, since this caller doesn't introduce any. Looking at patch 7
> helped, though it's unclear to me why we need to distinguish between
> custom and builtin drivers there. As you note there, nobody calls
> list-custom-drivers nor list-drivers. And if we haven't configured
> anything, then wouldn't list-drivers be the same as list-builtin-drivers?
> Or for the purposes of that test, if we _did_ configure something,
>
>   As an aside, it feels like this is something we ought to be able to
>   ask git-config about, rather than having a test-helper. This is
>   basically "baked-in" config, and if we represented it as such, and
>   parsed it into a struct just like regular config, then probably "git
>   config --list --source" could be used to find it (and differentiate it
>   from user-provided config). Possible downsides:
>
>     1. Would people find it confusing that "git config --list" suddenly
>        gets way bigger? Maybe we'd want an "--include-baked-in" option
>        or something.
>
>     2. Is the cost of parsing the config measurably bad? Obviously a
>        user could provide the same content and we'd have to parse it,
>        but there's a lot more rules here than most users would probably
>        provide.

Also:

 3. Only the PATTERNS() macro translates as-is to config syntax. We
    don't have a way to do what IPATTERN() does in the config syntax
    currently.

    We could add a ifuncname and xifuncname or whatever for it I guess,
    but currently the ICASE behavior in the C code is magic.

>> +enum userdiff_driver_type {
>> +	USERDIFF_DRIVER_TYPE_UNSPECIFIED = 1<<0,
>> +	USERDIFF_DRIVER_TYPE_BUILTIN = 1<<1,
>> +	USERDIFF_DRIVER_TYPE_CUSTOM = 1<<2,
>> +};
>
> I was confused by these being bits, because some of them seem mutually
> exclusive (e.g., UNSPECIFIED and anything else).
>
> Perhaps it would make more sense as:
>
>   USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
>   USERDIFF_DRIVER_TYPE_CUSTOM = 1<<0,
>   USERDIFF_DRIVER_TYPE_ALL = USERDIFF_DRIVER_TYPE_BUILTIN | USERDIFF_DRIVER_TYPE_CUSTOM
>
> Or the one caller who wants "ALL" could even do the OR themselves.
>
> I do kind of wonder if there's much value in having a single function
> with a type field at all, though, given that there's no overlap in the
> implementation. Would separate "for_each_custom" and "for_each_builtin"
> functions make sense? And then the existing caller would just call them
> sequentially.
>
> I dunno. I know a lot of this is nit-picking, and I don't think there's
> anything incorrect in this patch. I just found it surprisingly hard to
> read for something that purports to be refactoring / cleaning the code.
>
> -Peff
Jeff King March 25, 2021, 12:33 a.m. UTC | #6
On Thu, Mar 25, 2021 at 12:05:09AM +0100, Ævar Arnfjörð Bjarmason wrote:

> >   As an aside, it feels like this is something we ought to be able to
> >   ask git-config about, rather than having a test-helper. This is
> >   basically "baked-in" config, and if we represented it as such, and
> >   parsed it into a struct just like regular config, then probably "git
> >   config --list --source" could be used to find it (and differentiate it
> >   from user-provided config). Possible downsides:
> >
> >     1. Would people find it confusing that "git config --list" suddenly
> >        gets way bigger? Maybe we'd want an "--include-baked-in" option
> >        or something.
> >
> >     2. Is the cost of parsing the config measurably bad? Obviously a
> >        user could provide the same content and we'd have to parse it,
> >        but there's a lot more rules here than most users would probably
> >        provide.
> 
> Also:
> 
>  3. Only the PATTERNS() macro translates as-is to config syntax. We
>     don't have a way to do what IPATTERN() does in the config syntax
>     currently.
> 
>     We could add a ifuncname and xifuncname or whatever for it I guess,
>     but currently the ICASE behavior in the C code is magic.

Good point. IMHO that is something we should consider fixing
independently. It was a mistake to add builtins that couldn't be
replicated via the config (though I notice it happened quite a while
ago, and nobody seems to have cared, so perhaps it isn't that
important).

I have also wondered if we should just ship a file which could be
installed as /etc/gitconfig.filetypes and include it the stock
/etc/gitconfig. That is effectively the same as "baked in", but
hopefully makes it more clear to users how they can modify things.  But
all of that is somewhat orthogonal to what you're doing here.

-Peff
Ævar Arnfjörð Bjarmason March 25, 2021, 1:26 a.m. UTC | #7
On Thu, Mar 25 2021, Jeff King wrote:

> On Thu, Mar 25, 2021 at 12:05:09AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> >   As an aside, it feels like this is something we ought to be able to
>> >   ask git-config about, rather than having a test-helper. This is
>> >   basically "baked-in" config, and if we represented it as such, and
>> >   parsed it into a struct just like regular config, then probably "git
>> >   config --list --source" could be used to find it (and differentiate it
>> >   from user-provided config). Possible downsides:
>> >
>> >     1. Would people find it confusing that "git config --list" suddenly
>> >        gets way bigger? Maybe we'd want an "--include-baked-in" option
>> >        or something.
>> >
>> >     2. Is the cost of parsing the config measurably bad? Obviously a
>> >        user could provide the same content and we'd have to parse it,
>> >        but there's a lot more rules here than most users would probably
>> >        provide.
>> 
>> Also:
>> 
>>  3. Only the PATTERNS() macro translates as-is to config syntax. We
>>     don't have a way to do what IPATTERN() does in the config syntax
>>     currently.
>> 
>>     We could add a ifuncname and xifuncname or whatever for it I guess,
>>     but currently the ICASE behavior in the C code is magic.
>
> Good point. IMHO that is something we should consider fixing
> independently. It was a mistake to add builtins that couldn't be
> replicated via the config (though I notice it happened quite a while
> ago, and nobody seems to have cared, so perhaps it isn't that
> important).

I'm conspiring to eventually optimistically replace these ERE patterns
with PCRE if we build with that at some point.

Then you could just prefix your pattern with (?i) here in this and other
things that want icase...

> I have also wondered if we should just ship a file which could be
> installed as /etc/gitconfig.filetypes and include it the stock
> /etc/gitconfig. That is effectively the same as "baked in", but
> hopefully makes it more clear to users how they can modify things.  But
> all of that is somewhat orthogonal to what you're doing here.

In theory I'm with you on that, in practice this is just the sort of
thing that requires opt-in effort from every person packaging git
(installing system-wide-config).

A good number of those people will decide "default config values? In
some ~200 line file included in /etc/gitconfig?? I don't need that! This
is a minimal install!".

And then their web UI that calls "git diff" under the hood won't do the
right thing with a .gitattributes config that expects these built-in
drivers anymore.

But yeah, there's no reason your earlier suggestion of injecting global
defaults into "git config -l" wouldn't work, and for our own C APIs such
as this one to rely on that happening.

It would even make git more discoverable, as all the "this is set to xyz
by default" docs in git-config(1) could become reflective, you could
just run "git config -l --show-origin | grep ^default:" or something to
see all default values.
Jeff King March 26, 2021, 3:27 a.m. UTC | #8
On Thu, Mar 25, 2021 at 02:26:21AM +0100, Ævar Arnfjörð Bjarmason wrote:

> >>     We could add a ifuncname and xifuncname or whatever for it I guess,
> >>     but currently the ICASE behavior in the C code is magic.
> >
> > Good point. IMHO that is something we should consider fixing
> > independently. It was a mistake to add builtins that couldn't be
> > replicated via the config (though I notice it happened quite a while
> > ago, and nobody seems to have cared, so perhaps it isn't that
> > important).
> 
> I'm conspiring to eventually optimistically replace these ERE patterns
> with PCRE if we build with that at some point.
> 
> Then you could just prefix your pattern with (?i) here in this and other
> things that want icase...

I really like PCRE myself, but is it portable/common enough for us to
start using it for baked-in funcname patterns? I sort of assumed there
were exotic platforms where libpcre wouldn't build (or at least it would
be inconvenient or uncommon to have it). And it would be nice to degrade
those gracefully (or at least better than "I guess you don't get any
builtin funcnames. Tough luck").

> > I have also wondered if we should just ship a file which could be
> > installed as /etc/gitconfig.filetypes and include it the stock
> > /etc/gitconfig. That is effectively the same as "baked in", but
> > hopefully makes it more clear to users how they can modify things.  But
> > all of that is somewhat orthogonal to what you're doing here.
> 
> In theory I'm with you on that, in practice this is just the sort of
> thing that requires opt-in effort from every person packaging git
> (installing system-wide-config).

I assumed that we'd install it with "make install", and packaging would
pick it up from there. But you're right that it is likely to create extra
headaches.

> But yeah, there's no reason your earlier suggestion of injecting global
> defaults into "git config -l" wouldn't work, and for our own C APIs such
> as this one to rely on that happening.
> 
> It would even make git more discoverable, as all the "this is set to xyz
> by default" docs in git-config(1) could become reflective, you could
> just run "git config -l --show-origin | grep ^default:" or something to
> see all default values.

Yeah, exactly. The discoverability is the real value IMHO. I just worry
about overwhelming the user who runs "git config" with --show-origin. I
guess it could avoid showing baked-in config unless an extra option is
given, but then that makes discoverability slightly harder (though still
better than it is today).

I dunno. This is mostly orthogonal to your patch series, so I don't mind
at all punting on it for now. Though if we _did_ do the baked-in config
then, then a lot of this userdiff code would want to be converted to it.

-Peff
Ævar Arnfjörð Bjarmason April 9, 2021, 7:44 p.m. UTC | #9
On Fri, Mar 26 2021, Jeff King wrote:

> On Thu, Mar 25, 2021 at 02:26:21AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> >>     We could add a ifuncname and xifuncname or whatever for it I guess,
>> >>     but currently the ICASE behavior in the C code is magic.
>> >
>> > Good point. IMHO that is something we should consider fixing
>> > independently. It was a mistake to add builtins that couldn't be
>> > replicated via the config (though I notice it happened quite a while
>> > ago, and nobody seems to have cared, so perhaps it isn't that
>> > important).
>> 
>> I'm conspiring to eventually optimistically replace these ERE patterns
>> with PCRE if we build with that at some point.
>> 
>> Then you could just prefix your pattern with (?i) here in this and other
>> things that want icase...
>
> I really like PCRE myself, but is it portable/common enough for us to
> start using it for baked-in funcname patterns? I sort of assumed there
> were exotic platforms where libpcre wouldn't build (or at least it would
> be inconvenient or uncommon to have it). And it would be nice to degrade
> those gracefully (or at least better than "I guess you don't get any
> builtin funcnames. Tough luck").

I think it's as portable as git, the JIT backend isn't, but PCRE itself
is portably C code, and e.g. everything that python, PHP and any other
number of things that depend on PCRE has had PCRE ported to it.

That plan involves an "git rm -r compat/regex" and a compat/pcre
instead, I have some long-left-over patches for that.
Jeff King April 9, 2021, 8:11 p.m. UTC | #10
On Fri, Apr 09, 2021 at 09:44:57PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > I really like PCRE myself, but is it portable/common enough for us to
> > start using it for baked-in funcname patterns? I sort of assumed there
> > were exotic platforms where libpcre wouldn't build (or at least it would
> > be inconvenient or uncommon to have it). And it would be nice to degrade
> > those gracefully (or at least better than "I guess you don't get any
> > builtin funcnames. Tough luck").
> 
> I think it's as portable as git, the JIT backend isn't, but PCRE itself
> is portably C code, and e.g. everything that python, PHP and any other
> number of things that depend on PCRE has had PCRE ported to it.
> 
> That plan involves an "git rm -r compat/regex" and a compat/pcre
> instead, I have some long-left-over patches for that.

OK. I was more worried about platforms where it was cumbersome to
install pcre. If we are shipping it as a vendored library, then at least
the dependency management is on us, and not the user. I do worry a
little about running into complications with building or debugging it.
It would be importing a lot of new code (that in theory we never need to
look at, but that is not always how it works...). And while it may build
portably everywhere, that may involve extra work and not integrate with
our build knobs (e.g., it looks like it uses autoconf)

-Peff
Junio C Hamano April 9, 2021, 10:37 p.m. UTC | #11
Jeff King <peff@peff.net> writes:

>> That plan involves an "git rm -r compat/regex" and a compat/pcre
>> instead, I have some long-left-over patches for that.
>
> OK. I was more worried about platforms where it was cumbersome to
> install pcre. If we are shipping it as a vendored library, then at least
> the dependency management is on us, and not the user. I do worry a
> little about running into complications with building or debugging it.

Please don't.  I prefer not shipping compat/pcre ourselves; having
to worry about version skew and picking up upstream security fixes
in time, etc. for an external library that is well maintained is not
what we need.
Ævar Arnfjörð Bjarmason April 10, 2021, 12:30 p.m. UTC | #12
On Sat, Apr 10 2021, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
>
>>> That plan involves an "git rm -r compat/regex" and a compat/pcre
>>> instead, I have some long-left-over patches for that.
>>
>> OK. I was more worried about platforms where it was cumbersome to
>> install pcre. If we are shipping it as a vendored library, then at least
>> the dependency management is on us, and not the user. I do worry a
>> little about running into complications with building or debugging it.
>
> Please don't.  I prefer not shipping compat/pcre ourselves; having
> to worry about version skew and picking up upstream security fixes
> in time, etc. for an external library that is well maintained is not
> what we need.

I'm not submitting patches for this now, I personally don't care much if
PCRE became a required dep (as in you need the library) and we didn't
ship it.

I just assumed if I'd propose that that we'd want "make" to work on more
bare-bones systems, since we e.g. carry a copy of
sha1collisiondetection/ partially to facilitate that.

But FWIW, and to the extent I've got time etc. my rough plan for this
larger PCRE conpsiary was:

 1. Get the current pickaxe-prep-for-PCRE patches to land.

 2. I've got patches on top of that which move over to PCRE, so as with
    the existing grep.c code we could drop kwset entirely (declare that
    if you need optimized fast search, you can install PCRE).

 3. git rm kwset.[ch]

 4. Migrate more things to some light wrapper "do a regex match" API
    (just small modifications to grep.c) that'll optimistically be able
    to do BRE/ERE/PCRE matches. If you've got PCRE then PCRE can handle
    them all (it's got a RX dialecttranslation interface).

 5. All our regcomp/regexec should be gone in favor if this simple
    replacement API.

Which is where this "PCRE everywhere" would come in, at this point we'll
be carrying a compat/regex/ which we could just as well replace with a
compat/pcre/.

The reason it would matter to have a hard prereq on PCRE is because it
would simplify and speed up a lot of code in various places to be able
to make an assumption that it's there. E.g. for things that do "match a
line" like pickaxe we could do clever anchoring and match the whole diff
at once, and also things like streaming the diff out as we consume it
into PCRE's stream matching API.

We also have a lot of things purely on the C level that would be
simpler, e.g. (just one thing I remember) grep.c has 40 lines of code to
emulate \b<word>\b, but we could ... just use \b.

To Jeff's comment upthread:

> And while it may build portably everywhere, that may involve extra
> work and not integrate with our build knobs (e.g., it looks like it
> uses autoconf)

I already did that work in an old WIP branch I have. I haven't run it
again so maybe this summary is inaccurate, but here's the commit:
https://github.com/avar/git/commit/79256ee4a1

So basically it won't build with autoconf or whatever, just with our own
build system after we'd wget the relevant PCREv2 source files into
compat/pcre/.

The only things PCRE really needed autoconf or its own build system for
was to define 20-ish macros which are either options we can switch, or
(looking at this now) 4x probes like "HAVE_STDINT_H" which we could add
to compat.mak.uname and friends.

If we wanted it to Just Work we could run that script and "git commit"
it, or just leave it at a make warning saying "you don't have PCRE,
either install it using your pkg system, or run this handy shellscript".

But yeah, if Junio's paranoid about the security aspect we could skip
that "./get-pcre.sh && git commit" step.

I'm getting some deja-vu writing this next part, I'm pretty sure I've
written some version of the same E-Mail before, in any case:

 A. This wouldn't be anything new. We have some old bitrotting copy of
    GNU awk's copy of glibc's regex engine in compat/regex. It's trivial
    to e.g. get it to exhaust all memory on your system.

    Sure, PCRE could have bugs, but at least we *could* update it. As
    you may remember I abandoned my last attempt to update compat/regex/

 B. Even if you assume some bug like that, I think realistically no user
    anywhere would be vulnerable to a copy we'd ship in compat/pcre/,
    any vendor would ignore it and build the prereq themselves. It's
    purely a developer aid.

    Of course it could have some CVE, but it wouldn't be mean a git
    point release.

 C. Even if the regex engine has an CVE/RCE we have no feature anywhere
    to execute arbitrary user regexes, anyone who could inject one can
    already run arbitrary git commands.

 D. The *one* exception to C) is gitweb's not-on-by-default "grep the
    source tree" feature.

    So first its users would be better off than they are now now. You
    can trivially DoS any box that has it exposed, secondly the union of
    gitweb users + those that would turn that on + build their own git
    for running a public-facing site is probably 0 people these days.

 E. Even if you assume some combination of B..D isn't true I genuinely
    don't know why this concern of PCRE being a special
    security/vulnerability concern comes from, it seems to be brought up
    any time the topic is discussed.

    The implicit context is that we already process regexes using a
    plethora of engines, and are using anything from AIXs, SunOS's,
    Window's, OSX's, GNU libc's etc. regex engines to do so.

    I haven't seen anything to suggest that PCRE's security record is
    notably worse than even one of those pieces of software, and right
    now we're exposing any bugs in the union of all those engines.

    It's quite the opposite, unlike those engines PCRE actually gives
    you really good features to limit resource consumption and regex
    features in the event that you don't trust your regex or your
    input.

    I'd actually trust for things like processing a "git grep on the
    web" with appropriate resource constraints.

    The two solutions in that space are basically to use DFA engine as
    e.g. Google's re2 does. That means living with a severely restricted
    set of regex features, or to have bells and whistles as PCRE's NFA
    does, but one that has tweakable resource limits.
diff mbox series

Patch

diff --git a/userdiff.c b/userdiff.c
index 10a02d36209..55f4f769bd3 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -259,20 +259,32 @@  static struct userdiff_driver driver_false = {
 	{ NULL, 0 }
 };
 
-static struct userdiff_driver *userdiff_find_by_namelen(const char *k, size_t len)
+struct for_each_userdiff_driver_cb {
+	const char *k;
+	size_t len;
+	struct userdiff_driver *driver;
+};
+
+static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
+				       enum userdiff_driver_type type, void *priv)
 {
-	int i;
-	for (i = 0; i < ndrivers; i++) {
-		struct userdiff_driver *drv = drivers + i;
-		if (!strncmp(drv->name, k, len) && !drv->name[len])
-			return drv;
-	}
-	for (i = 0; i < ARRAY_SIZE(builtin_drivers); i++) {
-		struct userdiff_driver *drv = builtin_drivers + i;
-		if (!strncmp(drv->name, k, len) && !drv->name[len])
-			return drv;
+	struct for_each_userdiff_driver_cb *cb_data = priv;
+
+	if (!strncmp(driver->name, cb_data->k, cb_data->len) &&
+	    !driver->name[cb_data->len]) {
+		cb_data->driver = driver;
+		return -1; /* found it! */
 	}
-	return NULL;
+	return 0;
+}
+
+static struct userdiff_driver *userdiff_find_by_namelen(const char *k, size_t len)
+{
+	struct for_each_userdiff_driver_cb udcbdata = { .k = k, .len = len, .driver = NULL };
+
+	for_each_userdiff_driver(userdiff_find_by_namelen_cb,
+				 USERDIFF_DRIVER_TYPE_UNSPECIFIED, &udcbdata);
+	return udcbdata.driver;
 }
 
 static int parse_funcname(struct userdiff_funcname *f, const char *k,
@@ -373,3 +385,28 @@  struct userdiff_driver *userdiff_get_textconv(struct repository *r,
 
 	return driver;
 }
+
+int for_each_userdiff_driver(each_userdiff_driver_fn fn,
+			     enum userdiff_driver_type type, void *cb_data)
+{
+	int i, ret;
+	if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_CUSTOM)) {
+
+		for (i = 0; i < ndrivers; i++) {
+			struct userdiff_driver *drv = drivers + i;
+			ret = fn(drv, USERDIFF_DRIVER_TYPE_CUSTOM, cb_data);
+			if (ret)
+				return ret;
+		}
+	}
+	if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_BUILTIN)) {
+
+		for (i = 0; i < ARRAY_SIZE(builtin_drivers); i++) {
+			struct userdiff_driver *drv = builtin_drivers + i;
+			ret = fn(drv, USERDIFF_DRIVER_TYPE_BUILTIN, cb_data);
+			if (ret)
+				return ret;
+		}
+	}
+	return 0;
+}
diff --git a/userdiff.h b/userdiff.h
index 203057e13e5..fe14014a775 100644
--- a/userdiff.h
+++ b/userdiff.h
@@ -21,6 +21,13 @@  struct userdiff_driver {
 	struct notes_cache *textconv_cache;
 	int textconv_want_cache;
 };
+enum userdiff_driver_type {
+	USERDIFF_DRIVER_TYPE_UNSPECIFIED = 1<<0,
+	USERDIFF_DRIVER_TYPE_BUILTIN = 1<<1,
+	USERDIFF_DRIVER_TYPE_CUSTOM = 1<<2,
+};
+typedef int (*each_userdiff_driver_fn)(struct userdiff_driver *,
+				       enum userdiff_driver_type, void *);
 
 int userdiff_config(const char *k, const char *v);
 struct userdiff_driver *userdiff_find_by_name(const char *name);
@@ -34,4 +41,12 @@  struct userdiff_driver *userdiff_find_by_path(struct index_state *istate,
 struct userdiff_driver *userdiff_get_textconv(struct repository *r,
 					      struct userdiff_driver *driver);
 
+/*
+ * Iterate over each driver of type userdiff_driver_type, or
+ * USERDIFF_DRIVER_TYPE_UNSPECIFIED for all of them. Return non-zero
+ * to exit from the loop.
+ */
+int for_each_userdiff_driver(each_userdiff_driver_fn,
+			     enum userdiff_driver_type, void *);
+
 #endif /* USERDIFF */