Message ID | patch-05.11-64ea5e8443f-20210324T014604Z-avarab@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | userdiff: refactor + test improvements | expand |
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > +struct for_each_userdiff_driver_cb { > + const char *k; > + size_t len; > + struct userdiff_driver *driver; > +}; Makes me wonder if we want to rename s/k/name/; > +static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver, > + enum userdiff_driver_type type, void *priv) > { > + struct for_each_userdiff_driver_cb *cb_data = priv; > + > + if (!strncmp(driver->name, cb_data->k, cb_data->len) && > + !driver->name[cb_data->len]) { > + cb_data->driver = driver; > + return -1; /* found it! */ > } > - return NULL; > + return 0; > +} Makes sense. > +static struct userdiff_driver *userdiff_find_by_namelen(const char *k, size_t len) > +{ > + struct for_each_userdiff_driver_cb udcbdata = { .k = k, .len = len, .driver = NULL }; No need to explicitly spell the zero initialization. Wrapping it like this: struct for_each_userdiff_driver_cb udcbdata = { .k = k, .len = len }; would avoid the overlong line. > + > + for_each_userdiff_driver(userdiff_find_by_namelen_cb, > + USERDIFF_DRIVER_TYPE_UNSPECIFIED, &udcbdata); > + return udcbdata.driver; > @@ -373,3 +385,28 @@ struct userdiff_driver *userdiff_get_textconv(struct repository *r, > > return driver; > } > + > +int for_each_userdiff_driver(each_userdiff_driver_fn fn, > + enum userdiff_driver_type type, void *cb_data) > +{ > + int i, ret; > + if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_CUSTOM)) { I presume that the concrete ones are bitmask (i.e. BUILTIN occupies bit #0 while CUSTOM occupies bit #1, or something like that). Then #define USERDIFF_DRIVER_TYPE_UNSPECIFIED (-1) would make this (and the other) condition far easier to read, i.e. if (type & USERDIFF_DRIVER_TYPE_CUSTOM) { ... if the caller wants to iterate over "custom" drivers ... do these things
On Wed, Mar 24, 2021 at 02:48:47AM +0100, Ævar Arnfjörð Bjarmason wrote: > Refactor the userdiff_find_by_namelen() function so that a new > for_each_userdiff_driver() API function does most of the work. > > This will be useful for the same reason we've got other for_each_*() > API functions as part of various APIs, and will be used in a follow-up > commit. The refactorings up to here all made sense, but TBH this one makes the code more confusing to follow to me. Perhaps part of it is just that the diff is messy, but I had to read it several times to understand what's going on. Here's what I think were the tricky parts: > -static struct userdiff_driver *userdiff_find_by_namelen(const char *k, size_t len) > +struct for_each_userdiff_driver_cb { > + const char *k; > + size_t len; > + struct userdiff_driver *driver; > +}; Our callback function does _one_ type of selection (based on a "type" parameter), but not another (based on the name). That feels inconsistent, but is also the reason we have this awkward struct. Part of my confusion is the name: this is not something to be generically used with for_each_userdiff_driver(), but rather a type unique to find_by_namelen() to be passed through the opaque void pointer. So "struct find_by_namelen_data" would have been a lot more enlightening. The fact that callbacks are awkward in general in C might not be solvable, at least not without duplicating some iteration code. > +static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver, > + enum userdiff_driver_type type, void *priv) > { > [...] > + if (!strncmp(driver->name, cb_data->k, cb_data->len) && > + !driver->name[cb_data->len]) { > + cb_data->driver = driver; > + return -1; /* found it! */ > } This "return -1" took me a while to grok, and the comment didn't help all that much. The point is to stop traversing the list, but "-1" to me signals error. I think returning "1" might be a bit more idiomatic, but also a comment that says "tell the caller to stop iterating" would have been more clear. > +int for_each_userdiff_driver(each_userdiff_driver_fn fn, > + enum userdiff_driver_type type, void *cb_data) > +{ > + int i, ret; > + if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_CUSTOM)) { > + > + for (i = 0; i < ndrivers; i++) { > + struct userdiff_driver *drv = drivers + i; > + ret = fn(drv, USERDIFF_DRIVER_TYPE_CUSTOM, cb_data); > + if (ret) > + return ret; > + } > + } > + if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_BUILTIN)) { > + > + for (i = 0; i < ARRAY_SIZE(builtin_drivers); i++) { > + struct userdiff_driver *drv = builtin_drivers + i; > + ret = fn(drv, USERDIFF_DRIVER_TYPE_BUILTIN, cb_data); > + if (ret) > + return ret; > + } > + } > + return 0; > +} I spent a while scratching my head at these types, and what they would be used for, since this caller doesn't introduce any. Looking at patch 7 helped, though it's unclear to me why we need to distinguish between custom and builtin drivers there. As you note there, nobody calls list-custom-drivers nor list-drivers. And if we haven't configured anything, then wouldn't list-drivers be the same as list-builtin-drivers? Or for the purposes of that test, if we _did_ configure something, As an aside, it feels like this is something we ought to be able to ask git-config about, rather than having a test-helper. This is basically "baked-in" config, and if we represented it as such, and parsed it into a struct just like regular config, then probably "git config --list --source" could be used to find it (and differentiate it from user-provided config). Possible downsides: 1. Would people find it confusing that "git config --list" suddenly gets way bigger? Maybe we'd want an "--include-baked-in" option or something. 2. Is the cost of parsing the config measurably bad? Obviously a user could provide the same content and we'd have to parse it, but there's a lot more rules here than most users would probably provide. > +enum userdiff_driver_type { > + USERDIFF_DRIVER_TYPE_UNSPECIFIED = 1<<0, > + USERDIFF_DRIVER_TYPE_BUILTIN = 1<<1, > + USERDIFF_DRIVER_TYPE_CUSTOM = 1<<2, > +}; I was confused by these being bits, because some of them seem mutually exclusive (e.g., UNSPECIFIED and anything else). Perhaps it would make more sense as: USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0, USERDIFF_DRIVER_TYPE_CUSTOM = 1<<0, USERDIFF_DRIVER_TYPE_ALL = USERDIFF_DRIVER_TYPE_BUILTIN | USERDIFF_DRIVER_TYPE_CUSTOM Or the one caller who wants "ALL" could even do the OR themselves. I do kind of wonder if there's much value in having a single function with a type field at all, though, given that there's no overlap in the implementation. Would separate "for_each_custom" and "for_each_builtin" functions make sense? And then the existing caller would just call them sequentially. I dunno. I know a lot of this is nit-picking, and I don't think there's anything incorrect in this patch. I just found it surprisingly hard to read for something that purports to be refactoring / cleaning the code. -Peff
Jeff King <peff@peff.net> writes: > Our callback function does _one_ type of selection (based on a "type" > parameter), but not another (based on the name). That feels > inconsistent, but is also the reason we have this awkward struct. I wrote a very similar review but did not send it out, as I couldn't pinpoint where the awkwardness come from exactly. > Part > of my confusion is the name: this is not something to be generically > used with for_each_userdiff_driver(), but rather a type unique to > find_by_namelen() to be passed through the opaque void pointer. > > So "struct find_by_namelen_data" would have been a lot more > enlightening. Yes. The callback parameter being "void *" is the API, and it is just this user that uses this particular structure. And while "type" could also be made a part of this structure and have the API always iterate over all entries regardless of "type", i.e. the callback function could be the one responsible for finding the entries in the table for one particular type, it is understandable that potential callers can be helped by having the pre-filtering based on the "type" thing on the API side. >> + if (!strncmp(driver->name, cb_data->k, cb_data->len) && >> + !driver->name[cb_data->len]) { >> + cb_data->driver = driver; >> + return -1; /* found it! */ >> } > > This "return -1" took me a while to grok, and the comment didn't help > all that much. The point is to stop traversing the list, but "-1" to me > signals error. I think returning "1" might be a bit more idiomatic, but > also a comment that says "tell the caller to stop iterating" would have > been more clear. And the fact that it becomes the return value of "for_each_()" iterator does not quite help to use a negative value, implying there was some error condition X-<. > Perhaps it would make more sense as: > > USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0, > USERDIFF_DRIVER_TYPE_CUSTOM = 1<<0, > USERDIFF_DRIVER_TYPE_ALL = USERDIFF_DRIVER_TYPE_BUILTIN | USERDIFF_DRIVER_TYPE_CUSTOM > > Or the one caller who wants "ALL" could even do the OR themselves. Great minds think alike ;-) > I do kind of wonder if there's much value in having a single function > with a type field at all, though, given that there's no overlap in the > implementation. Would separate "for_each_custom" and "for_each_builtin" > functions make sense? And then the existing caller would just call them > sequentially. Or a single for_each_driver() that gives <name, length, type> tuple to the callback function, iterating over all drivers regardless of the type.
On Wed, Mar 24, 2021 at 12:57:15PM -0700, Junio C Hamano wrote: > And while "type" could also be made a part of this structure and > have the API always iterate over all entries regardless of "type", > i.e. the callback function could be the one responsible for finding > the entries in the table for one particular type, it is understandable > that potential callers can be helped by having the pre-filtering > based on the "type" thing on the API side. Yeah. We _could_ also have the for-each function filter by name (if present). And then quit when it finds a matching name (because it knows the names are supposed to be unique). That is the opposite of: > Or a single for_each_driver() that gives <name, length, type> tuple > to the callback function, iterating over all drivers regardless of > the type. I'd be fine with that, too. I don't have a huge preference either way, but it does feel like it would fix the weird asymmetry (though as I said, I don't think the asymmetry is _wrong_, and it may not be worth over-engineering this tiny corner of the interface). Thanks for spelling out both of these directions clearly. My response to Ævar was a little muddled as I was having trouble figuring out just what it was that left me so puzzled by the patch. :) By the way, one thing I wondered about all of this: could we have an entry in both the custom and builtin lists? I think the answer is "no", because any custom config for a builtin type would get placed inside the existing struct in the builtin_drivers array (which is sadly a reason we must leak any old string values when we parse diff.*.funcname, etc; we don't know if they are string literals or heap values from previously parsed config). I also notice that the "builtin" for-each does not count the boolean "diff" or "!diff" attribute structs. That is perhaps reasonable, as they do not have interesting names nor values in the first place. I do still wonder whether "builtin vs custom" is even a useful distinction (i.e., those builtins are less builtin_drivers[] than drivers[]). -Peff
On Wed, Mar 24 2021, Jeff King wrote: > On Wed, Mar 24, 2021 at 02:48:47AM +0100, Ævar Arnfjörð Bjarmason wrote: > >> Refactor the userdiff_find_by_namelen() function so that a new >> for_each_userdiff_driver() API function does most of the work. >> >> This will be useful for the same reason we've got other for_each_*() >> API functions as part of various APIs, and will be used in a follow-up >> commit. > > The refactorings up to here all made sense, but TBH this one makes the > code more confusing to follow to me. > > Perhaps part of it is just that the diff is messy, but I had to read it > several times to understand what's going on. Here's what I think were > the tricky parts: > >> -static struct userdiff_driver *userdiff_find_by_namelen(const char *k, size_t len) >> +struct for_each_userdiff_driver_cb { >> + const char *k; >> + size_t len; >> + struct userdiff_driver *driver; >> +}; > > Our callback function does _one_ type of selection (based on a "type" > parameter), but not another (based on the name). That feels > inconsistent, but is also the reason we have this awkward struct. Part > of my confusion is the name: this is not something to be generically > used with for_each_userdiff_driver(), but rather a type unique to > find_by_namelen() to be passed through the opaque void pointer. > > So "struct find_by_namelen_data" would have been a lot more > enlightening. > > The fact that callbacks are awkward in general in C might not be > solvable, at least not without duplicating some iteration code. > >> +static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver, >> + enum userdiff_driver_type type, void *priv) >> { >> [...] >> + if (!strncmp(driver->name, cb_data->k, cb_data->len) && >> + !driver->name[cb_data->len]) { >> + cb_data->driver = driver; >> + return -1; /* found it! */ >> } > > This "return -1" took me a while to grok, and the comment didn't help > all that much. The point is to stop traversing the list, but "-1" to me > signals error. I think returning "1" might be a bit more idiomatic, but > also a comment that says "tell the caller to stop iterating" would have > been more clear. *nod* Also thanks for all the reviewing so far both, I'm not replying to all of it point-by-point here, will respond with a re-roll at some point. >> +int for_each_userdiff_driver(each_userdiff_driver_fn fn, >> + enum userdiff_driver_type type, void *cb_data) >> +{ >> + int i, ret; >> + if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_CUSTOM)) { >> + >> + for (i = 0; i < ndrivers; i++) { >> + struct userdiff_driver *drv = drivers + i; >> + ret = fn(drv, USERDIFF_DRIVER_TYPE_CUSTOM, cb_data); >> + if (ret) >> + return ret; >> + } >> + } >> + if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_BUILTIN)) { >> + >> + for (i = 0; i < ARRAY_SIZE(builtin_drivers); i++) { >> + struct userdiff_driver *drv = builtin_drivers + i; >> + ret = fn(drv, USERDIFF_DRIVER_TYPE_BUILTIN, cb_data); >> + if (ret) >> + return ret; >> + } >> + } >> + return 0; >> +} > > I spent a while scratching my head at these types, and what they would > be used for, since this caller doesn't introduce any. Looking at patch 7 > helped, though it's unclear to me why we need to distinguish between > custom and builtin drivers there. As you note there, nobody calls > list-custom-drivers nor list-drivers. And if we haven't configured > anything, then wouldn't list-drivers be the same as list-builtin-drivers? > Or for the purposes of that test, if we _did_ configure something, > > As an aside, it feels like this is something we ought to be able to > ask git-config about, rather than having a test-helper. This is > basically "baked-in" config, and if we represented it as such, and > parsed it into a struct just like regular config, then probably "git > config --list --source" could be used to find it (and differentiate it > from user-provided config). Possible downsides: > > 1. Would people find it confusing that "git config --list" suddenly > gets way bigger? Maybe we'd want an "--include-baked-in" option > or something. > > 2. Is the cost of parsing the config measurably bad? Obviously a > user could provide the same content and we'd have to parse it, > but there's a lot more rules here than most users would probably > provide. Also: 3. Only the PATTERNS() macro translates as-is to config syntax. We don't have a way to do what IPATTERN() does in the config syntax currently. We could add a ifuncname and xifuncname or whatever for it I guess, but currently the ICASE behavior in the C code is magic. >> +enum userdiff_driver_type { >> + USERDIFF_DRIVER_TYPE_UNSPECIFIED = 1<<0, >> + USERDIFF_DRIVER_TYPE_BUILTIN = 1<<1, >> + USERDIFF_DRIVER_TYPE_CUSTOM = 1<<2, >> +}; > > I was confused by these being bits, because some of them seem mutually > exclusive (e.g., UNSPECIFIED and anything else). > > Perhaps it would make more sense as: > > USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0, > USERDIFF_DRIVER_TYPE_CUSTOM = 1<<0, > USERDIFF_DRIVER_TYPE_ALL = USERDIFF_DRIVER_TYPE_BUILTIN | USERDIFF_DRIVER_TYPE_CUSTOM > > Or the one caller who wants "ALL" could even do the OR themselves. > > I do kind of wonder if there's much value in having a single function > with a type field at all, though, given that there's no overlap in the > implementation. Would separate "for_each_custom" and "for_each_builtin" > functions make sense? And then the existing caller would just call them > sequentially. > > I dunno. I know a lot of this is nit-picking, and I don't think there's > anything incorrect in this patch. I just found it surprisingly hard to > read for something that purports to be refactoring / cleaning the code. > > -Peff
On Thu, Mar 25, 2021 at 12:05:09AM +0100, Ævar Arnfjörð Bjarmason wrote: > > As an aside, it feels like this is something we ought to be able to > > ask git-config about, rather than having a test-helper. This is > > basically "baked-in" config, and if we represented it as such, and > > parsed it into a struct just like regular config, then probably "git > > config --list --source" could be used to find it (and differentiate it > > from user-provided config). Possible downsides: > > > > 1. Would people find it confusing that "git config --list" suddenly > > gets way bigger? Maybe we'd want an "--include-baked-in" option > > or something. > > > > 2. Is the cost of parsing the config measurably bad? Obviously a > > user could provide the same content and we'd have to parse it, > > but there's a lot more rules here than most users would probably > > provide. > > Also: > > 3. Only the PATTERNS() macro translates as-is to config syntax. We > don't have a way to do what IPATTERN() does in the config syntax > currently. > > We could add a ifuncname and xifuncname or whatever for it I guess, > but currently the ICASE behavior in the C code is magic. Good point. IMHO that is something we should consider fixing independently. It was a mistake to add builtins that couldn't be replicated via the config (though I notice it happened quite a while ago, and nobody seems to have cared, so perhaps it isn't that important). I have also wondered if we should just ship a file which could be installed as /etc/gitconfig.filetypes and include it the stock /etc/gitconfig. That is effectively the same as "baked in", but hopefully makes it more clear to users how they can modify things. But all of that is somewhat orthogonal to what you're doing here. -Peff
On Thu, Mar 25 2021, Jeff King wrote: > On Thu, Mar 25, 2021 at 12:05:09AM +0100, Ævar Arnfjörð Bjarmason wrote: > >> > As an aside, it feels like this is something we ought to be able to >> > ask git-config about, rather than having a test-helper. This is >> > basically "baked-in" config, and if we represented it as such, and >> > parsed it into a struct just like regular config, then probably "git >> > config --list --source" could be used to find it (and differentiate it >> > from user-provided config). Possible downsides: >> > >> > 1. Would people find it confusing that "git config --list" suddenly >> > gets way bigger? Maybe we'd want an "--include-baked-in" option >> > or something. >> > >> > 2. Is the cost of parsing the config measurably bad? Obviously a >> > user could provide the same content and we'd have to parse it, >> > but there's a lot more rules here than most users would probably >> > provide. >> >> Also: >> >> 3. Only the PATTERNS() macro translates as-is to config syntax. We >> don't have a way to do what IPATTERN() does in the config syntax >> currently. >> >> We could add a ifuncname and xifuncname or whatever for it I guess, >> but currently the ICASE behavior in the C code is magic. > > Good point. IMHO that is something we should consider fixing > independently. It was a mistake to add builtins that couldn't be > replicated via the config (though I notice it happened quite a while > ago, and nobody seems to have cared, so perhaps it isn't that > important). I'm conspiring to eventually optimistically replace these ERE patterns with PCRE if we build with that at some point. Then you could just prefix your pattern with (?i) here in this and other things that want icase... > I have also wondered if we should just ship a file which could be > installed as /etc/gitconfig.filetypes and include it the stock > /etc/gitconfig. That is effectively the same as "baked in", but > hopefully makes it more clear to users how they can modify things. But > all of that is somewhat orthogonal to what you're doing here. In theory I'm with you on that, in practice this is just the sort of thing that requires opt-in effort from every person packaging git (installing system-wide-config). A good number of those people will decide "default config values? In some ~200 line file included in /etc/gitconfig?? I don't need that! This is a minimal install!". And then their web UI that calls "git diff" under the hood won't do the right thing with a .gitattributes config that expects these built-in drivers anymore. But yeah, there's no reason your earlier suggestion of injecting global defaults into "git config -l" wouldn't work, and for our own C APIs such as this one to rely on that happening. It would even make git more discoverable, as all the "this is set to xyz by default" docs in git-config(1) could become reflective, you could just run "git config -l --show-origin | grep ^default:" or something to see all default values.
On Thu, Mar 25, 2021 at 02:26:21AM +0100, Ævar Arnfjörð Bjarmason wrote: > >> We could add a ifuncname and xifuncname or whatever for it I guess, > >> but currently the ICASE behavior in the C code is magic. > > > > Good point. IMHO that is something we should consider fixing > > independently. It was a mistake to add builtins that couldn't be > > replicated via the config (though I notice it happened quite a while > > ago, and nobody seems to have cared, so perhaps it isn't that > > important). > > I'm conspiring to eventually optimistically replace these ERE patterns > with PCRE if we build with that at some point. > > Then you could just prefix your pattern with (?i) here in this and other > things that want icase... I really like PCRE myself, but is it portable/common enough for us to start using it for baked-in funcname patterns? I sort of assumed there were exotic platforms where libpcre wouldn't build (or at least it would be inconvenient or uncommon to have it). And it would be nice to degrade those gracefully (or at least better than "I guess you don't get any builtin funcnames. Tough luck"). > > I have also wondered if we should just ship a file which could be > > installed as /etc/gitconfig.filetypes and include it the stock > > /etc/gitconfig. That is effectively the same as "baked in", but > > hopefully makes it more clear to users how they can modify things. But > > all of that is somewhat orthogonal to what you're doing here. > > In theory I'm with you on that, in practice this is just the sort of > thing that requires opt-in effort from every person packaging git > (installing system-wide-config). I assumed that we'd install it with "make install", and packaging would pick it up from there. But you're right that it is likely to create extra headaches. > But yeah, there's no reason your earlier suggestion of injecting global > defaults into "git config -l" wouldn't work, and for our own C APIs such > as this one to rely on that happening. > > It would even make git more discoverable, as all the "this is set to xyz > by default" docs in git-config(1) could become reflective, you could > just run "git config -l --show-origin | grep ^default:" or something to > see all default values. Yeah, exactly. The discoverability is the real value IMHO. I just worry about overwhelming the user who runs "git config" with --show-origin. I guess it could avoid showing baked-in config unless an extra option is given, but then that makes discoverability slightly harder (though still better than it is today). I dunno. This is mostly orthogonal to your patch series, so I don't mind at all punting on it for now. Though if we _did_ do the baked-in config then, then a lot of this userdiff code would want to be converted to it. -Peff
On Fri, Mar 26 2021, Jeff King wrote: > On Thu, Mar 25, 2021 at 02:26:21AM +0100, Ævar Arnfjörð Bjarmason wrote: > >> >> We could add a ifuncname and xifuncname or whatever for it I guess, >> >> but currently the ICASE behavior in the C code is magic. >> > >> > Good point. IMHO that is something we should consider fixing >> > independently. It was a mistake to add builtins that couldn't be >> > replicated via the config (though I notice it happened quite a while >> > ago, and nobody seems to have cared, so perhaps it isn't that >> > important). >> >> I'm conspiring to eventually optimistically replace these ERE patterns >> with PCRE if we build with that at some point. >> >> Then you could just prefix your pattern with (?i) here in this and other >> things that want icase... > > I really like PCRE myself, but is it portable/common enough for us to > start using it for baked-in funcname patterns? I sort of assumed there > were exotic platforms where libpcre wouldn't build (or at least it would > be inconvenient or uncommon to have it). And it would be nice to degrade > those gracefully (or at least better than "I guess you don't get any > builtin funcnames. Tough luck"). I think it's as portable as git, the JIT backend isn't, but PCRE itself is portably C code, and e.g. everything that python, PHP and any other number of things that depend on PCRE has had PCRE ported to it. That plan involves an "git rm -r compat/regex" and a compat/pcre instead, I have some long-left-over patches for that.
On Fri, Apr 09, 2021 at 09:44:57PM +0200, Ævar Arnfjörð Bjarmason wrote: > > I really like PCRE myself, but is it portable/common enough for us to > > start using it for baked-in funcname patterns? I sort of assumed there > > were exotic platforms where libpcre wouldn't build (or at least it would > > be inconvenient or uncommon to have it). And it would be nice to degrade > > those gracefully (or at least better than "I guess you don't get any > > builtin funcnames. Tough luck"). > > I think it's as portable as git, the JIT backend isn't, but PCRE itself > is portably C code, and e.g. everything that python, PHP and any other > number of things that depend on PCRE has had PCRE ported to it. > > That plan involves an "git rm -r compat/regex" and a compat/pcre > instead, I have some long-left-over patches for that. OK. I was more worried about platforms where it was cumbersome to install pcre. If we are shipping it as a vendored library, then at least the dependency management is on us, and not the user. I do worry a little about running into complications with building or debugging it. It would be importing a lot of new code (that in theory we never need to look at, but that is not always how it works...). And while it may build portably everywhere, that may involve extra work and not integrate with our build knobs (e.g., it looks like it uses autoconf) -Peff
Jeff King <peff@peff.net> writes: >> That plan involves an "git rm -r compat/regex" and a compat/pcre >> instead, I have some long-left-over patches for that. > > OK. I was more worried about platforms where it was cumbersome to > install pcre. If we are shipping it as a vendored library, then at least > the dependency management is on us, and not the user. I do worry a > little about running into complications with building or debugging it. Please don't. I prefer not shipping compat/pcre ourselves; having to worry about version skew and picking up upstream security fixes in time, etc. for an external library that is well maintained is not what we need.
On Sat, Apr 10 2021, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > >>> That plan involves an "git rm -r compat/regex" and a compat/pcre >>> instead, I have some long-left-over patches for that. >> >> OK. I was more worried about platforms where it was cumbersome to >> install pcre. If we are shipping it as a vendored library, then at least >> the dependency management is on us, and not the user. I do worry a >> little about running into complications with building or debugging it. > > Please don't. I prefer not shipping compat/pcre ourselves; having > to worry about version skew and picking up upstream security fixes > in time, etc. for an external library that is well maintained is not > what we need. I'm not submitting patches for this now, I personally don't care much if PCRE became a required dep (as in you need the library) and we didn't ship it. I just assumed if I'd propose that that we'd want "make" to work on more bare-bones systems, since we e.g. carry a copy of sha1collisiondetection/ partially to facilitate that. But FWIW, and to the extent I've got time etc. my rough plan for this larger PCRE conpsiary was: 1. Get the current pickaxe-prep-for-PCRE patches to land. 2. I've got patches on top of that which move over to PCRE, so as with the existing grep.c code we could drop kwset entirely (declare that if you need optimized fast search, you can install PCRE). 3. git rm kwset.[ch] 4. Migrate more things to some light wrapper "do a regex match" API (just small modifications to grep.c) that'll optimistically be able to do BRE/ERE/PCRE matches. If you've got PCRE then PCRE can handle them all (it's got a RX dialecttranslation interface). 5. All our regcomp/regexec should be gone in favor if this simple replacement API. Which is where this "PCRE everywhere" would come in, at this point we'll be carrying a compat/regex/ which we could just as well replace with a compat/pcre/. The reason it would matter to have a hard prereq on PCRE is because it would simplify and speed up a lot of code in various places to be able to make an assumption that it's there. E.g. for things that do "match a line" like pickaxe we could do clever anchoring and match the whole diff at once, and also things like streaming the diff out as we consume it into PCRE's stream matching API. We also have a lot of things purely on the C level that would be simpler, e.g. (just one thing I remember) grep.c has 40 lines of code to emulate \b<word>\b, but we could ... just use \b. To Jeff's comment upthread: > And while it may build portably everywhere, that may involve extra > work and not integrate with our build knobs (e.g., it looks like it > uses autoconf) I already did that work in an old WIP branch I have. I haven't run it again so maybe this summary is inaccurate, but here's the commit: https://github.com/avar/git/commit/79256ee4a1 So basically it won't build with autoconf or whatever, just with our own build system after we'd wget the relevant PCREv2 source files into compat/pcre/. The only things PCRE really needed autoconf or its own build system for was to define 20-ish macros which are either options we can switch, or (looking at this now) 4x probes like "HAVE_STDINT_H" which we could add to compat.mak.uname and friends. If we wanted it to Just Work we could run that script and "git commit" it, or just leave it at a make warning saying "you don't have PCRE, either install it using your pkg system, or run this handy shellscript". But yeah, if Junio's paranoid about the security aspect we could skip that "./get-pcre.sh && git commit" step. I'm getting some deja-vu writing this next part, I'm pretty sure I've written some version of the same E-Mail before, in any case: A. This wouldn't be anything new. We have some old bitrotting copy of GNU awk's copy of glibc's regex engine in compat/regex. It's trivial to e.g. get it to exhaust all memory on your system. Sure, PCRE could have bugs, but at least we *could* update it. As you may remember I abandoned my last attempt to update compat/regex/ B. Even if you assume some bug like that, I think realistically no user anywhere would be vulnerable to a copy we'd ship in compat/pcre/, any vendor would ignore it and build the prereq themselves. It's purely a developer aid. Of course it could have some CVE, but it wouldn't be mean a git point release. C. Even if the regex engine has an CVE/RCE we have no feature anywhere to execute arbitrary user regexes, anyone who could inject one can already run arbitrary git commands. D. The *one* exception to C) is gitweb's not-on-by-default "grep the source tree" feature. So first its users would be better off than they are now now. You can trivially DoS any box that has it exposed, secondly the union of gitweb users + those that would turn that on + build their own git for running a public-facing site is probably 0 people these days. E. Even if you assume some combination of B..D isn't true I genuinely don't know why this concern of PCRE being a special security/vulnerability concern comes from, it seems to be brought up any time the topic is discussed. The implicit context is that we already process regexes using a plethora of engines, and are using anything from AIXs, SunOS's, Window's, OSX's, GNU libc's etc. regex engines to do so. I haven't seen anything to suggest that PCRE's security record is notably worse than even one of those pieces of software, and right now we're exposing any bugs in the union of all those engines. It's quite the opposite, unlike those engines PCRE actually gives you really good features to limit resource consumption and regex features in the event that you don't trust your regex or your input. I'd actually trust for things like processing a "git grep on the web" with appropriate resource constraints. The two solutions in that space are basically to use DFA engine as e.g. Google's re2 does. That means living with a severely restricted set of regex features, or to have bells and whistles as PCRE's NFA does, but one that has tweakable resource limits.
diff --git a/userdiff.c b/userdiff.c index 10a02d36209..55f4f769bd3 100644 --- a/userdiff.c +++ b/userdiff.c @@ -259,20 +259,32 @@ static struct userdiff_driver driver_false = { { NULL, 0 } }; -static struct userdiff_driver *userdiff_find_by_namelen(const char *k, size_t len) +struct for_each_userdiff_driver_cb { + const char *k; + size_t len; + struct userdiff_driver *driver; +}; + +static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver, + enum userdiff_driver_type type, void *priv) { - int i; - for (i = 0; i < ndrivers; i++) { - struct userdiff_driver *drv = drivers + i; - if (!strncmp(drv->name, k, len) && !drv->name[len]) - return drv; - } - for (i = 0; i < ARRAY_SIZE(builtin_drivers); i++) { - struct userdiff_driver *drv = builtin_drivers + i; - if (!strncmp(drv->name, k, len) && !drv->name[len]) - return drv; + struct for_each_userdiff_driver_cb *cb_data = priv; + + if (!strncmp(driver->name, cb_data->k, cb_data->len) && + !driver->name[cb_data->len]) { + cb_data->driver = driver; + return -1; /* found it! */ } - return NULL; + return 0; +} + +static struct userdiff_driver *userdiff_find_by_namelen(const char *k, size_t len) +{ + struct for_each_userdiff_driver_cb udcbdata = { .k = k, .len = len, .driver = NULL }; + + for_each_userdiff_driver(userdiff_find_by_namelen_cb, + USERDIFF_DRIVER_TYPE_UNSPECIFIED, &udcbdata); + return udcbdata.driver; } static int parse_funcname(struct userdiff_funcname *f, const char *k, @@ -373,3 +385,28 @@ struct userdiff_driver *userdiff_get_textconv(struct repository *r, return driver; } + +int for_each_userdiff_driver(each_userdiff_driver_fn fn, + enum userdiff_driver_type type, void *cb_data) +{ + int i, ret; + if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_CUSTOM)) { + + for (i = 0; i < ndrivers; i++) { + struct userdiff_driver *drv = drivers + i; + ret = fn(drv, USERDIFF_DRIVER_TYPE_CUSTOM, cb_data); + if (ret) + return ret; + } + } + if (type & (USERDIFF_DRIVER_TYPE_UNSPECIFIED | USERDIFF_DRIVER_TYPE_BUILTIN)) { + + for (i = 0; i < ARRAY_SIZE(builtin_drivers); i++) { + struct userdiff_driver *drv = builtin_drivers + i; + ret = fn(drv, USERDIFF_DRIVER_TYPE_BUILTIN, cb_data); + if (ret) + return ret; + } + } + return 0; +} diff --git a/userdiff.h b/userdiff.h index 203057e13e5..fe14014a775 100644 --- a/userdiff.h +++ b/userdiff.h @@ -21,6 +21,13 @@ struct userdiff_driver { struct notes_cache *textconv_cache; int textconv_want_cache; }; +enum userdiff_driver_type { + USERDIFF_DRIVER_TYPE_UNSPECIFIED = 1<<0, + USERDIFF_DRIVER_TYPE_BUILTIN = 1<<1, + USERDIFF_DRIVER_TYPE_CUSTOM = 1<<2, +}; +typedef int (*each_userdiff_driver_fn)(struct userdiff_driver *, + enum userdiff_driver_type, void *); int userdiff_config(const char *k, const char *v); struct userdiff_driver *userdiff_find_by_name(const char *name); @@ -34,4 +41,12 @@ struct userdiff_driver *userdiff_find_by_path(struct index_state *istate, struct userdiff_driver *userdiff_get_textconv(struct repository *r, struct userdiff_driver *driver); +/* + * Iterate over each driver of type userdiff_driver_type, or + * USERDIFF_DRIVER_TYPE_UNSPECIFIED for all of them. Return non-zero + * to exit from the loop. + */ +int for_each_userdiff_driver(each_userdiff_driver_fn, + enum userdiff_driver_type, void *); + #endif /* USERDIFF */
Refactor the userdiff_find_by_namelen() function so that a new for_each_userdiff_driver() API function does most of the work. This will be useful for the same reason we've got other for_each_*() API functions as part of various APIs, and will be used in a follow-up commit. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- userdiff.c | 61 +++++++++++++++++++++++++++++++++++++++++++----------- userdiff.h | 15 ++++++++++++++ 2 files changed, 64 insertions(+), 12 deletions(-)