diff mbox series

[RFC] userdiff: ship built-in driver config file

Message ID 20190617165450.81916-1-liboxuan@connect.hku.hk (mailing list archive)
State New, archived
Headers show
Series [RFC] userdiff: ship built-in driver config file | expand

Commit Message

LI, BO XUAN June 17, 2019, 4:54 p.m. UTC
The userdiff.c has been rewritten to avoid hard-coded built-in
driver patterns. Now we ship
$(sharedir)/git-core/templates/userdiff that can be read using
git_config_from_file() interface, using a very narrow callback
function that understands only diff.*.xfuncname,
diff.*.wordregex, and diff.*.regIcase.

Signed-off-by: Boxuan Li <liboxuan@connect.hku.hk>
---
A few notes and questions:
1. In [diff "tex"] section, \x80 and \xff cannot be parsed by git config parser.
I have no idea why this is happening. I changed them to \\x80 and \\xff as a workaround, which
resulted in t4034 failure (See https://travis-ci.org/li-boxuan/git/jobs/546729906#L4679).
2. I am not sure how and where I can free the memory allocated to "builtin_drivers".
3. When I run `git format-patch HEAD~1`, core dump happens occasionally. Seems
no test case caught this problem. Till now, I have no luck finding out the reason.

Any hint or review would be appreciated.
---
 templates/this--userdiff | 164 ++++++++++++++++++++++
 userdiff.c               | 284 +++++++++++++++------------------------
 2 files changed, 275 insertions(+), 173 deletions(-)
 create mode 100644 templates/this--userdiff

Comments

Johannes Sixt June 18, 2019, 8:32 p.m. UTC | #1
Am 17.06.19 um 18:54 schrieb Boxuan Li:
> The userdiff.c has been rewritten to avoid hard-coded built-in
> driver patterns. Now we ship
> $(sharedir)/git-core/templates/userdiff that can be read using
> git_config_from_file() interface, using a very narrow callback
> function that understands only diff.*.xfuncname,
> diff.*.wordregex, and diff.*.regIcase.
> 
> Signed-off-by: Boxuan Li <liboxuan@connect.hku.hk>
> ---
> A few notes and questions:
> 1. In [diff "tex"] section, \x80 and \xff cannot be parsed by git config parser.
> I have no idea why this is happening. I changed them to \\x80 and \\xff as a workaround, which
> resulted in t4034 failure (See https://travis-ci.org/li-boxuan/git/jobs/546729906#L4679).

I guess, the idea is to catch bytes of UTF-8 encoded characters as
regular words.

The problem is to write such bytes literally into a git-config file and
still keep the file editable in a portable. Perhaps it is necessary to
declare the file as CP1252 encoded via .gitattributes, write that part
of the regexp as [a-zA-Z0-9€-þ], and hope that your text editor writes
the file acutally as CP1252. ISO8859-1 does not work because \x80 is not
occupied.

> 2. I am not sure how and where I can free the memory allocated to "builtin_drivers".
> 3. When I run `git format-patch HEAD~1`, core dump happens occasionally. Seems
> no test case caught this problem. Till now, I have no luck finding out the reason.

I admit that haven't tested the driver beyond running t4018 and t4034.

> 
> Any hint or review would be appreciated.
> ---
>  templates/this--userdiff | 164 ++++++++++++++++++++++
>  userdiff.c               | 284 +++++++++++++++------------------------
>  2 files changed, 275 insertions(+), 173 deletions(-)
>  create mode 100644 templates/this--userdiff
> 
> diff --git a/templates/this--userdiff b/templates/this--userdiff
> new file mode 100644
> index 0000000000..85114a7229
> --- /dev/null
> +++ b/templates/this--userdiff

Why place this file in .git? To have per-repository diff drivers, we can
already specify them via 'git config'. This file should be installed in
the system.

> @@ -0,0 +1,164 @@
> +[diff "ada"]

etc... Please be aware that there are a few changes in 'next' that
affect this patch, in particular, the matlab pattern and new rust patterns.

> diff --git a/userdiff.c b/userdiff.c
> index 3a78fbf504..3e7052e13c 100644
> --- a/userdiff.c
> +++ b/userdiff.c

>  static struct userdiff_driver *drivers;
>  static int ndrivers;
>  static int drivers_alloc;
> +static struct config_set gm_config;
> +static int config_init;
> +struct userdiff_driver *builtin_drivers;
> +static int builtin_drivers_size;

Why do you not merge the builtin drivers with the other drivers? If
there is a reason to separate the two classes, please follow the
existing pattern to use ALLOC_GROW to reallocate the array.

> +static int userdiff_config_init(void)
> +{
> +	int ret = -1;
> +	if (!config_init) {

Please make this an early return to reduce the indentation of the
subsequent code.

> +		git_configset_init(&gm_config);
> +		if (the_repository && the_repository->gitdir)
> +			ret = git_configset_add_file(&gm_config, git_pathdup("userdiff"));
> +
> +		// if .git/userdiff does not exist, set config_init to be -1

Please do not use C++ style comments.

> +		if (ret == 0)
> +			config_init = 1;
> +		else
> +			config_init = -1;

After having done the initialization, it should be irrelevant whether
the driver list was not found. So, config_init = 1; should be the only
relevant case. Am I missing something?

> +
> +		builtin_drivers = (struct userdiff_driver *) malloc(sizeof(struct userdiff_driver));

Please do not use a cast here. It is unnecessary.
Please use xmalloc, which checks for an allocation failure.
I'm not going to repeat this for all other occurrences.

> +		*builtin_drivers = (struct userdiff_driver) { "default", NULL, -1, { NULL, 0 } };

I don't think we use this modern (GNU?) form of struct constants
anywhere already.

> +		builtin_drivers_size = 1;
> +	}
> +	return 0;
> +}
> +
> +static char* join_strings(const struct string_list *strings)
> +{
> +	char* str;
> +	int i, len, length = 0;
> +	if (!strings)
> +		return NULL;
> +
> +	for (i = 0; i < strings->nr; i++)
> +		length += strlen(strings->items[i].string);
> +
> +	str = (char *) malloc(length + 1);
> +	length = 0;
> +
> +	for (i = 0; i < strings->nr; i++) {
> +		len = strlen(strings->items[i].string);
> +		memcpy(str + length, strings->items[i].string, len);
> +		length += len;
> +	}
> +	str[length] = '\0';
> +	return str;
> +}

If you use the strbuf API instead of raw strings and
for_each_string_list_item, I'm sure you can boil this down to just a
handful of lines.

> +
> +static struct userdiff_driver *userdiff_find_builtin_by_namelen(const char *k, int len)
> +{
> +	int i, key_length, word_regex_size, ret, reg_icase, cflags;
> +	char *xfuncname_key, *word_regex_key, *ipattern_key;
> +	char *xfuncname_value, *word_regex_value, *word_regex, *name;
> +	struct userdiff_driver *builtin_driver;
> +	char word_regex_extra[] = "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+";

Aha! Have a look at 664d44ee7fb1 ("userdiff: simplify word-diff
safeguard", 2011-01-11). Perhaps the special part in the TeX pattern
should just be removed (in a preparatory patch). It would change the
meaning because it would treat runs of digits and letters as separate
words, but I don't think that will hurt.

> +	userdiff_config_init();
> +	name = (char *) malloc(len + 1);
> +	memcpy(name, k, len);
> +	name[len] = '\0';

xmemdupz?

> +
> +	// look up builtin_driver
> +	for (i = 0; i < builtin_drivers_size; i++) {
> +		struct userdiff_driver *drv = builtin_drivers + i;
> +		if (!strncmp(drv->name, name, len) && !drv->name[len])
> +			return drv;
> +	}
> +
> +	// if .git/userdiff does not exist and name is not "default", return NULL
> +	if (config_init == -1) {
> +		return NULL;
> +	}

I wonder why you look up a driver before initialization.

> +
> +	// load xfuncname and wordRegex from userdiff config file
> +	key_length = len + 16;
> +	xfuncname_key = (char *) malloc(key_length);
> +	word_regex_key = (char *) malloc(key_length);
> +	ipattern_key = (char *) malloc(key_length - 1);
> +	snprintf(xfuncname_key, key_length, "diff.%s.xfuncname", name);
> +	snprintf(word_regex_key, key_length, "diff.%s.wordRegex", name);
> +	snprintf(ipattern_key, key_length - 1, "diff.%s.regIcase", name);
> +
> +	xfuncname_value = join_strings(git_configset_get_value_multi(&gm_config, xfuncname_key));
> +	word_regex_value = join_strings(git_configset_get_value_multi(&gm_config, word_regex_key));

I'm not familiar with the git_config API. Can't comment on what this is
all about.

> +
> +	ret = git_configset_get_bool(&gm_config, ipattern_key, &reg_icase);
> +	// if "regIcase" is not found, do not use REG_ICASE flag
> +	if (ret == 1)
> +		reg_icase = 0;
> +	cflags = reg_icase ? REG_EXTENDED | REG_ICASE : REG_EXTENDED;
> +
> +	free(xfuncname_key);
> +	free(word_regex_key);
> +	free(ipattern_key);
> +
> +	if (!xfuncname_value || !word_regex_value)
> +		return NULL;
> +
> +	word_regex_size = strlen(word_regex_value) + strlen(word_regex_extra) + 1;
> +	word_regex = (char *) malloc(word_regex_size);
> +	snprintf(word_regex, word_regex_size,
> +			"%s%s", word_regex_value, word_regex_extra);
> +
> +	builtin_drivers_size++;
> +	builtin_drivers = realloc(builtin_drivers, builtin_drivers_size * sizeof(struct userdiff_driver));

This is where you should use ALLOC_GROW.

> +	builtin_driver = builtin_drivers + builtin_drivers_size - 1;
> +	*builtin_driver = (struct userdiff_driver) {
> +			name, NULL, -1, { xfuncname_value, cflags }, word_regex };
> +	return builtin_driver;
> +}

So, after having read through the whole patch, I understand that you are
using the builtin_drivers just as cache.

That is not how I initially thought it would work. IMO, you should just
slurp in all of the builtin drivers and stash them away once during
initialization. Then it is not necessary to parse the file more than once.

>  
>  static struct userdiff_driver driver_true = {
>  	"diff=true",
> @@ -197,12 +140,7 @@ static struct userdiff_driver *userdiff_find_by_namelen(const char *k, int len)
>  		if (!strncmp(drv->name, k, len) && !drv->name[len])
>  			return drv;
>  	}
> -	for (i = 0; i < ARRAY_SIZE(builtin_drivers); i++) {
> -		struct userdiff_driver *drv = builtin_drivers + i;
> -		if (!strncmp(drv->name, k, len) && !drv->name[len])
> -			return drv;
> -	}
> -	return NULL;
> +	return userdiff_find_builtin_by_namelen(k, len);
>  }

I hate functions with this layout:

fun()
{
    loop {
       stuff;
    }
    something_else();
}

The preferred layout is, IMO:

do_stuff()
{
    loop {
       stuff;
    }
}
fun()
{
     do_stuff();
     something_else();
}

Or (less preferable) expand something_else() in the function.

In this case, the goal could be:

static struct userdiff_driver *userdiff_find_by_namelen1(...)
{
	...lookup loop comes here...
}

static struct userdiff_driver *userdiff_find_by_namelen(const char *k,
int len)
{
	struct userdiff_driver *drv;
	drv = userdiff_find_by_namelen1(drivers, drivers_alloc);
	if (drv)
		return drv;
	if (!config_init)
		userdiff_config_init();
	return userdiff_find_by_namelen1(builtin_drivers, builtin_drivers_size);
}

Needless to say that userdiff_config_init() should parse the file and
stash away all the drivers it finds.

-- Hannes
Johannes Sixt June 18, 2019, 8:50 p.m. UTC | #2
Am 18.06.19 um 22:32 schrieb Johannes Sixt:
> Am 17.06.19 um 18:54 schrieb Boxuan Li:
>> The userdiff.c has been rewritten to avoid hard-coded built-in
>> driver patterns. Now we ship
>> $(sharedir)/git-core/templates/userdiff that can be read using
>> git_config_from_file() interface, using a very narrow callback
>> function that understands only diff.*.xfuncname,
>> diff.*.wordregex, and diff.*.regIcase.

I forgot to thank you for working on this. It is much appreciated.

>>
>> Signed-off-by: Boxuan Li <liboxuan@connect.hku.hk>
>> ---
>> A few notes and questions:
>> 1. In [diff "tex"] section, \x80 and \xff cannot be parsed by git config parser.
>> I have no idea why this is happening. I changed them to \\x80 and \\xff as a workaround, which
>> resulted in t4034 failure (See https://travis-ci.org/li-boxuan/git/jobs/546729906#L4679).
> 
> I guess, the idea is...

I think you noticed that I read the patch top-to-bottom and understood
the gist of it only near the end.

-- Hannes
Jeff King June 19, 2019, 3:49 a.m. UTC | #3
On Tue, Jun 18, 2019 at 10:32:47PM +0200, Johannes Sixt wrote:

> Am 17.06.19 um 18:54 schrieb Boxuan Li:
> > The userdiff.c has been rewritten to avoid hard-coded built-in
> > driver patterns. Now we ship
> > $(sharedir)/git-core/templates/userdiff that can be read using
> > git_config_from_file() interface, using a very narrow callback
> > function that understands only diff.*.xfuncname,
> > diff.*.wordregex, and diff.*.regIcase.
> > 
> > Signed-off-by: Boxuan Li <liboxuan@connect.hku.hk>
> > ---
> > A few notes and questions:
> > 1. In [diff "tex"] section, \x80 and \xff cannot be parsed by git config parser.
> > I have no idea why this is happening. I changed them to \\x80 and \\xff as a workaround, which
> > resulted in t4034 failure (See https://travis-ci.org/li-boxuan/git/jobs/546729906#L4679).
> 
> I guess, the idea is to catch bytes of UTF-8 encoded characters as
> regular words.
> 
> The problem is to write such bytes literally into a git-config file and
> still keep the file editable in a portable. Perhaps it is necessary to
> declare the file as CP1252 encoded via .gitattributes, write that part
> of the regexp as [a-zA-Z0-9€-þ], and hope that your text editor writes
> the file acutally as CP1252. ISO8859-1 does not work because \x80 is not
> occupied.

We don't allow octal or hex escapes in config values, though we do allow
common C ones like "\n". Maybe we should support them.

I didn't check whether we actually want the raw bytes here ourselves,
though, or are happy making sure the backslashed forms make it into the
regex parser (it _seems_ like the latter would be what we want, but that
does mean escaping the backslashes so they make it through the config
parser literally).

> > diff --git a/templates/this--userdiff b/templates/this--userdiff
> > new file mode 100644
> > index 0000000000..85114a7229
> > --- /dev/null
> > +++ b/templates/this--userdiff
> 
> Why place this file in .git? To have per-repository diff drivers, we can
> already specify them via 'git config'. This file should be installed in
> the system.

I think it _could_ actually just be part of the system /etc/gitconfig,
though it is kind of big, and Git has a tendency to parse the config
more than necessary. I wonder if would add a noticeable slowdown.

-Peff
Jeff King June 19, 2019, 3:58 a.m. UTC | #4
On Tue, Jun 18, 2019 at 12:54:50AM +0800, Boxuan Li wrote:

> A few notes and questions:
> 1. In [diff "tex"] section, \x80 and \xff cannot be parsed by git config parser.
> I have no idea why this is happening. I changed them to \\x80 and \\xff as a workaround, which
> resulted in t4034 failure (See https://travis-ci.org/li-boxuan/git/jobs/546729906#L4679).
> 2. I am not sure how and where I can free the memory allocated to "builtin_drivers".
> 3. When I run `git format-patch HEAD~1`, core dump happens occasionally. Seems
> no test case caught this problem. Till now, I have no luck finding out the reason.

I couldn't replicate it with a simple test, but perhaps running under
valgrind or "make SANITIZE=address" would help?

> diff --git a/templates/this--userdiff b/templates/this--userdiff
> new file mode 100644
> index 0000000000..85114a7229
> --- /dev/null
> +++ b/templates/this--userdiff
> @@ -0,0 +1,164 @@
> +[diff "ada"]
> +	xfuncname = "!^(.*[ \t])?(is[ \t]+new|renames|is[ \t]+separate)([ \t].*)?$\n"
> +	xfuncname = "!^[ \t]*with[ \t].*$\n"
> +	xfuncname = "^[ \t]*((procedure|function)[ \t]+.*)$\n"
> +	xfuncname = "^[ \t]*((package|protected|task)[ \t]+.*)$"

While having separate lines that get joined here does make the result
easier to read, I think it creates some confusion. diff.*.xfuncname in a
regular config file _doesn't_ behave this way (it's the usual
last-one-wins, so we expect a single string). You've handled this
specially in your code to read this file, but it's confusing because
this test otherwise looks exactly like a config file. And thus somebody
might be tempted to copy it to their config file and modify it, but it
would not do what they expected.

I don't recall how well our config parser copes with embedded newlines
in values.  I.e., if it would be possible to write:

  [diff "foo"]
  xfuncname = "the pattern starts here...
  and continues through newlines!"

I think it doesn't work, but perhaps it would be a nice feature to add
it. It would make the format slightly more complex, though (and make
diagnosing a missing double-quote much harder). I dunno.

-Peff
Johannes Sixt June 19, 2019, 6:30 a.m. UTC | #5
Am 19.06.19 um 05:49 schrieb Jeff King:
> On Tue, Jun 18, 2019 at 10:32:47PM +0200, Johannes Sixt wrote:
> 
>> Am 17.06.19 um 18:54 schrieb Boxuan Li:
>>> diff --git a/templates/this--userdiff b/templates/this--userdiff
>>> new file mode 100644
>>> index 0000000000..85114a7229
>>> --- /dev/null
>>> +++ b/templates/this--userdiff
>>
>> Why place this file in .git? To have per-repository diff drivers, we can
>> already specify them via 'git config'. This file should be installed in
>> the system.
> 
> I think it _could_ actually just be part of the system /etc/gitconfig,
> though it is kind of big, and Git has a tendency to parse the config
> more than necessary. I wonder if would add a noticeable slowdown.

But /etc/gitconfig would be the wrong place, because it would not be
updated when a new version ships with new patterns.

I would suggest to install the file as $prefix/share/git-core/userdiff
although the name "userdiff" sounds like an accident. How about
.../filetypes?

-- Hannes
Junio C Hamano June 19, 2019, 2:50 p.m. UTC | #6
Jeff King <peff@peff.net> writes:

> I think it _could_ actually just be part of the system /etc/gitconfig,
> though it is kind of big, and Git has a tendency to parse the config
> more than necessary. I wonder if would add a noticeable slowdown.

Yeah, that was what I was wondering too when somebody made a casual
mention of the "why do we want to pile these settings hardcoded in
the code for each and every language?", which I think led to this
patch.  I was actually imagining that this would be totally outside
the normal config mechanism (there is no strong reason to even share
the syntax, but there is no need to come up with a different syntax,
either), treated like other fixed data files (like *.mo files), and
is read lazily only when diff driver needs to be instantiated.

I also agree with you that changing how multiply-defined variables
are handled with the patch is a no-go, if we are going to share the
parser.

Thanks for a patch and a review.
Junio C Hamano June 19, 2019, 3:02 p.m. UTC | #7
Johannes Sixt <j6t@kdbg.org> writes:

> But /etc/gitconfig would be the wrong place, because it would not be
> updated when a new version ships with new patterns.
>
> I would suggest to install the file as $prefix/share/git-core/userdiff
> although the name "userdiff" sounds like an accident. How about
> .../filetypes?

I am not sure about the filename, but I do agree with your choice of
"somewhere under $prefix/share/".

Thanks.
LI, BO XUAN June 19, 2019, 3:32 p.m. UTC | #8
On Wed, Jun 19, 2019 at 11:58 AM Jeff King <peff@peff.net> wrote:
>
> While having separate lines that get joined here does make the result
> easier to read, I think it creates some confusion. diff.*.xfuncname in a
> regular config file _doesn't_ behave this way (it's the usual
> last-one-wins, so we expect a single string). You've handled this
> specially in your code to read this file, but it's confusing because
> this test otherwise looks exactly like a config file. And thus somebody
> might be tempted to copy it to their config file and modify it, but it
> would not do what they expected.
>
> I don't recall how well our config parser copes with embedded newlines
> in values.  I.e., if it would be possible to write:
>
>   [diff "foo"]
>   xfuncname = "the pattern starts here...
>   and continues through newlines!"
>

If I recall correctly, the above version wouldn't work, but the
following version would:

[diff "foo"]
    xfuncname = "The pattern starts here..."
"and continues here! But the indentation looks ugly,"
"and we lose the ability to add comments inline (within pattern)"

Actually, at the very beginning, I was imaging some syntax like this:

[diff "foo"]
    xfuncname = "The pattern starts here..."
    ; using '+=' will continue the pattern above
    xfuncname += "and continues here!"
    ; a '=' symbol will start a new pattern
    xfuncname = "This is another pattern.."
    xfuncname += "and remember, last one always wins"

The existing config parser does not support "+=" though, which is a
nice feature to have in my opinion. Maybe there is a reason?

By the way, thanks for all the reviews! Especially when I found more
lines of reviews than lines of my code.

Best regards,
Boxuan
Jeff King June 19, 2019, 6:39 p.m. UTC | #9
On Wed, Jun 19, 2019 at 08:30:25AM +0200, Johannes Sixt wrote:

> >> Why place this file in .git? To have per-repository diff drivers, we can
> >> already specify them via 'git config'. This file should be installed in
> >> the system.
> > 
> > I think it _could_ actually just be part of the system /etc/gitconfig,
> > though it is kind of big, and Git has a tendency to parse the config
> > more than necessary. I wonder if would add a noticeable slowdown.
> 
> But /etc/gitconfig would be the wrong place, because it would not be
> updated when a new version ships with new patterns.

I was thinking it would be, but I guess there is a merging problem if
the admin has made their own changes.

> I would suggest to install the file as $prefix/share/git-core/userdiff
> although the name "userdiff" sounds like an accident. How about
> .../filetypes?

Does it need to be specific to userdiff or filetypes? Could this be a
generic fourth level of config: repo, user, system, builtin? We
effectively already have that, except the "builtin" ones are truly baked
into the binary, which means they are not visible. So right now you
cannot say "git config diff.tex.xfuncname" and get any useful
information, even though we clearly are going to respect it.

On the other hand, this would make "git config --list" quite a bit
longer. And any solution that involves putting it into the generic
config paths may suffer from the bloating/slowdown problem I mentioned.

But without that, I have to wonder what problem we are really solving.
Now it's baked into the binary. Later it will be baked into the
distribution, but we still don't want anybody to touch it because their
changes will be overwritten. I guess it's a little easier for somebody
to find .../share/git-core/userdiff and use it as a template than it is
to find the definitions in the source. But it's not exactly easy.

-Peff
Jeff King June 19, 2019, 6:42 p.m. UTC | #10
On Wed, Jun 19, 2019 at 11:32:19PM +0800, LI, BO XUAN wrote:

> >   [diff "foo"]
> >   xfuncname = "the pattern starts here...
> >   and continues through newlines!"
> >
> 
> If I recall correctly, the above version wouldn't work, but the
> following version would:
> 
> [diff "foo"]
>     xfuncname = "The pattern starts here..."
> "and continues here! But the indentation looks ugly,"
> "and we lose the ability to add comments inline (within pattern)"

I don't think that works for our config files either (it does in C, of
course).

> Actually, at the very beginning, I was imaging some syntax like this:
> 
> [diff "foo"]
>     xfuncname = "The pattern starts here..."
>     ; using '+=' will continue the pattern above
>     xfuncname += "and continues here!"
>     ; a '=' symbol will start a new pattern
>     xfuncname = "This is another pattern.."
>     xfuncname += "and remember, last one always wins"
> 
> The existing config parser does not support "+=" though, which is a
> nice feature to have in my opinion. Maybe there is a reason?

There's no particular reason that feature isn't there, but in general
we've been very hesitant to add syntactic changes to the config files,
since there are many parsers in the wild. In general, today's files are
compatible with ones from 2005.

-Peff
Johannes Sixt June 20, 2019, 7:41 a.m. UTC | #11
Am 19.06.19 um 20:39 schrieb Jeff King:
> But without that, I have to wonder what problem we are really solving.

You have a point here.

> Now it's baked into the binary. Later it will be baked into the
> distribution, but we still don't want anybody to touch it because their
> changes will be overwritten.

Having the patterns outside the binary reduces our mental barrier to
accept additional patterns or slightly modified patterns for new file
types, like we had recently for octave that are just a small extension
to the matlab patterns.

But we are not solving a technical problem or a problem our users have,
AFAICS.

> I guess it's a little easier for somebody
> to find .../share/git-core/userdiff and use it as a template than it is
> to find the definitions in the source. But it's not exactly easy.

Having a slightly easiser to discover database may count as an advantage. ;)

-- Hannes
diff mbox series

Patch

diff --git a/templates/this--userdiff b/templates/this--userdiff
new file mode 100644
index 0000000000..85114a7229
--- /dev/null
+++ b/templates/this--userdiff
@@ -0,0 +1,164 @@ 
+[diff "ada"]
+	xfuncname = "!^(.*[ \t])?(is[ \t]+new|renames|is[ \t]+separate)([ \t].*)?$\n"
+	xfuncname = "!^[ \t]*with[ \t].*$\n"
+	xfuncname = "^[ \t]*((procedure|function)[ \t]+.*)$\n"
+	xfuncname = "^[ \t]*((package|protected|task)[ \t]+.*)$"
+	wordRegex = "[a-zA-Z][a-zA-Z0-9_]*"
+	wordRegex = "|[-+]?[0-9][0-9#_.aAbBcCdDeEfF]*([eE][+-]?[0-9_]+)?"
+	wordRegex = "|=>|\\.\\.|\\*\\*|:=|/=|>=|<=|<<|>>|<>"
+	regIcase = true
+
+[diff "fortran"]
+	xfuncname = "!^([C*]|[ \t]*!)\n"
+	xfuncname = "!^[ \t]*MODULE[ \t]+PROCEDURE[ \t]\n"
+	xfuncname = "^[ \t]*((END[ \t]+)?(PROGRAM|MODULE|BLOCK[ \t]+DATA"
+	xfuncname = "|([^'\" \t]+[ \t]+)*(SUBROUTINE|FUNCTION))[ \t]+[A-Z].*)$"
+	wordRegex = "[a-zA-Z][a-zA-Z0-9_]*"
+	wordRegex = "|\\.([Ee][Qq]|[Nn][Ee]|[Gg][TtEe]|[Ll][TtEe]|[Tt][Rr][Uu][Ee]|[Ff][Aa][Ll][Ss][Ee]|[Aa][Nn][Dd]|[Oo][Rr]|[Nn]?[Ee][Qq][Vv]|[Nn][Oo][Tt])\\."
+	; numbers and format statements like 2E14.4, or ES12.6, 9X.
+	; Don't worry about format statements without leading digits since
+	; they would have been matched above as a variable anyway.
+	wordRegex = "|[-+]?[0-9.]+([AaIiDdEeFfLlTtXx][Ss]?[-+]?[0-9.]*)?(_[a-zA-Z0-9][a-zA-Z0-9_]*)?"
+	wordRegex = "|//|\\*\\*|::|[/<>=]="
+	regIcase = true
+
+[diff "fountain"]
+	xfuncname = "^((\\.[^.]|(int|ext|est|int\\.?/ext|i/e)[. ]).*)$"
+	wordRegex = "[^ \t-]+"
+	regIcase = true
+
+[diff "golang"]
+	; Functions
+	xfuncname = "^[ \t]*(func[ \t]*.*(\\{[ \t]*)?)\n"
+	; Structs and interfaces
+	xfuncname = "^[ \t]*(type[ \t].*(struct|interface)[ \t]*(\\{[ \t]*)?)"
+	wordRegex = "[a-zA-Z_][a-zA-Z0-9_]*"
+	wordRegex = "|[-+0-9.eE]+i?|0[xX]?[0-9a-fA-F]+i?"
+	wordRegex = "|[-+*/<>%&^|=!:]=|--|\\+\\+|<<=?|>>=?|&\\^=?|&&|\\|\\||<-|\\.{3}"
+
+[diff "html"]
+	xfuncname = "^[ \t]*(<[Hh][1-6]([ \t].*)?>.*)$"
+	wordRegex = "[^<>= \t]+"
+
+[diff "java"]
+	xfuncname = "!^[ \t]*(catch|do|for|if|instanceof|new|return|switch|throw|while)\n"
+	xfuncname = "^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*[ \t]*\\([^;]*)$"
+	wordRegex = "[a-zA-Z_][a-zA-Z0-9_]*"
+	wordRegex = "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
+	wordRegex = "|[-+*/<>%&^|=!]="
+	wordRegex = "|--|\\+\\+|<<=?|>>>?=?|&&|\\|\\|"
+
+[diff "matlab"]
+	xfuncname = "^[[:space:]]*((classdef|function)[[:space:]].*)$|^%%[[:space:]].*$"
+	wordRegex = "[a-zA-Z_][a-zA-Z0-9_]*|[-+0-9.e]+|[=~<>]=|\\.[*/\\^']|\\|\\||&&"
+
+[diff "objc"]
+	; Negate C statements that can look like functions
+	xfuncname = "!^[ \t]*(do|for|if|else|return|switch|while)\n"
+	; Objective-C methods
+	xfuncname = "^[ \t]*([-+][ \t]*\\([ \t]*[A-Za-z_][A-Za-z_0-9* \t]*\\)[ \t]*[A-Za-z_].*)$\n"
+	; C functions
+	xfuncname = "^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*[ \t]*\\([^;]*)$\n"
+	; Objective-C class/protocol definitions
+	xfuncname = "^(@(implementation|interface|protocol)[ \t].*)$"
+	wordRegex = "[a-zA-Z_][a-zA-Z0-9_]*"
+	wordRegex = "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
+	wordRegex = "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"
+
+[diff "pascal"]
+	xfuncname = "^(((class[ \t]+)?(procedure|function)|constructor|destructor|interface|"
+	xfuncname = "implementation|initialization|finalization)[ \t]*.*)$"
+	xfuncname = "\n"
+	xfuncname = "^(.*=[ \t]*(class|record).*)$"
+	wordRegex = "[a-zA-Z_][a-zA-Z0-9_]*"
+	wordRegex = "|[-+0-9.e]+|0[xXbB]?[0-9a-fA-F]+"
+	wordRegex = "|<>|<=|>=|:=|\\.\\."
+
+[diff "perl"]
+	xfuncname = "^package .*\n"
+	xfuncname = "^sub [[:alnum:]_':]+[ \t]*"
+		xfuncname = "(\\([^)]*\\)[ \t]*)?" ; prototype
+		; Attributes.  A regex can't count nested parentheses,
+		; so just slurp up whatever we see, taking care not
+		; to accept lines like "sub foo; # defined elsewhere".
+		;
+		; An attribute could contain a semicolon, but at that
+		; point it seems reasonable enough to give up.
+		xfuncname = "(:[^;#]*)?"
+		xfuncname = "(\\{[ \t]*)?" ; brace can come here or on the next line
+		xfuncname = "(#.*)?$\n" ; comment
+	xfuncname = "^(BEGIN|END|INIT|CHECK|UNITCHECK|AUTOLOAD|DESTROY)[ \t]*"
+		xfuncname = "(\\{[ \t]*)?" ; brace can come here or on the next line
+		xfuncname = "(#.*)?$\n"
+	xfuncname = "^=head[0-9] .*" ; POD
+	wordRegex = "[[:alpha:]_'][[:alnum:]_']*"
+	wordRegex = "|0[xb]?[0-9a-fA-F_]*"
+	; taking care not to interpret 3..5 as (3.)(.5)
+	wordRegex = "|[0-9a-fA-F_]+(\\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?"
+	wordRegex = "|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::"
+	wordRegex = "|&&=|\\|\\|=|//=|\\*\\*="
+	wordRegex = "|&&|\\|\\||//|\\+\\+|--|\\*\\*|\\.\\.\\.?"
+	wordRegex = "|[-+*/%.^&<>=!|]="
+	wordRegex = "|=~|!~"
+	wordRegex = "|<<|<>|<=>|>>"
+
+[diff "php"]
+	xfuncname = "^[\t ]*(((public|protected|private|static)[\t ]+)*function.*)$\n"
+	xfuncname = "^[\t ]*((((final|abstract)[\t ]+)?class|interface|trait).*)$"
+	wordRegex = "[a-zA-Z_][a-zA-Z0-9_]*"
+	wordRegex = "|[-+0-9.e]+|0[xXbB]?[0-9a-fA-F]+"
+	wordRegex = "|[-+*/<>%&^|=!.]=|--|\\+\\+|<<=?|>>=?|===|&&|\\|\\||::|->"
+
+[diff "python"]
+	xfuncname = "^[ \t]*((class|def)[ \t].*)$"
+	wordRegex = "[a-zA-Z_][a-zA-Z0-9_]*"
+	wordRegex = "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?"
+	wordRegex = "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"
+
+[diff "ruby"]
+	xfuncname = "^[ \t]*((class|module|def)[ \t].*)$"
+	wordRegex = "(@|@@|\\$)?[a-zA-Z_][a-zA-Z0-9_]*"
+	wordRegex = "|[-+0-9.e]+|0[xXbB]?[0-9a-fA-F]+|\\?(\\\\C-)?(\\\\M-)?."
+	wordRegex = "|//=?|[-+*/<>%&^|=!]=|<<=?|>>=?|===|\\.{1,3}|::|[!=]~"
+
+[diff "bibtex"]
+	xfuncname = "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$"
+	wordRegex = "[={}\"]|[^={}\" \t]+"
+
+[diff "tex"]
+	xfuncname = "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$"
+	wordRegex = "\\\\[a-zA-Z@]+|\\\\.|[a-zA-Z0-9\\x80-\\xff]+"
+
+[diff "cpp"]
+	; Jump targets or access declarations
+	xfuncname = "!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:[[:space:]]*($|/[/*])\n"
+	; functions/methods, variables, and compounds at top level
+	xfuncname = "^((::[[:space:]]*)?[A-Za-z_].*)$"
+	wordRegex = "[a-zA-Z_][a-zA-Z0-9_]*"
+	wordRegex = "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lLuU]*"
+	wordRegex = "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"
+
+[diff "csharp"]
+	; Keywords
+	xfuncname = "!^[ \t]*(do|while|for|if|else|instanceof|new|return|switch|case|throw|catch|using)\n"
+	; Methods and constructors
+	xfuncname = "^[ \t]*(((static|public|internal|private|protected|new|virtual|sealed|override|unsafe|async)[ \t]+)*[][<>@.~_[:alnum:]]+[ \t]+[<>@._[:alnum:]]+[ \t]*\\(.*\\))[ \t]*$\n"
+	; Properties
+	xfuncname = "^[ \t]*(((static|public|internal|private|protected|new|virtual|sealed|override|unsafe)[ \t]+)*[][<>@.~_[:alnum:]]+[ \t]+[@._[:alnum:]]+)[ \t]*$\n"
+	; Type definitions
+	xfuncname = "^[ \t]*(((static|public|internal|private|protected|new|unsafe|sealed|abstract|partial)[ \t]+)*(class|enum|interface|struct)[ \t]+.*)$\n"
+	; Namespace
+	xfuncname = "^[ \t]*(namespace[ \t]+.*)$"
+	wordRegex = "[a-zA-Z_][a-zA-Z0-9_]*"
+	wordRegex = "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
+	wordRegex = "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"
+
+[diff "css"]
+	xfuncname = "![:;][[:space:]]*$\n"
+	xfuncname = "^[_a-z0-9].*$"
+	; This regex comes from W3C CSS specs. Should theoretically also
+	; allow ISO 10646 characters U+00A0 and higher,
+	; but they are not handled in this regex.
+	wordRegex = "-?[_a-zA-Z][-_a-zA-Z0-9]*" ; identifiers
+	wordRegex = "|-?[0-9]+|\\#[0-9a-fA-F]+" ; numbers
+	regIcase = true
diff --git a/userdiff.c b/userdiff.c
index 3a78fbf504..3e7052e13c 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -2,178 +2,121 @@ 
 #include "config.h"
 #include "userdiff.h"
 #include "attr.h"
+#include "exec-cmd.h"
+#include "repository.h"
 
 static struct userdiff_driver *drivers;
 static int ndrivers;
 static int drivers_alloc;
+static struct config_set gm_config;
+static int config_init;
+struct userdiff_driver *builtin_drivers;
+static int builtin_drivers_size;
 
-#define PATTERNS(name, pattern, word_regex)			\
-	{ name, NULL, -1, { pattern, REG_EXTENDED },		\
-	  word_regex "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" }
-#define IPATTERN(name, pattern, word_regex)			\
-	{ name, NULL, -1, { pattern, REG_EXTENDED | REG_ICASE }, \
-	  word_regex "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" }
-static struct userdiff_driver builtin_drivers[] = {
-IPATTERN("ada",
-	 "!^(.*[ \t])?(is[ \t]+new|renames|is[ \t]+separate)([ \t].*)?$\n"
-	 "!^[ \t]*with[ \t].*$\n"
-	 "^[ \t]*((procedure|function)[ \t]+.*)$\n"
-	 "^[ \t]*((package|protected|task)[ \t]+.*)$",
-	 /* -- */
-	 "[a-zA-Z][a-zA-Z0-9_]*"
-	 "|[-+]?[0-9][0-9#_.aAbBcCdDeEfF]*([eE][+-]?[0-9_]+)?"
-	 "|=>|\\.\\.|\\*\\*|:=|/=|>=|<=|<<|>>|<>"),
-IPATTERN("fortran",
-	 "!^([C*]|[ \t]*!)\n"
-	 "!^[ \t]*MODULE[ \t]+PROCEDURE[ \t]\n"
-	 "^[ \t]*((END[ \t]+)?(PROGRAM|MODULE|BLOCK[ \t]+DATA"
-		"|([^'\" \t]+[ \t]+)*(SUBROUTINE|FUNCTION))[ \t]+[A-Z].*)$",
-	 /* -- */
-	 "[a-zA-Z][a-zA-Z0-9_]*"
-	 "|\\.([Ee][Qq]|[Nn][Ee]|[Gg][TtEe]|[Ll][TtEe]|[Tt][Rr][Uu][Ee]|[Ff][Aa][Ll][Ss][Ee]|[Aa][Nn][Dd]|[Oo][Rr]|[Nn]?[Ee][Qq][Vv]|[Nn][Oo][Tt])\\."
-	 /* numbers and format statements like 2E14.4, or ES12.6, 9X.
-	  * Don't worry about format statements without leading digits since
-	  * they would have been matched above as a variable anyway. */
-	 "|[-+]?[0-9.]+([AaIiDdEeFfLlTtXx][Ss]?[-+]?[0-9.]*)?(_[a-zA-Z0-9][a-zA-Z0-9_]*)?"
-	 "|//|\\*\\*|::|[/<>=]="),
-IPATTERN("fountain", "^((\\.[^.]|(int|ext|est|int\\.?/ext|i/e)[. ]).*)$",
-	 "[^ \t-]+"),
-PATTERNS("golang",
-	 /* Functions */
-	 "^[ \t]*(func[ \t]*.*(\\{[ \t]*)?)\n"
-	 /* Structs and interfaces */
-	 "^[ \t]*(type[ \t].*(struct|interface)[ \t]*(\\{[ \t]*)?)",
-	 /* -- */
-	 "[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+0-9.eE]+i?|0[xX]?[0-9a-fA-F]+i?"
-	 "|[-+*/<>%&^|=!:]=|--|\\+\\+|<<=?|>>=?|&\\^=?|&&|\\|\\||<-|\\.{3}"),
-PATTERNS("html", "^[ \t]*(<[Hh][1-6]([ \t].*)?>.*)$",
-	 "[^<>= \t]+"),
-PATTERNS("java",
-	 "!^[ \t]*(catch|do|for|if|instanceof|new|return|switch|throw|while)\n"
-	 "^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*[ \t]*\\([^;]*)$",
-	 /* -- */
-	 "[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
-	 "|[-+*/<>%&^|=!]="
-	 "|--|\\+\\+|<<=?|>>>?=?|&&|\\|\\|"),
-PATTERNS("matlab",
-	 "^[[:space:]]*((classdef|function)[[:space:]].*)$|^%%[[:space:]].*$",
-	 "[a-zA-Z_][a-zA-Z0-9_]*|[-+0-9.e]+|[=~<>]=|\\.[*/\\^']|\\|\\||&&"),
-PATTERNS("objc",
-	 /* Negate C statements that can look like functions */
-	 "!^[ \t]*(do|for|if|else|return|switch|while)\n"
-	 /* Objective-C methods */
-	 "^[ \t]*([-+][ \t]*\\([ \t]*[A-Za-z_][A-Za-z_0-9* \t]*\\)[ \t]*[A-Za-z_].*)$\n"
-	 /* C functions */
-	 "^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*[ \t]*\\([^;]*)$\n"
-	 /* Objective-C class/protocol definitions */
-	 "^(@(implementation|interface|protocol)[ \t].*)$",
-	 /* -- */
-	 "[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
-	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
-PATTERNS("pascal",
-	 "^(((class[ \t]+)?(procedure|function)|constructor|destructor|interface|"
-		"implementation|initialization|finalization)[ \t]*.*)$"
-	 "\n"
-	 "^(.*=[ \t]*(class|record).*)$",
-	 /* -- */
-	 "[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+0-9.e]+|0[xXbB]?[0-9a-fA-F]+"
-	 "|<>|<=|>=|:=|\\.\\."),
-PATTERNS("perl",
-	 "^package .*\n"
-	 "^sub [[:alnum:]_':]+[ \t]*"
-		"(\\([^)]*\\)[ \t]*)?" /* prototype */
-		/*
-		 * Attributes.  A regex can't count nested parentheses,
-		 * so just slurp up whatever we see, taking care not
-		 * to accept lines like "sub foo; # defined elsewhere".
-		 *
-		 * An attribute could contain a semicolon, but at that
-		 * point it seems reasonable enough to give up.
-		 */
-		"(:[^;#]*)?"
-		"(\\{[ \t]*)?" /* brace can come here or on the next line */
-		"(#.*)?$\n" /* comment */
-	 "^(BEGIN|END|INIT|CHECK|UNITCHECK|AUTOLOAD|DESTROY)[ \t]*"
-		"(\\{[ \t]*)?" /* brace can come here or on the next line */
-		"(#.*)?$\n"
-	 "^=head[0-9] .*",	/* POD */
-	 /* -- */
-	 "[[:alpha:]_'][[:alnum:]_']*"
-	 "|0[xb]?[0-9a-fA-F_]*"
-	 /* taking care not to interpret 3..5 as (3.)(.5) */
-	 "|[0-9a-fA-F_]+(\\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?"
-	 "|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::"
-	 "|&&=|\\|\\|=|//=|\\*\\*="
-	 "|&&|\\|\\||//|\\+\\+|--|\\*\\*|\\.\\.\\.?"
-	 "|[-+*/%.^&<>=!|]="
-	 "|=~|!~"
-	 "|<<|<>|<=>|>>"),
-PATTERNS("php",
-	 "^[\t ]*(((public|protected|private|static)[\t ]+)*function.*)$\n"
-	 "^[\t ]*((((final|abstract)[\t ]+)?class|interface|trait).*)$",
-	 /* -- */
-	 "[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+0-9.e]+|0[xXbB]?[0-9a-fA-F]+"
-	 "|[-+*/<>%&^|=!.]=|--|\\+\\+|<<=?|>>=?|===|&&|\\|\\||::|->"),
-PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$",
-	 /* -- */
-	 "[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?"
-	 "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"),
-	 /* -- */
-PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$",
-	 /* -- */
-	 "(@|@@|\\$)?[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+0-9.e]+|0[xXbB]?[0-9a-fA-F]+|\\?(\\\\C-)?(\\\\M-)?."
-	 "|//=?|[-+*/<>%&^|=!]=|<<=?|>>=?|===|\\.{1,3}|::|[!=]~"),
-PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
-	 "[={}\"]|[^={}\" \t]+"),
-PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
-	 "\\\\[a-zA-Z@]+|\\\\.|[a-zA-Z0-9\x80-\xff]+"),
-PATTERNS("cpp",
-	 /* Jump targets or access declarations */
-	 "!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:[[:space:]]*($|/[/*])\n"
-	 /* functions/methods, variables, and compounds at top level */
-	 "^((::[[:space:]]*)?[A-Za-z_].*)$",
-	 /* -- */
-	 "[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lLuU]*"
-	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"),
-PATTERNS("csharp",
-	 /* Keywords */
-	 "!^[ \t]*(do|while|for|if|else|instanceof|new|return|switch|case|throw|catch|using)\n"
-	 /* Methods and constructors */
-	 "^[ \t]*(((static|public|internal|private|protected|new|virtual|sealed|override|unsafe|async)[ \t]+)*[][<>@.~_[:alnum:]]+[ \t]+[<>@._[:alnum:]]+[ \t]*\\(.*\\))[ \t]*$\n"
-	 /* Properties */
-	 "^[ \t]*(((static|public|internal|private|protected|new|virtual|sealed|override|unsafe)[ \t]+)*[][<>@.~_[:alnum:]]+[ \t]+[@._[:alnum:]]+)[ \t]*$\n"
-	 /* Type definitions */
-	 "^[ \t]*(((static|public|internal|private|protected|new|unsafe|sealed|abstract|partial)[ \t]+)*(class|enum|interface|struct)[ \t]+.*)$\n"
-	 /* Namespace */
-	 "^[ \t]*(namespace[ \t]+.*)$",
-	 /* -- */
-	 "[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
-	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
-IPATTERN("css",
-	 "![:;][[:space:]]*$\n"
-	 "^[_a-z0-9].*$",
-	 /* -- */
-	 /*
-	  * This regex comes from W3C CSS specs. Should theoretically also
-	  * allow ISO 10646 characters U+00A0 and higher,
-	  * but they are not handled in this regex.
-	  */
-	 "-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */
-	 "|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */
-),
-{ "default", NULL, -1, { NULL, 0 } },
-};
-#undef PATTERNS
-#undef IPATTERN
+static int userdiff_config_init(void)
+{
+	int ret = -1;
+	if (!config_init) {
+		git_configset_init(&gm_config);
+		if (the_repository && the_repository->gitdir)
+			ret = git_configset_add_file(&gm_config, git_pathdup("userdiff"));
+
+		// if .git/userdiff does not exist, set config_init to be -1
+		if (ret == 0)
+			config_init = 1;
+		else
+			config_init = -1;
+
+		builtin_drivers = (struct userdiff_driver *) malloc(sizeof(struct userdiff_driver));
+		*builtin_drivers = (struct userdiff_driver) { "default", NULL, -1, { NULL, 0 } };
+		builtin_drivers_size = 1;
+	}
+	return 0;
+}
+
+static char* join_strings(const struct string_list *strings)
+{
+	char* str;
+	int i, len, length = 0;
+	if (!strings)
+		return NULL;
+
+	for (i = 0; i < strings->nr; i++)
+		length += strlen(strings->items[i].string);
+
+	str = (char *) malloc(length + 1);
+	length = 0;
+
+	for (i = 0; i < strings->nr; i++) {
+		len = strlen(strings->items[i].string);
+		memcpy(str + length, strings->items[i].string, len);
+		length += len;
+	}
+	str[length] = '\0';
+	return str;
+}
+
+static struct userdiff_driver *userdiff_find_builtin_by_namelen(const char *k, int len)
+{
+	int i, key_length, word_regex_size, ret, reg_icase, cflags;
+	char *xfuncname_key, *word_regex_key, *ipattern_key;
+	char *xfuncname_value, *word_regex_value, *word_regex, *name;
+	struct userdiff_driver *builtin_driver;
+	char word_regex_extra[] = "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+";
+	userdiff_config_init();
+	name = (char *) malloc(len + 1);
+	memcpy(name, k, len);
+	name[len] = '\0';
+
+	// look up builtin_driver
+	for (i = 0; i < builtin_drivers_size; i++) {
+		struct userdiff_driver *drv = builtin_drivers + i;
+		if (!strncmp(drv->name, name, len) && !drv->name[len])
+			return drv;
+	}
+
+	// if .git/userdiff does not exist and name is not "default", return NULL
+	if (config_init == -1) {
+		return NULL;
+	}
+
+	// load xfuncname and wordRegex from userdiff config file
+	key_length = len + 16;
+	xfuncname_key = (char *) malloc(key_length);
+	word_regex_key = (char *) malloc(key_length);
+	ipattern_key = (char *) malloc(key_length - 1);
+	snprintf(xfuncname_key, key_length, "diff.%s.xfuncname", name);
+	snprintf(word_regex_key, key_length, "diff.%s.wordRegex", name);
+	snprintf(ipattern_key, key_length - 1, "diff.%s.regIcase", name);
+
+	xfuncname_value = join_strings(git_configset_get_value_multi(&gm_config, xfuncname_key));
+	word_regex_value = join_strings(git_configset_get_value_multi(&gm_config, word_regex_key));
+
+	ret = git_configset_get_bool(&gm_config, ipattern_key, &reg_icase);
+	// if "regIcase" is not found, do not use REG_ICASE flag
+	if (ret == 1)
+		reg_icase = 0;
+	cflags = reg_icase ? REG_EXTENDED | REG_ICASE : REG_EXTENDED;
+
+	free(xfuncname_key);
+	free(word_regex_key);
+	free(ipattern_key);
+
+	if (!xfuncname_value || !word_regex_value)
+		return NULL;
+
+	word_regex_size = strlen(word_regex_value) + strlen(word_regex_extra) + 1;
+	word_regex = (char *) malloc(word_regex_size);
+	snprintf(word_regex, word_regex_size,
+			"%s%s", word_regex_value, word_regex_extra);
+
+	builtin_drivers_size++;
+	builtin_drivers = realloc(builtin_drivers, builtin_drivers_size * sizeof(struct userdiff_driver));
+	builtin_driver = builtin_drivers + builtin_drivers_size - 1;
+	*builtin_driver = (struct userdiff_driver) {
+			name, NULL, -1, { xfuncname_value, cflags }, word_regex };
+	return builtin_driver;
+}
 
 static struct userdiff_driver driver_true = {
 	"diff=true",
@@ -197,12 +140,7 @@  static struct userdiff_driver *userdiff_find_by_namelen(const char *k, int len)
 		if (!strncmp(drv->name, k, len) && !drv->name[len])
 			return drv;
 	}
-	for (i = 0; i < ARRAY_SIZE(builtin_drivers); i++) {
-		struct userdiff_driver *drv = builtin_drivers + i;
-		if (!strncmp(drv->name, k, len) && !drv->name[len])
-			return drv;
-	}
-	return NULL;
+	return userdiff_find_builtin_by_namelen(k, len);
 }
 
 static int parse_funcname(struct userdiff_funcname *f, const char *k,