diff mbox series

merge-file: warn for implicit 'myers' algorithm

Message ID pull.1741.git.git.1720016469254.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series merge-file: warn for implicit 'myers' algorithm | expand

Commit Message

Antonin Delpeuch July 3, 2024, 2:21 p.m. UTC
From: Antonin Delpeuch <antonin@delpeuch.eu>

The current default diff algorithm for the merge-file command is
'myers', despite the default for the 'ort' strategy being 'histogram'.
Since 2.44.0 it is possible to specify a different diff algorithm via
the --diff-algorithm option. As a preparation for changing the default
to 'histogram', we warn the user about the different behaviour this
may cause.

Signed-off-by: Antonin Delpeuch <antonin@delpeuch.eu>
---
    merge-file: warn for implicit 'myers' algorithm

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1741%2Fwetneb%2Fexplicit_diff_algorithm-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1741/wetneb/explicit_diff_algorithm-v1
Pull-Request: https://github.com/git/git/pull/1741

 builtin/merge-file.c  | 8 ++++++++
 t/t6403-merge-file.sh | 5 +++--
 2 files changed, 11 insertions(+), 2 deletions(-)


base-commit: 06e570c0dfb2a2deb64d217db78e2ec21672f558

Comments

Junio C Hamano July 3, 2024, 5:30 p.m. UTC | #1
"Antonin Delpeuch via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Antonin Delpeuch <antonin@delpeuch.eu>
>
> The current default diff algorithm for the merge-file command is
> 'myers', despite the default for the 'ort' strategy being 'histogram'.

It is unclear only from the above description if `ort` should match
by adopting `myers` as its default, or vice versa.  I'll go into
more detail why I think this whole thing is done in a wrong order
later.

> Since 2.44.0 it is possible to specify a different diff algorithm via
> the --diff-algorithm option.

"2.44.0" -> "4f7fd79e (merge-file: add --diff-algorithm option,
2023-11-20)".  Thanks for that patch, by the way.

> As a preparation for changing the default
> to 'histogram', we warn the user about the different behaviour this
> may cause.

There is a huge leap in logic here.  Nobody proposed to change the
default and we had no such discussion here.  That needs to happen
before anybody can warn users that "the default will change".

Once everybody agrees that such a change is a good idea, we'd need
to devise transition plan, and one of the tasks _might_ be to add
this warning to the code, among other things we may do.  The whole
process has to be designed carefully.  Having this step as the first
one is way too suboptimal and hurts the users (e.g. "what if we
decide that using histogram as the default is not what we want to do
in the end?").

> +static int explicit_diff_algorithm = 0;

We shouldn't (and we generally do not) initialize a variable
explicitly to "0".  Just let being in the .bss section take care of
it instead.

This one is flipped by diff_algorithm_cb(), which is a callback
function that deals with the command line option "--diff-algorithm",
so calling the variable to remember the fact that the option was
used with "explicit" in its name is very much appropriate.

Now on to the real problem I have with this patch.

What do you want the end-user experience for those who saw and
understood this warning message to be?  Especially for ones who do
not really _care_ what the default algorithm used is, and would
rather tell us "I'll let Git developers to choose the best algorithm
for us, do not bother me with what exact choice you made---I'll
happily use the built-in default")?

Whether they have their favorite algorithm or they are willing to go
with the built-in default, they will keep getting shown by this
message and there is no obvious way to easily squelch the message.

If we do not count "Every time you run this command, you can give
the --diff-algorithm=myers option from the command line to squelch
it" as a usable piece of advice, that is.

Stepping back a bit.

It is curious that, even though we read the merge.conflictstyle
configuration variable by calling into xdiff-interface.c, we do not
seem to pay attention to the diff.algorithm configuration variable.
Shouldn't we teach the command to do so first, as a follow-up to
your 4f7fd79e (merge-file: add --diff-algorithm option, 2023-11-20)?
If `ort` does not pay attention to it (I do not know if it does),
then perhaps that can also be fixed in the same "preparatory"
series.  It would allow us to have a way to consistently and
uniformly configure the diff algorithm to employ regardless of the
caller of the diff machinery (and if we wanted to go fancy, we could
even introduce "diff.ort.algorithm" and "diff.merge-file.algorithm"
that overrides "diff.algorithm" that in turn overrides whatever the
built-in default is).

Such a change would prepare the codebase to allow users to say "I'll
adopt the new default that will come in Git 3.0 before it happens by
setting diff.algorithm to histogram" or "I'll set diff.algorithm to
default to express that I'll go with the flow and let Git developers
to decide".  With such a preparatory change made, you can build an
equivalent of this step, but you make sure that you pay attention to
both the command line (i.e. your explicit_diff_algorithm) and also
the configuration as ways for users to express that "I've read the
warning.  Please do not repeat it."

So, in short, I do not like this patch because of two reasons:

 - It shouldn't be the first message that begins the topic of
   flipping the default diff algorithm for merge-file.

 - Even if we arrived at a consensus to migrate the default to
   'histogram' (and I do not think we even started discussing), the
   "warning" mechanism that has no easy way to squelch is not an
   acceptable component in the migration plan.
Antonin Delpeuch July 3, 2024, 6:28 p.m. UTC | #2
Hi Junio,

I'm really sorry, I thought the switch of default and migration plan had already been agreed on in our discussion of my earlier patch.
Specifically, you wrote (https://lore.kernel.org/git/xmqq7cmdpbhq.fsf@gitster.g):

    First allow to configure the
    custom algorithm from the command line option (and optionally via a
    configuration variable) and ship it in a release, start giving a
    warning if the using script did not specify the configuration or the
    command line option and used the current default and ship it in the
    next release, wait for a few releases and then finally flip the
    default, or something like that.

So I thought it would be helpful to follow-up with a patch that implements the approach you outlined.
But I totally understand that it might be worth discussing this more.
Actually, I do agree with your assessment that this warning is not great UX.

I think relying on `diff.algorithm` is a natural idea, but it might also be confusing for users.
At least to me, the name `diff.algorithm` suggests that it's the algorithm used for "git diff",
but I might not realize that it also influences how my merges are done.
It's probably common to want different algorithms for those situations as they require different speed and accuracy trade-offs.

In any case, I'm happy to withdraw this patch. Would it be helpful if I start a new thread on the mailing list, independently from this patch, to discuss if and how the default should be switched?

Best,
Antonin
Junio C Hamano July 3, 2024, 8:19 p.m. UTC | #3
Antonin Delpeuch <antonin@delpeuch.eu> writes:

> I'm really sorry, I thought the switch of default and migration
> plan had already been agreed on in our discussion of my earlier
> patch.

Ahh, OK.

So we did some time ago floated the idea.  I do not remember how
widely accepted the proposal was, though.  Having a such reference
and an explicit mention of what we have and not have yet reached
consensus on (either in cover letter or after the three-dash line)
would have been very much helpful.

> I think relying on `diff.algorithm` is a natural idea, but it
> might also be confusing for users.  At least to me, the name
> `diff.algorithm` suggests that it's the algorithm used for "git
> diff", but I might not realize that it also influences how my
> merges are done.  It's probably common to want different
> algorithms for those situations as they require different speed
> and accuracy trade-offs.

I never thought that we should get rid of one-off command line
option.  After all, we started with command line option to support
such one-off tweaks in the earlier 4f7fd79e (merge-file: add
--diff-algorithm option, 2023-11-20) for that exact reason.

The need to have one-off capability is orthogonal to the need to
allow users to choose their own default via the configuration.  We
want to have both, given that some commands other than "git diff"
already honor the `diff.THING` configuration variables.  Doesn't
"git log -p" already pay attention to "diff.algorithm" among other
"diff.THING" variables?

A possible downside I had envisioned was that depending on the
application (i.e. "diff" that produces a patch vs an internal
implementation detail of "merge-file") the users may want to choose
the value of "diff.THING" differently.  But then we can use the
"'diff.THING' is used as the default, but 'diff.frotz.THING', when
defined, overrides the choice inside the 'git frotz' program as a
more specific configuration" pattern.

In any case, honoring things like

    [diff] algorithm = default
    [diff "merge-file"] algorithm = default

in the configuration file might be a reasonable way out to prepare
that users will have a way to squelch the warning messages.

Thanks.
Elijah Newren July 5, 2024, 5:39 p.m. UTC | #4
On Wed, Jul 3, 2024 at 1:24 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Antonin Delpeuch <antonin@delpeuch.eu> writes:
>
> > I'm really sorry, I thought the switch of default and migration
> > plan had already been agreed on in our discussion of my earlier
> > patch.
>
> Ahh, OK.
>
> So we did some time ago floated the idea.  I do not remember how
> widely accepted the proposal was, though.  Having a such reference
> and an explicit mention of what we have and not have yet reached
> consensus on (either in cover letter or after the three-dash line)
> would have been very much helpful.

There's been a few discussions, the other most recent one I remember
was the thread over at
https://lore.kernel.org/git/Y+zzh80fybq8Tn66@coredump.intra.peff.net/.
(And beyond the git community, there's
https://lkml.org/lkml/2023/5/7/206 in the kernel community, and if
others know of discussions in other large developer communities I'd be
interested in links.)

The previous discussions felt to me like we were moving towards
consensus, but while I found that encouraging since I think histogram
would eventually be a better default, I did not make any actual
proposals and try to push further towards consensus because there are
a couple known issues that I think should be fixed before we consider
flipping the default.  I have some work-in-progress that was put on
the backburner a few years ago that I would like to pick up again, and
if successful, investigate how much that helps general cases in a
format that can help people make educated decisions, and then again
float the idea of changing the default.  If consensus is reached, then
we'd change the default across the board -- diff/log/merge-file/etc.
rather than just the somewhat rarely used merge-file.  At least,
that's my current plan in this area; if others think I should
investigate things in a different order or would like to see
additional steps planned into this journey, please do let me know.
Antonin Delpeuch July 5, 2024, 8:14 p.m. UTC | #5
Thanks both.

It sounds like there is a strong case for at least making "merge-file"
honour the `diff.algorithm` config variable, possibly with more
fine-grained config variables as suggested by Junio. I'll try to make a
patch for that.

Elijah, if any of your work on improving the histogram diff can be
shared (for instance by writing down a description of the issues you'd
like to address), I would be interested to have a look and perhaps help
out if doable. But as far as I can tell there seems to be a lot of fans
of the existing algorithm already, and such heuristics can never be
perfect, so hopefully we can make the switch in a not too distant future.

Best,

Antonin
Junio C Hamano July 6, 2024, 6:06 a.m. UTC | #6
Elijah Newren <newren@gmail.com> writes:

> The previous discussions felt to me like we were moving towards
> consensus, but while I found that encouraging since I think histogram
> would eventually be a better default, I did not make any actual
> proposals and try to push further towards consensus because there are
> a couple known issues that I think should be fixed before we consider
> flipping the default.

OK.

> I have some work-in-progress that was put on
> the backburner a few years ago that I would like to pick up again, and
> if successful, investigate how much that helps general cases in a
> format that can help people make educated decisions, and then again
> float the idea of changing the default.

Sounds like a good plan.

> If consensus is reached, then
> we'd change the default across the board -- diff/log/merge-file/etc.
> rather than just the somewhat rarely used merge-file.  

Everybody uses diff.c::diff_algorithm somehow, will be affected by
the diff.algorithm configuration variable, and everybody should
honor "--diff-algorithm=<choice>" command line option to override?

It sounds like a great plan.  If ort for some reason performs better
with histogram while format-patch works better with minimal, we
could extend the configuration system to make diff.<cmd>.algorithm
override the more generic diff.algorithm and do similarly silly
things.  We may want to also teach the diff plumbing commands to
ignore diff.algorithm for better reproducibility but we may need to
wait for Git 3.0 to do that.

> At least,
> that's my current plan in this area; if others think I should
> investigate things in a different order or would like to see
> additional steps planned into this journey, please do let me know.

Thanks.
Antonin Delpeuch July 6, 2024, 1:30 p.m. UTC | #7
On 06/07/2024 08:06, Junio C Hamano wrote:
> Everybody uses diff.c::diff_algorithm somehow, will be affected by
> the diff.algorithm configuration variable, and everybody should
> honor "--diff-algorithm=<choice>" command line option to override?

I have looked into writing a patch to implement this, but it looks like
the situation is quite messy. There are already half a dozen of commands
which claim to honour the diff.algorithm configuration variable, but
ignore it entirely.

In the documentation of the recursive merge strategy (for instance in
"man git-merge"), it is claimed that "recursive defaults to the
diff.algorithm config setting". As far as I can tell, both from my
reading of the code and my interactive testing, this is wrong. This
affects the "merge", "rebase" and "pull" commands, which all three
mention this configuration variable in their man page without respecting
it. Ouch!

I have looked for all commands which mention diff.algorithm in their man
page and checked whether they indeed respect it. The "diff-index",
"diff-tree" and "diff-files" commands also make this erroneous claim.
The --diff-algorithm CLI option (as well as the --histogram and
siblings) are respected, but not the diff.algorithm config variable.
Those inconsistencies seem to be caused by the inclusion of the
`diff-options.txt` file in the man pages, which leads their man pages to
documenting a bunch of config variables which are in fact ignored. From
a user perspective, I would say that those commands should indeed honour
diff.algorithm, so I would be tempted to fix the code rather than the
documentation. That being said, the man pages of those commands also
mention other options which are in fact ignored, such as
"diff.relative". I think it's likely that for some of those variables,
the contrary is desirable: remove them from the docs because they indeed
shouldn't be relied on by the command (because it does not make sense
for that particular command). I don't have (yet) a good enough
understanding of those commands and those options to judge this
immediately, but it looks like there is some cleaning up to do.

In diff.c, the code that is responsible for reading diff.* configuration
variables for commands like "log", "diff" or "diff-index" divides diff.*
variables into two categories: the ones that are "basic" (parsed by
"git_diff_basic_config") and the ones only relevant to the "ui" (parsed
by the "git_diff_ui_config" function). Surprisingly to me, the
"diff.algorithm" variable belongs to the "ui" category, and is therefore
skipped by commands such as "diff-index". It seems natural to me to move
diff.algorithm to the "basic" section. However, that alone will not fix
the problem (in my opinion much more serious) that the recursive merge
strategy ignores the variable. To fix this, the parsing of this variable
could either be added to "merge_recursive_config" (in
merge-recursive.c), or in fact directly to "git_xmerge_config" (in
xdiff-interface.c), which would then not only fix the recursive merge
strategy, but also a range of other commands such as "rerere" or
"merge-file". Given that I started looking into this with a specific
interest in "merge-file", I would obviously be tempted to fix this bug
directly at the xmerge level, and I think it would indeed make sense for
other affected commands (surely "rerere" should benefit from using merge
options that are consistent with the ones used for "git merge" or "git
rebase", no?), but I can imagine that it's too bold a step and could
have unwanted consequences I am not aware of - especially since Junio
recommended to add support for diff.algorithm at the diff.c level. What
do you think?

In any case, I would of course make sure the "ort" strategy continues to
ignore diff.algorithm for now, given its current default value.

If you want to try this out for yourself, I have set up a test
repository at https://git.kanthaus.online/antonin/testrepo. It has two
branches "master" and "theirs", which you can try to merge/rebase, for
instance "git merge -s recursive theirs" while being on master and
having "diff.algorithm" set to "histogram". The merge scenario is
designed to fail with a conflict if the myers algorithm is used, and
succeed if histogram is used. You should be able to see that "git merge
-s recursive theirs" will fail but "git merge -s recursive
-Xdiff-algorithm=histogram theirs" should work.

Best,

Antonin
Junio C Hamano July 6, 2024, 6:06 p.m. UTC | #8
Antonin Delpeuch <antonin@delpeuch.eu> writes:

> In the documentation of the recursive merge strategy (for instance in
> "man git-merge"), it is claimed that "recursive defaults to the
> diff.algorithm config setting". As far as I can tell, both from my
> reading of the code and my interactive testing, this is wrong. This
> affects the "merge", "rebase" and "pull" commands, which all three
> mention this configuration variable in their man page without respecting
> it. Ouch!

Thanks for digging.

> I have looked for all commands which mention diff.algorithm in their man
> page and checked whether they indeed respect it. The "diff-index",
> "diff-tree" and "diff-files" commands also make this erroneous claim.

It is not erroneous if we say that these 3 diff plumbing commands
ignore the configuration variable.  They should ignore end-user
configuration for reproducibility.

See also my earlier response to Elijah <xmqqed873vgn.fsf@gitster.g>
on a related topic.

> The --diff-algorithm CLI option (as well as the --histogram and
> siblings) are respected, but not the diff.algorithm config variable.

Then they are behaving exactly as designed, which is good.  We still
need to correct their documentation, though.

> Those inconsistencies seem to be caused by the inclusion of the
> `diff-options.txt` file in the man pages, which leads their man pages to
> documenting a bunch of config variables which are in fact ignored.

That is quite understandable mistake ;-) "git merge" and other
end-user facing commands should be taught to pay attention to both
the command line option and the configuration variable.  The
plumbing commands should pay attention to the command line only.

> In any case, I would of course make sure the "ort" strategy continues to
> ignore diff.algorithm for now, given its current default value.

It may make the effort easy to follow if you do this step-wise:

 (1) start with "all Porcelain commands pay attention to the same
     diff.algorithm variable.  The plumbing commands ignore
     diff.algorithm.  All commands, either Porcelain or plumbing,
     may have different default when unconfigured (like ort does)".

 (2) then add "each Porcelain command <cmd> pays attention to
     diff.<cmd>.algorithm if defined, otherwise diff.algorithm is
     used as a fallback default.  There is no diff.<cmd>.algorithm
     for plumbing commands---they are designed not to be affected by
     the configuration variables".

 (3) optionally, doing "all Porcelain commands, when not configured,
     will use the same default (ort is no longer special---everybody
     falls back to algorithm X)" may be desiable for consistency and
     simplicity, but it would probably want further discussion can
     be left outside of the topic (e.g. right now the best candidate
     for X may be histogram, but is it suitable for all commands?
     should this extend to plumbing, making diff-index for example
     to use X as the default not myers when the command line does
     not specify --diff-algorithm?)

Thanks.
diff mbox series

Patch

diff --git a/builtin/merge-file.c b/builtin/merge-file.c
index 1f987334a31..dce2676415e 100644
--- a/builtin/merge-file.c
+++ b/builtin/merge-file.c
@@ -29,6 +29,8 @@  static int label_cb(const struct option *opt, const char *arg, int unset)
 	return 0;
 }
 
+static int explicit_diff_algorithm = 0;
+
 static int set_diff_algorithm(xpparam_t *xpp,
 			      const char *alg)
 {
@@ -50,6 +52,8 @@  static int diff_algorithm_cb(const struct option *opt,
 		return error(_("option diff-algorithm accepts \"myers\", "
 			       "\"minimal\", \"patience\" and \"histogram\""));
 
+	explicit_diff_algorithm = 1;
+
 	return 0;
 }
 
@@ -103,6 +107,10 @@  int cmd_merge_file(int argc, const char **argv, const char *prefix)
 			return error_errno("failed to redirect stderr to /dev/null");
 	}
 
+	if (!explicit_diff_algorithm) {
+		warning(_("--diff-algorithm not provided, defaulting to \"myers\". This default will change to \"histogram\" in a future version."));
+	}
+
 	if (object_id)
 		setup_git_directory();
 
diff --git a/t/t6403-merge-file.sh b/t/t6403-merge-file.sh
index fb872c5a113..9d0045be955 100755
--- a/t/t6403-merge-file.sh
+++ b/t/t6403-merge-file.sh
@@ -540,8 +540,9 @@  test_expect_success 'merging C files with "myers" diff algorithm creates some sp
 	}
 	EOF
 
-	test_must_fail git merge-file -p --diff3 --diff-algorithm myers ours.c base.c theirs.c >myers_output.c &&
-	test_cmp expect.c myers_output.c
+	test_must_fail git merge-file -p --diff3 ours.c base.c theirs.c >myers_output.c 2> err &&
+	test_cmp expect.c myers_output.c &&
+	grep "diff-algorithm not provided" err
 '
 
 test_expect_success 'merging C files with "histogram" diff algorithm avoids some spurious conflicts' '