diff mbox series

[v3,6/7] merge: ensure we can actually restore pre-merge state

Message ID 887967c1f3fd6f03cf1d0bb3c19ed16819541092.1658391391.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Fix merge restore state | expand

Commit Message

Elijah Newren July 21, 2022, 8:16 a.m. UTC
From: Elijah Newren <newren@gmail.com>

Merge strategies can:
  * succeed with a clean merge
  * succeed with a conflicted merge
  * fail to handle the given type of merge

If one is thinking in terms of automatic mergeability, they would use
the word "fail" instead of "succeed" for the second bullet, but I am
focusing here on ability of the merge strategy to handle the given
inputs, not on whether the given inputs are mergeable.  The third
category is about the merge strategy failing to know how to handle the
given data; examples include:

  * Passing more than 2 branches to 'recursive' or 'ort'
  * Passing 2 or fewer branches to 'octopus'
  * Trying to do more complicated merges with 'resolve' (I believe
    directory/file conflicts will cause it to bail.)
  * Octopus running into a merge conflict for any branch OTHER than
    the final one (see the "exit 2" codepath of commit 98efc8f3d8
    ("octopus: allow manual resolve on the last round.", 2006-01-13))

That final one is particularly interesting, because it shows that the
merge strategy can muck with the index and working tree, and THEN bail
and say "sorry, this strategy cannot handle this type of merge; use
something else".

Further, we do not currently expect the individual strategies to clean
up after themselves, but instead expect builtin/merge.c to do so.  For
it to be able to, it needs to save the state before trying the merge
strategy so it can have something to restore to.  Therefore, remove the
shortcut bypassing the save_state() call.

There is another bug on the restore_state() side of things, so no
testcase will be added until the next commit when we have addressed that
issue as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/merge.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Junio C Hamano July 21, 2022, 4:31 p.m. UTC | #1
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/builtin/merge.c b/builtin/merge.c
> index f807bf335bd..11bb4bab0a1 100644
> --- a/builtin/merge.c
> +++ b/builtin/merge.c
> @@ -1686,12 +1686,12 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
>  	 * tree in the index -- this means that the index must be in
>  	 * sync with the head commit.  The strategies are responsible
>  	 * to ensure this.
> +	 *
> +	 * Stash away the local changes so that we can try more than one
> +	 * and/or recover from merge strategies bailing while leaving the
> +	 * index and working tree polluted.
>  	 */

Makes sense.  We may want to special-case strategies that are known
not to have the buggy "leave contaminated tree when bailing out"
behaviour to avoid waste.  I expect that more than 99.99% of the
time people are feeding a single other commit to ort or recursive,
and if these are known to be safe, a lot will be saved by not saving
"just in case".  But that can be left for later, after the series
solidifies.

Thanks.

> -	if (use_strategies_nr == 1 ||
> -	    /*
> -	     * Stash away the local changes so that we can try more than one.
> -	     */
> -	    save_state(&stash))
> +	if (save_state(&stash))
>  		oidclr(&stash);
>  
>  	for (i = 0; !merge_was_ok && i < use_strategies_nr; i++) {
Ben Humphreys March 2, 2023, 7:17 a.m. UTC | #2
Hi Junio / Elijah,

On Thu, Jul 21, 2022 at 09:31:44AM -0700, Junio C Hamano wrote:
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > diff --git a/builtin/merge.c b/builtin/merge.c
> > index f807bf335bd..11bb4bab0a1 100644
> > --- a/builtin/merge.c
> > +++ b/builtin/merge.c
> > @@ -1686,12 +1686,12 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
> >  	 * tree in the index -- this means that the index must be in
> >  	 * sync with the head commit.  The strategies are responsible
> >  	 * to ensure this.
> > +	 *
> > +	 * Stash away the local changes so that we can try more than one
> > +	 * and/or recover from merge strategies bailing while leaving the
> > +	 * index and working tree polluted.
> >  	 */
> 
> Makes sense.  We may want to special-case strategies that are known
> not to have the buggy "leave contaminated tree when bailing out"
> behaviour to avoid waste.  I expect that more than 99.99% of the
> time people are feeding a single other commit to ort or recursive,
> and if these are known to be safe, a lot will be saved by not saving
> "just in case".  But that can be left for later, after the series
> solidifies.

This may stretch your memory a bit since the above was many months ago,
but I'm wondering if you know of any effort since to build the above
described optimisations?

We've seen when Git 2.38.0 (which introduced this change) is used with
Bitbucket Server it results in a severe performance regression due to an
sharp increase in disk and CPU load. Our code that tests the mergeability
of a pull request is one such affected codepath.

If there isn't any existing efforts to build the optimisations you
mention above I will have a shot at it.

> > -	if (use_strategies_nr == 1 ||
> > -	    /*
> > -	     * Stash away the local changes so that we can try more than one.
> > -	     */
> > -	    save_state(&stash))
> > +	if (save_state(&stash))
> >  		oidclr(&stash);
> >  
> >  	for (i = 0; !merge_was_ok && i < use_strategies_nr; i++) {

Best Regards,
Ben Humphreys
Elijah Newren March 2, 2023, 3:35 p.m. UTC | #3
Hi Ben,

On Wed, Mar 1, 2023 at 11:17 PM Ben Humphreys <behumphreys@atlassian.com> wrote:
>
> Hi Junio / Elijah,
>
> On Thu, Jul 21, 2022 at 09:31:44AM -0700, Junio C Hamano wrote:
> > "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >
> > > diff --git a/builtin/merge.c b/builtin/merge.c
> > > index f807bf335bd..11bb4bab0a1 100644
> > > --- a/builtin/merge.c
> > > +++ b/builtin/merge.c
> > > @@ -1686,12 +1686,12 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
> > >      * tree in the index -- this means that the index must be in
> > >      * sync with the head commit.  The strategies are responsible
> > >      * to ensure this.
> > > +    *
> > > +    * Stash away the local changes so that we can try more than one
> > > +    * and/or recover from merge strategies bailing while leaving the
> > > +    * index and working tree polluted.
> > >      */
> >
> > Makes sense.  We may want to special-case strategies that are known
> > not to have the buggy "leave contaminated tree when bailing out"
> > behaviour to avoid waste.  I expect that more than 99.99% of the
> > time people are feeding a single other commit to ort or recursive,
> > and if these are known to be safe, a lot will be saved by not saving
> > "just in case".  But that can be left for later, after the series
> > solidifies.
>
> This may stretch your memory a bit since the above was many months ago,
> but I'm wondering if you know of any effort since to build the above
> described optimisations?
>
> We've seen when Git 2.38.0 (which introduced this change) is used with
> Bitbucket Server it results in a severe performance regression due to an
> sharp increase in disk and CPU load. Our code that tests the mergeability
> of a pull request is one such affected codepath.
>
> If there isn't any existing efforts to build the optimisations you
> mention above I will have a shot at it.

I've got bad news for you and great news for you.

The bad news: there have not yet been any efforts to build these
optimizations mentioned above.

The great news: the fact that this affects you means you are using
non-bare clones in your mergeability checks, and being forced with
every merge to first checkout the appropriate branch, and pay for the
penalty of updating both the index and the working tree both in that
checkout and during the merge (and perhaps in doing a hard reset
afterwards) in your mergeability check, despite the fact that a
mergeability check really only needs a boolean: "does it merge
cleanly?".  Doing a full worktree-tied merge like this is really
expensive, and while the above Git changes may have made it even more
expensive for you, the real savings comes from switching to a bare
clone and not writing any working tree files or the index.  That's
available via running `git merge-tree`; see the documentation for the
--write-tree option in particular.  GitHub switched over to it last
year and GitLab should be switching soon (or may have already
completed it; I haven't checked in a bit).

You are, of course, more than welcome to build the optimizations Junio
alludes to.  It'd help out various end users.  But for improving
server side operations, I think switching to `git merge-tree` would
provide you _much_ bigger benefits.


Hope that helps,
Elijah
Junio C Hamano March 2, 2023, 4:19 p.m. UTC | #4
Elijah Newren <newren@gmail.com> writes:

> The great news: the fact that this affects you means you are using
> non-bare clones in your mergeability checks, and being forced with
> every merge to first checkout the appropriate branch, and pay for the
> penalty of updating both the index and the working tree both in that
> checkout and during the merge (and perhaps in doing a hard reset
> afterwards) in your mergeability check, despite the fact that a
> mergeability check really only needs a boolean: "does it merge
> cleanly?".  Doing a full worktree-tied merge like this is really
> expensive, and while the above Git changes may have made it even more
> expensive for you, the real savings comes from switching to a bare
> clone and not writing any working tree files or the index.  That's
> available via running `git merge-tree`; see the documentation for the
> --write-tree option in particular.  GitHub switched over to it last
> year and GitLab should be switching soon (or may have already
> completed it; I haven't checked in a bit).

Nice.
Rudy Rigot March 4, 2023, 4:18 p.m. UTC | #5
> I think switching to `git merge-tree` would provide you _much_ bigger benefits.

Supporting evidence: we have an automerge pipeline running
on our gigantic monolith at Salesforce to keep some shared
branches automatically up-to-date with their upstream. We first
built it with a version of Git older than 2.38, and we were doing
checkouts and running good old 'git merge' and 'git push'
commands. We didn't measure since it wasn't in production yet,
but from memory, each merge could take maybe 30s to a minute,
the checkouts could be really heavy sometimes. And of course,
after our initial fetch (which we still do of course), there was also an
incompressible multi-minute initial checkout.

We updated our pipeline to run Git 2.39, and we switched to doing
no checkout and running 'git merge-tree', followed by 'git commit-tree',
and then 'git update-ref'. Now, each of those is about 5 seconds,
including re-fetching the branches before starting in case they very
recently changed. It made a gigantic difference to us, thanks a lot
to everyone who worked on this.
Ben Humphreys March 6, 2023, 10:19 p.m. UTC | #6
On Thu, Mar 02, 2023 at 07:35:30AM -0800, Elijah Newren wrote:
> I've got bad news for you and great news for you.
> 
> The bad news: there have not yet been any efforts to build these
> optimizations mentioned above.
> 
> The great news: the fact that this affects you means you are using
> non-bare clones in your mergeability checks, and being forced with
> every merge to first checkout the appropriate branch, and pay for the
> penalty of updating both the index and the working tree both in that
> checkout and during the merge (and perhaps in doing a hard reset
> afterwards) in your mergeability check, despite the fact that a
> mergeability check really only needs a boolean: "does it merge
> cleanly?".  Doing a full worktree-tied merge like this is really
> expensive, and while the above Git changes may have made it even more
> expensive for you, the real savings comes from switching to a bare
> clone and not writing any working tree files or the index.  That's
> available via running `git merge-tree`; see the documentation for the
> --write-tree option in particular.  GitHub switched over to it last
> year and GitLab should be switching soon (or may have already
> completed it; I haven't checked in a bit).
> 
> You are, of course, more than welcome to build the optimizations Junio
> alludes to.  It'd help out various end users.  But for improving
> server side operations, I think switching to `git merge-tree` would
> provide you _much_ bigger benefits.

Many thanks for the detailed reply Elijah; indeed the good news
outweighs the bad news! I've started migrating to merge-tree and it
looks great. Once complete I might take a look at the other
optimizations anyway, as a fun project.

Thanks again!

Best Regards,
Ben Humphreys
diff mbox series

Patch

diff --git a/builtin/merge.c b/builtin/merge.c
index f807bf335bd..11bb4bab0a1 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1686,12 +1686,12 @@  int cmd_merge(int argc, const char **argv, const char *prefix)
 	 * tree in the index -- this means that the index must be in
 	 * sync with the head commit.  The strategies are responsible
 	 * to ensure this.
+	 *
+	 * Stash away the local changes so that we can try more than one
+	 * and/or recover from merge strategies bailing while leaving the
+	 * index and working tree polluted.
 	 */
-	if (use_strategies_nr == 1 ||
-	    /*
-	     * Stash away the local changes so that we can try more than one.
-	     */
-	    save_state(&stash))
+	if (save_state(&stash))
 		oidclr(&stash);
 
 	for (i = 0; !merge_was_ok && i < use_strategies_nr; i++) {