[00/14] Introduce new `git replay` command

Message ID	20230407072415.1360068-1-christian.couder@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@vger.kernel.org> From: Christian Couder <christian.couder@gmail.com> To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, Patrick Steinhardt <ps@pks.im>, Johannes Schindelin <Johannes.Schindelin@gmx.de>, Elijah Newren <newren@gmail.com>, John Cai <johncai86@gmail.com>, Christian Couder <christian.couder@gmail.com> Subject: [PATCH 00/14] Introduce new `git replay` command Date: Fri, 7 Apr 2023 09:24:01 +0200 Message-Id: <20230407072415.1360068-1-christian.couder@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Introduce new `git replay` command \| expand [00/14] Introduce new `git replay` command [01/14] replay: introduce new builtin [02/14] replay: start using parse_options API [03/14] replay: die() instead of failing assert() [04/14] replay: introduce pick_regular_commit() [05/14] replay: don't simplify history [06/14] replay: add an important FIXME comment about gpg signing [07/14] replay: remove progress and info output [08/14] replay: remove HEAD related sanity check [09/14] replay: very coarse worktree updating [10/14] replay: make it a minimal server side command [11/14] replay: use standard revision ranges [12/14] replay: introduce guess_new_base() [13/14] replay: add different modes [14/14] replay: stop assuming replayed branches do not diverge

Message ID

20230407072415.1360068-1-christian.couder@gmail.com (mailing list archive)

Headers

From: Christian Couder <christian.couder@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>, Patrick Steinhardt <ps@pks.im>,
        Johannes Schindelin <Johannes.Schindelin@gmx.de>,
        Elijah Newren <newren@gmail.com>,
        John Cai <johncai86@gmail.com>,
        Christian Couder <christian.couder@gmail.com>
Subject: [PATCH 00/14] Introduce new `git replay` command 
Date: Fri,  7 Apr 2023 09:24:01 +0200
Message-Id: <20230407072415.1360068-1-christian.couder@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Introduce new `git replay` command | expand

Message

Christian Couder April 7, 2023, 7:24 a.m. UTC

# Intro

`git replay` has initially been developed entirely by Elijah Newren
mostly between June and July 2022 at:

https://github.com/newren/git/commits/replay

I took over a few months ago to polish and upstream it as GitLab is
interested in replacing libgit2, and for that purpose needs a command
to do server side (so without using a worktree) rebases, cherry-picks
and reverts.

I reduced the number of commits and features in this first patch
series, compared to what Elijah already developed. Especially I
stopped short of replaying merge commits and replaying
interactively. These and other features might be upstreamed in the
future after this patch series has graduated.

Thanks to Elijah, Patrick Steinhardt and Dscho for early reviews and
discussions.

Based on ae73b2c8f1 (The seventh batch, 2023-04-04)

# Quick Overview (from Elijah)

`git replay`, at a basic level, can perhaps be thought of as a
"default-to-dry-run rebase" -- meaning no updates to the working tree,
or to the index, or to any references. However, it differs from
rebase in that it:

* Works for branches that aren't checked out
* Works in a bare repository
* Can replay multiple branches simultaneously (with or without common
history in the range being replayed)
* Preserves relative topology by default (merges are replayed too)
* Focuses on performance
* Has several altered defaults as a result of the above

I sometimes think of `git replay` as "fast-replay", a patch-based
analogue to the snapshot-based fast-export & fast-import tools.

# Reasons for diverging from cherry-pick & rebase (from Elijah)

There are multiple reasons to diverge from the defaults in cherry-pick and
rebase.

* Server side needs

* Both cherry-pick and rebase, via the sequencer, are heavily tied
to updating the working tree, index, some refs, and a lot of
control files with every commit replayed, and invoke a mess of
hooks[1] that might be hard to avoid for backward compatibility
reasons (at least, that's been brought up a few times on the
list).

* cherry-pick and rebase both fork various subprocesses
unnecessarily, but somewhat intrinsically in part to ensure the
same hooks are called that old scripted implementations would
have called.

* "Dry run" behavior, where there are no updates to worktree, index,
or even refs might be important.

* Should not assume users only want to operate on HEAD (see next
section)

* Decapitate HEAD-centric assumptions

* cherry-pick forces commits to be played on top of HEAD; inflexible.

* rebase assumes the range of commits to be replayed is
upstream..HEAD by default, though it allows one to replay
upstream..otherbranch -- but it still forcibly and needlessly
checks out otherbranch before starting to replay things.

* Assuming HEAD is involved severely limits replaying multiple
(possibly divergent) branches.

* Once you stop assuming HEAD has a certain meaning, there's not
much reason to have two separate commands anymore (except for the
funny extra not-necessarily-compatible options both have gained
over time).

* (Micro issue: Assuming HEAD is involved also makes it harder for
new users to learn what rebase means and does; it makes command
lines hard to parse. Not sure I want to harp on this too much, as
I have a suspicion I might be creating a tool for experts with
complicated use cases, but it's a minor quibble.)

* Performance

* jj is slaughtering us on rebase speed[2]. I would like us to become
competitive. (I dropped a few comments in the link at [2] about why
git is currently so bad.)

* From [3], there was a simple 4-patch series in linux.git that took
53 seconds to rebase. Switching to ort dropped it to 16 seconds.
While that sounds great, only 11 *milliseconds* were needed to do
the actual merges. That means almost *all* the time (>99%) was
overhead! Big offenders:

* --reapply-cherry-picks should be the default

* can_fast_forward() should be ripped out, and perhaps other extraneous
revision walks

* avoid updating working tree, index, refs, reflogs, and control
structures except when needed (e.g. hitting a conflict, or operation
finished)

* Other performance ideas:

* single-file control structures instead of directory of files

* avoid forking subprocesses unless explicitly requested (e.g.
--exec, --strategy, --run-hooks). For example, definitely do not
invoke `git commit` or `git merge`.

* Sanitize hooks:

* dispense with all per-commit hooks for sure (pre-commit,
post-commit, post-checkout).

* pre-rebase also seems to assume exactly 1 ref is written, and
invoking it repeatedly would be stupid. Plus, it's specific
to "rebase". So...ignore? (Stolee's --ref-update option for
rebase probably broke the pre-rebase assumptions already...)

* post-rewrite hook might make sense, but fast-import got
exempted, and I think of replay like a patch-based analogue
to the snapshot-based fast-import.

* When not running server side, resolve conflicts in a sparse-cone
sparse-index worktree to reduce number of files written to a
working tree. (See below as well)

* [High risk of possible premature optimization] Avoid large
numbers of newly created loose objects, when replaying large
numbers of commits. Two possibilities: (1) Consider using
tmp-objdir and pack objects from the tmp-objdir at end of
exercise, (2) Lift code from git-fast-import to immediately
stuff new objects into a pack?

* Multiple branches and non-checked out branches

* The ability to operate on non-checked out branches also implies
that we should generally be able to replay when in a dirty working
tree (exception being when we expect to update HEAD and any of the
dirty files is one that needs to be updated by the replay).

* Also, if we are operating locally on a non-checked out branch and
hit a conflict, we should have a way to resolve the conflict without
messing with the user's work on their current branch.

* Idea: new worktree with sparse cone + sparse index checkout,
containing only files in the root directory, and whatever is
necessary to get the conflicts

* Companion to above idea: control structures should be written to
$GIT_COMMON_DIR/replay-${worktree}, so users can have multiple
replay sessions, and so we know which worktrees are associated
with which replay operations.

- [1] https://lore.kernel.org/git/pull.749.v3.git.git.1586044818132.gitgitgadget@gmail.com/
- [2] https://github.com/martinvonz/jj/discussions/49
- [3] https://lore.kernel.org/git/CABPp-BE48=97k_3tnNqXPjSEfA163F8hoE+HY0Zvz1SWB2B8EA@mail.gmail.com/

# Important limitations

* The code die()s if there are any conflict. No resumability. No nice
output. No interactivity.

* No replaying merges, nor root commits. Only regular commits.

* Signed commits are not properly handled. It's not clear what to do
to such commits when replaying on the server side.

# Commit overview

* 1/14 replay: introduce new builtin

This creates a minimal `git replay` command by moving the code
from the `fast-rebase` test helper from `t/helper/` into
`builtin/` and doing some renames and a few other needed changes.

* - 2/14 replay: start using parse_options API
- 3/14 replay: die() instead of failing assert()
- 4/14 replay: introduce pick_regular_commit()
- 5/14 replay: don't simplify history
- 6/14 replay: add an important FIXME comment about gpg signing
- 7/14 replay: remove progress and info output
- 8/14 replay: remove HEAD related sanity check

These slowly change the command to make it behave more like a
regular commands and to start cleaning up its output.

* 9/14 replay: very coarse worktree updating

Make it handle conflicts in a very coarse way. This might not
work on bare repos, but it allows existing tests to pass and it's
nice to help cli users a bit when they get conflicts.

* 10/14 replay: make it a minimal server side command

After the cleaning up in previous ommits, it's now time to
radically change the way it works by stopping it to do ref
updates, to update the index and worktree, to consider HEAD as
special. Instead just make it output commands that should be
passed to `git update-ref --stdin`.

* - 11/14 replay: use standard revision ranges
- 12/14 replay: introduce guess_new_base()
- 13/14 replay: add different modes
- 14/14 replay: stop assuming replayed branches do not diverge

These finish the clean up and add new interesting features at
the same time, as well as related documentation and tests.

# Note about tests and documentation

Note that the `fast-rebase` test helper was used before this series in

t6429-merge-sequence-rename-caching.sh

So when `git replay` is created from `fast-rebase` in patch 1/14, this
test script is also converted to use `git replay`. This ensures that
`git replay` doesn't break too badly during the first 10 patches in
this patch series.

Tests and documentation are introduced specifically for `git replay`
only in 11/14 and later patches as it doesn't make much sense to
document and test behavior that we know is going to change soon. So
it's only when the command is crystalizing towards its final form that
we start documenting and testing it.

# Possibly controversial issues

* bare or not bare: this series works towards a command with the end
goal of it being usable and used on bare repos, contrary to existing
commands like `git rebase` and `git cherry-pick`, but the tests
currently don't check that, and in case of conflicts it won't
currently work on bare repos. One reason for that is that existing
tests in t6429 should continue to work, and one of these tests
requires some output in case of conflict. And it's nice for users to
get some help in case of conflict. It's also nice for users if
commands that should work on both bare and non bare repos work well
on non bare repos first as they are less likely to use them on bare
repos. So let's have a command that works well on non-bare repos
first, even if its end goal is to work fine on bare repos too. We
plan to improve things for bare repos soon after this first patch
series graduates.

* exit status: a successful, non-conflicted replay exits with code
0. When the replay has conflicts, the exit status is 1. If the
replay is not able to complete (or start) due to some kind of error,
the exit status is something other than 0 or 1. It has been
suggested in an internal review that conflicts might want to get a
more specific error code as an error code of 1 might be quite easy
to return by accident. It doesn't seem to me from their docs (which
might want to be improved, I didn't look at the code) that other
commands like `git merge` and `git rebase` exit with a special error
code in case of conflict.

* to guess or not to guess: commit 12/14 introduces the
guess_new_base() function which tries to find a base to rebase onto
when the --onto option is not provided, making this option actually
optional instead of mandatory. Given that it's an heuristic and the
command end goal is to be used on server side, we might want to
introduce this as an iterative improvement later. I still think it's
interesting to have it in for now though, as it shows that --onto
and --advance (which is introduced in the following commit) should
indeed be options. If --onto was always mandatory in the series,
people could argue that it shouldn't be an option and its argument
should always be the first (unconditional) argument of the command.

* make worktree and index changes optional: commit 10/14 stops
updating the index and worktree, but it might be better especially
for cli users to make that optional. The issue is that this would
make the command more complex while we are developing a number of
important features. It seems to me that this should rather be done
in an iterative improvement after the important features have
landed.

* when and where to add tests and docs: although t6429 has tests that
are changed to use the new command instead of the fast-rebase
test-tool command as soon as the former is introduced, there is no
specific test script and no doc for the new command until commit
11/14 when standard revision ranges are used. This is done to avoid
churn in tests and docs while the final form of the command hasn't
crystalized enough. Adding tests and doc at this point makes this
commit quite big and possibly more difficult to review than if they
were in separate commits though. On the other hand when tests and
docs are added in specific commits some reviewers say it would be
better to introduce them when the related changes are made.

* --advance and --contained: these two advanced options might not
belong to this first series and could perhaps be added in a followup
series in separate commits. On the other hand the code for
--contained seems involved with the code of --advance and it's nice
to see soon that git replay can indeed do cherry-picking and rebase
many refs at once, and this way fullfil these parts of its promise.

* replaying diverging branches: 14/14 the last patch in the series,
which allow replaying diverging branches, can be seen as a
fundamental fix or alternatively as adding an interesting
feature. So it's debatable if it should be in its own patch along
with its own tests as in this series, or if it should be merged into
a previous patch and which one.

* only 2 patches: this patch series can be seen from a high level
point of view as 1) introducing the new `git replay` command, and 2)
using `git replay` to replace, and get rid of, the fast-rebase
test-tool command. The fact that not much of the original
fast-rebase code and interface is left would agree with that point
of view. On the other hand, fast-rebase can also be seen as a first
iteration towards `git replay`. So it can also make sense to see how
`git replay` evolved from it.

Elijah Newren (14):
replay: introduce new builtin
replay: start using parse_options API
replay: die() instead of failing assert()
replay: introduce pick_regular_commit()
replay: don't simplify history
replay: add an important FIXME comment about gpg signing
replay: remove progress and info output
replay: remove HEAD related sanity check
replay: very coarse worktree updating
replay: make it a minimal server side command
replay: use standard revision ranges
replay: introduce guess_new_base()
replay: add different modes
replay: stop assuming replayed branches do not diverge

.gitignore | 1 +
Documentation/git-replay.txt | 130 +++++++
Makefile | 2 +-
builtin.h | 1 +
builtin/replay.c | 419 +++++++++++++++++++++++
command-list.txt | 1 +
git.c | 1 +
t/helper/test-fast-rebase.c | 233 -------------
t/helper/test-tool.c | 1 -
t/helper/test-tool.h | 1 -
t/t3650-replay-basics.sh | 160 +++++++++
t/t6429-merge-sequence-rename-caching.sh | 43 ++-
12 files changed, 739 insertions(+), 254 deletions(-)
create mode 100644 Documentation/git-replay.txt
create mode 100644 builtin/replay.c
delete mode 100644 t/helper/test-fast-rebase.c
create mode 100755 t/t3650-replay-basics.sh

Comments

Phillip Wood April 14, 2023, 10:12 a.m. UTC | #1

Hi Christian and Elijah

On 07/04/2023 08:24, Christian Couder wrote:
> # Intro
> 
> `git replay` has initially been developed entirely by Elijah Newren
> mostly between June and July 2022 at:
> 
> https://github.com/newren/git/commits/replay
> 
> I took over a few months ago to polish and upstream it as GitLab is
> interested in replacing libgit2, and for that purpose needs a command
> to do server side (so without using a worktree) rebases, cherry-picks
> and reverts.
> 
> I reduced the number of commits and features in this first patch
> series, compared to what Elijah already developed. Especially I
> stopped short of replaying merge commits and replaying
> interactively. These and other features might be upstreamed in the
> future after this patch series has graduated.
> 
> Thanks to Elijah, Patrick Steinhardt and Dscho for early reviews and
> discussions.
> 
> Based on ae73b2c8f1 (The seventh batch, 2023-04-04)

Thanks to both of you for working on this it looks very interesting. 
I've had a quick read over the patches and I ended up slightly confused 
as to exactly what the aim of this series is. My main confusion is 
whether "replay" is intended to be a plumbing command or a porcelain 
command. The use case above suggests plumbing and there are patches that 
take it in that direction by removing any diagnostic output and stopping 
it update any refs. But then it is marked as porcelain in 
command-list.txt and there are patches that do things like 
unconditionally updating the index and worktree when there are conflicts 
that stop it working in bare repositories. I've left some comments below

> # Quick Overview (from Elijah)
> 
> `git replay`, at a basic level, can perhaps be thought of as a
> "default-to-dry-run rebase" -- meaning no updates to the working tree,
> or to the index, or to any references.  However, it differs from
> rebase in that it:
> 
>    * Works for branches that aren't checked out
>    * Works in a bare repository
>    * Can replay multiple branches simultaneously (with or without common
>      history in the range being replayed)
>    * Preserves relative topology by default (merges are replayed too)
>    * Focuses on performance
>    * Has several altered defaults as a result of the above
> 
> I sometimes think of `git replay` as "fast-replay", a patch-based
> analogue to the snapshot-based fast-export & fast-import tools.
> 
> # Reasons for diverging from cherry-pick & rebase (from Elijah)
> 
> There are multiple reasons to diverge from the defaults in cherry-pick and
> rebase.
> 
> * Server side needs
> 
>    * Both cherry-pick and rebase, via the sequencer, are heavily tied
>      to updating the working tree, index, some refs, and a lot of
>      control files with every commit replayed, and invoke a mess of
>      hooks[1] that might be hard to avoid for backward compatibility
>      reasons (at least, that's been brought up a few times on the
>      list).
> 
>    * cherry-pick and rebase both fork various subprocesses
>      unnecessarily, but somewhat intrinsically in part to ensure the
>      same hooks are called that old scripted implementations would
>      have called.

To clarify, since 356ee4659bb (sequencer: try to commit without forking 
'git commit', 2017-11-24) cherry-pick and rebase do not fork 
subprocesses other than hooks for the cases covered by this patch series 
(i.e. they do not fork "git commit" for simple picks).

>    * "Dry run" behavior, where there are no updates to worktree, index,
>      or even refs might be important.
> 
>    * Should not assume users only want to operate on HEAD (see next
>      section)
> 
> * Decapitate HEAD-centric assumptions
> 
>    * cherry-pick forces commits to be played on top of HEAD; inflexible.
> 
>    * rebase assumes the range of commits to be replayed is
>      upstream..HEAD by default, though it allows one to replay
>      upstream..otherbranch -- but it still forcibly and needlessly
>      checks out otherbranch before starting to replay things.

I agree it would be nice to be able to restrict the range of commits 
replayed, especially when replaying merges. The comment about checking 
out other branch is out of date since 767a9c417eb (rebase -i: stop 
checking out the tip of the branch to rebase, 2020-01-24)

>    * Assuming HEAD is involved severely limits replaying multiple
>      (possibly divergent) branches.

I'm not sure how true this is anymore, since 89fc0b53fdb (rebase: update 
refs from 'update-ref' commands, 2022-07-19) the sequencer can update 
multiple branches. The issue with divergent branch is with command line 
arguments and the todo list generation rather than the capabilities of 
the sequencer.

>    * Once you stop assuming HEAD has a certain meaning, there's not
>      much reason to have two separate commands anymore (except for the
>      funny extra not-necessarily-compatible options both have gained
>      over time).

I agree having a unified command at the plumbing level certainly makes 
sense.

>    * (Micro issue: Assuming HEAD is involved also makes it harder for
>      new users to learn what rebase means and does; it makes command
>      lines hard to parse.

That's an interesting point, I wonder if operating on branches that are 
not checked out is potentially confusing for new user though.

>  Not sure I want to harp on this too much, as
>      I have a suspicion I might be creating a tool for experts with
>      complicated use cases, but it's a minor quibble.)
> 
> * Performance
> 
>    * jj is slaughtering us on rebase speed[2].  I would like us to become
>      competitive.  (I dropped a few comments in the link at [2] about why
>      git is currently so bad.)
> 
>    * From [3], there was a simple 4-patch series in linux.git that took
>      53 seconds to rebase.  Switching to ort dropped it to 16 seconds.
>      While that sounds great, only 11 *milliseconds* were needed to do
>      the actual merges.  That means almost *all* the time (>99%) was
>      overhead!  Big offenders:
> 
>      * --reapply-cherry-picks should be the default

I agree that can be a performance hit if there are a lot of upstream 
commits, but it is also a usability feature as it means we don't stop 
and ask the user what to do with the commits that have been upstreamed 
which wastes more of their time. I think maybe we want different 
defaults for the server use case than the user replaying commits or 
perhaps default to dropping commits that become empty.

>      * can_fast_forward() should be ripped out, and perhaps other extraneous
>        revision walks

We should look at doing that at least for the merge backend which has 
skip_unnecessary_picks(). I think it is useful to tell the user that the 
branch was not updated by the rebase though.

>      * avoid updating working tree, index, refs, reflogs, and control
>        structures except when needed (e.g. hitting a conflict, or operation
>        finished)

Not writing to disc unless we need to is sensible. Having said that for 
interactive rebases I do find having HEAD's reflog record all the picks 
useful to unpick what went wrong if mess something up.

>    * Other performance ideas:
> 
>      * single-file control structures instead of directory of files

I like the idea as it should make it easier to keep the on disc state 
consistent, but I'm not sure how much of an issue that is in practice as 
we only read/write the files once each time git is run. The bigger slow 
down is writing the author script, commit message, list of rewritten 
commits, todo list and done files with each pick.


>      * avoid forking subprocesses unless explicitly requested (e.g.
>        --exec, --strategy, --run-hooks).  For example, definitely do not
>        invoke `git commit` or `git merge`.

Good, that matches what the sequencer does for non-merge commits when 
we're not editing the commit message.

>      * Sanitize hooks:
> 
>        * dispense with all per-commit hooks for sure (pre-commit,
>          post-commit, post-checkout).

I agree we should not be running those (we don't run the pre-commit hook 
anyway). However we had a bug report when cherry-pick stopped running 
the "prepare-commit-msg" hook (see 
https://lore.kernel.org/git/CAKdAkRQuj1hfKeckjuR2oP+8C1i+ZR36O-+aRYif4ufaS_zs+w@mail.gmail.com/). 
That shouldn't matter for the server but we should bear it in mind when 
it comes to other use cases.

>        * pre-rebase also seems to assume exactly 1 ref is written, and
>          invoking it repeatedly would be stupid.  Plus, it's specific
>          to "rebase".  So...ignore?  (Stolee's --ref-update option for
>          rebase probably broke the pre-rebase assumptions already...)

If replay is a plumbing command then skipping the pre-rebase hook makes 
sense as scripts can call it themselves if they want to. For a porcelain 
command keeping a hook that can prevent it from rewriting commits  that 
are already upstream (which I think is one of the main uses of the 
pre-rebase hook) would be good.

>        * post-rewrite hook might make sense, but fast-import got
>          exempted, and I think of replay like a patch-based analogue
>          to the snapshot-based fast-import.

If we don't call the hook it would be good to have a option that outputs 
that information so scripts can request it if they want. Also we should 
think about if/when we want to update the notes associated with replayed 
commits.

>      * When not running server side, resolve conflicts in a sparse-cone
>        sparse-index worktree to reduce number of files written to a
>        working tree.  (See below as well)
>
>      * [High risk of possible premature optimization] Avoid large
>        numbers of newly created loose objects, when replaying large
>        numbers of commits.  Two possibilities: (1) Consider using
>        tmp-objdir and pack objects from the tmp-objdir at end of
>        exercise, (2) Lift code from git-fast-import to immediately
>        stuff new objects into a pack?
> 
> * Multiple branches and non-checked out branches
> 
>    * The ability to operate on non-checked out branches also implies
>      that we should generally be able to replay when in a dirty working
>      tree (exception being when we expect to update HEAD and any of the
>      dirty files is one that needs to be updated by the replay).
> 
>    * Also, if we are operating locally on a non-checked out branch and
>      hit a conflict, we should have a way to resolve the conflict without
>      messing with the user's work on their current branch.

That sounds tricky to do in a user friendly way.

>      * Idea: new worktree with sparse cone + sparse index checkout,
>        containing only files in the root directory, and whatever is
>        necessary to get the conflicts

If the user has not asked for a sparse checkout then this could be 
surprising. Sometimes I find it helpful to be able to poke about in 
other source files when resolving a conflict. I also often build and 
test after resolving a conflict which requires more than just the 
conflict to be checked out.

>      * Companion to above idea: control structures should be written to
>        $GIT_COMMON_DIR/replay-${worktree}, so users can have multiple
>        replay sessions, and so we know which worktrees are associated
>        with which replay operations.


We certainly want some way of making sure we only update a given ref in 
one replay session, and have checks to for whether the ref is checked 
out anywhere as we do now for rebase --update-refs. That seems to be 
lacking in the patches adding ref updating in this series.

>    - [1] https://lore.kernel.org/git/pull.749.v3.git.git.1586044818132.gitgitgadget@gmail.com/
>    - [2] https://github.com/martinvonz/jj/discussions/49
>    - [3] https://lore.kernel.org/git/CABPp-BE48=97k_3tnNqXPjSEfA163F8hoE+HY0Zvz1SWB2B8EA@mail.gmail.com/
> 
> # Important limitations
> 
> * The code die()s if there are any conflict. No resumability. No nice
>    output. No interactivity.

I can see that on a server you might not want any output, but if I run 
it locally it would be nice to have a message saying which paths have 
conflicts. Maybe we could add a --quiet flag for the server rather than 
removing the existing messages?

> * No replaying merges, nor root commits. Only regular commits.

That is a reasonable place to start.

> * Signed commits are not properly handled. It's not clear what to do
>    to such commits when replaying on the server side.

Yes on the server where you don't have access to the signing key there 
is not much need for replay to have a signing option.

> # Commit overview
> 
> * 1/14 replay: introduce new builtin
> 
>       This creates a minimal `git replay` command by moving the code
>       from the `fast-rebase` test helper from `t/helper/` into
>       `builtin/` and doing some renames and a few other needed changes.
> 
> * - 2/14 replay: start using parse_options API
>    - 3/14 replay: die() instead of failing assert()
>    - 4/14 replay: introduce pick_regular_commit()
>    - 5/14 replay: don't simplify history
>    - 6/14 replay: add an important FIXME comment about gpg signing
>    - 7/14 replay: remove progress and info output
>    - 8/14 replay: remove HEAD related sanity check
> 
>       These slowly change the command to make it behave more like a
>       regular commands and to start cleaning up its output.
> 
> * 9/14 replay: very coarse worktree updating
> 
>       Make it handle conflicts in a very coarse way. This might not
>       work on bare repos, but it allows existing tests to pass and it's
>       nice to help cli users a bit when they get conflicts.
> 
> * 10/14 replay: make it a minimal server side command
> 
>       After the cleaning up in previous ommits, it's now time to
>       radically change the way it works by stopping it to do ref
>       updates, to update the index and worktree, to consider HEAD as
>       special. Instead just make it output commands that should be
>       passed to `git update-ref --stdin`.
> 
> * - 11/14 replay: use standard revision ranges
>    - 12/14 replay: introduce guess_new_base()
>    - 13/14 replay: add different modes
>    - 14/14 replay: stop assuming replayed branches do not diverge
> 
>        These finish the clean up and add new interesting features at
>        the same time, as well as related documentation and tests.
> 
> # Note about tests and documentation
> 
> Note that the `fast-rebase` test helper was used before this series in
> 
> t6429-merge-sequence-rename-caching.sh
> 
> So when `git replay` is created from `fast-rebase` in patch 1/14, this
> test script is also converted to use `git replay`. This ensures that
> `git replay` doesn't break too badly during the first 10 patches in
> this patch series.
> 
> Tests and documentation are introduced specifically for `git replay`
> only in 11/14 and later patches as it doesn't make much sense to
> document and test behavior that we know is going to change soon. So
> it's only when the command is crystalizing towards its final form that
> we start documenting and testing it.
> 
> # Possibly controversial issues
> 
> * bare or not bare: this series works towards a command with the end
>    goal of it being usable and used on bare repos, contrary to existing
>    commands like `git rebase` and `git cherry-pick`, but the tests
>    currently don't check that, and in case of conflicts it won't
>    currently work on bare repos. One reason for that is that existing
>    tests in t6429 should continue to work, and one of these tests
>    requires some output in case of conflict. And it's nice for users to
>    get some help in case of conflict. It's also nice for users if
>    commands that should work on both bare and non bare repos work well
>    on non bare repos first as they are less likely to use them on bare
>    repos. So let's have a command that works well on non-bare repos
>    first, even if its end goal is to work fine on bare repos too. We
>    plan to improve things for bare repos soon after this first patch
>    series graduates.
> 
> * exit status: a successful, non-conflicted replay exits with code
>    0. When the replay has conflicts, the exit status is 1. If the
>    replay is not able to complete (or start) due to some kind of error,
>    the exit status is something other than 0 or 1. It has been
>    suggested in an internal review that conflicts might want to get a
>    more specific error code as an error code of 1 might be quite easy
>    to return by accident. It doesn't seem to me from their docs (which
>    might want to be improved, I didn't look at the code) that other
>    commands like `git merge` and `git rebase` exit with a special error
>    code in case of conflict.

I don't think we've ever had a special "conflict" error code but it 
would be useful for scripts if replay had one. Does replay return a 
different exit code for "merge conflicts" and "cannot merge because it 
would overwrite an untracked file"? Does it have an exit code for "the 
commit becomes empty" or are those patches unconditionally dropped?

> * to guess or not to guess: commit 12/14 introduces the
>    guess_new_base() function which tries to find a base to rebase onto
>    when the --onto option is not provided, making this option actually
>    optional instead of mandatory. Given that it's an heuristic and the
>    command end goal is to be used on server side, we might want to
>    introduce this as an iterative improvement later. I still think it's
>    interesting to have it in for now though, as it shows that --onto
>    and --advance (which is introduced in the following commit) should
>    indeed be options. If --onto was always mandatory in the series,
>    people could argue that it shouldn't be an option and its argument
>    should always be the first (unconditional) argument of the command.

I think it comes down to "what's the aim of this series?" is it focused 
on the bare repository server use case or is it trying to add a general 
purpose cli tool.

> * make worktree and index changes optional: commit 10/14 stops
>    updating the index and worktree, but it might be better especially
>    for cli users to make that optional. The issue is that this would
>    make the command more complex while we are developing a number of
>    important features. It seems to me that this should rather be done
>    in an iterative improvement after the important features have
>    landed.

I'm confused by this as patch 9 seems to start updating the index and 
worktree when there are conflicts but patch 10 stops updating the index 
and worktree if the replay is successful.

> * when and where to add tests and docs: although t6429 has tests that
>    are changed to use the new command instead of the fast-rebase
>    test-tool command as soon as the former is introduced, there is no
>    specific test script and no doc for the new command until commit
>    11/14 when standard revision ranges are used. This is done to avoid
>    churn in tests and docs while the final form of the command hasn't
>    crystalized enough. Adding tests and doc at this point makes this
>    commit quite big and possibly more difficult to review than if they
>    were in separate commits though. On the other hand when tests and
>    docs are added in specific commits some reviewers say it would be
>    better to introduce them when the related changes are made.
> 
> * --advance and --contained: these two advanced options might not
>    belong to this first series and could perhaps be added in a followup
>    series in separate commits. On the other hand the code for
>    --contained seems involved with the code of --advance and it's nice
>    to see soon that git replay can indeed do cherry-picking and rebase
>    many refs at once, and this way fullfil these parts of its promise.

Once I understood what these options did the names made sense, but I 
could not tell from the names what they were going to do. For me 
"--cherry-pick" and "--update-refs" would have been clearer. It might be 
worth splitting this patch so the individual options are added separately.

> * replaying diverging branches: 14/14 the last patch in the series,
>    which allow replaying diverging branches, can be seen as a
>    fundamental fix or alternatively as adding an interesting
>    feature. So it's debatable if it should be in its own patch along
>    with its own tests as in this series, or if it should be merged into
>    a previous patch and which one.

It might make sense to add this in the same patch as you add --contained.

> * only 2 patches: this patch series can be seen from a high level
>    point of view as 1) introducing the new `git replay` command, and 2)
>    using `git replay` to replace, and get rid of, the fast-rebase
>    test-tool command. The fact that not much of the original
>    fast-rebase code and interface is left would agree with that point
>    of view. On the other hand, fast-rebase can also be seen as a first
>    iteration towards `git replay`. So it can also make sense to see how
>    `git replay` evolved from it.

Starting with fast-rebase means one has to checkout the first patch to 
see what code we're adding to replay.c and to make sense of the later 
patches that remove code. It would be interesting to compare this series 
to one that started from scratch but I guess that would be quite a bit 
of work.

Thanks for working on this, I'm interested to see where it goes

Best Wishes

Phillip

> 
> Elijah Newren (14):
>    replay: introduce new builtin
>    replay: start using parse_options API
>    replay: die() instead of failing assert()
>    replay: introduce pick_regular_commit()
>    replay: don't simplify history
>    replay: add an important FIXME comment about gpg signing
>    replay: remove progress and info output
>    replay: remove HEAD related sanity check
>    replay: very coarse worktree updating
>    replay: make it a minimal server side command
>    replay: use standard revision ranges
>    replay: introduce guess_new_base()
>    replay: add different modes
>    replay: stop assuming replayed branches do not diverge
> 
>   .gitignore                               |   1 +
>   Documentation/git-replay.txt             | 130 +++++++
>   Makefile                                 |   2 +-
>   builtin.h                                |   1 +
>   builtin/replay.c                         | 419 +++++++++++++++++++++++
>   command-list.txt                         |   1 +
>   git.c                                    |   1 +
>   t/helper/test-fast-rebase.c              | 233 -------------
>   t/helper/test-tool.c                     |   1 -
>   t/helper/test-tool.h                     |   1 -
>   t/t3650-replay-basics.sh                 | 160 +++++++++
>   t/t6429-merge-sequence-rename-caching.sh |  43 ++-
>   12 files changed, 739 insertions(+), 254 deletions(-)
>   create mode 100644 Documentation/git-replay.txt
>   create mode 100644 builtin/replay.c
>   delete mode 100644 t/helper/test-fast-rebase.c
>   create mode 100755 t/t3650-replay-basics.sh
>

Felipe Contreras April 14, 2023, 5:39 p.m. UTC | #2

Christian Couder wrote:
> # Quick Overview (from Elijah)
> 
> `git replay`, at a basic level, can perhaps be thought of as a
> "default-to-dry-run rebase" -- meaning no updates to the working tree,
> or to the index, or to any references.

Interesting, I just ran into this problem trying to cleanup my personal git
branches.

Simply checking which branches can be cleanly rebased on top of master takes a
significant amount of time without any tricks, and using `git merge-tree` still
takes some time.

But the biggest offender is checking which patches have not yet been merged
into master, which takes 52 seconds on my machine which is by no means old.

> # Reasons for diverging from cherry-pick & rebase (from Elijah)
> 
> * Server side needs

I personally don't care about the server side, but...

>   * Both cherry-pick and rebase, via the sequencer, are heavily tied
>     to updating the working tree, index, some refs, and a lot of
>     control files with every commit replayed, and invoke a mess of
>     hooks[1] that might be hard to avoid for backward compatibility
>     reasons (at least, that's been brought up a few times on the
>     list).

This is important as an end user as well.

Since day 1 one of the important selling points of git was that operations that
could be done in milliseconds did take milliseconds.

If it can be done faster, why wouldn't I want it to be done faster?

> * Decapitate HEAD-centric assumptions

That's good, but not particularly important at the moment IMO.

> * Performance
> 
>   * jj is slaughtering us on rebase speed[2].  I would like us to become
>     competitive.  (I dropped a few comments in the link at [2] about why
>     git is currently so bad.)

Indeed.

>   * From [3], there was a simple 4-patch series in linux.git that took
>     53 seconds to rebase.

I did participate in that discussion, but Uwe Kleine-König never responded back.

In [1] he clearly noticed the problem was *before* attempting to apply any
patch. Other people mentioned the fork-point detection, but I don't think that
was the issue, my guess was that checking for the possibility of a fast-forward
was the issue.

The code was clearly doing the wrong thing for that case, but I believe it
should have been fixed by d42c9ffa0f (rebase: factor out branch_base
calculation, 2022-10-17).

It would be interesting to see if this issue can be reproduced somehow.

>     Switching to ort dropped it to 16 seconds.

No, it dropped to 16 seconds it for Elijah, not Uwe. Uwe (who had the real
repository) noticed a big reduction of around 70%, but the discrepancy of using
--onto versus not always remained.

>     While that sounds great, only 11 *milliseconds* were needed to do
>     the actual merges.  That means almost *all* the time (>99%) was
>     overhead!  Big offenders:
> 
>     * --reapply-cherry-picks should be the default
> 
>     * can_fast_forward() should be ripped out, and perhaps other extraneous
>       revision walks

Doesn't d42c9ffa0f (rebase: factor out branch_base calculation, 2022-10-17)
deal with that?

---

I think something like this is defeinitely needed, when I rewrote `git rebase`
to use `git cherry-pick` I noticed many areas of improvement, and I'm of the
opinion that `git rebase` should be rewritten from scratch.

But precisely because git focuses too much on backwards compatibility (and
often in the wrong areas), I think `git replay` should be thoroughly discussed
before accepting something we could quickly realize can be substantially
improved.

Cheers.

[1] https://lore.kernel.org/git/20210528214024.vw4huojcklrm6d27@pengutronix.de/

Elijah Newren April 15, 2023, 6:44 a.m. UTC | #3

On Fri, Apr 7, 2023 at 12:24 AM Christian Couder
<christian.couder@gmail.com> wrote:
>
> # Intro
>
> `git replay` has initially been developed entirely by Elijah Newren
> mostly between June and July 2022 at:

(Sidenote: actually, there was a good chunk in Jan & Feb 2022 as well,
and various design and idea work preceding that over a long time, some
in the form of the fast-rebase test-tool...)

> https://github.com/newren/git/commits/replay
>
> I took over a few months ago to polish and upstream it as GitLab is
> interested in replacing libgit2, and for that purpose needs a command
> to do server side (so without using a worktree) rebases, cherry-picks
> and reverts.
>
> I reduced the number of commits and features in this first patch
> series, compared to what Elijah already developed. Especially I
> stopped short of replaying merge commits and replaying
> interactively. These and other features might be upstreamed in the
> future after this patch series has graduated.

...and also cleaned up my commits which were in a WIP state.  Thanks!  :-)

> Thanks to Elijah, Patrick Steinhardt and Dscho for early reviews and
> discussions.
>
> Based on ae73b2c8f1 (The seventh batch, 2023-04-04)
>
> # Quick Overview (from Elijah)

In particular, this comes from the replay-design-notes.txt file on the
replay branch, up until the footnote links.

<snip>

My replay branch involved a whole bunch of aspirational ideas, work in
progress, and some things that worked.  As noted above, you've taken
this portion of the cover letter from my replay-design-notes.txt file
on that branch, but that file involved my in-progress thought
processes on all these ideas.  For this series, we should probably
just focus on the server-side usecases since other aspects just
weren't complete enough for use.  (I also thought the server-side
aspects of git-replay weren't good enough for use either because I
thought we'd need some conflict-related handling similar to what
git-merge-tree does, but both you and Dscho have said you aren't
worried about that, and are fine with just a simple non-zero exit
status.)

Anyway, deleting the forward-looking stuff and concentrating on the
server-side replaying of commits would mean that we can at least
delete the portion of this cover letter starting from here...

>   * Other performance ideas:
>
>     * single-file control structures instead of directory of files
>
>     * avoid forking subprocesses unless explicitly requested (e.g.
>       --exec, --strategy, --run-hooks).  For example, definitely do not
>       invoke `git commit` or `git merge`.
>
>     * Sanitize hooks:
>
>       * dispense with all per-commit hooks for sure (pre-commit,
>         post-commit, post-checkout).
>
>       * pre-rebase also seems to assume exactly 1 ref is written, and
>         invoking it repeatedly would be stupid.  Plus, it's specific
>         to "rebase".  So...ignore?  (Stolee's --ref-update option for
>         rebase probably broke the pre-rebase assumptions already...)
>
>       * post-rewrite hook might make sense, but fast-import got
>         exempted, and I think of replay like a patch-based analogue
>         to the snapshot-based fast-import.
>
>     * When not running server side, resolve conflicts in a sparse-cone
>       sparse-index worktree to reduce number of files written to a
>       working tree.  (See below as well)
>
>     * [High risk of possible premature optimization] Avoid large
>       numbers of newly created loose objects, when replaying large
>       numbers of commits.  Two possibilities: (1) Consider using
>       tmp-objdir and pack objects from the tmp-objdir at end of
>       exercise, (2) Lift code from git-fast-import to immediately
>       stuff new objects into a pack?
>
> * Multiple branches and non-checked out branches
>
>   * The ability to operate on non-checked out branches also implies
>     that we should generally be able to replay when in a dirty working
>     tree (exception being when we expect to update HEAD and any of the
>     dirty files is one that needs to be updated by the replay).
>
>   * Also, if we are operating locally on a non-checked out branch and
>     hit a conflict, we should have a way to resolve the conflict without
>     messing with the user's work on their current branch.
>
>     * Idea: new worktree with sparse cone + sparse index checkout,
>       containing only files in the root directory, and whatever is
>       necessary to get the conflicts
>
>     * Companion to above idea: control structures should be written to
>       $GIT_COMMON_DIR/replay-${worktree}, so users can have multiple
>       replay sessions, and so we know which worktrees are associated
>       with which replay operations.

...up to here.  Some of the other stuff could perhaps be trimmed as
well, though I suspect at least some of it is useful from the
perspective of letting others know of additional usecases we'd like to
support (so that design suggestions don't curtail those additional
future usecases).

>   - [1] https://lore.kernel.org/git/pull.749.v3.git.git.1586044818132.gitgitgadget@gmail.com/
>   - [2] https://github.com/martinvonz/jj/discussions/49
>   - [3] https://lore.kernel.org/git/CABPp-BE48=97k_3tnNqXPjSEfA163F8hoE+HY0Zvz1SWB2B8EA@mail.gmail.com/

This appears to be the end of the part you copied from replay-design-notes.txt

<snip>

> * 9/14 replay: very coarse worktree updating
>
>      Make it handle conflicts in a very coarse way. This might not
>      work on bare repos, but it allows existing tests to pass and it's
>      nice to help cli users a bit when they get conflicts.

I had the coarse working updating in git-replay mostly because it came
from fast-rebase.  I did also use it to poke and prod early ideas,
even though I knew that its current implementation might well be
incompatible with some of the other ideas I had.

I don't think this helps the server-side use though, and since it
potentially conflicts with some of the other goals, I'd be inclined to
say we should just drop this patch (and/or squash the next patch into
this one.)

> * 10/14 replay: make it a minimal server side command
>
>      After the cleaning up in previous ommits, it's now time to
>      radically change the way it works by stopping it to do ref
>      updates, to update the index and worktree, to consider HEAD as
>      special. Instead just make it output commands that should be
>      passed to `git update-ref --stdin`.

A squashed 9 & 10 would thus be similar to patch 8 in the sense that
its purpose was to get rid of something that made sense for
fast-rebase but which doesn't align with the current goals of this
command.

<snip>

> # Possibly controversial issues
>
> * bare or not bare: this series works towards a command with the end
>   goal of it being usable and used on bare repos, contrary to existing
>   commands like `git rebase` and `git cherry-pick`, but the tests
>   currently don't check that, and in case of conflicts it won't
>   currently work on bare repos. One reason for that is that existing
>   tests in t6429 should continue to work, and one of these tests
>   requires some output in case of conflict. And it's nice for users to
>   get some help in case of conflict. It's also nice for users if
>   commands that should work on both bare and non bare repos work well
>   on non bare repos first as they are less likely to use them on bare
>   repos. So let's have a command that works well on non-bare repos
>   first, even if its end goal is to work fine on bare repos too. We
>   plan to improve things for bare repos soon after this first patch
>   series graduates.

I think there's a lot of work to do for conflict handling, and if we
want to submit this for inclusion then we should just exclude conflict
handling entirely and adjust the tests accordingly.

> * exit status: a successful, non-conflicted replay exits with code
>   0. When the replay has conflicts, the exit status is 1. If the
>   replay is not able to complete (or start) due to some kind of error,
>   the exit status is something other than 0 or 1. It has been
>   suggested in an internal review that conflicts might want to get a
>   more specific error code as an error code of 1 might be quite easy
>   to return by accident. It doesn't seem to me from their docs (which
>   might want to be improved, I didn't look at the code) that other
>   commands like `git merge` and `git rebase` exit with a special error
>   code in case of conflict.

The merge backend does not provide any granularity finer than "had
conflicts" for its return code.  This is somewhat cemented by the
git-merge API as well, where it mandates that the return code of merge
backends be 1 for conflicts:

    /*
     * The backend exits with 1 when conflicts are
     * left to be resolved, with 2 when it does not
     * handle the given merge at all.
     */

Because of this, all of git-merge-resolve, git-merge-octopus,
git-merge-recursive, and git-merge-ort return 1 for conflicts.

Of course, we do not have to use the return code of the merge backend
as the exit status for `git replay`.  We could inspect the conflicts,
somewhat like git-merge-tree does, except that instead of printing
information about those conflicts, we could also let it guide us to
choose a new exit status.  I'm not sure we'd want to do that, but we
could.

Since an exit status of 1 is pretty thoroughly baked in elsewhere for
everything merge related, including `git-merge-tree`, I'd be inclined
to leave this series as-is and have an exit status of 1 for conflicts.

> * to guess or not to guess: commit 12/14 introduces the
>   guess_new_base() function which tries to find a base to rebase onto
>   when the --onto option is not provided, making this option actually
>   optional instead of mandatory. Given that it's an heuristic and the
>   command end goal is to be used on server side, we might want to
>   introduce this as an iterative improvement later. I still think it's
>   interesting to have it in for now though, as it shows that --onto
>   and --advance (which is introduced in the following commit) should
>   indeed be options. If --onto was always mandatory in the series,
>   people could argue that it shouldn't be an option and its argument
>   should always be the first (unconditional) argument of the command.

I do not want a positional argument for <onto>; rebase's use of
positional arguments is an example to avoid, IMO.

I also don't want people thinking this is just rebase.  Rebase is
perhaps used more than cherry-pick and as such a lot of the
documentation and design goals talk about it more, but I designed this
command very deliberately to be about handling both cherry-pick and
rebase functionality in the same command.

All that said, I remember spending a fair amount of time on the
heuristic, but don't remember if I felt I had gotten it good enough or
not.  If folks want to leave it out for a future series, that'd be
fine, so long as it doesn't lead people to suggestions that'd conflict
with the above points.

> * make worktree and index changes optional: commit 10/14 stops
>   updating the index and worktree, but it might be better especially
>   for cli users to make that optional. The issue is that this would
>   make the command more complex while we are developing a number of
>   important features. It seems to me that this should rather be done
>   in an iterative improvement after the important features have
>   landed.

I don't think worktree and index changes should be included in this
series, even as an option.  The related code as currently written is
incompatible with some other ideas expressed in the cover letter.
Granted, those ideas might change, even dramatically, but I don't want
us limiting the design space and I don't think the worktree or index
updates in this series are ready for users to use yet.  (And there
aren't any patches available, to my knowledge, that make them ready.
Those still need to be written.)

> * when and where to add tests and docs: although t6429 has tests that
>   are changed to use the new command instead of the fast-rebase
>   test-tool command as soon as the former is introduced, there is no
>   specific test script and no doc for the new command until commit
>   11/14 when standard revision ranges are used. This is done to avoid
>   churn in tests and docs while the final form of the command hasn't
>   crystalized enough. Adding tests and doc at this point makes this
>   commit quite big and possibly more difficult to review than if they
>   were in separate commits though. On the other hand when tests and
>   docs are added in specific commits some reviewers say it would be
>   better to introduce them when the related changes are made.

I don't have opinions on this one.  Maybe I will after reading through
this cleaned up series more carefully.

> * --advance and --contained: these two advanced options might not
>   belong to this first series and could perhaps be added in a followup
>   series in separate commits. On the other hand the code for
>   --contained seems involved with the code of --advance and it's nice
>   to see soon that git replay can indeed do cherry-picking and rebase
>   many refs at once, and this way fullfil these parts of its promise.

--advance is the one option I thought I had a chance of getting my
company to use server-side in the near term...

Also, without --advance, I worry reviewers will make rebase-centric
assumptions that preclude handling cherry-picks, so I'm really leery
of deferring it based on that.

I understand that --contained isn't as likely to be useful
server-side, but it was the one piece that I really enjoyed using
cli-side (especially since it pre-dated rebase --update-refs).  But
yeah, it's utility is somewhat limited until there's some kind of
conflict resolution mechanism provided.

> * replaying diverging branches: 14/14 the last patch in the series,
>   which allow replaying diverging branches, can be seen as a
>   fundamental fix or alternatively as adding an interesting
>   feature. So it's debatable if it should be in its own patch along
>   with its own tests as in this series, or if it should be merged into
>   a previous patch and which one.

I don't have opinions on this either.  Maybe I will after reading
through the series?

> * only 2 patches: this patch series can be seen from a high level
>   point of view as 1) introducing the new `git replay` command, and 2)
>   using `git replay` to replace, and get rid of, the fast-rebase
>   test-tool command. The fact that not much of the original
>   fast-rebase code and interface is left would agree with that point
>   of view. On the other hand, fast-rebase can also be seen as a first
>   iteration towards `git replay`. So it can also make sense to see how
>   `git replay` evolved from it.

I preferred the angle of starting from fast-rebase, but I don't feel
too strongly about it.  Mostly I just thought it'd make it easier for
me, and for the few other people who already reviewed fast-rebase (and
yes, I got review comments on it from a few different people back when
it was proposed).  If dividing up the patches some other way makes it
easier for others to reason about, and you want to switch it around,
go for it.

Elijah Newren April 15, 2023, 5:18 p.m. UTC | #4

Hi Phillip,

On Fri, Apr 14, 2023 at 3:13 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Christian and Elijah
>
> On 07/04/2023 08:24, Christian Couder wrote:
> > # Intro
> >
> > `git replay` has initially been developed entirely by Elijah Newren
> > mostly between June and July 2022 at:
> >
> > https://github.com/newren/git/commits/replay
> >
> > I took over a few months ago to polish and upstream it as GitLab is
> > interested in replacing libgit2, and for that purpose needs a command
> > to do server side (so without using a worktree) rebases, cherry-picks
> > and reverts.
> >
> > I reduced the number of commits and features in this first patch
> > series, compared to what Elijah already developed. Especially I
> > stopped short of replaying merge commits and replaying
> > interactively. These and other features might be upstreamed in the
> > future after this patch series has graduated.
> >
> > Thanks to Elijah, Patrick Steinhardt and Dscho for early reviews and
> > discussions.
> >
> > Based on ae73b2c8f1 (The seventh batch, 2023-04-04)
>
> Thanks to both of you for working on this it looks very interesting.
> I've had a quick read over the patches and I ended up slightly confused
> as to exactly what the aim of this series is. My main confusion is
> whether "replay" is intended to be a plumbing command or a porcelain
> command. The use case above suggests plumbing and there are patches that
> take it in that direction by removing any diagnostic output and stopping
> it update any refs. But then it is marked as porcelain in
> command-list.txt and there are patches that do things like
> unconditionally updating the index and worktree when there are conflicts
> that stop it working in bare repositories. I've left some comments below

Yeah, that's fair.  It was totally unclear in my work-in-progress
patches, as I had ideas taking it in both directions.  Then Christian
and Dscho came along and said they wanted to use it for server-side
rebasing & cherry-picking.  Christian did some good cleanup of my
patches, but I think your comment here is just reflecting that there's
still a bit more to do.

I'd say for now that we should expunge the worktree & index updating,
any conflict handling beyond exit status, mark it as plumbing, and
focus strictly on server-side use for now.

> > # Quick Overview (from Elijah)
> >
> > `git replay`, at a basic level, can perhaps be thought of as a
> > "default-to-dry-run rebase" -- meaning no updates to the working tree,
> > or to the index, or to any references.  However, it differs from
> > rebase in that it:
> >
> >    * Works for branches that aren't checked out
> >    * Works in a bare repository
> >    * Can replay multiple branches simultaneously (with or without common
> >      history in the range being replayed)
> >    * Preserves relative topology by default (merges are replayed too)
> >    * Focuses on performance
> >    * Has several altered defaults as a result of the above
> >
> > I sometimes think of `git replay` as "fast-replay", a patch-based
> > analogue to the snapshot-based fast-export & fast-import tools.
> >
> > # Reasons for diverging from cherry-pick & rebase (from Elijah)
> >
> > There are multiple reasons to diverge from the defaults in cherry-pick and
> > rebase.
> >
> > * Server side needs
> >
> >    * Both cherry-pick and rebase, via the sequencer, are heavily tied
> >      to updating the working tree, index, some refs, and a lot of
> >      control files with every commit replayed, and invoke a mess of
> >      hooks[1] that might be hard to avoid for backward compatibility
> >      reasons (at least, that's been brought up a few times on the
> >      list).
> >
> >    * cherry-pick and rebase both fork various subprocesses
> >      unnecessarily, but somewhat intrinsically in part to ensure the
> >      same hooks are called that old scripted implementations would
> >      have called.
>
> To clarify, since 356ee4659bb (sequencer: try to commit without forking
> 'git commit', 2017-11-24) cherry-pick and rebase do not fork
> subprocesses other than hooks for the cases covered by this patch series
> (i.e. they do not fork "git commit" for simple picks).

Thanks for the clarification.

Let me also add a clarification about this text you're responding to:
the document that Christian copied into this portion of the cover
letter wasn't talking about this particular series he submitted, but
about the broader design goals I was working on.  I don't want a
run_git_commit() function within replay.c, even for commit message
editing, for picking merge commits, or for handling staged changes
after resuming from an interactive rebase.  I'd even say I don't want
it for failure handling, either, though that one is less important.

> >    * "Dry run" behavior, where there are no updates to worktree, index,
> >      or even refs might be important.
> >
> >    * Should not assume users only want to operate on HEAD (see next
> >      section)
> >
> > * Decapitate HEAD-centric assumptions
> >
> >    * cherry-pick forces commits to be played on top of HEAD; inflexible.
> >
> >    * rebase assumes the range of commits to be replayed is
> >      upstream..HEAD by default, though it allows one to replay
> >      upstream..otherbranch -- but it still forcibly and needlessly
> >      checks out otherbranch before starting to replay things.
>
> I agree it would be nice to be able to restrict the range of commits
> replayed, especially when replaying merges. The comment about checking
> out other branch is out of date since 767a9c417eb (rebase -i: stop
> checking out the tip of the branch to rebase, 2020-01-24)

Guess how many years I've been gathering my ideas for git-replay.  ;-)

One thing to note here, though, is the checkout handling still isn't
fully fixed, and it may be impossible to fix at this point.  In
particular, `git rebase --abort` doesn't return people to the original
branch when otherbranch is specified, which is problematic in
conjunction with other options (see e.g.
https://lore.kernel.org/git/xmqqft5icsd9.fsf@gitster.c.googlers.com/).

> >    * Assuming HEAD is involved severely limits replaying multiple
> >      (possibly divergent) branches.
>
> I'm not sure how true this is anymore, since 89fc0b53fdb (rebase: update
> refs from 'update-ref' commands, 2022-07-19) the sequencer can update
> multiple branches. The issue with divergent branch is with command line
> arguments and the todo list generation rather than the capabilities of
> the sequencer.

Yeah, this particular paragraph I suspect is from about 18 months ago.
(See also where I responded to the update-refs series about this
https://lore.kernel.org/git/CABPp-BEOV53oBoBp4YjiRfksZMmAADanZUUemhxwn7Wor=m-nA@mail.gmail.com/).

However, I'd still say the statement is true, it just needs to be
focused on the divergent branches.  The use of HEAD (or more generally
exactly 3 commitish-es), as done in rebase, is going to make it really
awkward to introduce any kind of command line UI for handling rebasing
of multiple divergent branches.  I'm not sure it even makes sense.
git-replay has done it for over a year, and I think it makes sense
specifically because it got rid of the problematic assumptions
hard-wired into the command line UI.

> >    * Once you stop assuming HEAD has a certain meaning, there's not
> >      much reason to have two separate commands anymore (except for the
> >      funny extra not-necessarily-compatible options both have gained
> >      over time).
>
> I agree having a unified command at the plumbing level certainly makes
> sense.

:-)

> >    * (Micro issue: Assuming HEAD is involved also makes it harder for
> >      new users to learn what rebase means and does; it makes command
> >      lines hard to parse.
>
> That's an interesting point, I wonder if operating on branches that are
> not checked out is potentially confusing for new user though.

Yes, absolutely.  Especially since I'm very focused on having
operations be O(changes), not O(size-of-repo).  The amount of work
people go through to avoid backporting because checking out other
branches is too expensive is impressive.  But designing things for the
user that needs scaling to big repos isn't always the friendliest to
new users.  And I admittedly am focused on performance above other
aspects, and in some cases that might leave new users behind.

> >  Not sure I want to harp on this too much, as
> >      I have a suspicion I might be creating a tool for experts with
> >      complicated use cases, but it's a minor quibble.)

Oh, heh.  I guess I knew at the time I wrote this that people might
point out ways I was not being newbie friendly.  :-)

> > * Performance
> >
> >    * jj is slaughtering us on rebase speed[2].  I would like us to become
> >      competitive.  (I dropped a few comments in the link at [2] about why
> >      git is currently so bad.)
> >
> >    * From [3], there was a simple 4-patch series in linux.git that took
> >      53 seconds to rebase.  Switching to ort dropped it to 16 seconds.
> >      While that sounds great, only 11 *milliseconds* were needed to do
> >      the actual merges.  That means almost *all* the time (>99%) was
> >      overhead!  Big offenders:
> >
> >      * --reapply-cherry-picks should be the default
>
> I agree that can be a performance hit if there are a lot of upstream
> commits, but it is also a usability feature as it means we don't stop
> and ask the user what to do with the commits that have been upstreamed
> which wastes more of their time. I think maybe we want different
> defaults for the server use case than the user replaying commits or
> perhaps default to dropping commits that become empty.

Oh, sure, there are people who will like different options and
configurations.  Non-performant features can certainly be options, I
was mostly harping on the defaults for git-replay.  It should default
to fast, even if/when it moves beyond server-side only.

> >      * can_fast_forward() should be ripped out, and perhaps other extraneous
> >        revision walks
>
> We should look at doing that at least for the merge backend which has
> skip_unnecessary_picks(). I think it is useful to tell the user that the
> branch was not updated by the rebase though.

Which could be done as a very simple post-operation check: do pre- and
post- hashes match?  No need for a bunch of logic to check what's
going to happen that'll result in slow startup in large repositories.
Default should be fast and scale to large repos.

> >      * avoid updating working tree, index, refs, reflogs, and control
> >        structures except when needed (e.g. hitting a conflict, or operation
> >        finished)
>
> Not writing to disc unless we need to is sensible. Having said that for
> interactive rebases I do find having HEAD's reflog record all the picks
> useful to unpick what went wrong if mess something up.

Adding an option for writing HEAD's reflog at every step along the way
might be reasonable.  But I specifically was designing a new command
with speed as priority number 1.  Updating some external file with
every commit thus should not be the default for this new command.  I
know that's incompatible with how git-rebase works, but this isn't
rebase.

Actually, there's a second problem as well.  git-replay shouldn't
update _anything_ (other than adding new un-referenced
blobs/trees/commits), until it is entirely done.  In fact, even when
it's done, it doesn't update any refs by default; it simply prints out
some "update-ref" instructions for `git update-ref --stdin` (though I
was thinking of adding a flag that would make git-replay also do the
ref updates, as the very final thing the program did).

And there's a third problem.  HEAD isn't being updated in many cases,
so why should its reflog be?

> >    * Other performance ideas:
> >
> >      * single-file control structures instead of directory of files
>
> I like the idea as it should make it easier to keep the on disc state
> consistent, but I'm not sure how much of an issue that is in practice as
> we only read/write the files once each time git is run. The bigger slow
> down is writing the author script, commit message, list of rewritten
> commits, todo list and done files with each pick.

The author script, commit message, list of rewritten commits, todo
list and done files are all control structures that are part of that
directory of files I was complaining about.  As far as I can tell, you
said it's not an issue, but then immediately said the opposite.  What
did you take the "directory of files" control structures to mean?
...and is there some other way I could word this that would be better?

Also, if we speed up the other stuff enough, the writing of the
control files becomes much more important.  See the example elsewhere
where I mentioned that for a cherry-pick of a single commit that took
a minute under git-rebase, that actually creating the new toplevel
tree only required like 11 milliseconds.  If we can speed things up
dramatically enough, then all the stuff that used to just not matter,
suddenly starts mattering a lot.  (This happened a lot when I was
rewriting the merge machinery and optimizing it.)  And, I think that
writing one file is going to be faster than writing N files
(especially on spinny disks, and yes my laptop has a spinny disk).

Also, single file control structures are easier to make safe against
an ill-timed Ctrl-C, though I suspect that's so rarely a problem that
it may not be worth worrying much about.

> >      * avoid forking subprocesses unless explicitly requested (e.g.
> >        --exec, --strategy, --run-hooks).  For example, definitely do not
> >        invoke `git commit` or `git merge`.
>
> Good, that matches what the sequencer does for non-merge commits when
> we're not editing the commit message.

...and when not committing staged changes after a stopped interactive
rebase, and when not trying to ensure identical error messages when an
issue is hit.  :-)

This is good; sequencer is always improving.  If it could just stop
calling git commit in these four extra cases, we'd be rid of the
git-commit forking for good.

> >      * Sanitize hooks:
> >
> >        * dispense with all per-commit hooks for sure (pre-commit,
> >          post-commit, post-checkout).
>
> I agree we should not be running those (we don't run the pre-commit hook
> anyway).

...unless picking a merge commit or editing the commit message or...  :-)

> However we had a bug report when cherry-pick stopped running
> the "prepare-commit-msg" hook (see
> https://lore.kernel.org/git/CAKdAkRQuj1hfKeckjuR2oP+8C1i+ZR36O-+aRYif4ufaS_zs+w@mail.gmail.com/).
> That shouldn't matter for the server but we should bear it in mind when
> it comes to other use cases.

git-fast-export and git-fast-import do not run hooks, in part for
speed.  I don't think git-fast-patch, i.e. git-replay should either,
at least not by default.  I'm well aware that it's impossible to make
that kind of change to git-rebase, but I'm not trying to design
another rebase.

Another thing that might horrify you that I don't think I wrote down
is that when interactivity is added, I don't want replay to pre-parse
and validate the script, much like fast-import doesn't pre-parse and
validate its input.  Not validating would be anathema to how rebase
operates (because rebase muddies things as it goes -- the worktree and
the index AND refs AND reflogs, and it's slow, so you don't want to
start down that path until you're confident that people didn't make
simple mistakes).  As such, this is a change that I'd never even
consider suggesting for rebase, but it's exactly the kind of thing I
want for replay.

> >        * pre-rebase also seems to assume exactly 1 ref is written, and
> >          invoking it repeatedly would be stupid.  Plus, it's specific
> >          to "rebase".  So...ignore?  (Stolee's --ref-update option for
> >          rebase probably broke the pre-rebase assumptions already...)
>
> If replay is a plumbing command then skipping the pre-rebase hook makes
> sense as scripts can call it themselves if they want to. For a porcelain
> command keeping a hook that can prevent it from rewriting commits  that
> are already upstream (which I think is one of the main uses of the
> pre-rebase hook) would be good.

pre-rebase has a broken design; it already doesn't work correctly with
Stolee's --ref-update option.  So, even if we wanted to support this
kind of functionality, it certainly wouldn't be that script.

Also, the name is pre-rebase, not pre-cherry-pick, and doesn't apply
to both commands.  That means there's a second reason the design is
broken.  So, even if we wanted to support this kind of functionality,
it'd have to be a differently named script with a different design.

But, perhaps more importantly than either of the above, is that replay
doesn't change any refs, so why do we need a hook to prevent it from
changing refs?

> >        * post-rewrite hook might make sense, but fast-import got
> >          exempted, and I think of replay like a patch-based analogue
> >          to the snapshot-based fast-import.
>
> If we don't call the hook it would be good to have a option that outputs
> that information so scripts can request it if they want.

Good point, although the fact that replay doesn't update refs probably
means I didn't even need to mention these hooks.

> Also we should
> think about if/when we want to update the notes associated with replayed
> commits.

_Very_ good point.  There should definitely be an option for that,
maybe even a config option.

> >      * When not running server side, resolve conflicts in a sparse-cone
> >        sparse-index worktree to reduce number of files written to a
> >        working tree.  (See below as well)
> >
> >      * [High risk of possible premature optimization] Avoid large
> >        numbers of newly created loose objects, when replaying large
> >        numbers of commits.  Two possibilities: (1) Consider using
> >        tmp-objdir and pack objects from the tmp-objdir at end of
> >        exercise, (2) Lift code from git-fast-import to immediately
> >        stuff new objects into a pack?
> >
> > * Multiple branches and non-checked out branches
> >
> >    * The ability to operate on non-checked out branches also implies
> >      that we should generally be able to replay when in a dirty working
> >      tree (exception being when we expect to update HEAD and any of the
> >      dirty files is one that needs to be updated by the replay).
> >
> >    * Also, if we are operating locally on a non-checked out branch and
> >      hit a conflict, we should have a way to resolve the conflict without
> >      messing with the user's work on their current branch.

This stuff really shouldn't have been in the cover letter for this
series, but since it is...

> That sounds tricky to do in a user friendly way.

Yep, totally agreed.

But, I really, really want a command that operates in O(changes)
rather than O(repo size), and which can operate on non-checked-out
branches, and even if the user has a dirty working tree.

There are other choices too -- e.g. if replay is fast enough, just
fail if the user has a dirty working tree and tell them to re-issue
the command after first cleaning it up, allowing us to use their
working tree for conflict resolution.  However, the level of IDE
thrashing that happens when people check out other branches (and the
subsequent loss of productivity they experience after resolving the
simple conflict and switching back) means I'm not willing to have the
only option be resolving conflicts in the current working tree.  I
think something very different than how git-rebase and git-cherry-pick
behave currently are needed, at least as an option.

> >      * Idea: new worktree with sparse cone + sparse index checkout,
> >        containing only files in the root directory, and whatever is
> >        necessary to get the conflicts
>
> If the user has not asked for a sparse checkout then this could be
> surprising. Sometimes I find it helpful to be able to poke about in
> other source files when resolving a conflict. I also often build and
> test after resolving a conflict which requires more than just the
> conflict to be checked out.

Sure, some people may need an option to have conflicts be resolved in
the current working tree, or perhaps even in a separate full tree.
But both are such performance killers that an O(changes) option that
doesn't mess with the current working tree is very much needed for
some usecases.

But again, that's **way** outside the scope of this series.

> >      * Companion to above idea: control structures should be written to
> >        $GIT_COMMON_DIR/replay-${worktree}, so users can have multiple
> >        replay sessions, and so we know which worktrees are associated
> >        with which replay operations.
>
> We certainly want some way of making sure we only update a given ref in
> one replay session, and have checks to for whether the ref is checked
> out anywhere as we do now for rebase --update-refs. That seems to be
> lacking in the patches adding ref updating in this series.

Yeah, that was in patch 9.  It should be dropped or have patch 10
squashed into it.

At that point, this series won't contain any ref updating, but will
only have instructions that can be passed to `git update-ref --stdin`.
At that point, no such checks will be necessary for this series (as it
all belongs in `git update-ref --stdin` instead).

> > * The code die()s if there are any conflict. No resumability. No nice
> >    output. No interactivity.
>
> I can see that on a server you might not want any output, but if I run
> it locally it would be nice to have a message saying which paths have
> conflicts. Maybe we could add a --quiet flag for the server rather than
> removing the existing messages?

I think the choice of what to do for conflicts becomes really complex,
really quick.  Just look at git-merge-tree, which is a much simpler
case, and those kinds of questions blocked the series from progressing
for like half a year last year.

I don't think we're ready to answer these questions.  In fact, that's
precisely why I never submitted this series, even though I've had the
bits that Christian is pushing in a working state for about 9 months
now.

I think we should defer all conflict handling other than exit status
until later (or continue deferring this series until conflict handling
is ready, but that may be a long time).

> > * Signed commits are not properly handled. It's not clear what to do
> >    to such commits when replaying on the server side.
>
> Yes on the server where you don't have access to the signing key there
> is not much need for replay to have a signing option.

And on the client side, it's the one place where I'm almost certainly
going to be forced to accept forking a subprocess, though rest assured
that I'll grumble about it incessantly.  :-)

> > * exit status: a successful, non-conflicted replay exits with code
> >    0. When the replay has conflicts, the exit status is 1. If the
> >    replay is not able to complete (or start) due to some kind of error,
> >    the exit status is something other than 0 or 1. It has been
> >    suggested in an internal review that conflicts might want to get a
> >    more specific error code as an error code of 1 might be quite easy
> >    to return by accident. It doesn't seem to me from their docs (which
> >    might want to be improved, I didn't look at the code) that other
> >    commands like `git merge` and `git rebase` exit with a special error
> >    code in case of conflict.
>
> I don't think we've ever had a special "conflict" error code but it
> would be useful for scripts if replay had one. Does replay return a
> different exit code for "merge conflicts" and "cannot merge because it
> would overwrite an untracked file"? Does it have an exit code for "the
> commit becomes empty" or are those patches unconditionally dropped?

I'm not sure what Christian meant here, but as you say there's
certainly no different special codes in the merge machinery.

> > * --advance and --contained: these two advanced options might not
> >    belong to this first series and could perhaps be added in a followup
> >    series in separate commits. On the other hand the code for
> >    --contained seems involved with the code of --advance and it's nice
> >    to see soon that git replay can indeed do cherry-picking and rebase
> >    many refs at once, and this way fullfil these parts of its promise.
>
> Once I understood what these options did the names made sense, but I
> could not tell from the names what they were going to do. For me
> "--cherry-pick" and "--update-refs" would have been clearer. It might be
> worth splitting this patch so the individual options are added separately.

The output of `git replay`, with or without these options, is meant to
be passed as input to `git update-refs --stdin`.  As such, naming an
option --update-refs for this command just seems inherently confusing.

I also feel like --update-refs is a name coded more around
implementation details than end-user description of behavior (and as
far as implementation details are concerned, this options is not tied
to an interactive backend the way the --update-refs options in rebase
is).

In contrast, --cherry-pick might be reasonable.  But I worry it'll be
read backwards by people.  I want:

   git replay --OPTION branch upstream..commit

to make sense and I worry that OPTION==cherry-pick will cause users to
think `branch` is being cherry-picked rather than `upstream..commit`.
What this command does is advance `branch` by cherry-picking
`upstream..commit` onto it.

> > * only 2 patches: this patch series can be seen from a high level
> >    point of view as 1) introducing the new `git replay` command, and 2)
> >    using `git replay` to replace, and get rid of, the fast-rebase
> >    test-tool command. The fact that not much of the original
> >    fast-rebase code and interface is left would agree with that point
> >    of view. On the other hand, fast-rebase can also be seen as a first
> >    iteration towards `git replay`. So it can also make sense to see how
> >    `git replay` evolved from it.
>
> Starting with fast-rebase means one has to checkout the first patch to
> see what code we're adding to replay.c and to make sense of the later
> patches that remove code. It would be interesting to compare this series
> to one that started from scratch but I guess that would be quite a bit
> of work.

Yeah, I was probably wrong to suggest to Christian to use fast-rebase
as the starting point.  That is how I started my series, and multiple
people had reviewed it and it seemed like it'd be useful to leverage
that past review, but the reviews of these series so far suggest I may
have made the wrong recommendation to Christian.