diff mbox series

rebase --merge: optionally skip upstreamed commits

Message ID 20200309205523.121319-1-jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series rebase --merge: optionally skip upstreamed commits | expand

Commit Message

Jonathan Tan March 9, 2020, 8:55 p.m. UTC
When rebasing against an upstream that has had many commits since the
original branch was created:

 O -- O -- ... -- O -- O (upstream)
  \
   -- O (my-dev-branch)

because "git rebase" attempts to exclude commits that are duplicates of
upstream ones, it must read the contents of every novel upstream commit,
in addition to the tip of the upstream and the merge base. This can be a
significant performance hit, especially in a partial clone, wherein a
read of an object may end up being a fetch.

Add a flag to "git rebase" to allow suppression of this feature. This
flag only works when using the "merge" backend.

This flag changes the behavior of sequencer_make_script(), called from
do_interactive_rebase() <- run_rebase_interactive() <-
run_specific_rebase() <- cmd_rebase(). With this flag, limit_list()
(indirectly called from sequencer_make_script() through
prepare_revision_walk()) will no longer call cherry_pick_list(), and
thus PATCHSAME is no longer set. Refraining from setting PATCHSAME both
means that the intermediate commits in upstream are no longer read (as
shown by the test) and means that no PATCHSAME-caused skipping of
commits is done by sequencer_make_script(), either directly or through
make_script_with_merges().

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
More improvements for partial clone, but this is a benefit for
non-partial-clone as well, hence the way I wrote the commit message (not
focusing too much on partial clone) and the documentation.

I've chosen --skip-already-present and --no-skip-already-present to
reuse the language already existing in the documentation and to avoid a
double negative (e.g. --avoid-checking-if-already-present and
--no-avoid-checking-if-already-present) but this causes some clumsiness
in the documentation and in the code. Any suggestions for the name are
welcome.

I've only implemented this for the "merge" backend since I think that
there is an effort to migrate "rebase" to use the "merge" backend by
default, and also because "merge" uses diff internally which already has
the (per-commit) blob batch prefetching.
---
 Documentation/git-rebase.txt | 12 +++++-
 builtin/rebase.c             | 10 ++++-
 sequencer.c                  |  3 +-
 sequencer.h                  |  2 +-
 t/t3402-rebase-merge.sh      | 77 ++++++++++++++++++++++++++++++++++++
 5 files changed, 100 insertions(+), 4 deletions(-)

Comments

Taylor Blau March 10, 2020, 2:10 a.m. UTC | #1
Hi Jonathan,

This patch makes good sense to me. I left a few notes below, but they
are relatively minor, and this seems to be all in a good direction.

As a (somewhat) interesting aside, this feature would be useful to me
outside of partial clones, since I often have this workflow in my local
development wherein 'git rebase' spends quite a bit of time comparing
patches on my branch to everything new upstream.

On Mon, Mar 09, 2020 at 01:55:23PM -0700, Jonathan Tan wrote:
> When rebasing against an upstream that has had many commits since the
> original branch was created:
>
>  O -- O -- ... -- O -- O (upstream)
>   \
>    -- O (my-dev-branch)
>
> because "git rebase" attempts to exclude commits that are duplicates of
> upstream ones, it must read the contents of every novel upstream commit,
> in addition to the tip of the upstream and the merge base.

This sentence is a little confusing if you skip over the graph, since it
reads: "When rebasing against an because ... because ...". It may be
clearer if you swap the order of the last two clauses to instead be:

  it must read the contents of every novel upstream commit, in addition to
  the tip of the upstream and the merge base, because "git rebase"
  attempts to exclude commits that are duplicates of upstream ones.

> This can be a significant performance hit, especially in a partial
> clone, wherein a read of an object may end up being a fetch.
>
> Add a flag to "git rebase" to allow suppression of this feature. This
> flag only works when using the "merge" backend.
>
> This flag changes the behavior of sequencer_make_script(), called from
> do_interactive_rebase() <- run_rebase_interactive() <-
> run_specific_rebase() <- cmd_rebase(). With this flag, limit_list()
> (indirectly called from sequencer_make_script() through
> prepare_revision_walk()) will no longer call cherry_pick_list(), and
> thus PATCHSAME is no longer set. Refraining from setting PATCHSAME both
> means that the intermediate commits in upstream are no longer read (as
> shown by the test) and means that no PATCHSAME-caused skipping of
> commits is done by sequencer_make_script(), either directly or through
> make_script_with_merges().

This all sounds good to me.

> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> More improvements for partial clone, but this is a benefit for
> non-partial-clone as well, hence the way I wrote the commit message (not
> focusing too much on partial clone) and the documentation.
>
> I've chosen --skip-already-present and --no-skip-already-present to
> reuse the language already existing in the documentation and to avoid a
> double negative (e.g. --avoid-checking-if-already-present and
> --no-avoid-checking-if-already-present) but this causes some clumsiness
> in the documentation and in the code. Any suggestions for the name are
> welcome.
>
> I've only implemented this for the "merge" backend since I think that
> there is an effort to migrate "rebase" to use the "merge" backend by
> default, and also because "merge" uses diff internally which already has
> the (per-commit) blob batch prefetching.

This also makes sense to me.

> ---
>  Documentation/git-rebase.txt | 12 +++++-
>  builtin/rebase.c             | 10 ++++-
>  sequencer.c                  |  3 +-
>  sequencer.h                  |  2 +-
>  t/t3402-rebase-merge.sh      | 77 ++++++++++++++++++++++++++++++++++++
>  5 files changed, 100 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
> index 0c4f038dd6..f73a82b4a9 100644
> --- a/Documentation/git-rebase.txt
> +++ b/Documentation/git-rebase.txt
> @@ -318,6 +318,15 @@ See also INCOMPATIBLE OPTIONS below.
>  +
>  See also INCOMPATIBLE OPTIONS below.
>
> +--skip-already-present::
> +--no-skip-already-present::
> +	Skip commits that are already present in the new upstream.
> +	This is the default.

I believe that you mean '--skip-already-present' is the default, here,
but the placement makes it ambiguous, since it is in a paragraph with a
header that contains both the positive and negated version of this flag.

Maybe this could changed to: s/This/--skip-already-present/'.

> ++
> +If the skip-if-already-present feature is unnecessary or undesired,
> +`--no-skip-already-present` may improve performance since it avoids
> +the need to read the contents of every commit in the new upstream.
> +
>  --rerere-autoupdate::
>  --no-rerere-autoupdate::
>  	Allow the rerere mechanism to update the index with the
> @@ -866,7 +875,8 @@ Only works if the changes (patch IDs based on the diff contents) on
>  'subsystem' did.
>
>  In that case, the fix is easy because 'git rebase' knows to skip
> -changes that are already present in the new upstream.  So if you say
> +changes that are already present in the new upstream (unless
> +`--no-skip-already-present` is given). So if you say

Extremely minor nit: there is a whitespace change on this line where the
original has two spaces between the '.' and 'So', and the new version
has only one.

>  (assuming you're on 'topic')
>  ------------
>      $ git rebase subsystem
> diff --git a/builtin/rebase.c b/builtin/rebase.c
> index 6154ad8fa5..943211e5bb 100644
> --- a/builtin/rebase.c
> +++ b/builtin/rebase.c
> @@ -88,13 +88,15 @@ struct rebase_options {
>  	struct strbuf git_format_patch_opt;
>  	int reschedule_failed_exec;
>  	int use_legacy_rebase;
> +	int skip_already_present;
>  };
>
>  #define REBASE_OPTIONS_INIT {			  	\
>  		.type = REBASE_UNSPECIFIED,	  	\
>  		.flags = REBASE_NO_QUIET, 		\
>  		.git_am_opts = ARGV_ARRAY_INIT,		\
> -		.git_format_patch_opt = STRBUF_INIT	\
> +		.git_format_patch_opt = STRBUF_INIT,	\
> +		.skip_already_present =	1		\
>  	}
>
>  static struct replay_opts get_replay_opts(const struct rebase_options *opts)
> @@ -373,6 +375,7 @@ static int run_rebase_interactive(struct rebase_options *opts,
>  	flags |= opts->rebase_cousins > 0 ? TODO_LIST_REBASE_COUSINS : 0;
>  	flags |= opts->root_with_onto ? TODO_LIST_ROOT_WITH_ONTO : 0;
>  	flags |= command == ACTION_SHORTEN_OIDS ? TODO_LIST_SHORTEN_IDS : 0;
> +	flags |= opts->skip_already_present ? TODO_LIST_SKIP_ALREADY_PRESENT : 0;
>
>  	switch (command) {
>  	case ACTION_NONE: {
> @@ -1507,6 +1510,8 @@ int cmd_rebase(int argc, const char **argv, const char *prefix)
>  		OPT_BOOL(0, "reschedule-failed-exec",
>  			 &reschedule_failed_exec,
>  			 N_("automatically re-schedule any `exec` that fails")),
> +		OPT_BOOL(0, "skip-already-present", &options.skip_already_present,
> +			 N_("skip changes that are already present in the new upstream")),

I scratched my head a little bit about why we weren't using OPT_BIT and
&flags directly here, but it matches the pattern in the surrounding, so
I think that 'OPT_BOOL' and target '&options.skip_already_present' here.

>  		OPT_END(),
>  	};
>  	int i;
> @@ -1840,6 +1845,9 @@ int cmd_rebase(int argc, const char **argv, const char *prefix)
>  			      "interactive or merge options"));
>  	}
>
> +	if (!options.skip_already_present && !is_interactive(&options))
> +		die(_("--no-skip-already-present does not work with the 'am' backend"));
> +
>  	if (options.signoff) {
>  		if (options.type == REBASE_PRESERVE_MERGES)
>  			die("cannot combine '--signoff' with "
> diff --git a/sequencer.c b/sequencer.c
> index ba90a513b9..752580c017 100644
> --- a/sequencer.c
> +++ b/sequencer.c
> @@ -4797,12 +4797,13 @@ int sequencer_make_script(struct repository *r, struct strbuf *out, int argc,
>  	int keep_empty = flags & TODO_LIST_KEEP_EMPTY;
>  	const char *insn = flags & TODO_LIST_ABBREVIATE_CMDS ? "p" : "pick";
>  	int rebase_merges = flags & TODO_LIST_REBASE_MERGES;
> +	int skip_already_present = !!(flags & TODO_LIST_SKIP_ALREADY_PRESENT);
>
>  	repo_init_revisions(r, &revs, NULL);
>  	revs.verbose_header = 1;
>  	if (!rebase_merges)
>  		revs.max_parents = 1;
> -	revs.cherry_mark = 1;
> +	revs.cherry_mark = skip_already_present;

:-). All of that plumbing just to poke at this variable. Looks good to
me.

>  	revs.limited = 1;
>  	revs.reverse = 1;
>  	revs.right_only = 1;
> diff --git a/sequencer.h b/sequencer.h
> index 393571e89a..39bb12f624 100644
> --- a/sequencer.h
> +++ b/sequencer.h
> @@ -149,7 +149,7 @@ int sequencer_remove_state(struct replay_opts *opts);
>   * `--onto`, we do not want to re-generate the root commits.
>   */
>  #define TODO_LIST_ROOT_WITH_ONTO (1U << 6)
> -
> +#define TODO_LIST_SKIP_ALREADY_PRESENT (1U << 7)

This was another spot that I thought could maybe be turned into an enum,
but it's clearly not the fault of your patch, and could easily be turned
into #leftoverbits.

>  int sequencer_make_script(struct repository *r, struct strbuf *out, int argc,
>  			  const char **argv, unsigned flags);
> diff --git a/t/t3402-rebase-merge.sh b/t/t3402-rebase-merge.sh
> index a1ec501a87..9b52739a10 100755
> --- a/t/t3402-rebase-merge.sh
> +++ b/t/t3402-rebase-merge.sh
> @@ -162,4 +162,81 @@ test_expect_success 'rebase --skip works with two conflicts in a row' '
>  	git rebase --skip
>  '
>
> +test_expect_success '--no-skip-already-present' '
> +	git init repo &&
> +
> +	# O(1-10) -- O(1-11) -- O(0-10) master
> +	#        \
> +	#         -- O(1-11) -- O(1-12) otherbranch
> +
> +	printf "Line %d\n" $(test_seq 1 10) >repo/file.txt &&
> +	git -C repo add file.txt &&
> +	git -C repo commit -m "base commit" &&
> +
> +	printf "Line %d\n" $(test_seq 1 11) >repo/file.txt &&
> +	git -C repo commit -a -m "add 11" &&
> +
> +	printf "Line %d\n" $(test_seq 0 10) >repo/file.txt &&
> +	git -C repo commit -a -m "add 0 delete 11" &&
> +
> +	git -C repo checkout -b otherbranch HEAD^^ &&
> +	printf "Line %d\n" $(test_seq 1 11) >repo/file.txt &&
> +	git -C repo commit -a -m "add 11 in another branch" &&
> +
> +	printf "Line %d\n" $(test_seq 1 12) >repo/file.txt &&
> +	git -C repo commit -a -m "add 12 in another branch" &&
> +
> +	# Regular rebase fails, because the 1-11 commit is deduplicated
> +	test_must_fail git -C repo rebase --merge master 2> err &&
> +	test_i18ngrep "error: could not apply.*add 12 in another branch" err &&
> +	git -C repo rebase --abort &&
> +
> +	# With --no-skip-already-present, it works
> +	git -C repo rebase --merge --no-skip-already-present master
> +'
> +
> +test_expect_success '--no-skip-already-present refrains from reading unneeded blobs' '
> +	git init server &&
> +
> +	# O(1-10) -- O(1-11) -- O(1-12) master
> +	#        \
> +	#         -- O(0-10) otherbranch
> +
> +	printf "Line %d\n" $(test_seq 1 10) >server/file.txt &&
> +	git -C server add file.txt &&
> +	git -C server commit -m "merge base" &&
> +
> +	printf "Line %d\n" $(test_seq 1 11) >server/file.txt &&
> +	git -C server commit -a -m "add 11" &&
> +
> +	printf "Line %d\n" $(test_seq 1 12) >server/file.txt &&
> +	git -C server commit -a -m "add 12" &&
> +
> +	git -C server checkout -b otherbranch HEAD^^ &&
> +	printf "Line %d\n" $(test_seq 0 10) >server/file.txt &&
> +	git -C server commit -a -m "add 0" &&
> +
> +	test_config -C server uploadpack.allowfilter 1 &&
> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
> +
> +	git clone --filter=blob:none "file://$(pwd)/server" client &&
> +	git -C client checkout origin/master &&
> +	git -C client checkout origin/otherbranch &&
> +
> +	# Sanity check to ensure that the blobs from the merge base and "add
> +	# 11" are missing
> +	git -C client rev-list --objects --all --missing=print >missing_list &&
> +	MERGE_BASE_BLOB=$(git -C server rev-parse master^^:file.txt) &&
> +	ADD_11_BLOB=$(git -C server rev-parse master^:file.txt) &&
> +	grep "\\?$MERGE_BASE_BLOB" missing_list &&
> +	grep "\\?$ADD_11_BLOB" missing_list &&
> +
> +	git -C client rebase --merge --no-skip-already-present origin/master &&
> +
> +	# The blob from the merge base had to be fetched, but not "add 11"
> +	git -C client rev-list --objects --all --missing=print >missing_list &&
> +	! grep "\\?$MERGE_BASE_BLOB" missing_list &&
> +	grep "\\?$ADD_11_BLOB" missing_list
> +'
> +
>  test_done
> --
> 2.25.1.481.gfbce0eb801-goog

The tests look good to me. Thanks for working on this!

Thanks,
Taylor
Johannes Schindelin March 10, 2020, 12:17 p.m. UTC | #2
Hi Jonathan,

On Mon, 9 Mar 2020, Jonathan Tan wrote:

> When rebasing against an upstream that has had many commits since the
> original branch was created:
>
>  O -- O -- ... -- O -- O (upstream)
>   \
>    -- O (my-dev-branch)
>
> because "git rebase" attempts to exclude commits that are duplicates of
> upstream ones, it must read the contents of every novel upstream commit,
> in addition to the tip of the upstream and the merge base. This can be a
> significant performance hit, especially in a partial clone, wherein a
> read of an object may end up being a fetch.
>
> Add a flag to "git rebase" to allow suppression of this feature. This
> flag only works when using the "merge" backend.

I wonder whether we can make this a bit more user-friendly by defaulting
to `--right-only` if there are no promised objects in the symmetric range,
and if there _are_ promised objects, to skip `--right-only`, possibly with
an advice that we did that and how to force it to download the promised
objects?

Ciao,
Dscho
Jonathan Tan March 10, 2020, 3:51 p.m. UTC | #3
> Hi Jonathan,
> 
> This patch makes good sense to me. I left a few notes below, but they
> are relatively minor, and this seems to be all in a good direction.
> 
> As a (somewhat) interesting aside, this feature would be useful to me
> outside of partial clones, since I often have this workflow in my local
> development wherein 'git rebase' spends quite a bit of time comparing
> patches on my branch to everything new upstream.

Thanks for your review, and it's great to know of a use case that this
helps.

> This sentence is a little confusing if you skip over the graph, since it
> reads: "When rebasing against an because ... because ...". It may be
> clearer if you swap the order of the last two clauses to instead be:
> 
>   it must read the contents of every novel upstream commit, in addition to
>   the tip of the upstream and the merge base, because "git rebase"
>   attempts to exclude commits that are duplicates of upstream ones.

Sounds good; will do.

> > +--skip-already-present::
> > +--no-skip-already-present::
> > +	Skip commits that are already present in the new upstream.
> > +	This is the default.
> 
> I believe that you mean '--skip-already-present' is the default, here,
> but the placement makes it ambiguous, since it is in a paragraph with a
> header that contains both the positive and negated version of this flag.
> 
> Maybe this could changed to: s/This/--skip-already-present/'.

Will do.

> >  In that case, the fix is easy because 'git rebase' knows to skip
> > -changes that are already present in the new upstream.  So if you say
> > +changes that are already present in the new upstream (unless
> > +`--no-skip-already-present` is given). So if you say
> 
> Extremely minor nit: there is a whitespace change on this line where the
> original has two spaces between the '.' and 'So', and the new version
> has only one.

OK - I'll change it to 2 spaces.

> > diff --git a/sequencer.h b/sequencer.h
> > index 393571e89a..39bb12f624 100644
> > --- a/sequencer.h
> > +++ b/sequencer.h
> > @@ -149,7 +149,7 @@ int sequencer_remove_state(struct replay_opts *opts);
> >   * `--onto`, we do not want to re-generate the root commits.
> >   */
> >  #define TODO_LIST_ROOT_WITH_ONTO (1U << 6)
> > -
> > +#define TODO_LIST_SKIP_ALREADY_PRESENT (1U << 7)
> 
> This was another spot that I thought could maybe be turned into an enum,
> but it's clearly not the fault of your patch, and could easily be turned
> into #leftoverbits.

There was a recent discussion on the list [1] about whether bitsets
should be enums, and we decided against it. But anyway we can revisit
this later if need be.

[1] https://lore.kernel.org/git/20191016193750.258148-1-jonathantanmy@google.com/

The changes Taylor suggested were minor, so I'll hold off sending
another version until there are more substantial changes requested.
Jonathan Tan March 10, 2020, 4 p.m. UTC | #4
> I wonder whether we can make this a bit more user-friendly by defaulting
> to `--right-only` if there are no promised objects in the symmetric range,
> and if there _are_ promised objects, to skip `--right-only`, possibly with
> an advice that we did that and how to force it to download the promised
> objects?

Thanks for your suggestion. I'm inclined to think that in a partial
clone, whether an object is missing or not should not affect the
behavior of a Git command (except for lazy fetching and performance),
but I can see how this is useful, at least for the purposes of
discoverability and ease of use (good diagnostics for the
non-partial-clone case, and better performance for the partial clone
case).

But in any case, I think that this can be built later on top of my
patch. Even if we have automatic detection of missing objects and
automatic selection of functionality, we will still need the CLI
arguments for manual override, so the CLI flags and functionality in
this patch are still useful.
Elijah Newren March 10, 2020, 6:56 p.m. UTC | #5
On Mon, Mar 9, 2020 at 1:58 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> When rebasing against an upstream that has had many commits since the
> original branch was created:
>
>  O -- O -- ... -- O -- O (upstream)
>   \
>    -- O (my-dev-branch)
>
> because "git rebase" attempts to exclude commits that are duplicates of
> upstream ones, it must read the contents of every novel upstream commit,
> in addition to the tip of the upstream and the merge base. This can be a
> significant performance hit, especially in a partial clone, wherein a
> read of an object may end up being a fetch.

Does this suggest that the cherry-pick detection is suboptimal and
needs to be improved?  When rebasing, it is typical that you are just
rebasing a small number of patches compared to how many exist
upstream.  As such, any upstream patch modifying files outside the set
of files modified on the rebased side is known to not be PATCHSAME
without looking at those new files.  Or is the issue just the sheer
number of upstream commits that modify only the files also modified on
the rebased side is large?

> Add a flag to "git rebase" to allow suppression of this feature. This
> flag only works when using the "merge" backend.

Interesting.  A little over a year ago we discussed not only making
such a change in behavior, but making it the default; see
https://public-inbox.org/git/nycvar.QRO.7.76.6.1901211635080.41@tvgsbejvaqbjf.bet/
(from "Ooh, that's interesting" to "VFS for Git").

Since that time, we did indeed change the handling of commits that
become empty (so that they now default to dropping), which certainly
goes well with the new behavior to skip the cherry-pick detection.

One note: I think that old thread was wrong about the apply versus the
merge backends (which were referred to as "am" and "interactive" at
the time): both the apply and the merge backends have done the
cherry-pick detection so it wasn't a difference between the two.

> This flag changes the behavior of sequencer_make_script(), called from
> do_interactive_rebase() <- run_rebase_interactive() <-
> run_specific_rebase() <- cmd_rebase(). With this flag, limit_list()
> (indirectly called from sequencer_make_script() through
> prepare_revision_walk()) will no longer call cherry_pick_list(), and
> thus PATCHSAME is no longer set. Refraining from setting PATCHSAME both
> means that the intermediate commits in upstream are no longer read (as
> shown by the test) and means that no PATCHSAME-caused skipping of
> commits is done by sequencer_make_script(), either directly or through
> make_script_with_merges().
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> More improvements for partial clone, but this is a benefit for
> non-partial-clone as well, hence the way I wrote the commit message (not
> focusing too much on partial clone) and the documentation.

Yes, I've wanted to kill that performance overhead too even without
partial clones, though I was slightly worried about cases of a known
cherry-pick no longer cleanly applying and thus forcing the user to
detect that it has become empty.  I guess that's why it's a flag
instead of the new default, but there's something inside of me that
asks why this special case is detected for the user when other
conflict cases aren't...  Not sure if I'm just pigeonholing on
performance too much.

> I've chosen --skip-already-present and --no-skip-already-present to
> reuse the language already existing in the documentation and to avoid a
> double negative (e.g. --avoid-checking-if-already-present and
> --no-avoid-checking-if-already-present) but this causes some clumsiness
> in the documentation and in the code. Any suggestions for the name are
> welcome.

I'll add comments on this below.

> I've only implemented this for the "merge" backend since I think that
> there is an effort to migrate "rebase" to use the "merge" backend by
> default, and also because "merge" uses diff internally which already has
> the (per-commit) blob batch prefetching.

I understand the first half of your reason here, but I don't follow
the second half.  The apply backend uses diff to generate the patches,
but diff isn't the relevant operation here; it's the rev-list walking,
and both call the exact same rev-list walk the last time I checked so
I'm not sure what the difference is here.  Am I misunderstanding one
or more things?

> ---
>  Documentation/git-rebase.txt | 12 +++++-
>  builtin/rebase.c             | 10 ++++-
>  sequencer.c                  |  3 +-
>  sequencer.h                  |  2 +-
>  t/t3402-rebase-merge.sh      | 77 ++++++++++++++++++++++++++++++++++++
>  5 files changed, 100 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
> index 0c4f038dd6..f73a82b4a9 100644
> --- a/Documentation/git-rebase.txt
> +++ b/Documentation/git-rebase.txt
> @@ -318,6 +318,15 @@ See also INCOMPATIBLE OPTIONS below.
>  +
>  See also INCOMPATIBLE OPTIONS below.
>
> +--skip-already-present::
> +--no-skip-already-present::
> +       Skip commits that are already present in the new upstream.
> +       This is the default.
> ++
> +If the skip-if-already-present feature is unnecessary or undesired,
> +`--no-skip-already-present` may improve performance since it avoids
> +the need to read the contents of every commit in the new upstream.
> +

I'm afraid the naming might be pretty opaque and confusing to users.
Even if we do keep the names, it might help to be clearer about the
ramifications.  And there's a missing reference to the option
incompatibility.  Perhaps something like:

--skip-cherry-pick-detection
--no-skip-cherry-pick-detection

    Whether rebase tries to determine if commits are already present
upstream, i.e. if there are commits which are cherry-picks.  If such
detection is done, any commits being rebased which are cherry-picks
will be dropped, since those commits are already found upstream.  If
such detection is not done, those commits will be re-applied, which
most likely will result in no new changes (as the changes are already
upstream) and result in the commit being dropped anyway.  cherry-pick
detection is the default, but can be expensive in repos with a large
number of upstream commits that need to be read.

See also INCOMPATIBLE OPTIONS below.

>  --rerere-autoupdate::
>  --no-rerere-autoupdate::
>         Allow the rerere mechanism to update the index with the
> @@ -866,7 +875,8 @@ Only works if the changes (patch IDs based on the diff contents) on
>  'subsystem' did.
>
>  In that case, the fix is easy because 'git rebase' knows to skip
> -changes that are already present in the new upstream.  So if you say
> +changes that are already present in the new upstream (unless
> +`--no-skip-already-present` is given). So if you say
>  (assuming you're on 'topic')
>  ------------
>      $ git rebase subsystem
> diff --git a/builtin/rebase.c b/builtin/rebase.c
> index 6154ad8fa5..943211e5bb 100644
> --- a/builtin/rebase.c
> +++ b/builtin/rebase.c
> @@ -88,13 +88,15 @@ struct rebase_options {
>         struct strbuf git_format_patch_opt;
>         int reschedule_failed_exec;
>         int use_legacy_rebase;
> +       int skip_already_present;
>  };
>
>  #define REBASE_OPTIONS_INIT {                          \
>                 .type = REBASE_UNSPECIFIED,             \
>                 .flags = REBASE_NO_QUIET,               \
>                 .git_am_opts = ARGV_ARRAY_INIT,         \
> -               .git_format_patch_opt = STRBUF_INIT     \
> +               .git_format_patch_opt = STRBUF_INIT,    \
> +               .skip_already_present = 1               \
>         }
>
>  static struct replay_opts get_replay_opts(const struct rebase_options *opts)
> @@ -373,6 +375,7 @@ static int run_rebase_interactive(struct rebase_options *opts,
>         flags |= opts->rebase_cousins > 0 ? TODO_LIST_REBASE_COUSINS : 0;
>         flags |= opts->root_with_onto ? TODO_LIST_ROOT_WITH_ONTO : 0;
>         flags |= command == ACTION_SHORTEN_OIDS ? TODO_LIST_SHORTEN_IDS : 0;
> +       flags |= opts->skip_already_present ? TODO_LIST_SKIP_ALREADY_PRESENT : 0;
>
>         switch (command) {
>         case ACTION_NONE: {
> @@ -1507,6 +1510,8 @@ int cmd_rebase(int argc, const char **argv, const char *prefix)
>                 OPT_BOOL(0, "reschedule-failed-exec",
>                          &reschedule_failed_exec,
>                          N_("automatically re-schedule any `exec` that fails")),
> +               OPT_BOOL(0, "skip-already-present", &options.skip_already_present,
> +                        N_("skip changes that are already present in the new upstream")),
>                 OPT_END(),
>         };
>         int i;
> @@ -1840,6 +1845,9 @@ int cmd_rebase(int argc, const char **argv, const char *prefix)
>                               "interactive or merge options"));
>         }
>
> +       if (!options.skip_already_present && !is_interactive(&options))
> +               die(_("--no-skip-already-present does not work with the 'am' backend"));
> +

with the *apply* backend, not the 'am' one (the backend was renamed in
commit 10cdb9f38a ("rebase: rename the two primary rebase backends",
2020-02-15))

>         if (options.signoff) {
>                 if (options.type == REBASE_PRESERVE_MERGES)
>                         die("cannot combine '--signoff' with "
> diff --git a/sequencer.c b/sequencer.c
> index ba90a513b9..752580c017 100644
> --- a/sequencer.c
> +++ b/sequencer.c
> @@ -4797,12 +4797,13 @@ int sequencer_make_script(struct repository *r, struct strbuf *out, int argc,
>         int keep_empty = flags & TODO_LIST_KEEP_EMPTY;
>         const char *insn = flags & TODO_LIST_ABBREVIATE_CMDS ? "p" : "pick";
>         int rebase_merges = flags & TODO_LIST_REBASE_MERGES;
> +       int skip_already_present = !!(flags & TODO_LIST_SKIP_ALREADY_PRESENT);
>
>         repo_init_revisions(r, &revs, NULL);
>         revs.verbose_header = 1;
>         if (!rebase_merges)
>                 revs.max_parents = 1;
> -       revs.cherry_mark = 1;
> +       revs.cherry_mark = skip_already_present;
>         revs.limited = 1;
>         revs.reverse = 1;
>         revs.right_only = 1;
> diff --git a/sequencer.h b/sequencer.h
> index 393571e89a..39bb12f624 100644
> --- a/sequencer.h
> +++ b/sequencer.h
> @@ -149,7 +149,7 @@ int sequencer_remove_state(struct replay_opts *opts);
>   * `--onto`, we do not want to re-generate the root commits.
>   */
>  #define TODO_LIST_ROOT_WITH_ONTO (1U << 6)
> -
> +#define TODO_LIST_SKIP_ALREADY_PRESENT (1U << 7)
>
>  int sequencer_make_script(struct repository *r, struct strbuf *out, int argc,
>                           const char **argv, unsigned flags);
> diff --git a/t/t3402-rebase-merge.sh b/t/t3402-rebase-merge.sh
> index a1ec501a87..9b52739a10 100755
> --- a/t/t3402-rebase-merge.sh
> +++ b/t/t3402-rebase-merge.sh
> @@ -162,4 +162,81 @@ test_expect_success 'rebase --skip works with two conflicts in a row' '
>         git rebase --skip
>  '
>
> +test_expect_success '--no-skip-already-present' '
> +       git init repo &&
> +
> +       # O(1-10) -- O(1-11) -- O(0-10) master
> +       #        \
> +       #         -- O(1-11) -- O(1-12) otherbranch
> +
> +       printf "Line %d\n" $(test_seq 1 10) >repo/file.txt &&
> +       git -C repo add file.txt &&
> +       git -C repo commit -m "base commit" &&
> +
> +       printf "Line %d\n" $(test_seq 1 11) >repo/file.txt &&
> +       git -C repo commit -a -m "add 11" &&
> +
> +       printf "Line %d\n" $(test_seq 0 10) >repo/file.txt &&
> +       git -C repo commit -a -m "add 0 delete 11" &&
> +
> +       git -C repo checkout -b otherbranch HEAD^^ &&
> +       printf "Line %d\n" $(test_seq 1 11) >repo/file.txt &&
> +       git -C repo commit -a -m "add 11 in another branch" &&
> +
> +       printf "Line %d\n" $(test_seq 1 12) >repo/file.txt &&
> +       git -C repo commit -a -m "add 12 in another branch" &&
> +
> +       # Regular rebase fails, because the 1-11 commit is deduplicated
> +       test_must_fail git -C repo rebase --merge master 2> err &&
> +       test_i18ngrep "error: could not apply.*add 12 in another branch" err &&
> +       git -C repo rebase --abort &&
> +
> +       # With --no-skip-already-present, it works
> +       git -C repo rebase --merge --no-skip-already-present master
> +'
> +
> +test_expect_success '--no-skip-already-present refrains from reading unneeded blobs' '
> +       git init server &&
> +
> +       # O(1-10) -- O(1-11) -- O(1-12) master
> +       #        \
> +       #         -- O(0-10) otherbranch
> +
> +       printf "Line %d\n" $(test_seq 1 10) >server/file.txt &&
> +       git -C server add file.txt &&
> +       git -C server commit -m "merge base" &&
> +
> +       printf "Line %d\n" $(test_seq 1 11) >server/file.txt &&
> +       git -C server commit -a -m "add 11" &&
> +
> +       printf "Line %d\n" $(test_seq 1 12) >server/file.txt &&
> +       git -C server commit -a -m "add 12" &&
> +
> +       git -C server checkout -b otherbranch HEAD^^ &&
> +       printf "Line %d\n" $(test_seq 0 10) >server/file.txt &&
> +       git -C server commit -a -m "add 0" &&
> +
> +       test_config -C server uploadpack.allowfilter 1 &&
> +       test_config -C server uploadpack.allowanysha1inwant 1 &&
> +
> +       git clone --filter=blob:none "file://$(pwd)/server" client &&
> +       git -C client checkout origin/master &&
> +       git -C client checkout origin/otherbranch &&
> +
> +       # Sanity check to ensure that the blobs from the merge base and "add
> +       # 11" are missing
> +       git -C client rev-list --objects --all --missing=print >missing_list &&
> +       MERGE_BASE_BLOB=$(git -C server rev-parse master^^:file.txt) &&
> +       ADD_11_BLOB=$(git -C server rev-parse master^:file.txt) &&
> +       grep "\\?$MERGE_BASE_BLOB" missing_list &&
> +       grep "\\?$ADD_11_BLOB" missing_list &&
> +
> +       git -C client rebase --merge --no-skip-already-present origin/master &&
> +
> +       # The blob from the merge base had to be fetched, but not "add 11"
> +       git -C client rev-list --objects --all --missing=print >missing_list &&
> +       ! grep "\\?$MERGE_BASE_BLOB" missing_list &&
> +       grep "\\?$ADD_11_BLOB" missing_list
> +'
> +
>  test_done
> --
> 2.25.1.481.gfbce0eb801-goog

Should there be a config setting to flip the default?  And should
feature.experimental and/or feature.manyFiles enable it by default?
Jonathan Tan March 10, 2020, 10:56 p.m. UTC | #6
> On Mon, Mar 9, 2020 at 1:58 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > When rebasing against an upstream that has had many commits since the
> > original branch was created:
> >
> >  O -- O -- ... -- O -- O (upstream)
> >   \
> >    -- O (my-dev-branch)
> >
> > because "git rebase" attempts to exclude commits that are duplicates of
> > upstream ones, it must read the contents of every novel upstream commit,
> > in addition to the tip of the upstream and the merge base. This can be a
> > significant performance hit, especially in a partial clone, wherein a
> > read of an object may end up being a fetch.
> 
> Does this suggest that the cherry-pick detection is suboptimal and
> needs to be improved?  When rebasing, it is typical that you are just
> rebasing a small number of patches compared to how many exist
> upstream.  As such, any upstream patch modifying files outside the set
> of files modified on the rebased side is known to not be PATCHSAME
> without looking at those new files.

That's true - and this would drastically reduce the fetches necessary in
partial clone, perhaps enough that we no longer need this check.

In the absence of partial clone, this also might improve performance
sufficiently, such that we no longer need my new option. (Or it might
not.)

> Or is the issue just the sheer
> number of upstream commits that modify only the files also modified on
> the rebased side is large?
> 
> > Add a flag to "git rebase" to allow suppression of this feature. This
> > flag only works when using the "merge" backend.
> 
> Interesting.  A little over a year ago we discussed not only making
> such a change in behavior, but making it the default; see
> https://public-inbox.org/git/nycvar.QRO.7.76.6.1901211635080.41@tvgsbejvaqbjf.bet/
> (from "Ooh, that's interesting" to "VFS for Git").

Thanks for the pointer! Dscho did propose again to make it the default
[1] and I replied that this can be done later [2].

[1] https://lore.kernel.org/git/nycvar.QRO.7.76.6.2003101315100.46@tvgsbejvaqbjf.bet/
[2] https://lore.kernel.org/git/20200310160035.20252-1-jonathantanmy@google.com/

> Yes, I've wanted to kill that performance overhead too even without
> partial clones, though I was slightly worried about cases of a known
> cherry-pick no longer cleanly applying and thus forcing the user to
> detect that it has become empty.  I guess that's why it's a flag
> instead of the new default, but there's something inside of me that
> asks why this special case is detected for the user when other
> conflict cases aren't...  Not sure if I'm just pigeonholing on
> performance too much.

I haven't dug into this, but the email you linked [3] shows that this
behavior was once-upon-a-time relied upon ("For example, when I did not
use GitGitGadget yet to submit patches..."). So I don't think we should
change it.

[3] https://public-inbox.org/git/nycvar.QRO.7.76.6.1901211635080.41@tvgsbejvaqbjf.bet/


> > I've only implemented this for the "merge" backend since I think that
> > there is an effort to migrate "rebase" to use the "merge" backend by
> > default, and also because "merge" uses diff internally which already has
> > the (per-commit) blob batch prefetching.
> 
> I understand the first half of your reason here, but I don't follow
> the second half.  The apply backend uses diff to generate the patches,
> but diff isn't the relevant operation here; it's the rev-list walking,
> and both call the exact same rev-list walk the last time I checked so
> I'm not sure what the difference is here.  Am I misunderstanding one
> or more things?

Maybe just ignore the second half :-)

I thought and wrote the second half because I noticed that somewhere in
the "am"-related code, blobs were being fetched one by one, but no such
thing was happening when I used the "merge" backend. The rev-list
walking doesn't access blobs, I believe.

> > +--skip-already-present::
> > +--no-skip-already-present::
> > +       Skip commits that are already present in the new upstream.
> > +       This is the default.
> > ++
> > +If the skip-if-already-present feature is unnecessary or undesired,
> > +`--no-skip-already-present` may improve performance since it avoids
> > +the need to read the contents of every commit in the new upstream.
> > +
> 
> I'm afraid the naming might be pretty opaque and confusing to users.
> Even if we do keep the names, it might help to be clearer about the
> ramifications.  And there's a missing reference to the option
> incompatibility.  Perhaps something like:
> 
> --skip-cherry-pick-detection
> --no-skip-cherry-pick-detection
> 
>     Whether rebase tries to determine if commits are already present
> upstream, i.e. if there are commits which are cherry-picks.  If such
> detection is done, any commits being rebased which are cherry-picks
> will be dropped, since those commits are already found upstream.  If
> such detection is not done, those commits will be re-applied, which
> most likely will result in no new changes (as the changes are already
> upstream) and result in the commit being dropped anyway.  cherry-pick
> detection is the default, but can be expensive in repos with a large
> number of upstream commits that need to be read.
> 
> See also INCOMPATIBLE OPTIONS below.

I understand that commits being already present in upstream is usually
due to cherry-picking, but I don't think that's always the case, so
perhaps there is some imprecision here. But this might be better - in
particular, documentation and code will not be so clumsy (the "no", or
0, is the status quo, and the lack of "no", or 1, requires special
handling).

> > +       if (!options.skip_already_present && !is_interactive(&options))
> > +               die(_("--no-skip-already-present does not work with the 'am' backend"));
> > +
> 
> with the *apply* backend, not the 'am' one (the backend was renamed in
> commit 10cdb9f38a ("rebase: rename the two primary rebase backends",
> 2020-02-15))

Thanks. Will do.

> Should there be a config setting to flip the default?  And should
> feature.experimental and/or feature.manyFiles enable it by default?

As above, I think this can be done in a separate patch.
Jonathan Tan March 12, 2020, 6:04 p.m. UTC | #7
> > Does this suggest that the cherry-pick detection is suboptimal and
> > needs to be improved?  When rebasing, it is typical that you are just
> > rebasing a small number of patches compared to how many exist
> > upstream.  As such, any upstream patch modifying files outside the set
> > of files modified on the rebased side is known to not be PATCHSAME
> > without looking at those new files.
> 
> That's true - and this would drastically reduce the fetches necessary in
> partial clone, perhaps enough that we no longer need this check.
> 
> In the absence of partial clone, this also might improve performance
> sufficiently, such that we no longer need my new option. (Or it might
> not.)

I took a further look at this. patch-ids.c and its caller
(cherry_pick_list() in revision.c) implement duplicate checking by first
generating full diff outputs for the commits in the shorter side,
putting them in a hashmap keyed by the SHA-1 of the diff output (and
values being the commit itself), and then generating full diff outputs
for the commits in the longer side and checking them against the
hashmap. When processing the shorter side, we could also generate
filename-only diffs and put their hashes into a hashset; so when
processing the longer side, we could generate the filename-only diff
first (without reading any blobs) and checking them against our new
hashset, and only if it appears in our new hashset, then do we generate
the full diff (thus reading blobs).

One issue with this is unpredictability to the user (since which blobs
get read depend on which side is longer), but that seems resolvable by
not doing any length checks but always reading the blobs on the right
side (that is, the non-upstream side).

So I would say that yes, the cherry-pick detection is suboptimal and
could be improved. So the question is...what to do with my patch? An
argument could be made that my patch should be dropped because an
improvement in cherry-pick detection would eliminate the need for the
option I'm introducing in my patch, but after some thought, I think that
this option will still be useful even with cherry-pick detection. If we
move in a direction where not only blobs but also trees (or even
commits) are omitted, we'll definitely want this new option. And even if
a user is not using partial clone at all, I think it is still useful to
suppress both the filtering of commits (e.g. when upstream has a commit
then revert, it would be reasonable to cherry-pick the same commit on
top) and reduce disk reads (although I don't know if this will be the
case in practice).
Elijah Newren March 12, 2020, 10:40 p.m. UTC | #8
On Thu, Mar 12, 2020 at 11:04 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > > Does this suggest that the cherry-pick detection is suboptimal and
> > > needs to be improved?  When rebasing, it is typical that you are just
> > > rebasing a small number of patches compared to how many exist
> > > upstream.  As such, any upstream patch modifying files outside the set
> > > of files modified on the rebased side is known to not be PATCHSAME
> > > without looking at those new files.
> >
> > That's true - and this would drastically reduce the fetches necessary in
> > partial clone, perhaps enough that we no longer need this check.
> >
> > In the absence of partial clone, this also might improve performance
> > sufficiently, such that we no longer need my new option. (Or it might
> > not.)
>
> I took a further look at this. patch-ids.c and its caller
> (cherry_pick_list() in revision.c) implement duplicate checking by first
> generating full diff outputs for the commits in the shorter side,
> putting them in a hashmap keyed by the SHA-1 of the diff output (and
> values being the commit itself), and then generating full diff outputs
> for the commits in the longer side and checking them against the
> hashmap. When processing the shorter side, we could also generate
> filename-only diffs and put their hashes into a hashset; so when
> processing the longer side, we could generate the filename-only diff
> first (without reading any blobs) and checking them against our new
> hashset, and only if it appears in our new hashset, then do we generate
> the full diff (thus reading blobs).
>
> One issue with this is unpredictability to the user (since which blobs
> get read depend on which side is longer), but that seems resolvable by
> not doing any length checks but always reading the blobs on the right
> side (that is, the non-upstream side).
>
> So I would say that yes, the cherry-pick detection is suboptimal and
> could be improved.

Sweet, thanks for doing this investigative work!  Sounds promising.

> So the question is...what to do with my patch? An
> argument could be made that my patch should be dropped because an
> improvement in cherry-pick detection would eliminate the need for the
> option I'm introducing in my patch, but after some thought, I think that
> this option will still be useful even with cherry-pick detection.

The option may be totally justified here, but can I go on a possible
tangent and vent for a little bit?  We seem to introduce options an
awful lot.  While options can be valuable they also have a cost, and I
think we tend not to acknowledge that.  Some of the negatives:

- Developers add options when they run into bugs instead of fixing
bugs (usually they don't realize that the behavior in question is a
bug, but that's exacerbated by a willingness to add options and not
consider costs)
- Developers add options without considering combinations of options
and what they mean (though it's hard to fault them because considering
combinations becomes harder and harder with the more options we have;
it's a negative feedback cycle)
- Growth in number of options leads to code that is hard or impossible
to refactor based on a maze of competing options with myriad edge and
corner cases that are fundamentally broken
- Users get overloaded by the sheer number of options and minor distinctions

The fourth case is probably obvious, so let me just include some
examples of the first three cases above:
* Commits b00bf1c9a8 ("git-rebase: make --allow-empty-message the
default", 2018-06-27), 22a69fda19 ("git-rebase.txt: update description
of --allow-empty-message", 2020-01-16), and d48e5e21da ("rebase
(interactive-backend): make --keep-empty the default", 2020-02-15)
noted that options that previously existed were just workarounds to
buggy behavior and the flags should have been always on.
* Commit e86bbcf987 ("clean: disambiguate the definition of -d",
2019-09-17) showed a pretty hairy case where the combination of
options led to cases where I not only didn't know how to implement
correct behavior, I didn't even know how to state what the desired
behavior for end-users was.  Despite a few different reports over a
year and a half that I had a series that fixed some known issues for
users the series languished because I couldn't get an answer on what
was right.  See also
https://lore.kernel.org/git/20190905154735.29784-1-newren@gmail.com/
* See the huge "Behavioral differences" section of the git-rebase
manpage, and a combination of rants from me on dir.c:
  - https://lore.kernel.org/git/20190905154735.29784-1-newren@gmail.com/
  - https://lore.kernel.org/git/CABPp-BFG3FkTkC=L1v97LUksndkOmCN8ZhNJh5eoNdquE7v9DA@mail.gmail.com/
  - https://lore.kernel.org/git/pull.676.v3.git.git.1576571586.gitgitgadget@gmail.com/
  - The commit message of
https://lore.kernel.org/git/d3136ef52f3306d465a5a6004cdc9ba5b1ae4148.1580495486.git.gitgitgadget@gmail.com/

>  If we
> move in a direction where not only blobs but also trees (or even
> commits) are omitted, we'll definitely want this new option.

Why would this new option be needed if we omitted trees?   If trees
are omitted based on something like sparse-checkouts, then they are
omitted based on path; shouldn't we be able to avoid walking trees
just by noting they modified some path outside a requested sparse
checkout?

I want grep, log, etc. to behave within the cone of a sparse checkout,
which means that I need trees of upstream branches within the relevant
paths anyway.  But theoretically I should certainly be able to avoid
walking trees outside those paths.

>  And even if
> a user is not using partial clone at all, I think it is still useful to
> suppress both the filtering of commits (e.g. when upstream has a commit
> then revert, it would be reasonable to cherry-pick the same commit on
> top) and reduce disk reads (although I don't know if this will be the
> case in practice).

That sounds like yet another argument that the behavior you're arguing
for should be the default, not a flag we make the users pick to
workaround bugs.  Yes, sometimes weird behaviors beget usecases (cue
link to xkcd's comic on emacs spacebar overheating) and we need to
provide transition plans, but I think this might be a case where
transitioning makes sense.  From a high level, here's my guess
(emphasis on guess) at the history:

* am checked for upstream patches, because apply would get confused
trying to apply an already applied patch
* legacy-interactive-rebase would check for upstream patches as a
performance optimization because having to shell out to a separate
cherry-pick process for each commit is slow (and may have also been
done partially to match am, even though am did it as a workaround)

And now we're in the state where:
* The check-for-upstream bits hurt performance, significantly enough
that we have three different reports of folks not liking it (you, me,
and Taylor)
* It actively does the wrong thing in cases such as revert + re-apply
sequences, which exist in practice (and exist a lot more than they
should, but they absolutely do exist)

We've made changes in other places (e.g. opening an editor for merge
or rebase, push.default, etc.); is there any reason a similar change
wouldn't be justified here?
Elijah Newren March 14, 2020, 8:04 a.m. UTC | #9
On Thu, Mar 12, 2020 at 3:40 PM Elijah Newren <newren@gmail.com> wrote:
>
>  On Thu, Mar 12, 2020 at 11:04 AM Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > > > Does this suggest that the cherry-pick detection is suboptimal and
> > > > needs to be improved?  When rebasing, it is typical that you are just
> > > > rebasing a small number of patches compared to how many exist
> > > > upstream.  As such, any upstream patch modifying files outside the set
> > > > of files modified on the rebased side is known to not be PATCHSAME
> > > > without looking at those new files.
> > >
> > > That's true - and this would drastically reduce the fetches necessary in
> > > partial clone, perhaps enough that we no longer need this check.
> > >
> > > In the absence of partial clone, this also might improve performance
> > > sufficiently, such that we no longer need my new option. (Or it might
> > > not.)
> >
> > I took a further look at this. patch-ids.c and its caller
> > (cherry_pick_list() in revision.c) implement duplicate checking by first
> > generating full diff outputs for the commits in the shorter side,
> > putting them in a hashmap keyed by the SHA-1 of the diff output (and
> > values being the commit itself), and then generating full diff outputs
> > for the commits in the longer side and checking them against the
> > hashmap. When processing the shorter side, we could also generate
> > filename-only diffs and put their hashes into a hashset; so when
> > processing the longer side, we could generate the filename-only diff
> > first (without reading any blobs) and checking them against our new
> > hashset, and only if it appears in our new hashset, then do we generate
> > the full diff (thus reading blobs).
> >
> > One issue with this is unpredictability to the user (since which blobs
> > get read depend on which side is longer), but that seems resolvable by
> > not doing any length checks but always reading the blobs on the right
> > side (that is, the non-upstream side).
> >
> > So I would say that yes, the cherry-pick detection is suboptimal and
> > could be improved.
>
> Sweet, thanks for doing this investigative work!  Sounds promising.
>
> > So the question is...what to do with my patch? An
> > argument could be made that my patch should be dropped because an
> > improvement in cherry-pick detection would eliminate the need for the
> > option I'm introducing in my patch, but after some thought, I think that
> > this option will still be useful even with cherry-pick detection.
>
> The option may be totally justified here, but can I go on a possible
> tangent and vent for a little bit?  We seem to introduce options an
> awful lot.  While options can be valuable they also have a cost, and I
> think we tend not to acknowledge that.  Some of the negatives:
>
> - Developers add options when they run into bugs instead of fixing
> bugs (usually they don't realize that the behavior in question is a
> bug, but that's exacerbated by a willingness to add options and not
> consider costs)
> - Developers add options without considering combinations of options
> and what they mean (though it's hard to fault them because considering
> combinations becomes harder and harder with the more options we have;
> it's a negative feedback cycle)
> - Growth in number of options leads to code that is hard or impossible
> to refactor based on a maze of competing options with myriad edge and
> corner cases that are fundamentally broken
> - Users get overloaded by the sheer number of options and minor distinctions
>
> The fourth case is probably obvious, so let me just include some
> examples of the first three cases above:
> * Commits b00bf1c9a8 ("git-rebase: make --allow-empty-message the
> default", 2018-06-27), 22a69fda19 ("git-rebase.txt: update description
> of --allow-empty-message", 2020-01-16), and d48e5e21da ("rebase
> (interactive-backend): make --keep-empty the default", 2020-02-15)
> noted that options that previously existed were just workarounds to
> buggy behavior and the flags should have been always on.
> * Commit e86bbcf987 ("clean: disambiguate the definition of -d",
> 2019-09-17) showed a pretty hairy case where the combination of
> options led to cases where I not only didn't know how to implement
> correct behavior, I didn't even know how to state what the desired
> behavior for end-users was.  Despite a few different reports over a
> year and a half that I had a series that fixed some known issues for
> users the series languished because I couldn't get an answer on what
> was right.  See also
> https://lore.kernel.org/git/20190905154735.29784-1-newren@gmail.com/
> * See the huge "Behavioral differences" section of the git-rebase
> manpage, and a combination of rants from me on dir.c:
>   - https://lore.kernel.org/git/20190905154735.29784-1-newren@gmail.com/
>   - https://lore.kernel.org/git/CABPp-BFG3FkTkC=L1v97LUksndkOmCN8ZhNJh5eoNdquE7v9DA@mail.gmail.com/
>   - https://lore.kernel.org/git/pull.676.v3.git.git.1576571586.gitgitgadget@gmail.com/
>   - The commit message of
> https://lore.kernel.org/git/d3136ef52f3306d465a5a6004cdc9ba5b1ae4148.1580495486.git.gitgitgadget@gmail.com/
>
> >  If we
> > move in a direction where not only blobs but also trees (or even
> > commits) are omitted, we'll definitely want this new option.
>
> Why would this new option be needed if we omitted trees?   If trees
> are omitted based on something like sparse-checkouts, then they are
> omitted based on path; shouldn't we be able to avoid walking trees
> just by noting they modified some path outside a requested sparse
> checkout?
>
> I want grep, log, etc. to behave within the cone of a sparse checkout,
> which means that I need trees of upstream branches within the relevant
> paths anyway.  But theoretically I should certainly be able to avoid
> walking trees outside those paths.
>
> >  And even if
> > a user is not using partial clone at all, I think it is still useful to
> > suppress both the filtering of commits (e.g. when upstream has a commit
> > then revert, it would be reasonable to cherry-pick the same commit on
> > top) and reduce disk reads (although I don't know if this will be the
> > case in practice).
>
> That sounds like yet another argument that the behavior you're arguing
> for should be the default, not a flag we make the users pick to
> workaround bugs.  Yes, sometimes weird behaviors beget usecases (cue
> link to xkcd's comic on emacs spacebar overheating) and we need to
> provide transition plans, but I think this might be a case where
> transitioning makes sense.  From a high level, here's my guess
> (emphasis on guess) at the history:
>
> * am checked for upstream patches, because apply would get confused
> trying to apply an already applied patch
> * legacy-interactive-rebase would check for upstream patches as a
> performance optimization because having to shell out to a separate
> cherry-pick process for each commit is slow (and may have also been
> done partially to match am, even though am did it as a workaround)

Or maybe that wasn't the reasoning?  I'm having a hard time parsing the
history to verify:

a6ec3c1599 ("git-rebase: Use --ignore-if-in-upstream option when
            executing git-format-patch.", 2006-10-03)
  '''This reduces the number of conflicts when rebasing after a series of
     patches to the same piece of code is committed upstream.'''

96ffe892e3 ("rebase -i: ignore patches that are already in the
            upstream", 2007-08-01)
  '''Non-interactive rebase had this from the beginning -- match it by
     using --cherry-pick option to rev-list.'''

and related: 1e0dacdbdb ("rebase: omit patch-identical commits with
    --fork-point", 2014-07-16)

> And now we're in the state where:
> * The check-for-upstream bits hurt performance, significantly enough
> that we have three different reports of folks not liking it (you, me,
> and Taylor)
> * It actively does the wrong thing in cases such as revert + re-apply
> sequences, which exist in practice (and exist a lot more than they
> should, but they absolutely do exist)

Though these are definitely still problems with
check-if-upstream-already behavior.

> We've made changes in other places (e.g. opening an editor for merge
> or rebase, push.default, etc.); is there any reason a similar change
> wouldn't be justified here?

After another day of thought, and my attempt to figure out the reason
above, perhaps my assumptions about the reason behind the original
behavior, any my assumptions about the sanity of switching the default
might not be as grounded as I thought.  Thus, my worries about yet
another flag may be overstated as well, at least in this case.
Jonathan Tan March 17, 2020, 3:03 a.m. UTC | #10
> > We've made changes in other places (e.g. opening an editor for merge
> > or rebase, push.default, etc.); is there any reason a similar change
> > wouldn't be justified here?
> 
> After another day of thought, and my attempt to figure out the reason
> above, perhaps my assumptions about the reason behind the original
> behavior, any my assumptions about the sanity of switching the default
> might not be as grounded as I thought.  Thus, my worries about yet
> another flag may be overstated as well, at least in this case.

Thanks for the further investigation. For what it's worth, I do agree
that we should think of the cost of introducing an option before
introducing it. Admittedly in my write-up [1], I mentioned my
investigation into speeding up existing behavior enough to not need my
new feature, but didn't mention the possibility of just changing the
existing behavior. (But it seemed to me that this existing behavior is
presented as a desirable feature, so I didn't think of changing it.)

[1] https://lore.kernel.org/git/20200312180427.192096-1-jonathantanmy@google.com/

This question (from your other email [2]) is probably moot if we're
going to introduce this option anyway, but just to answer it:

> >  If we
> > move in a direction where not only blobs but also trees (or even
> > commits) are omitted, we'll definitely want this new option.
> 
> Why would this new option be needed if we omitted trees?   If trees
> are omitted based on something like sparse-checkouts, then they are
> omitted based on path; shouldn't we be able to avoid walking trees
> just by noting they modified some path outside a requested sparse
> checkout?
> 
> I want grep, log, etc. to behave within the cone of a sparse checkout,
> which means that I need trees of upstream branches within the relevant
> paths anyway.  But theoretically I should certainly be able to avoid
> walking trees outside those paths.

I haven't given much thought to it, but the diffing mechanism will need
to receive a whitelist of paths and, if it ever needs to traverse
outside those, will need to abort with "there's a difference outside
this whitelist". I don't know if it supports such a thing now.

[2] https://lore.kernel.org/git/CABPp-BE83ZhezkgmwatxAhqh4rptMUggcjSwBeiSByyPTUi6Lw@mail.gmail.com/

I'll give some time for others to respond, and will send out a v2 with
your and Taylor's suggestions implemented.
diff mbox series

Patch

diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
index 0c4f038dd6..f73a82b4a9 100644
--- a/Documentation/git-rebase.txt
+++ b/Documentation/git-rebase.txt
@@ -318,6 +318,15 @@  See also INCOMPATIBLE OPTIONS below.
 +
 See also INCOMPATIBLE OPTIONS below.
 
+--skip-already-present::
+--no-skip-already-present::
+	Skip commits that are already present in the new upstream.
+	This is the default.
++
+If the skip-if-already-present feature is unnecessary or undesired,
+`--no-skip-already-present` may improve performance since it avoids
+the need to read the contents of every commit in the new upstream.
+
 --rerere-autoupdate::
 --no-rerere-autoupdate::
 	Allow the rerere mechanism to update the index with the
@@ -866,7 +875,8 @@  Only works if the changes (patch IDs based on the diff contents) on
 'subsystem' did.
 
 In that case, the fix is easy because 'git rebase' knows to skip
-changes that are already present in the new upstream.  So if you say
+changes that are already present in the new upstream (unless
+`--no-skip-already-present` is given). So if you say
 (assuming you're on 'topic')
 ------------
     $ git rebase subsystem
diff --git a/builtin/rebase.c b/builtin/rebase.c
index 6154ad8fa5..943211e5bb 100644
--- a/builtin/rebase.c
+++ b/builtin/rebase.c
@@ -88,13 +88,15 @@  struct rebase_options {
 	struct strbuf git_format_patch_opt;
 	int reschedule_failed_exec;
 	int use_legacy_rebase;
+	int skip_already_present;
 };
 
 #define REBASE_OPTIONS_INIT {			  	\
 		.type = REBASE_UNSPECIFIED,	  	\
 		.flags = REBASE_NO_QUIET, 		\
 		.git_am_opts = ARGV_ARRAY_INIT,		\
-		.git_format_patch_opt = STRBUF_INIT	\
+		.git_format_patch_opt = STRBUF_INIT,	\
+		.skip_already_present =	1		\
 	}
 
 static struct replay_opts get_replay_opts(const struct rebase_options *opts)
@@ -373,6 +375,7 @@  static int run_rebase_interactive(struct rebase_options *opts,
 	flags |= opts->rebase_cousins > 0 ? TODO_LIST_REBASE_COUSINS : 0;
 	flags |= opts->root_with_onto ? TODO_LIST_ROOT_WITH_ONTO : 0;
 	flags |= command == ACTION_SHORTEN_OIDS ? TODO_LIST_SHORTEN_IDS : 0;
+	flags |= opts->skip_already_present ? TODO_LIST_SKIP_ALREADY_PRESENT : 0;
 
 	switch (command) {
 	case ACTION_NONE: {
@@ -1507,6 +1510,8 @@  int cmd_rebase(int argc, const char **argv, const char *prefix)
 		OPT_BOOL(0, "reschedule-failed-exec",
 			 &reschedule_failed_exec,
 			 N_("automatically re-schedule any `exec` that fails")),
+		OPT_BOOL(0, "skip-already-present", &options.skip_already_present,
+			 N_("skip changes that are already present in the new upstream")),
 		OPT_END(),
 	};
 	int i;
@@ -1840,6 +1845,9 @@  int cmd_rebase(int argc, const char **argv, const char *prefix)
 			      "interactive or merge options"));
 	}
 
+	if (!options.skip_already_present && !is_interactive(&options))
+		die(_("--no-skip-already-present does not work with the 'am' backend"));
+
 	if (options.signoff) {
 		if (options.type == REBASE_PRESERVE_MERGES)
 			die("cannot combine '--signoff' with "
diff --git a/sequencer.c b/sequencer.c
index ba90a513b9..752580c017 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -4797,12 +4797,13 @@  int sequencer_make_script(struct repository *r, struct strbuf *out, int argc,
 	int keep_empty = flags & TODO_LIST_KEEP_EMPTY;
 	const char *insn = flags & TODO_LIST_ABBREVIATE_CMDS ? "p" : "pick";
 	int rebase_merges = flags & TODO_LIST_REBASE_MERGES;
+	int skip_already_present = !!(flags & TODO_LIST_SKIP_ALREADY_PRESENT);
 
 	repo_init_revisions(r, &revs, NULL);
 	revs.verbose_header = 1;
 	if (!rebase_merges)
 		revs.max_parents = 1;
-	revs.cherry_mark = 1;
+	revs.cherry_mark = skip_already_present;
 	revs.limited = 1;
 	revs.reverse = 1;
 	revs.right_only = 1;
diff --git a/sequencer.h b/sequencer.h
index 393571e89a..39bb12f624 100644
--- a/sequencer.h
+++ b/sequencer.h
@@ -149,7 +149,7 @@  int sequencer_remove_state(struct replay_opts *opts);
  * `--onto`, we do not want to re-generate the root commits.
  */
 #define TODO_LIST_ROOT_WITH_ONTO (1U << 6)
-
+#define TODO_LIST_SKIP_ALREADY_PRESENT (1U << 7)
 
 int sequencer_make_script(struct repository *r, struct strbuf *out, int argc,
 			  const char **argv, unsigned flags);
diff --git a/t/t3402-rebase-merge.sh b/t/t3402-rebase-merge.sh
index a1ec501a87..9b52739a10 100755
--- a/t/t3402-rebase-merge.sh
+++ b/t/t3402-rebase-merge.sh
@@ -162,4 +162,81 @@  test_expect_success 'rebase --skip works with two conflicts in a row' '
 	git rebase --skip
 '
 
+test_expect_success '--no-skip-already-present' '
+	git init repo &&
+
+	# O(1-10) -- O(1-11) -- O(0-10) master
+	#        \
+	#         -- O(1-11) -- O(1-12) otherbranch
+
+	printf "Line %d\n" $(test_seq 1 10) >repo/file.txt &&
+	git -C repo add file.txt &&
+	git -C repo commit -m "base commit" &&
+
+	printf "Line %d\n" $(test_seq 1 11) >repo/file.txt &&
+	git -C repo commit -a -m "add 11" &&
+
+	printf "Line %d\n" $(test_seq 0 10) >repo/file.txt &&
+	git -C repo commit -a -m "add 0 delete 11" &&
+
+	git -C repo checkout -b otherbranch HEAD^^ &&
+	printf "Line %d\n" $(test_seq 1 11) >repo/file.txt &&
+	git -C repo commit -a -m "add 11 in another branch" &&
+
+	printf "Line %d\n" $(test_seq 1 12) >repo/file.txt &&
+	git -C repo commit -a -m "add 12 in another branch" &&
+
+	# Regular rebase fails, because the 1-11 commit is deduplicated
+	test_must_fail git -C repo rebase --merge master 2> err &&
+	test_i18ngrep "error: could not apply.*add 12 in another branch" err &&
+	git -C repo rebase --abort &&
+
+	# With --no-skip-already-present, it works
+	git -C repo rebase --merge --no-skip-already-present master
+'
+
+test_expect_success '--no-skip-already-present refrains from reading unneeded blobs' '
+	git init server &&
+
+	# O(1-10) -- O(1-11) -- O(1-12) master
+	#        \
+	#         -- O(0-10) otherbranch
+
+	printf "Line %d\n" $(test_seq 1 10) >server/file.txt &&
+	git -C server add file.txt &&
+	git -C server commit -m "merge base" &&
+
+	printf "Line %d\n" $(test_seq 1 11) >server/file.txt &&
+	git -C server commit -a -m "add 11" &&
+
+	printf "Line %d\n" $(test_seq 1 12) >server/file.txt &&
+	git -C server commit -a -m "add 12" &&
+
+	git -C server checkout -b otherbranch HEAD^^ &&
+	printf "Line %d\n" $(test_seq 0 10) >server/file.txt &&
+	git -C server commit -a -m "add 0" &&
+
+	test_config -C server uploadpack.allowfilter 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+
+	git clone --filter=blob:none "file://$(pwd)/server" client &&
+	git -C client checkout origin/master &&
+	git -C client checkout origin/otherbranch &&
+
+	# Sanity check to ensure that the blobs from the merge base and "add
+	# 11" are missing
+	git -C client rev-list --objects --all --missing=print >missing_list &&
+	MERGE_BASE_BLOB=$(git -C server rev-parse master^^:file.txt) &&
+	ADD_11_BLOB=$(git -C server rev-parse master^:file.txt) &&
+	grep "\\?$MERGE_BASE_BLOB" missing_list &&
+	grep "\\?$ADD_11_BLOB" missing_list &&
+
+	git -C client rebase --merge --no-skip-already-present origin/master &&
+
+	# The blob from the merge base had to be fetched, but not "add 11"
+	git -C client rev-list --objects --all --missing=print >missing_list &&
+	! grep "\\?$MERGE_BASE_BLOB" missing_list &&
+	grep "\\?$ADD_11_BLOB" missing_list
+'
+
 test_done