diff mbox series

[v2] rebase --merge: optionally skip upstreamed commits

Message ID 20200318173051.25875-1-jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series [v2] rebase --merge: optionally skip upstreamed commits | expand

Commit Message

Jonathan Tan March 18, 2020, 5:30 p.m. UTC
When rebasing against an upstream that has had many commits since the
original branch was created:

 O -- O -- ... -- O -- O (upstream)
  \
   -- O (my-dev-branch)

it must read the contents of every novel upstream commit, in addition to
the tip of the upstream and the merge base, because "git rebase"
attempts to exclude commits that are duplicates of upstream ones. This
can be a significant performance hit, especially in a partial clone,
wherein a read of an object may end up being a fetch.

Add a flag to "git rebase" to allow suppression of this feature. This
flag only works when using the "merge" backend.

This flag changes the behavior of sequencer_make_script(), called from
do_interactive_rebase() <- run_rebase_interactive() <-
run_specific_rebase() <- cmd_rebase(). With this flag, limit_list()
(indirectly called from sequencer_make_script() through
prepare_revision_walk()) will no longer call cherry_pick_list(), and
thus PATCHSAME is no longer set. Refraining from setting PATCHSAME both
means that the intermediate commits in upstream are no longer read (as
shown by the test) and means that no PATCHSAME-caused skipping of
commits is done by sequencer_make_script(), either directly or through
make_script_with_merges().

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
New in V2: changed parameter name, used Taylor's commit message
suggestions, and used Elijah's documentation suggestions.
---
 Documentation/git-rebase.txt | 20 +++++++++-
 builtin/rebase.c             |  7 ++++
 sequencer.c                  |  3 +-
 sequencer.h                  |  2 +-
 t/t3402-rebase-merge.sh      | 77 ++++++++++++++++++++++++++++++++++++
 5 files changed, 106 insertions(+), 3 deletions(-)

Comments

Junio C Hamano March 18, 2020, 6:47 p.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> When rebasing against an upstream that has had many commits since the
> original branch was created:
>
>  O -- O -- ... -- O -- O (upstream)
>   \
>    -- O (my-dev-branch)
>
> it must read the contents of every novel upstream commit, in addition to
> the tip of the upstream and the merge base, because "git rebase"
> attempts to exclude commits that are duplicates of upstream ones. This
> can be a significant performance hit, especially in a partial clone,
> wherein a read of an object may end up being a fetch.

OK.  I presume that we do this by comparing patch IDs?

Total disabling would of course is OK as a feature, especially for
the first cut, but I wonder if it would be a reasonable idea to use
some heuristic to keep the current "filter the same change" feature
as much as possible but optimize it by filtering the novel upstream
commits without hitting their trees and blobs (I am assuming that
you at least are aware of and have the commit objects on the
upstream side).

The most false-negative-prone approach is just to compare the
<author ident, author timestamp> of a candidate upstream commit with
what you have---if that author does not appear on my-dev-branch, it
is very unlikely that your change has been accepted upstream.  Of
course, two people who independently discover the same solution is
not all that rare, so it does risk false-negative to take too little
clue from the commits to compare, but at least it is not worse than
what you are proposing here ;-)  And if one of your commits on
my-dev-branch _might_ be identical to one of the novel upstream ones,
at that point, we could dig deeper to actually compute the patch ID
by fetching the upstream's tree.

That's all totally outside the scope of this patch.  It is just a
random thought to see if anybody wants to pursue to make the topic
even better, possible after it lands.

> New in V2: changed parameter name, used Taylor's commit message
> suggestions, and used Elijah's documentation suggestions.

Hmph, what was it called earlier?  My gut reaction without much
thinking finds --no-skip-* a bit confusing double-negation and
suspect "--[no-]detect-cherry-pick" (which defaults to true for
backward compatibility) may feel more natural, but I suspect (I do
not recall details of the discussion on v1) it has been already
discussed and people found --no-skip-* is OK (in which case I won't
object)?

I also wonder if --detect-cherry-pick=(yes|no|auto) may give a
better end-user experience, with "auto" meaning "do run patch-ID
based filtering, but if we know it will be expensive (e.g. the
repository is sparsely cloned), please skip it".  That way, there
may appear other reasons that makes patch-ID computation expensive
now or in the fiture, and the users are automatically covered.
Jonathan Tan March 18, 2020, 7:28 p.m. UTC | #2
> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > When rebasing against an upstream that has had many commits since the
> > original branch was created:
> >
> >  O -- O -- ... -- O -- O (upstream)
> >   \
> >    -- O (my-dev-branch)
> >
> > it must read the contents of every novel upstream commit, in addition to
> > the tip of the upstream and the merge base, because "git rebase"
> > attempts to exclude commits that are duplicates of upstream ones. This
> > can be a significant performance hit, especially in a partial clone,
> > wherein a read of an object may end up being a fetch.
> 
> OK.  I presume that we do this by comparing patch IDs?

Yes.

> Total disabling would of course is OK as a feature, especially for
> the first cut, but I wonder if it would be a reasonable idea to use
> some heuristic to keep the current "filter the same change" feature
> as much as possible but optimize it by filtering the novel upstream
> commits without hitting their trees and blobs (I am assuming that
> you at least are aware of and have the commit objects on the
> upstream side).
> 
> The most false-negative-prone approach is just to compare the
> <author ident, author timestamp> of a candidate upstream commit with
> what you have---if that author does not appear on my-dev-branch, it
> is very unlikely that your change has been accepted upstream.  Of
> course, two people who independently discover the same solution is
> not all that rare, so it does risk false-negative to take too little
> clue from the commits to compare, but at least it is not worse than
> what you are proposing here ;-)  And if one of your commits on
> my-dev-branch _might_ be identical to one of the novel upstream ones,
> at that point, we could dig deeper to actually compute the patch ID
> by fetching the upstream's tree.

As far as I know, the existing patch ID behavior is only based on the
patch contents, so if there was any author name or time rewriting (or if
two people independently discovered the same solution, as you wrote),
then the behavior would be different. Apart from that, this does sound
like a cheap thing to compare before comparing the diff.

Elijah Newren suggested and I investigated another approach of using a
filename-only diff as a first approximation. The relevant quotations and
explanations are in my email here [1].

[1] https://lore.kernel.org/git/20200312180427.192096-1-jonathantanmy@google.com/

> That's all totally outside the scope of this patch.  It is just a
> random thought to see if anybody wants to pursue to make the topic
> even better, possible after it lands.

OK.

> > New in V2: changed parameter name, used Taylor's commit message
> > suggestions, and used Elijah's documentation suggestions.
> 
> Hmph, what was it called earlier?  My gut reaction without much
> thinking finds --no-skip-* a bit confusing double-negation and
> suspect "--[no-]detect-cherry-pick" (which defaults to true for
> backward compatibility) may feel more natural, but I suspect (I do
> not recall details of the discussion on v1) it has been already
> discussed and people found --no-skip-* is OK (in which case I won't
> object)?

It was earlier called "--{,no-}skip-already-present" (with the opposite
meaning, and thus, --skip-already-present is the default), so the double
negative has always existed. "--detect-cherry-pick" might be a better
idea...I'll wait to see if anybody else has an opinion.

> I also wonder if --detect-cherry-pick=(yes|no|auto) may give a
> better end-user experience, with "auto" meaning "do run patch-ID
> based filtering, but if we know it will be expensive (e.g. the
> repository is sparsely cloned), please skip it".  That way, there
> may appear other reasons that makes patch-ID computation expensive
> now or in the fiture, and the users are automatically covered.

It might be better to have predictability, and for "auto", I don't know
if we can have a simple and explainable set of rules as to when to use
patch-ID-based filtering - for example, in a partial clone with no
blobs, I would normally want no patch-ID-based filtering, but in a
partial clone with only a blob size limit, I probably will still want
patch-ID-based filtering.
Junio C Hamano March 18, 2020, 7:55 p.m. UTC | #3
Jonathan Tan <jonathantanmy@google.com> writes:

>> Hmph, what was it called earlier?  My gut reaction without much
>> thinking finds --no-skip-* a bit confusing double-negation and
>> suspect "--[no-]detect-cherry-pick" (which defaults to true for
>> backward compatibility) may feel more natural, but I suspect (I do
>> not recall details of the discussion on v1) it has been already
>> discussed and people found --no-skip-* is OK (in which case I won't
>> object)?
>
> It was earlier called "--{,no-}skip-already-present" (with the opposite
> meaning, and thus, --skip-already-present is the default), so the double
> negative has always existed. "--detect-cherry-pick" might be a better
> idea...I'll wait to see if anybody else has an opinion.

While "--[no-]detect-cherry-pick" is much better in avoiding double
negation, it is a horrible name---we do not tell the users what we
do after we detect cherry pick ("--[no-]skip-cherry-pick-detection"
does not tell us, either).  

Compared to them, "--[no-]skip-already-present" is much better, even
though there is double negation.  

How about a name along the lines of "--[no-]keep-duplicate", then?

>> I also wonder if --detect-cherry-pick=(yes|no|auto) may give a
>> better end-user experience, with "auto" meaning "do run patch-ID
>> based filtering, but if we know it will be expensive (e.g. the
>> repository is sparsely cloned), please skip it".  That way, there
>> may appear other reasons that makes patch-ID computation expensive
>> now or in the fiture, and the users are automatically covered.
>
> It might be better to have predictability, and for "auto", I don't know
> if we can have a simple and explainable set of rules as to when to use
> patch-ID-based filtering - for example, in a partial clone with no
> blobs, I would normally want no patch-ID-based filtering, but in a
> partial clone with only a blob size limit, I probably will still want
> patch-ID-based filtering.

Perhaps.  You could have something more specific than "auto".  The
main point was instead of "--[no-]$knob", "--$knob=(yes|no|...)" is
much easier to extend.  I simply do not know if we will see need to
extend the vocabulary in the near future (to which you guys who are
more interested in sparse clones would have much better insight than
I do).

Thanks.
Junio C Hamano March 18, 2020, 8:20 p.m. UTC | #4
Jonathan Tan <jonathantanmy@google.com> writes:

> @@ -1840,6 +1844,9 @@ int cmd_rebase(int argc, const char **argv, const char *prefix)
>  			      "interactive or merge options"));
>  	}
>  
> +	if (options.skip_cherry_pick_detection && !is_interactive(&options))
> +		die(_("--skip-cherry-pick-detection does not work with the 'apply' backend"));
> +

I presume this is, as before, built directly on v2.25.0; thanks for
keeping the original base while iterating.

Just a note to myself and those who are experimenting with the
patch.  When merged to the more recent codebase, is_interactive()
here will have to become is_merge().
Elijah Newren March 18, 2020, 8:41 p.m. UTC | #5
On Wed, Mar 18, 2020 at 12:55 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jonathan Tan <jonathantanmy@google.com> writes:
>
> >> Hmph, what was it called earlier?  My gut reaction without much
> >> thinking finds --no-skip-* a bit confusing double-negation and
> >> suspect "--[no-]detect-cherry-pick" (which defaults to true for
> >> backward compatibility) may feel more natural, but I suspect (I do
> >> not recall details of the discussion on v1) it has been already
> >> discussed and people found --no-skip-* is OK (in which case I won't
> >> object)?
> >
> > It was earlier called "--{,no-}skip-already-present" (with the opposite
> > meaning, and thus, --skip-already-present is the default), so the double
> > negative has always existed. "--detect-cherry-pick" might be a better
> > idea...I'll wait to see if anybody else has an opinion.
>
> While "--[no-]detect-cherry-pick" is much better in avoiding double
> negation, it is a horrible name---we do not tell the users what we
> do after we detect cherry pick ("--[no-]skip-cherry-pick-detection"
> does not tell us, either).

I like --[no-]detect-cherry-pick.  I'm on board with using "keep"
instead of "skip" to avoid double negation.

> Compared to them, "--[no-]skip-already-present" is much better, even
> though there is double negation.

This one seems especially bad from a discoverability and
understandability viewpoint though.  It's certainly nice if options
are fully self-documenting, but sometimes that would require full
paragraphs for the option name.  Focusing on what is done with the
option at the expense of discovering which options are relevant to
your case or at the expense of enabling users to create a mental model
for when options might be meaningful is something that I think is very
detrimental to usability.  I think users who want such an option would
have a very hard time finding this based on its name, and people who
want completely unrelated features would be confused enough by it that
they feel compelled to read its description and attempt to parse it
and guess how it's related.  In contrast, --[no-]detect-cherry-pick is
a bit clearer to both groups of people for whether it is useful, and
the group who wants it can read up the description to get the details.

> How about a name along the lines of "--[no-]keep-duplicate", then?

This name is much better than --[no-]keep-already-present would be
because "duplicate" is a far better indicator than "already-present"
of the intended meaning.  But I'm still worried the name "duplicate"
isn't going to be enough of a clue to individuals about whether they
will need this options or not.  Perhaps --[no-]keep-cherry-pick?

> >> I also wonder if --detect-cherry-pick=(yes|no|auto) may give a
> >> better end-user experience, with "auto" meaning "do run patch-ID
> >> based filtering, but if we know it will be expensive (e.g. the
> >> repository is sparsely cloned), please skip it".  That way, there
> >> may appear other reasons that makes patch-ID computation expensive
> >> now or in the fiture, and the users are automatically covered.
> >
> > It might be better to have predictability, and for "auto", I don't know
> > if we can have a simple and explainable set of rules as to when to use
> > patch-ID-based filtering - for example, in a partial clone with no
> > blobs, I would normally want no patch-ID-based filtering, but in a
> > partial clone with only a blob size limit, I probably will still want
> > patch-ID-based filtering.
>
> Perhaps.  You could have something more specific than "auto".  The
> main point was instead of "--[no-]$knob", "--$knob=(yes|no|...)" is
> much easier to extend.  I simply do not know if we will see need to
> extend the vocabulary in the near future (to which you guys who are
> more interested in sparse clones would have much better insight than
> I do).

I also struggle to understand when auto would be used.  But beyond
that, I'm still a little uneasy with where we seem to be ending up
(even if no fault of this patch):

1) Behavior has long been --keep-cherry-pick, and in various cases
that behavior can reduce conflicts users have to deal with.
2) Both Junio and I independently guessed that the cherry-pick
detection logic is poorly performing and could be improved; Jonathan
confirmed with some investigative work.  We've all suggested punting
for now, though.
3) I think we can make the sequencer machinery fast enough that the
cherry-pick detection is going to be the slowest part by a good margin
even in normal cases, not just sparse clones or the cases Taylor or I
had in mind.  So I think it's going to stick out like a sore thumb for
a lot more people (though maybe they're all happy because it's faster
overall?).
4) Jonathan provided some good examples of cases where the
--keep-cherry-pick behavior isn't just slow, but leads to actually
wrong answers (a revert followed by an un-revert).

I particularly don't like the idea of something being the default when
it can both cause wrong behavior and present a huge performance
problem that folks have to learn to workaround, especially when based
only on the tradeoff of sometimes reducing the amount of work we push
back on the user.  Maybe that's just inevitable, but does anyone have
any words that will make me feel better about this?

Elijah
Junio C Hamano March 18, 2020, 11:39 p.m. UTC | #6
Elijah Newren <newren@gmail.com> writes:

> 4) Jonathan provided some good examples of cases where the
> --keep-cherry-pick behavior isn't just slow, but leads to actually
> wrong answers (a revert followed by an un-revert).

That one cuts both ways, doesn't it?  If your change that upstream
once thought was good (and got accepted) turned out to be bad and
they reverted, you do not want to blindly reapply it to break the
codebase again, and with the "drop duplicate" logic, it would lead
to a wrong answer silently.

So from correctness point of view, I do not think you can make any
argument either way.
Elijah Newren March 19, 2020, 12:17 a.m. UTC | #7
On Wed, Mar 18, 2020 at 4:39 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > 4) Jonathan provided some good examples of cases where the
> > --keep-cherry-pick behavior isn't just slow, but leads to actually
> > wrong answers (a revert followed by an un-revert).
>
> That one cuts both ways, doesn't it?  If your change that upstream
> once thought was good (and got accepted) turned out to be bad and
> they reverted, you do not want to blindly reapply it to break the
> codebase again, and with the "drop duplicate" logic, it would lead
> to a wrong answer silently.
>
> So from correctness point of view, I do not think you can make any
> argument either way.

Good point.  Thanks, that helps.
Jonathan Tan March 26, 2020, 5:50 p.m. UTC | #8
> New in V2: changed parameter name, used Taylor's commit message
> suggestions, and used Elijah's documentation suggestions.

I think the discussion has shifted away from whether this functionality
is desirable (or desirable and we should implement this functionality
without any CLI option) to the name and nature of the CLI option. Before
I send out a new version, what do you think of using this name and
documenting it this way:

  --keep-cherry-pick=(always|never)::
          Control rebase's behavior towards commits in the working
          branch that are already present upstream, i.e. cherry-picks.
  +
  If 'never', these commits will be dropped. Because this necessitates
  reading all upstream commits, this can be expensive in repos with a
  large number of upstream commits that need to be read.
  +
  If 'always', all commits (including these) will be re-applied. This
  allows rebase to forgo reading all upstream commits, potentially 
  improving performance.
  +
  The default is 'never'.
  +
  See also INCOMPATIBLE OPTIONS below.

I've tried to use everyone's suggestions: Junio's suggestions to use the
"keep" name (instead of "detect", so that we also communicate what we do
with the result of our detection) and the non-boolean option (for
extensibility later if we need it), and Elijah's suggestion to use
"cherry-pick" instead of "duplicate". If this sounds good, I'll update
the patch and send out a new version.
Elijah Newren March 26, 2020, 7:17 p.m. UTC | #9
On Thu, Mar 26, 2020 at 10:50 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > New in V2: changed parameter name, used Taylor's commit message
> > suggestions, and used Elijah's documentation suggestions.
>
> I think the discussion has shifted away from whether this functionality
> is desirable (or desirable and we should implement this functionality
> without any CLI option) to the name and nature of the CLI option. Before
> I send out a new version, what do you think of using this name and
> documenting it this way:
>
>   --keep-cherry-pick=(always|never)::
>           Control rebase's behavior towards commits in the working
>           branch that are already present upstream, i.e. cherry-picks.
>   +
>   If 'never', these commits will be dropped. Because this necessitates
>   reading all upstream commits, this can be expensive in repos with a
>   large number of upstream commits that need to be read.
>   +
>   If 'always', all commits (including these) will be re-applied. This
>   allows rebase to forgo reading all upstream commits, potentially
>   improving performance.
>   +
>   The default is 'never'.
>   +
>   See also INCOMPATIBLE OPTIONS below.
>
> I've tried to use everyone's suggestions: Junio's suggestions to use the
> "keep" name (instead of "detect", so that we also communicate what we do
> with the result of our detection) and the non-boolean option (for
> extensibility later if we need it), and Elijah's suggestion to use
> "cherry-pick" instead of "duplicate". If this sounds good, I'll update
> the patch and send out a new version.

Sounds good to me.  Thanks!
Junio C Hamano March 26, 2020, 7:27 p.m. UTC | #10
Jonathan Tan <jonathantanmy@google.com> writes:

>> New in V2: changed parameter name, used Taylor's commit message
>> suggestions, and used Elijah's documentation suggestions.
>
> I think the discussion has shifted away from whether this functionality
> is desirable (or desirable and we should implement this functionality
> without any CLI option) to the name and nature of the CLI option. Before
> I send out a new version, what do you think of using this name and
> documenting it this way:
>
>   --keep-cherry-pick=(always|never)::
>   ...
>   The default is 'never'.
>   +
>   See also INCOMPATIBLE OPTIONS below.

Sounds much better to me.  I do not mind --[no-]keep-cherry-pick,
either, by the way.  I know I raised the possibility of having to
make it non-bool later, but since then I haven't thought of a good
third option myself anyway, so...

Thanks for keeping the ball rolling.
diff mbox series

Patch

diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
index 0c4f038dd6..4629eb573f 100644
--- a/Documentation/git-rebase.txt
+++ b/Documentation/git-rebase.txt
@@ -318,6 +318,20 @@  See also INCOMPATIBLE OPTIONS below.
 +
 See also INCOMPATIBLE OPTIONS below.
 
+--skip-cherry-pick-detection::
+--no-skip-cherry-pick-detection::
+	Whether rebase tries to determine if commits are already present
+	upstream, i.e. if there are commits which are cherry-picks.  If such
+	detection is done, any commits being rebased which are cherry-picks
+	will be dropped, since those commits are already found upstream.  If
+	such detection is not done, those commits will be re-applied, which
+	most likely will result in no new changes (as the changes are already
+	upstream) and result in the commit being dropped anyway.  cherry-pick
+	detection is the default, but can be expensive in repos with a large
+	number of upstream commits that need to be read.
++
+See also INCOMPATIBLE OPTIONS below.
+
 --rerere-autoupdate::
 --no-rerere-autoupdate::
 	Allow the rerere mechanism to update the index with the
@@ -568,6 +582,9 @@  In addition, the following pairs of options are incompatible:
  * --keep-base and --onto
  * --keep-base and --root
 
+Also, the --skip-cherry-pick-detection option requires the use of the merge
+backend (e.g., through --merge).
+
 BEHAVIORAL DIFFERENCES
 -----------------------
 
@@ -866,7 +883,8 @@  Only works if the changes (patch IDs based on the diff contents) on
 'subsystem' did.
 
 In that case, the fix is easy because 'git rebase' knows to skip
-changes that are already present in the new upstream.  So if you say
+changes that are already present in the new upstream (unless
+`--skip-cherry-pick-detection` is given). So if you say
 (assuming you're on 'topic')
 ------------
     $ git rebase subsystem
diff --git a/builtin/rebase.c b/builtin/rebase.c
index 6154ad8fa5..100b8872af 100644
--- a/builtin/rebase.c
+++ b/builtin/rebase.c
@@ -88,6 +88,7 @@  struct rebase_options {
 	struct strbuf git_format_patch_opt;
 	int reschedule_failed_exec;
 	int use_legacy_rebase;
+	int skip_cherry_pick_detection;
 };
 
 #define REBASE_OPTIONS_INIT {			  	\
@@ -373,6 +374,7 @@  static int run_rebase_interactive(struct rebase_options *opts,
 	flags |= opts->rebase_cousins > 0 ? TODO_LIST_REBASE_COUSINS : 0;
 	flags |= opts->root_with_onto ? TODO_LIST_ROOT_WITH_ONTO : 0;
 	flags |= command == ACTION_SHORTEN_OIDS ? TODO_LIST_SHORTEN_IDS : 0;
+	flags |= opts->skip_cherry_pick_detection ? TODO_LIST_SKIP_CHERRY_PICK_DETECTION : 0;
 
 	switch (command) {
 	case ACTION_NONE: {
@@ -1507,6 +1509,8 @@  int cmd_rebase(int argc, const char **argv, const char *prefix)
 		OPT_BOOL(0, "reschedule-failed-exec",
 			 &reschedule_failed_exec,
 			 N_("automatically re-schedule any `exec` that fails")),
+		OPT_BOOL(0, "skip-cherry-pick-detection", &options.skip_cherry_pick_detection,
+			 N_("skip changes that are already present in the new upstream")),
 		OPT_END(),
 	};
 	int i;
@@ -1840,6 +1844,9 @@  int cmd_rebase(int argc, const char **argv, const char *prefix)
 			      "interactive or merge options"));
 	}
 
+	if (options.skip_cherry_pick_detection && !is_interactive(&options))
+		die(_("--skip-cherry-pick-detection does not work with the 'apply' backend"));
+
 	if (options.signoff) {
 		if (options.type == REBASE_PRESERVE_MERGES)
 			die("cannot combine '--signoff' with "
diff --git a/sequencer.c b/sequencer.c
index ba90a513b9..8b2cae3b69 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -4797,12 +4797,13 @@  int sequencer_make_script(struct repository *r, struct strbuf *out, int argc,
 	int keep_empty = flags & TODO_LIST_KEEP_EMPTY;
 	const char *insn = flags & TODO_LIST_ABBREVIATE_CMDS ? "p" : "pick";
 	int rebase_merges = flags & TODO_LIST_REBASE_MERGES;
+	int skip_cherry_pick_detection = flags & TODO_LIST_SKIP_CHERRY_PICK_DETECTION;
 
 	repo_init_revisions(r, &revs, NULL);
 	revs.verbose_header = 1;
 	if (!rebase_merges)
 		revs.max_parents = 1;
-	revs.cherry_mark = 1;
+	revs.cherry_mark = !skip_cherry_pick_detection;
 	revs.limited = 1;
 	revs.reverse = 1;
 	revs.right_only = 1;
diff --git a/sequencer.h b/sequencer.h
index 393571e89a..a54ea696c2 100644
--- a/sequencer.h
+++ b/sequencer.h
@@ -149,7 +149,7 @@  int sequencer_remove_state(struct replay_opts *opts);
  * `--onto`, we do not want to re-generate the root commits.
  */
 #define TODO_LIST_ROOT_WITH_ONTO (1U << 6)
-
+#define TODO_LIST_SKIP_CHERRY_PICK_DETECTION (1U << 7)
 
 int sequencer_make_script(struct repository *r, struct strbuf *out, int argc,
 			  const char **argv, unsigned flags);
diff --git a/t/t3402-rebase-merge.sh b/t/t3402-rebase-merge.sh
index a1ec501a87..290c79e0f6 100755
--- a/t/t3402-rebase-merge.sh
+++ b/t/t3402-rebase-merge.sh
@@ -162,4 +162,81 @@  test_expect_success 'rebase --skip works with two conflicts in a row' '
 	git rebase --skip
 '
 
+test_expect_success '--skip-cherry-pick-detection' '
+	git init repo &&
+
+	# O(1-10) -- O(1-11) -- O(0-10) master
+	#        \
+	#         -- O(1-11) -- O(1-12) otherbranch
+
+	printf "Line %d\n" $(test_seq 1 10) >repo/file.txt &&
+	git -C repo add file.txt &&
+	git -C repo commit -m "base commit" &&
+
+	printf "Line %d\n" $(test_seq 1 11) >repo/file.txt &&
+	git -C repo commit -a -m "add 11" &&
+
+	printf "Line %d\n" $(test_seq 0 10) >repo/file.txt &&
+	git -C repo commit -a -m "add 0 delete 11" &&
+
+	git -C repo checkout -b otherbranch HEAD^^ &&
+	printf "Line %d\n" $(test_seq 1 11) >repo/file.txt &&
+	git -C repo commit -a -m "add 11 in another branch" &&
+
+	printf "Line %d\n" $(test_seq 1 12) >repo/file.txt &&
+	git -C repo commit -a -m "add 12 in another branch" &&
+
+	# Regular rebase fails, because the 1-11 commit is deduplicated
+	test_must_fail git -C repo rebase --merge master 2> err &&
+	test_i18ngrep "error: could not apply.*add 12 in another branch" err &&
+	git -C repo rebase --abort &&
+
+	# With --skip-cherry-pick-detection, it works
+	git -C repo rebase --merge --skip-cherry-pick-detection master
+'
+
+test_expect_success '--skip-cherry-pick-detection refrains from reading unneeded blobs' '
+	git init server &&
+
+	# O(1-10) -- O(1-11) -- O(1-12) master
+	#        \
+	#         -- O(0-10) otherbranch
+
+	printf "Line %d\n" $(test_seq 1 10) >server/file.txt &&
+	git -C server add file.txt &&
+	git -C server commit -m "merge base" &&
+
+	printf "Line %d\n" $(test_seq 1 11) >server/file.txt &&
+	git -C server commit -a -m "add 11" &&
+
+	printf "Line %d\n" $(test_seq 1 12) >server/file.txt &&
+	git -C server commit -a -m "add 12" &&
+
+	git -C server checkout -b otherbranch HEAD^^ &&
+	printf "Line %d\n" $(test_seq 0 10) >server/file.txt &&
+	git -C server commit -a -m "add 0" &&
+
+	test_config -C server uploadpack.allowfilter 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+
+	git clone --filter=blob:none "file://$(pwd)/server" client &&
+	git -C client checkout origin/master &&
+	git -C client checkout origin/otherbranch &&
+
+	# Sanity check to ensure that the blobs from the merge base and "add
+	# 11" are missing
+	git -C client rev-list --objects --all --missing=print >missing_list &&
+	MERGE_BASE_BLOB=$(git -C server rev-parse master^^:file.txt) &&
+	ADD_11_BLOB=$(git -C server rev-parse master^:file.txt) &&
+	grep "\\?$MERGE_BASE_BLOB" missing_list &&
+	grep "\\?$ADD_11_BLOB" missing_list &&
+
+	git -C client rebase --merge --skip-cherry-pick-detection origin/master &&
+
+	# The blob from the merge base had to be fetched, but not "add 11"
+	git -C client rev-list --objects --all --missing=print >missing_list &&
+	! grep "\\?$MERGE_BASE_BLOB" missing_list &&
+	grep "\\?$ADD_11_BLOB" missing_list
+'
+
 test_done