diff mbox series

patch-ids: handle duplicate hashmap entries

Message ID X/3FwNPHqZqY+hv0@coredump.intra.peff.net (mailing list archive)
State Accepted
Commit abcfbf8603b16bb874f09c0bf8a94bc0030a0409
Headers show
Series patch-ids: handle duplicate hashmap entries | expand

Commit Message

Jeff King Jan. 12, 2021, 3:52 p.m. UTC
On Tue, Jan 12, 2021 at 03:34:38PM +0000, Arnaud Morin wrote:

> So instead of:
> id = has_commit_patch_id(commit, &ids);
> 
> We should use some sort of iterator, in order to call:
> id->commit->object.flags |= cherry_flag;
> 
> for all commits which are related to this patch?

Yep, exactly.

Here it is as a patch. :)

-- >8 --
Subject: [PATCH] patch-ids: handle duplicate hashmap entries

This fixes a bug introduced in dfb7a1b4d0 (patch-ids: stop using a
hand-rolled hashmap implementation, 2016-07-29) in which

  git rev-list --cherry-pick A...B

will fail to suppress commits reachable from A even if a commit with
matching patch-id appears in B.

Around the time of that commit, the algorithm for "--cherry-pick" looked
something like this:

  0. Traverse all of the commits, marking them as being on the left or
     right side of the symmetric difference.

  1. Iterate over the left-hand commits, inserting a patch-id struct for
     each into a hashmap, and pointing commit->util to the patch-id
     struct.

  2. Iterate over the right-hand commits, checking which are present in
     the hashmap. If so, we exclude the commit from the output _and_ we
     mark the patch-id as "seen".

  3. Iterate again over the left-hand commits, checking whether
     commit->util->seen is set; if so, exclude them from the output.

At the end, we'll have eliminated commits from both sides that have a
matching patch-id on the other side. But there's a subtle assumption
here: for any given patch-id, we must have exactly one struct
representing it. If two commits from A both have the same patch-id and
we allow duplicates in the hashmap, then we run into a problem:

  a. In step 1, we insert two patch-id structs into the hashmap.

  b. In step 2, our lookups will find only one of these structs, so only
     one "seen" flag is marked.

  c. In step 3, one of the commits in A will have its commit->util->seen
     set, but the other will not. We'll erroneously output the latter.

Prior to dfb7a1b4d0, our hashmap did not allow duplicates. Afterwards,
it used hashmap_add(), which explicitly does allow duplicates.

At that point, the solution would have been easy: when we are about to
add a duplicate, skip doing so and return the existing entry which
matches. But it gets more complicated.

In 683f17ec44 (patch-ids: replace the seen indicator with a commit
pointer, 2016-07-29), our step 3 goes away entirely. Instead, in step 2,
when the right-hand side finds a matching patch_id from the left-hand
side, we can directly mark the left-hand patch_id->commit to be omitted.
Solving that would be easy, too; there's a one-to-many relationship of
patch-ids to commits, so we just need to keep a list.

But there's more. Commit b3dfeebb92 (rebase: avoid computing unnecessary
patch IDs, 2016-07-29) built on that by lazily computing the full
patch-ids. So we don't even know when adding to the hashmap whether two
commits truly have the same id. We'd have to tentatively assign them a
list, and then possibly split them apart (possibly into N new structs)
at the moment we compute the real patch-ids. This could work, but it's
complicated and error-prone.

Instead, let's accept that we may store duplicates, and teach the lookup
side to be more clever. Rather than asking for a single matching
patch-id, it will need to iterate over all matching patch-ids. This does
mean examining every entry in a single hash bucket, but the worst-case
for a hash lookup was already doing that.

We'll keep the hashmap details out of the caller by providing a simple
iteration interface. We can retain the simple has_commit_patch_id()
interface for the other callers, but we'll simplify its return value
into an integer, rather than returning the patch_id struct. That way
they won't be tempted to look at the "commit" field of the return value
without iterating.

Reported-by: Arnaud Morin <arnaud.morin@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 patch-ids.c                          | 14 +++++++++++++-
 patch-ids.h                          | 20 +++++++++++++++++++-
 revision.c                           |  6 ++++--
 t/t6007-rev-list-cherry-pick-file.sh | 12 ++++++++++++
 4 files changed, 48 insertions(+), 4 deletions(-)

Comments

Arnaud Morin Jan. 12, 2021, 4:24 p.m. UTC | #1
YAY!

Thanks a lot :)
Junio C Hamano Jan. 12, 2021, 7:13 p.m. UTC | #2
Jeff King <peff@peff.net> writes:

> At the end, we'll have eliminated commits from both sides that have a
> matching patch-id on the other side. But there's a subtle assumption
> here: for any given patch-id, we must have exactly one struct
> representing it. If two commits from A both have the same patch-id and
> we allow duplicates in the hashmap, then we run into a problem:

In practical terms, for one side of the history to have two
identical patches, the most likely scenerio for that to happen is to
have a patch, its reversion, and its reapplication, intermixed in
its history with other commits, e.g.

    ---o---o---M---o---o---W---o---o---M---o--- ...
        \
         o---o---o---M---o--- ...

where "M" are commits that does an identical change, and "W"
(upside-down "M") is its reversion.  On the top history, "M" was
introduced, the bottom history cherry-picked, the top history found
problems in it and reverted with "W", and later (perhaps with the
help of preparatory patches on top of "W"), the top history now
considers that "M" is safe, so reapplies it.

And a "--cherry-pick" output that excludes "identical patches" that
appear on both sides on such a history would hide all "M"'s, leaving
a history like

    ---o---o-------o---o---W---o---o-------o--- ...
        \
         o---o---o-------o--- ...

But is this result what the user really wanted to see, I have to
wonder.

I do not see any problem in the patch itself.  We used to hide only
one "M" from the history on the top half in the picture, leaving one
"M" and "W", while hiding the sole "M" from the bottom half.  Now if
we want to no longer show any "M", the updated code would correctly
hide all of them.

It just feels to me that the resulting view of the history look
weird, leaving only the reversion of a patch that has never been
applied in the first place.
Arnaud Morin Jan. 13, 2021, 9:24 a.m. UTC | #3
Without this patch, that's even worst, consistency is broken.
Let me explain.

With your history example:

     ---o---o---M---o---o---W---o---o---M---o--- branch
         \
          o---o---o---M---o--- master

# WITHOUT PATCH
If we imagine that master is having more commits count than branch.
The result of rev-list will be like you described:
$ git rev-list --left-right --cherry-pick branch...master
<M
<W

In other words, it's showing both W and M.

BUT, if we imagine now that master is having less commits count than branch.
$ git rev-list --left-right --cherry-pick branch...master
<W

It's only showing W!

So the result is not consistent, depending on which branch is having
more commits.

# WITH PATCH
With the patch, everything is consistent, and only W is kept in ouptut,
no matter the size of history:
$ git.p rev-list --left-right --cherry-pick branch...master
<W

Cheers,
Jeff King Jan. 13, 2021, 12:59 p.m. UTC | #4
On Wed, Jan 13, 2021 at 09:24:48AM +0000, Arnaud Morin wrote:

> Without this patch, that's even worst, consistency is broken.
> Let me explain.
> 
> With your history example:
> 
>      ---o---o---M---o---o---W---o---o---M---o--- branch
>          \
>           o---o---o---M---o--- master
> 
> # WITHOUT PATCH
> If we imagine that master is having more commits count than branch.
> The result of rev-list will be like you described:
> $ git rev-list --left-right --cherry-pick branch...master
> <M
> <W
> 
> In other words, it's showing both W and M.
> 
> BUT, if we imagine now that master is having less commits count than branch.
> $ git rev-list --left-right --cherry-pick branch...master
> <W
> 
> It's only showing W!
> 
> So the result is not consistent, depending on which branch is having
> more commits.

Right. There's definitely a question of whether --cherry-pick is the
most useful thing in such a situation, but the current behavior was
simply broken. It did not behave as advertised, and it treated one side
of the history different from the other. So whether we want to do
anything else to help that case, I think this at least makes
--cherry-pick sane. :)

Here are two related histories to think about.

This is the same history, but the revert and re-do happens on both
sides (and this is actually the setup in the new test):

      ---o---o---M---o---o---W---o---o---M---o--- branch
          \
           o---o---M---o---o---W---o---o---M---o--- master

There we really _do_ want to clear out both M's (really, all four),
because we're doing so _from both sides_. And that is ultimately the
point of the cherry-pick option: show commits that were picked from one
side to the other.

Another interesting case is when the first "M" is actually bundled into
another commit, like:

      ---o---o---M+X---o---o---W---o---o---M---o--- branch
          \
           o---o---o--M---o--- master

where "M+X" has the changes from "M" plus some other changes. It will
get a different patch-id, and --cherry-pick would show both M+X and W,
but not M for either side. This is a limitation of the patch-id concept:
we are looking at whole commits, not overall changes.

I suspect for most operations we care less about "remove all
cherry-picks from both sides", but rather "give me one side, omitting
commits that are already on the other side" (because you want to rebase
or cherry-pick some subset).

-Peff
Junio C Hamano Jan. 13, 2021, 7:28 p.m. UTC | #5
Arnaud Morin <arnaud.morin@gmail.com> writes:

> Without this patch, that's even worst, consistency is broken.
> Let me explain.
>
> With your history example:
>
>      ---o---o---M---o---o---W---o---o---M---o--- branch
>          \
>           o---o---o---M---o--- master
>
> # WITHOUT PATCH
> If we imagine that master is having more commits count than branch.
> The result of rev-list will be like you described:
> $ git rev-list --left-right --cherry-pick branch...master
> <M
> <W
>
> In other words, it's showing both W and M.

So, at least they cancel out and the reader can tell that the net
effect was none --- that is "sort of understandable" result.

> BUT, if we imagine now that master is having less commits count than branch.
> $ git rev-list --left-right --cherry-pick branch...master
> <W
>
> It's only showing W!

Which is what I felt misleading.

> # WITH PATCH
> With the patch, everything is consistent, and only W is kept in ouptut,
> no matter the size of history:
> $ git.p rev-list --left-right --cherry-pick branch...master
> <W

So with the patch, the former case is made the same as the latter
misleading result, which would mean that they are consistently
misleading?

I guess that it is better to be misleading all the time than being
"sort of understandable" half the time and misleading the rest of
time.  At least that is consistent.

Thanks.
Junio C Hamano Jan. 13, 2021, 8:21 p.m. UTC | #6
Jeff King <peff@peff.net> writes:

> Right. There's definitely a question of whether --cherry-pick is the
> most useful thing in such a situation, but the current behavior was
> simply broken. It did not behave as advertised, and it treated one side
> of the history different from the other. So whether we want to do
> anything else to help that case, I think this at least makes
> --cherry-pick sane. :)

Yes; I think "sane" does not always equal to "useful", though ;-)

> Here are two related histories to think about.
> ...
> I suspect for most operations we care less about "remove all
> cherry-picks from both sides", but rather "give me one side, omitting
> commits that are already on the other side" (because you want to rebase
> or cherry-pick some subset).

Yes again.  It means "--cherry-pick" inherently is not symmetric.
One side is the reference side (i.e. mainline), and the goal is to
remove from the other side what is in the reference side.

What we are seeing in this discussion is that "--cherry-pick" and
symmetric difference do not conceptually play well together.

But the behaviour with the patch makes the most sense when
cherry-pick is applied to a symmetric difference (provided that
doing so makes sense in the first place, that is).
Jeff King Jan. 13, 2021, 8:33 p.m. UTC | #7
On Wed, Jan 13, 2021 at 12:21:30PM -0800, Junio C Hamano wrote:

> > I suspect for most operations we care less about "remove all
> > cherry-picks from both sides", but rather "give me one side, omitting
> > commits that are already on the other side" (because you want to rebase
> > or cherry-pick some subset).
> 
> Yes again.  It means "--cherry-pick" inherently is not symmetric.
> One side is the reference side (i.e. mainline), and the goal is to
> remove from the other side what is in the reference side.
> 
> What we are seeing in this discussion is that "--cherry-pick" and
> symmetric difference do not conceptually play well together.
> 
> But the behaviour with the patch makes the most sense when
> cherry-pick is applied to a symmetric difference (provided that
> doing so makes sense in the first place, that is).

I didn't realize --cherry-pick would work without a symmetric
difference. The documentation says:

  Omit any commit that introduces the same change as another commit on
  the “other side” when the set of commits are limited with symmetric
  difference.

And indeed, I think it silently does nothing with:

  git rev-list --cherry-pick A..B

(because there is nothing on the "other" side to match).

So you do need some symmetric difference in order to define the two
sets, but you might only want to see one of the sides.  And that is
basically what --cherry does. But having looked at the implementation of
cherry_pick_list(), it is quite happy to swap the sides internally. I
guess if we were going to make the output unsymmetric, the first thing
would be to stop doing that swap. :)

For the specifics of reverts and replays, though, I think the weirdness
has nothing to do with left/right, or showing one side. It's the
mismatched count. So if we were to make any change, it would be to keep
a count of how many times each commit appears on the other side, and
cancel them one for one.

I.e., this:

  o--M--W--M--o--
   \
    o--M--o---

might plausibly want to show "M" once on the upper branch. But:

  o--M--W--M--o--
   \
    o--M--W--M--o--

should never show M, I wouldn't think.

I'm not sure if that would be violating what existing callers expect
from "--cherry-pick", though (i.e., if it would need another name, or an
extra option to trigger).

-Peff
diff mbox series

Patch

diff --git a/patch-ids.c b/patch-ids.c
index 21973e4933..f51021a0cb 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -89,7 +89,7 @@  static int init_patch_id_entry(struct patch_id *patch,
 	return 0;
 }
 
-struct patch_id *has_commit_patch_id(struct commit *commit,
+struct patch_id *patch_id_iter_first(struct commit *commit,
 				     struct patch_ids *ids)
 {
 	struct patch_id patch;
@@ -104,6 +104,18 @@  struct patch_id *has_commit_patch_id(struct commit *commit,
 	return hashmap_get_entry(&ids->patches, &patch, ent, NULL);
 }
 
+struct patch_id *patch_id_iter_next(struct patch_id *cur,
+				    struct patch_ids *ids)
+{
+	return hashmap_get_next_entry(&ids->patches, cur, ent);
+}
+
+int has_commit_patch_id(struct commit *commit,
+			struct patch_ids *ids)
+{
+	return !!patch_id_iter_first(commit, ids);
+}
+
 struct patch_id *add_commit_patch_id(struct commit *commit,
 				     struct patch_ids *ids)
 {
diff --git a/patch-ids.h b/patch-ids.h
index 03bb04e707..ab6c6a6804 100644
--- a/patch-ids.h
+++ b/patch-ids.h
@@ -23,7 +23,25 @@  int commit_patch_id(struct commit *commit, struct diff_options *options,
 		    struct object_id *oid, int, int);
 int init_patch_ids(struct repository *, struct patch_ids *);
 int free_patch_ids(struct patch_ids *);
+
+/* Add a patch_id for a single commit to the set. */
 struct patch_id *add_commit_patch_id(struct commit *, struct patch_ids *);
-struct patch_id *has_commit_patch_id(struct commit *, struct patch_ids *);
+
+/* Returns true if the patch-id of "commit" is present in the set. */
+int has_commit_patch_id(struct commit *commit, struct patch_ids *);
+
+/*
+ * Iterate over all commits in the set whose patch id matches that of
+ * "commit", like:
+ *
+ *   struct patch_id *cur;
+ *   for (cur = patch_id_iter_first(commit, ids);
+ *        cur;
+ *        cur = patch_id_iter_next(cur, ids) {
+ *           ... look at cur->commit
+ *   }
+ */
+struct patch_id *patch_id_iter_first(struct commit *commit, struct patch_ids *);
+struct patch_id *patch_id_iter_next(struct patch_id *cur, struct patch_ids *);
 
 #endif /* PATCH_IDS_H */
diff --git a/revision.c b/revision.c
index 9dff845bed..7ec9b634e3 100644
--- a/revision.c
+++ b/revision.c
@@ -1241,12 +1241,14 @@  static void cherry_pick_list(struct commit_list *list, struct rev_info *revs)
 		/*
 		 * Have we seen the same patch id?
 		 */
-		id = has_commit_patch_id(commit, &ids);
+		id = patch_id_iter_first(commit, &ids);
 		if (!id)
 			continue;
 
 		commit->object.flags |= cherry_flag;
-		id->commit->object.flags |= cherry_flag;
+		do {
+			id->commit->object.flags |= cherry_flag;
+		} while ((id = patch_id_iter_next(id, &ids)));
 	}
 
 	free_patch_ids(&ids);
diff --git a/t/t6007-rev-list-cherry-pick-file.sh b/t/t6007-rev-list-cherry-pick-file.sh
index f0268372d2..8bf5ae23c2 100755
--- a/t/t6007-rev-list-cherry-pick-file.sh
+++ b/t/t6007-rev-list-cherry-pick-file.sh
@@ -245,6 +245,18 @@  test_expect_success '--count --left-right' '
 	test_cmp expect actual
 '
 
+test_expect_success '--cherry-pick with duplicates on each side' '
+	git checkout -b dup-orig &&
+	test_commit dup-base &&
+	git revert dup-base &&
+	git cherry-pick dup-base &&
+	git checkout -b dup-side HEAD~3 &&
+	test_tick &&
+	git cherry-pick -3 dup-orig &&
+	git rev-list --cherry-pick dup-orig...dup-side >actual &&
+	test_must_be_empty actual
+'
+
 # Corrupt the object store deliberately to make sure
 # the object is not even checked for its existence.
 remove_loose_object () {