diff mbox series

[v3,3/7] builtin: patch-id: fix patch-id with binary diffs

Message ID 2164212892712930cb34223499bb3e03bf2c2392.1665737804.git.gitgitgadget@gmail.com (mailing list archive)
State Accepted
Commit ef4ed401b63a364a7f2a1d33b18159b94e50af72
Headers show
Series patch-id fixes and improvements | expand

Commit Message

Jerry Zhang Oct. 14, 2022, 8:56 a.m. UTC
From: Jerry Zhang <Jerry@skydio.com>

"git patch-id" currently doesn't produce correct output if the
incoming diff has any binary files. Add logic to
get_one_patchid to handle the different possible styles of binary
diff. This attempts to keep resulting patch-ids identical to what
would be produced by the counterpart logic in diff.c, that is it
produces the id by hashing the a and b oids in succession.

In general we handle binary diffs by first caching the object ids from
the "index" line and using those if we then find an indication
that the diff is binary.

The input could contain patches generated with "git diff --binary". This
currently breaks the parse logic and results in multiple patch-ids
output for a single commit. Here we have to skip the contents of the
patch itself since those do not go into the patch id. --binary
implies --full-index so the object ids are always available.

When the diff is generated with --full-index there is no patch content
to skip over.

When a diff is generated without --full-index or --binary, it will
contain abbreviated object ids. This will still result in a sufficiently
unique patch-id when hashed, but does not match internal patch id
output. We'll call this ok for now as we already need specialized
arguments to diff in order to match internal patch id (namely -U3).

Signed-off-by: Jerry Zhang <Jerry@skydio.com>
---
 builtin/patch-id.c  | 36 ++++++++++++++++++++++++++++++++++--
 t/t4204-patch-id.sh | 29 ++++++++++++++++++++++++++++-
 2 files changed, 62 insertions(+), 3 deletions(-)

Comments

Junio C Hamano Oct. 14, 2022, 5:13 p.m. UTC | #1
"Jerry Zhang via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Jerry Zhang <Jerry@skydio.com>
>
> "git patch-id" currently doesn't produce correct output if the
> incoming diff has any binary files. Add logic to
> get_one_patchid to handle the different possible styles of binary
> diff. This attempts to keep resulting patch-ids identical to what
> would be produced by the counterpart logic in diff.c, that is it
> produces the id by hashing the a and b oids in succession.

It is sad that we have two separate implementations in the first
place.  Do you see if it is feasible to unify the implementation
by reusing one from the other (answering this is not a requirement
for this patch to be looked at)?
Junio C Hamano Oct. 14, 2022, 9:12 p.m. UTC | #2
"Jerry Zhang via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Jerry Zhang <Jerry@skydio.com>
>
> "git patch-id" currently doesn't produce correct output if the
> incoming diff has any binary files. Add logic to
> get_one_patchid to handle the different possible styles of binary
> diff. This attempts to keep resulting patch-ids identical to what
> would be produced by the counterpart logic in diff.c, that is it
> produces the id by hashing the a and b oids in succession.

I thought I saw that a previous step touched diff.c to change how
patch ID for a binary diff is computed to match what patch-id
command computes?  Now we also have to change patch-id?  In the end
output from both may match, but which one between diff and patch-id
have we standardised on?

Puzzled...
Jerry Zhang Oct. 14, 2022, 10:33 p.m. UTC | #3
On Fri, Oct 14, 2022 at 10:13 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Jerry Zhang via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Jerry Zhang <Jerry@skydio.com>
> >
> > "git patch-id" currently doesn't produce correct output if the
> > incoming diff has any binary files. Add logic to
> > get_one_patchid to handle the different possible styles of binary
> > diff. This attempts to keep resulting patch-ids identical to what
> > would be produced by the counterpart logic in diff.c, that is it
> > produces the id by hashing the a and b oids in succession.
>
> It is sad that we have two separate implementations in the first
> place.  Do you see if it is feasible to unify the implementation
> by reusing one from the other (answering this is not a requirement
> for this patch to be looked at)?
Yeah I wondered this myself, it's tricky because they are actually doing
opposite things: the diff.c logic is adding diff metadata before doing the
patch-id, while the patch-id logic is parsing out the diff metadata. We could
refactor it, but would have to be careful not to accidentally change the output
semantics.

Another possible path to "unifying" the logic would be to add a
"--patch-id" mode
to "git diff' that produces the patch-id of what would be the diff,
rather than the diff
itself. For the usecases that involve piping "git diff" into "git
patch-id", this would
require not needing the separate patch-id tool at all. Of course
people also like
to run "patch-id" on the output of "format-patch" after the fact so
this isn't a perfect
solution either.

Speaking of which, do you have some context as to why we promise that
"git patch-id"
output will remain the same across git versions? Were there cases in
the past where
people actually made persistent databases of patch-ids, or complained
about the output
changing? I ask because this requirement makes it difficult to make
big changes, and
there aren't any tests to verify consistent output between git
versions. Also git itself
is already a persistent database of patches, so I'm not sure why
someone would choose
to implement a new system for this.
>
Jerry Zhang Oct. 14, 2022, 10:34 p.m. UTC | #4
On Fri, Oct 14, 2022 at 2:12 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Jerry Zhang via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Jerry Zhang <Jerry@skydio.com>
> >
> > "git patch-id" currently doesn't produce correct output if the
> > incoming diff has any binary files. Add logic to
> > get_one_patchid to handle the different possible styles of binary
> > diff. This attempts to keep resulting patch-ids identical to what
> > would be produced by the counterpart logic in diff.c, that is it
> > produces the id by hashing the a and b oids in succession.
>
> I thought I saw that a previous step touched diff.c to change how
> patch ID for a binary diff is computed to match what patch-id
> command computes?  Now we also have to change patch-id?  In the end
> output from both may match, but which one between diff and patch-id
> have we standardised on?
Er yeah let me see if I can simplify.

Before:
Internal patch-id w/ unstable + binary was correct
Internal patch-id w/ stable + binary was broken
builtin patch-id w/ binary was broken

After:
Internal patch-id w/ unstable + binary is correct
Internal patch-id w/ stable + binary is now correct
builtin patch-id w/ binary is now correct

So the "standard" actually came from the one working case from
"before", which was the diff.c logic + unstable. I based all new logic
on that because it seemed reasonable and correct. Since "internal
w/unstable" is never exposed externally, it's perhaps true that i
could have invented a totally new format and standardized on that.
Hashing the oids in succession is pretty much representative of a
binary patch though, so I don't think there's much to be improved on.
>
> Puzzled...
Junio C Hamano Oct. 17, 2022, 3:23 p.m. UTC | #5
Jerry Zhang <jerry@skydio.com> writes:

>> I thought I saw that a previous step touched diff.c to change how
>> patch ID for a binary diff is computed to match what patch-id
>> command computes?  Now we also have to change patch-id?  In the end
>> output from both may match, but which one between diff and patch-id
>> have we standardised on?
> Er yeah let me see if I can simplify.
>
> Before:
> Internal patch-id w/ unstable + binary was correct
> Internal patch-id w/ stable + binary was broken
> builtin patch-id w/ binary was broken
>
> After:
> Internal patch-id w/ unstable + binary is correct
> Internal patch-id w/ stable + binary is now correct
> builtin patch-id w/ binary is now correct
>
> So the "standard" actually came from the one working case from
> "before", which was the diff.c logic + unstable.

OK.

The question was meant to help you improve the log message, as it is
something a future reader of "git log" would wonder after reading
them.  I think including something that makes it easy for readers to
arrive at the summary above themselves by reading the log message
would be a very much welcome change.

Thanks.
diff mbox series

Patch

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index 881fcf32732..e7a31123142 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -61,6 +61,8 @@  static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 {
 	int patchlen = 0, found_next = 0;
 	int before = -1, after = -1;
+	int diff_is_binary = 0;
+	char pre_oid_str[GIT_MAX_HEXSZ + 1], post_oid_str[GIT_MAX_HEXSZ + 1];
 	git_hash_ctx ctx;
 
 	the_hash_algo->init_fn(&ctx);
@@ -88,14 +90,44 @@  static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 
 		/* Parsing diff header?  */
 		if (before == -1) {
-			if (starts_with(line, "index "))
+			if (starts_with(line, "GIT binary patch") ||
+			    starts_with(line, "Binary files")) {
+				diff_is_binary = 1;
+				before = 0;
+				the_hash_algo->update_fn(&ctx, pre_oid_str,
+							 strlen(pre_oid_str));
+				the_hash_algo->update_fn(&ctx, post_oid_str,
+							 strlen(post_oid_str));
+				if (stable)
+					flush_one_hunk(result, &ctx);
 				continue;
-			else if (starts_with(line, "--- "))
+			} else if (skip_prefix(line, "index ", &p)) {
+				char *oid1_end = strstr(line, "..");
+				char *oid2_end = NULL;
+				if (oid1_end)
+					oid2_end = strstr(oid1_end, " ");
+				if (!oid2_end)
+					oid2_end = line + strlen(line) - 1;
+				if (oid1_end != NULL && oid2_end != NULL) {
+					*oid1_end = *oid2_end = '\0';
+					strlcpy(pre_oid_str, p, GIT_MAX_HEXSZ + 1);
+					strlcpy(post_oid_str, oid1_end + 2, GIT_MAX_HEXSZ + 1);
+				}
+				continue;
+			} else if (starts_with(line, "--- "))
 				before = after = 1;
 			else if (!isalpha(line[0]))
 				break;
 		}
 
+		if (diff_is_binary) {
+			if (starts_with(line, "diff ")) {
+				diff_is_binary = 0;
+				before = -1;
+			}
+			continue;
+		}
+
 		/* Looking for a valid hunk header?  */
 		if (before == 0 && after == 0) {
 			if (starts_with(line, "@@ -")) {
diff --git a/t/t4204-patch-id.sh b/t/t4204-patch-id.sh
index a730c0db985..cdc5191aa8d 100755
--- a/t/t4204-patch-id.sh
+++ b/t/t4204-patch-id.sh
@@ -42,7 +42,7 @@  calc_patch_id () {
 }
 
 get_top_diff () {
-	git log -p -1 "$@" -O bar-then-foo --
+	git log -p -1 "$@" -O bar-then-foo --full-index --
 }
 
 get_patch_id () {
@@ -61,6 +61,33 @@  test_expect_success 'patch-id detects inequality' '
 	get_patch_id notsame &&
 	! test_cmp patch-id_main patch-id_notsame
 '
+test_expect_success 'patch-id detects equality binary' '
+	cat >.gitattributes <<-\EOF &&
+	foo binary
+	bar binary
+	EOF
+	get_patch_id main &&
+	get_patch_id same &&
+	git log -p -1 --binary main >top-diff.output &&
+	calc_patch_id <top-diff.output main_binpatch &&
+	git log -p -1 --binary same >top-diff.output &&
+	calc_patch_id <top-diff.output same_binpatch &&
+	test_cmp patch-id_main patch-id_main_binpatch &&
+	test_cmp patch-id_same patch-id_same_binpatch &&
+	test_cmp patch-id_main patch-id_same &&
+	test_when_finished "rm .gitattributes"
+'
+
+test_expect_success 'patch-id detects inequality binary' '
+	cat >.gitattributes <<-\EOF &&
+	foo binary
+	bar binary
+	EOF
+	get_patch_id main &&
+	get_patch_id notsame &&
+	! test_cmp patch-id_main patch-id_notsame &&
+	test_when_finished "rm .gitattributes"
+'
 
 test_expect_success 'patch-id supports git-format-patch output' '
 	get_patch_id main &&