Message ID | 20240710093610.GA2076910@coredump.intra.peff.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC/PATCH] add-patch: handle splitting hunks with diff.suppressBlankEmpty | expand |
Hi Peff On 10/07/2024 10:36, Jeff King wrote: > Subject: add-patch: handle splitting hunks with diff.suppressBlankEmpty > When "add -p" parses diffs, it looks for context lines starting with a > single space. But when diff.suppressBlankEmpty is in effect, an empty > context line will omit the space, giving us a true empty line. This > confuses the parser, which is unable to split based on such a line. > It's tempting to say that we should just make sure that we generate a > diff without that option. But we may parse diffs not only generated by > Git, but ones that users have manually edited. And POSIX calls the > decision of whether to print the space here "implementation-defined". Do we ever parse an edited hunk with this code? I'm not sure there is a way of splitting a hunk after it has been edited as I don't think we ever display it again. > So let's handle both cases: a context line either starts with a space or > consists of a totally empty line. > > Reported-by: Ilya Tumaykin <itumaykin@gmail.com> > Signed-off-by: Jeff King <peff@peff.net> > --- > I'm a little worried that this creates ambiguities, since I don't think > we are careful about following the hunk header's line counts. Imagine > you had an input like this: > > @@ -1,2 +1,2 @@> > -old > +new > stuff > > some garbage > > We obviously know that "some garbage" is not a context line and is just > trailing junk, because it does not begin with "-", "+" or space. But > what about the blank line in between? It looks like an empty context > line, but we can only know that it is not by respecting the counts in > the hunk header. > > I don't think we'd ever generate this ourselves, but could somebody > manually edit a hunk into this shape? When I tried it in practice, it > looks like we fail to apply the result even before my patch, though. I'm> not sure why that is. If I put "some garbage" without the blank line, we > correctly realize it should be discarded. It's possible I'm just holding > it wrong. When we recount the hunk after it has been edited we ignore lines that don't begin with '+', '-', ' ', or '\n' so if you add some garbage at the end of the hunk the recounted hunk header excludes it when it gets applied. I think your patch looks good. I did wonder if we wanted to fix this by normalizing context lines instead as shown in the diff below. That might make it less likely to miss adding "|| '\n'" in future code that is looking for a context line but I don't have a strong preference either way. Best Wishes Phillip ---- >8 ---- diff --git a/add-patch.c b/add-patch.c index d8ea05ff108..795aa772b7a 100644 --- a/add-patch.c +++ b/add-patch.c @@ -400,6 +400,12 @@ static void complete_file(char marker, struct hunk *hunk) hunk->splittable_into++; } +/* Empty context lines may omit the leading ' ' */ +static int normalize_marker(char marker) +{ + return marker == '\n' ? ' ' : marker; +} + static int parse_diff(struct add_p_state *s, const struct pathspec *ps) { struct strvec args = STRVEC_INIT; @@ -485,6 +491,7 @@ static int parse_diff(struct add_p_state *s, const struct pathspec *ps) while (p != pend) { char *eol = memchr(p, '\n', pend - p); const char *deleted = NULL, *mode_change = NULL; + char ch = normalize_marker(*p); if (!eol) eol = pend; @@ -532,7 +539,7 @@ static int parse_diff(struct add_p_state *s, const struct pathspec *ps) * Start counting into how many hunks this one can be * split */ - marker = *p; + marker = ch; } else if (hunk == &file_diff->head && starts_with(p, "new file")) { file_diff->added = 1; @@ -586,10 +593,10 @@ static int parse_diff(struct add_p_state *s, const struct pathspec *ps) (int)(eol - (plain->buf + file_diff->head.start)), plain->buf + file_diff->head.start); - if ((marker == '-' || marker == '+') && *p == ' ') + if ((marker == '-' || marker == '+') && ch == ' ') hunk->splittable_into++; - if (marker && *p != '\\') - marker = *p; + if (marker && ch != '\\') + marker = ch; p = eol == pend ? pend : eol + 1; hunk->end = p - plain->buf; @@ -953,7 +960,7 @@ static int split_hunk(struct add_p_state *s, struct file_diff *file_diff, context_line_count = 0; while (splittable_into > 1) { - ch = s->plain.buf[current]; + ch = normalize_marker(s->plain.buf[current]); if (!ch) BUG("buffer overrun while splitting hunks"); diff --git a/t/t3701-add-interactive.sh b/t/t3701-add-interactive.sh index 5d78868ac16..385c246baf0 100755 --- a/t/t3701-add-interactive.sh +++ b/t/t3701-add-interactive.sh @@ -1164,4 +1164,31 @@ test_expect_success 'reset -p with unmerged files' ' test_must_be_empty staged ' +test_expect_success 'hunk splitting works with diff.suppressBlankEmpty' ' + cat >expect <<-\EOF && + diff --git a/file b/file + index 777b174..ebc9684 100755 + --- a/file + +++ b/file + @@ -2,6 +2,6 @@ p + q + r + + -d + -e + -f + +s + +t + +u + EOF + + test_config diff.suppressBlankEmpty true && + test_write_lines a b c "" d e f >file && + git add file && + test_write_lines p q r "" s t u >file && + test_write_lines s y n q | git add -p && + git diff >actual && + diff_cmp expect actual +' + test_done
Jeff King <peff@peff.net> writes: > I'm a little worried that this creates ambiguities, since I don't think > we are careful about following the hunk header's line counts. Imagine > you had an input like this: > > @@ -1,2 +1,2 @@ > -old > +new > stuff > > some garbage True. Especially if you allow editing of hunks, the only thing you could make available to you are the version you gave the user and the version you got back from the user, but comparing them would not help resolve the ambiguity to correct the hunk header all that much. "diff" edit mode various editors may have _could_ help as they know each and every keystroke by the user and how the modified patch was constructed to guess the intention better than a mere comparison of before- and after- shape of the patch (but the last time I checked, diff edit mode of GNU Emacs did not adjust the hunk header correctly in some corner cases). > I don't think we'd ever generate this ourselves, but could somebody > manually edit a hunk into this shape? When I tried it in practice, it > looks like we fail to apply the result even before my patch, though. I'm > not sure why that is. If I put "some garbage" without the blank line, we > correctly realize it should be discarded. It's possible I'm just holding > it wrong. I've given up on the "hunk edit" doing wrong things to a patch already. The "edit" codepath used to be a lot less careless before Phillip whipped it into a much better shape with the series that ended at 436d18f2 (Merge branch 'pw/add-p-recount', 2018-03-14), that introduced recount_edited_hunk() that special cases "+/-/ ". It already knows that a completely empty line in a patch is an empty and unmodified line (which was ported from f4d35a6b (add -p: fix counting empty context lines in edited patches, 2018-06-11)), so this patch does not have to do anything new there. But "recounting" will be fooled by garbage left in the edited result, so I think we have done all we can do to resolve possible ambiguities. The patch under discussion is not adding anything new and making it any worse, I would say. > diff --git a/add-patch.c b/add-patch.c > index 6e176cd21a..7beead1d0a 100644 > --- a/add-patch.c > +++ b/add-patch.c > @@ -588,7 +588,7 @@ static int parse_diff(struct add_p_state *s, const struct pathspec *ps) > (int)(eol - (plain->buf + file_diff->head.start)), > plain->buf + file_diff->head.start); > > - if ((marker == '-' || marker == '+') && *p == ' ') > + if ((marker == '-' || marker == '+') && (*p == ' ' || *p == '\n')) > hunk->splittable_into++; > if (marker && *p != '\\') > marker = *p; > @@ -964,7 +964,7 @@ static int split_hunk(struct add_p_state *s, struct file_diff *file_diff, > * Is this the first context line after a chain of +/- lines? > * Then record the start of the next split hunk. > */ > - if ((marker == '-' || marker == '+') && ch == ' ') { > + if ((marker == '-' || marker == '+') && (ch == ' ' || ch == '\n')) { > first = 0; > hunk[1].start = current; > if (colored) > @@ -979,14 +979,14 @@ static int split_hunk(struct add_p_state *s, struct file_diff *file_diff, > * Then just increment the appropriate counter and continue > * with the next line. > */ > - if (marker != ' ' || (ch != '-' && ch != '+')) { > + if ((marker != ' ' && marker != '\n') || (ch != '-' && ch != '+')) { > next_hunk_line: > /* Comment lines are attached to the previous line */ > if (ch == '\\') > ch = marker ? marker : ' '; > > /* current hunk not done yet */ > - if (ch == ' ') > + if (ch == ' ' || ch == '\n') > context_line_count++; > else if (ch == '-') > header->old_count++; > diff --git a/t/t3701-add-interactive.sh b/t/t3701-add-interactive.sh > index 5d78868ac1..92c8e6dc8c 100755 > --- a/t/t3701-add-interactive.sh > +++ b/t/t3701-add-interactive.sh > @@ -1164,4 +1164,36 @@ test_expect_success 'reset -p with unmerged files' ' > test_must_be_empty staged > ' > > +test_expect_success 'splitting handles diff.suppressBlankEmpty' ' > + test_when_finished "git reset --hard" && > + cat >file <<-\EOF && > + 1 > + 2 > + > + 3 > + 4 > + EOF > + git add file && > + > + cat >file <<-\EOF && > + one > + two > + > + three > + four > + EOF > + test_write_lines s n y | > + git -c diff.suppressBlankEmpty=true add -p && > + > + git cat-file blob :file >actual && > + cat >expect <<-\EOF && > + 1 > + 2 > + > + three > + four > + EOF > + test_cmp expect actual > +' > + > test_done
On Wed, Jul 10, 2024 at 02:46:30PM +0100, Phillip Wood wrote: > > It's tempting to say that we should just make sure that we generate a > > diff without that option. But we may parse diffs not only generated by > > Git, but ones that users have manually edited. And POSIX calls the > > decision of whether to print the space here "implementation-defined". > > Do we ever parse an edited hunk with this code? I'm not sure there is a > way of splitting a hunk after it has been edited as I don't think we > ever display it again. Hmm, I just assumed this code would see the edited hunk, but now I'm not sure. Note that the changes required do go outside of split_hunk(); the initial parse_diff() needs to decide whether the hunk is splittable in the first place (as an aside, that puzzled me at first why just changing split_hunk() was enough for the case that started this thread, but not the one in the included test. The difference is that without the empty line as context, the hunk in the test wouldn't be splittable at all). But looking closer: yes, I do think parse_diff() is used only for the initial patch. So we really would only see git-generated patches here. Which I think takes away my ambiguity concern, but does mean the commit message is wrong. > > I don't think we'd ever generate this ourselves, but could somebody > > manually edit a hunk into this shape? When I tried it in practice, it > > looks like we fail to apply the result even before my patch, though. I'm > > not sure why that is. If I put "some garbage" without the blank line, we > > correctly realize it should be discarded. It's possible I'm just holding > > it wrong. > > When we recount the hunk after it has been edited we ignore lines that > don't begin with '+', '-', ' ', or '\n' so if you add some garbage at > the end of the hunk the recounted hunk header excludes it when it gets > applied. OK, that makes sense. And we could never rely on the hunk header in the edited hunk anyway, since the whole point is that we have to recount it. So the user must accept that an extra blank line is potential context (and that goes all the way back to bcdd297b78 (built-in add -p: implement hunk editing, 2019-12-13). > I think your patch looks good. I did wonder if we wanted to fix this > by normalizing context lines instead as shown in the diff below. That > might make it less likely to miss adding "|| '\n'" in future code that > is looking for a context line but I don't have a strong preference > either way. Yeah, I had a similar thought, but it got messy because we have to deal with the source buffer. But the extra "char ch" you added in the patch below fixes that. I think the result is better. Looking at the blank-line handling in recount_edited_hunk(), we also handle a CRLF empty line there. Should we do so here, too? If so, then it would just be a matter of touching normalize_marker() in your patch. Do you want to just re-send your patch with a commit message to replace mine? (Feel free to steal the non-wrong parts of my message ;) ). > ---- >8 ---- > diff --git a/add-patch.c b/add-patch.c > index d8ea05ff108..795aa772b7a 100644 > --- a/add-patch.c > +++ b/add-patch.c > @@ -400,6 +400,12 @@ static void complete_file(char marker, struct hunk *hunk) > hunk->splittable_into++; > } > +/* Empty context lines may omit the leading ' ' */ > +static int normalize_marker(char marker) > +{ > + return marker == '\n' ? ' ' : marker; > +} > + > static int parse_diff(struct add_p_state *s, const struct pathspec *ps) Minor nit: missing blank line between functions. -Peff
On Wed, Jul 10, 2024 at 10:06:28AM -0700, Junio C Hamano wrote: > > I don't think we'd ever generate this ourselves, but could somebody > > manually edit a hunk into this shape? When I tried it in practice, it > > looks like we fail to apply the result even before my patch, though. I'm > > not sure why that is. If I put "some garbage" without the blank line, we > > correctly realize it should be discarded. It's possible I'm just holding > > it wrong. > > I've given up on the "hunk edit" doing wrong things to a patch > already. > > The "edit" codepath used to be a lot less careless before Phillip > whipped it into a much better shape with the series that ended at > 436d18f2 (Merge branch 'pw/add-p-recount', 2018-03-14), that > introduced recount_edited_hunk() that special cases "+/-/ ". It > already knows that a completely empty line in a patch is an empty > and unmodified line (which was ported from f4d35a6b (add -p: fix > counting empty context lines in edited patches, 2018-06-11)), so > this patch does not have to do anything new there. Yeah, I was being paranoid without actually thinking through the implications. As Phillip noted, we do not even run the code I changed on the edited hunk (which already does handle the empty line). So not only could I not break it, but it is already doing the right thing thanks to that earlier work. > But "recounting" will be fooled by garbage left in the edited > result, so I think we have done all we can do to resolve possible > ambiguities. The patch under discussion is not adding anything new > and making it any worse, I would say. Yep, agreed. So modulo a slight inaccuracy in the commit message, I think my patch is OK. That said, I like what Phillip showed in his reply. If he wants to wrap that up into a patch, I think it could replace mine. -Peff
Hi Peff On 11/07/2024 22:26, Jeff King wrote: > On Wed, Jul 10, 2024 at 02:46:30PM +0100, Phillip Wood wrote: > >> I think your patch looks good. I did wonder if we wanted to fix this >> by normalizing context lines instead as shown in the diff below. That >> might make it less likely to miss adding "|| '\n'" in future code that >> is looking for a context line but I don't have a strong preference >> either way. > > Yeah, I had a similar thought, but it got messy because we have to deal > with the source buffer. But the extra "char ch" you added in the patch > below fixes that. I think the result is better. > > Looking at the blank-line handling in recount_edited_hunk(), we also > handle a CRLF empty line there. Should we do so here, too? If so, then > it would just be a matter of touching normalize_marker() in your patch. > > Do you want to just re-send your patch with a commit message to replace > mine? (Feel free to steal the non-wrong parts of my message ;) ). Thanks, I'll do that Best Wishes Phillip >> ---- >8 ---- >> diff --git a/add-patch.c b/add-patch.c >> index d8ea05ff108..795aa772b7a 100644 >> --- a/add-patch.c >> +++ b/add-patch.c >> @@ -400,6 +400,12 @@ static void complete_file(char marker, struct hunk *hunk) >> hunk->splittable_into++; >> } >> +/* Empty context lines may omit the leading ' ' */ >> +static int normalize_marker(char marker) >> +{ >> + return marker == '\n' ? ' ' : marker; >> +} >> + >> static int parse_diff(struct add_p_state *s, const struct pathspec *ps) > > Minor nit: missing blank line between functions. > > -Peff
On Sat, Jul 13, 2024 at 02:19:39PM +0100, phillip.wood123@gmail.com wrote: > > Do you want to just re-send your patch with a commit message to replace > > mine? (Feel free to steal the non-wrong parts of my message ;) ). > > Thanks, I'll do that Thanks. Note that mine is in 'next', but I think it would not be a big deal to revert and replace (I'm not sure it is even on track for 2.46 anyway). -Peff
On 13/07/2024 22:21, Jeff King wrote: > On Sat, Jul 13, 2024 at 02:19:39PM +0100, phillip.wood123@gmail.com wrote: > >>> Do you want to just re-send your patch with a commit message to replace >>> mine? (Feel free to steal the non-wrong parts of my message ;) ). >> >> Thanks, I'll do that > > Thanks. Note that mine is in 'next', but I think it would not be a big > deal to revert and replace (I'm not sure it is even on track for 2.46 > anyway). Thanks for the heads-up, I see that Junio has reverted it in the latest what's cooking email Phillip > -Peff >
On 11/07/2024 22:26, Jeff King wrote: > On Wed, Jul 10, 2024 at 02:46:30PM +0100, Phillip Wood wrote: > >>> It's tempting to say that we should just make sure that we generate a >>> diff without that option. But we may parse diffs not only generated by >>> Git, but ones that users have manually edited. And POSIX calls the >>> decision of whether to print the space here "implementation-defined". >> >> Do we ever parse an edited hunk with this code? I'm not sure there is a >> way of splitting a hunk after it has been edited as I don't think we >> ever display it again. > > Hmm, I just assumed this code would see the edited hunk, but now I'm not > sure. Note that the changes required do go outside of split_hunk(); the > initial parse_diff() needs to decide whether the hunk is splittable in > the first place (as an aside, that puzzled me at first why just changing > split_hunk() was enough for the case that started this thread, but not > the one in the included test. The difference is that without the empty > line as context, the hunk in the test wouldn't be splittable at all). > > But looking closer: yes, I do think parse_diff() is used only for the > initial patch. That's true but I've realized that we do in fact allow the user to re-display and split an edited hunk. If there are two changes in a file then you can edit the first hunk and when prompted about the second press 'K' to go back to the first one and then split it with 's' (I messed up my test for this before sending my previous mail as I changed two separate files rather than putting two changes in a single file). So split_hunk() can encounter empty context lines even without diff.suppressBlankEmpty being set as lots of editors strip the leading space when the rest of the line is empty[1]. As we haven't had any bug reports about that I suspect people are not splitting the hunks they edited which I guess makes sense. I think there is another bug lurking for anyone who does try to split an edited hunk as we don't update `hunk->splittable_into` after it has been edited and the edit might change a deleted of an empty line to a context line or vice versa. We should make sure any garbage at the end of the hunk is discarded as well so we don't show it when we display the edited hunk. Best Wishes Phillip [1] When I added the code to recount the hunk header rather than relying on "git apply --recount" initially it did not support empty context lines and we received quite a few bug reports pretty quickly after it was released. > So we really would only see git-generated patches here. > Which I think takes away my ambiguity concern, but does mean the commit > message is wrong. > >>> I don't think we'd ever generate this ourselves, but could somebody >>> manually edit a hunk into this shape? When I tried it in practice, it >>> looks like we fail to apply the result even before my patch, though. I'm >>> not sure why that is. If I put "some garbage" without the blank line, we >>> correctly realize it should be discarded. It's possible I'm just holding >>> it wrong. >> >> When we recount the hunk after it has been edited we ignore lines that >> don't begin with '+', '-', ' ', or '\n' so if you add some garbage at >> the end of the hunk the recounted hunk header excludes it when it gets >> applied. > > OK, that makes sense. And we could never rely on the hunk header in the > edited hunk anyway, since the whole point is that we have to recount it. > So the user must accept that an extra blank line is potential context > (and that goes all the way back to bcdd297b78 (built-in add -p: > implement hunk editing, 2019-12-13). > >> I think your patch looks good. I did wonder if we wanted to fix this >> by normalizing context lines instead as shown in the diff below. That >> might make it less likely to miss adding "|| '\n'" in future code that >> is looking for a context line but I don't have a strong preference >> either way. > > Yeah, I had a similar thought, but it got messy because we have to deal > with the source buffer. But the extra "char ch" you added in the patch > below fixes that. I think the result is better. > > Looking at the blank-line handling in recount_edited_hunk(), we also > handle a CRLF empty line there. Should we do so here, too? If so, then > it would just be a matter of touching normalize_marker() in your patch. > > Do you want to just re-send your patch with a commit message to replace > mine? (Feel free to steal the non-wrong parts of my message ;) ). > >> ---- >8 ---- >> diff --git a/add-patch.c b/add-patch.c >> index d8ea05ff108..795aa772b7a 100644 >> --- a/add-patch.c >> +++ b/add-patch.c >> @@ -400,6 +400,12 @@ static void complete_file(char marker, struct hunk *hunk) >> hunk->splittable_into++; >> } >> +/* Empty context lines may omit the leading ' ' */ >> +static int normalize_marker(char marker) >> +{ >> + return marker == '\n' ? ' ' : marker; >> +} >> + >> static int parse_diff(struct add_p_state *s, const struct pathspec *ps) > > Minor nit: missing blank line between functions. > > -Peff >
diff without that option. But we may parse diffs not only generated by Git, but ones that users have manually edited. And POSIX calls the decision of whether to print the space here "implementation-defined". So let's handle both cases: a context line either starts with a space or consists of a totally empty line. Reported-by: Ilya Tumaykin <itumaykin@gmail.com> Signed-off-by: Jeff King <peff@peff.net> --- I'm a little worried that this creates ambiguities, since I don't think we are careful about following the hunk header's line counts. Imagine you had an input like this: @@ -1,2 +1,2 @@ -old +new stuff some garbage We obviously know that "some garbage" is not a context line and is just trailing junk, because it does not begin with "-", "+" or space. But what about the blank line in between? It looks like an empty context line, but we can only know that it is not by respecting the counts in the hunk header. I don't think we'd ever generate this ourselves, but could somebody manually edit a hunk into this shape? When I tried it in practice, it looks like we fail to apply the result even before my patch, though. I'm not sure why that is. If I put "some garbage" without the blank line, we correctly realize it should be discarded. It's possible I'm just holding it wrong. add-patch.c | 8 ++++---- t/t3701-add-interactive.sh | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+), 4 deletions(-) diff --git a/add-patch.c b/add-patch.c index 6e176cd21a..7beead1d0a 100644 --- a/add-patch.c +++ b/add-patch.c @@ -588,7 +588,7 @@ static int parse_diff(struct add_p_state *s, const struct pathspec *ps) (int)(eol - (plain->buf + file_diff->head.start)), plain->buf + file_diff->head.start); - if ((marker == '-' || marker == '+') && *p == ' ') + if ((marker == '-' || marker == '+') && (*p == ' ' || *p == '\n')) hunk->splittable_into++; if (marker && *p != '\\') marker = *p; @@ -964,7 +964,7 @@ static int split_hunk(struct add_p_state *s, struct file_diff *file_diff, * Is this the first context line after a chain of +/- lines? * Then record the start of the next split hunk. */ - if ((marker == '-' || marker == '+') && ch == ' ') { + if ((marker == '-' || marker == '+') && (ch == ' ' || ch == '\n')) { first = 0; hunk[1].start = current; if (colored) @@ -979,14 +979,14 @@ static int split_hunk(struct add_p_state *s, struct file_diff *file_diff, * Then just increment the appropriate counter and continue * with the next line. */ - if (marker != ' ' || (ch != '-' && ch != '+')) { + if ((marker != ' ' && marker != '\n') || (ch != '-' && ch != '+')) { next_hunk_line: /* Comment lines are attached to the previous line */ if (ch == '\\') ch = marker ? marker : ' '; /* current hunk not done yet */ - if (ch == ' ') + if (ch == ' ' || ch == '\n') context_line_count++; else if (ch == '-') header->old_count++; diff --git a/t/t3701-add-interactive.sh b/t/t3701-add-interactive.sh index 5d78868ac1..92c8e6dc8c 100755 --- a/t/t3701-add-interactive.sh +++ b/t/t3701-add-interactive.sh @@ -1164,4 +1164,36 @@ test_expect_success 'reset -p with unmerged files' ' test_must_be_empty staged ' +test_expect_success 'splitting handles diff.suppressBlankEmpty' ' + test_when_finished "git reset --hard" && + cat >file <<-\EOF && + 1 + 2 + + 3 + 4 + EOF + git add file && + + cat >file <<-\EOF && + one + two + + three + four + EOF + test_write_lines s n y | + git -c diff.suppressBlankEmpty=true add -p && + + git cat-file blob :file >actual && + cat >expect <<-\EOF && + 1 + 2 + + three + four + EOF + test_cmp expect actual +' + test_done