Message ID | 20240915221055.904107-1-bence@ferdinandy.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] set-head: no update without change and better output for --auto | expand |
The message I am responding to is sent as the first message to start a brand-new thread, but for future reference, it makes it easier for everybody involved if you sent a new iteration of a patch (or a patch series) as a reply/follow-up to the last round. If it is too cumbersome to do for any reason, as an alternative, you can give a reference to previous rounds as URLs to the list archive, i.e. (v1) https://lore.kernel.org/git/20240910203129.2251090-1-bence@ferdinandy.com/ (v2) https://lore.kernel.org/git/20240910203835.2288291-1-bence@ferdinandy.com/ > Currently, even if there is no actual change to remote/HEAD calling > remote set-head will overwrite the appropriate file and if set to --auto > will also print a message saying "remote/HEAD set to branch", which > implies something was changed. > > Change the behaviour of remote set-head so that the reference is only > updated if it actually needs to change. I sense a patch with a racy TOCTOU change coming ;-) > Change the output of --auto, so the output actually reflects what was > done: a) set a previously unset HEAD, b) change HEAD because remote > changed or c) no updates. Make the output easily parsable, by using > a slightly clunky wording that allows all three outputs to have the same > structure and number of words. This would be useful. > This patch was originally sent along when I thought set-head was > going to be invoked by fetch, but the discussion on the RFC > concluded that it should be not. This opened the possibility to make > it more explicit. I am not quite sure what the discussion concluded "should be not", though. In a message from you https://lore.kernel.org/git/D43G2CGX2N7L.ZRETD4HLIH0E@ferdinandy.com/ what we agreed was that "git fetch" does not need a new option, but we also agreed that it would be a good idea for "git fetch" to create, refs/remotes/$remote/HEAD when it does not exist without being told. I take it that you still want to see such a change to "git fetch" eventually happen, but decided to tackle the other half of the original two-patch series first with this patch? > static int set_head(int argc, const char **argv, const char *prefix) > { > - int i, opt_a = 0, opt_d = 0, result = 0; > - struct strbuf buf = STRBUF_INIT, buf2 = STRBUF_INIT; > + int i, opt_a = 0, opt_d = 0, is_ref_changed = 0, result = 0; Shall we simply call it ref_changed instead, so that I do not have to wonder which is better between has_ref_changed and is_ref_changed? ;-) > + struct strbuf buf = STRBUF_INIT, buf2 = STRBUF_INIT, buf3 = STRBUF_INIT; > char *head_name = NULL; > > struct option options[] = { > @@ -1440,13 +1440,19 @@ static int set_head(int argc, const char **argv, const char *prefix) > > if (head_name) { > strbuf_addf(&buf2, "refs/remotes/%s/%s", argv[0], head_name); > + refs_read_symbolic_ref(get_main_ref_store(the_repository),buf.buf,&buf3); > + is_ref_changed = strcmp(buf2.buf,buf3.buf); > /* make sure it's valid */ > if (!refs_ref_exists(get_main_ref_store(the_repository), buf2.buf)) > result |= error(_("Not a valid ref: %s"), buf2.buf); > - else if (refs_update_symref(get_main_ref_store(the_repository), buf.buf, buf2.buf, "remote set-head")) > + else if (is_ref_changed && refs_update_symref(get_main_ref_store(the_repository), buf.buf, buf2.buf, "remote set-head")) > result |= error(_("Could not setup %s"), buf.buf); Two and a half small issues: - Do not omit " " after "," and avoid overlong lines by folding lines at an appropriate place (usually after an operator at higher point in parse tree, i.e. after "is_ref_changed &&"). - This is inherently racy, isn't it? We read the _current_ value. After we do so, but before we write _our_ value, another process may update it, so we'd end up overwriting the value they wrote. - To compute is_ref_changed, we look at buf3.buf but what happens if such a ref does not exist, exists but is not a symbolic ref, or is a hierarchy under which other refs hang (e.g., a funny ref "refs/remotes/origin/HEAD/main" exists)? Does the strcmp() compare a valid buf2.buf and an uninitialized junk in buf3.buf and give a random result? Shouldn't the code checking the return value from the refs_read_symbolic_ref() call, and if we did so, would we learn enough to tell among the cases where (1) it is a symref, (2) it is a regular ref, (3) no such ref exists, and (4) the refs layer hit an error to prevent us from giving one of the previous 3 answers? If we really wanted to resolve the raciness, I think we need to enhance the set of values that create_symref() can optionally return to its callers so that the caller can tell what the old value was. I am not sure offhand how involved such an update would be, though. > + else if (opt_a && !strcmp(buf3.buf,"")) > + printf("%s/HEAD was unset: set to %s\n", argv[0], head_name); I suspect that this does not account for a case where the "read_symbolic_ref()" we did earlier failed for a reason other than "the ref did not exist". > + else if (opt_a && is_ref_changed) > + printf("%s/HEAD was changed: set to %s\n", argv[0], head_name); > else if (opt_a) > - printf("%s/HEAD set to %s\n", argv[0], head_name); > + printf("%s/HEAD was unchanged: set to %s\n", argv[0], head_name); > free(head_name); > } Quoting the values we obtained from the environment or the user may be nicer, e.g. 'refs/remotes/origin/HEAD' is now set to 'main'. 'refs/remotes/origin/HEAD' is unchanged and points at 'main'. This is a tangent outside the immediate topic of this patch, but I wonder if we need "--quiet" mode. Thanks.
On Mon Sep 16, 2024 at 20:36, Junio C Hamano <gitster@pobox.com> wrote: > The message I am responding to is sent as the first message to start > a brand-new thread, but for future reference, it makes it easier for > everybody involved if you sent a new iteration of a patch (or a > patch series) as a reply/follow-up to the last round. Sorry about that, I've only ever used git-send-email on sourcehut before, where they ask you _not_ to use --in-reply-to because of their UI and got used it... I'll do that next time! > > If it is too cumbersome to do for any reason, as an alternative, you > can give a reference to previous rounds as URLs to the list archive, > i.e. > > (v1) https://lore.kernel.org/git/20240910203129.2251090-1-bence@ferdinandy.com/ > (v2) https://lore.kernel.org/git/20240910203835.2288291-1-bence@ferdinandy.com/ Tbf, I did put this in the commit notes :) > > > Currently, even if there is no actual change to remote/HEAD calling > > remote set-head will overwrite the appropriate file and if set to --auto > > will also print a message saying "remote/HEAD set to branch", which > > implies something was changed. > > > > Change the behaviour of remote set-head so that the reference is only > > updated if it actually needs to change. > > I sense a patch with a racy TOCTOU change coming ;-) It seems that indeed :( [snip] > I am not quite sure what the discussion concluded "should be not", > though. In a message from you > > https://lore.kernel.org/git/D43G2CGX2N7L.ZRETD4HLIH0E@ferdinandy.com/ > > what we agreed was that "git fetch" does not need a new option, but > we also agreed that it would be a good idea for "git fetch" to > create, refs/remotes/$remote/HEAD when it does not exist without > being told. > > I take it that you still want to see such a change to "git fetch" > eventually happen, but decided to tackle the other half of the > original two-patch series first with this patch? Yes, but it did also turn up, that calling `git remote set-head --auto` directly from `git fetch` as I did in the original RFC requires the user to authenticate twice, which is not the best UX. So now I'm hammering away at adding a "set_head" directly to fetch (I've sort-of figured it out already, but it is definitely more involved than this one, EDIT: since reading below about racyness, maybe this one is actually more complicated :D)). It's just that now I can handle the two patches independently. > > > static int set_head(int argc, const char **argv, const char *prefix) > > { > > - int i, opt_a = 0, opt_d = 0, result = 0; > > - struct strbuf buf = STRBUF_INIT, buf2 = STRBUF_INIT; > > + int i, opt_a = 0, opt_d = 0, is_ref_changed = 0, result = 0; > > Shall we simply call it ref_changed instead, so that I do not have > to wonder which is better between has_ref_changed and is_ref_changed? ;-) Makes sense :) > > > + struct strbuf buf = STRBUF_INIT, buf2 = STRBUF_INIT, buf3 = STRBUF_INIT; > > char *head_name = NULL; > > > > struct option options[] = { > > @@ -1440,13 +1440,19 @@ static int set_head(int argc, const char **argv, const char *prefix) > > > > if (head_name) { > > strbuf_addf(&buf2, "refs/remotes/%s/%s", argv[0], head_name); > > > + refs_read_symbolic_ref(get_main_ref_store(the_repository),buf.buf,&buf3); > > + is_ref_changed = strcmp(buf2.buf,buf3.buf); > > /* make sure it's valid */ > > if (!refs_ref_exists(get_main_ref_store(the_repository), buf2.buf)) > > result |= error(_("Not a valid ref: %s"), buf2.buf); > > - else if (refs_update_symref(get_main_ref_store(the_repository), buf.buf, buf2.buf, "remote set-head")) > > + else if (is_ref_changed && refs_update_symref(get_main_ref_store(the_repository), buf.buf, buf2.buf, "remote set-head")) > > result |= error(_("Could not setup %s"), buf.buf); > > Two and a half small issues: > > - Do not omit " " after "," and avoid overlong lines by folding > lines at an appropriate place (usually after an operator at > higher point in parse tree, i.e. after "is_ref_changed &&"). > > - This is inherently racy, isn't it? We read the _current_ value. > After we do so, but before we write _our_ value, another process > may update it, so we'd end up overwriting the value they wrote. Honestly, I've just sort of assumed git in general protects against parallel writes, but I guess I'm wrong :D We'd need to block writes to the reference while we're doing this, but I'm not quite sure how to achieve that. > > - To compute is_ref_changed, we look at buf3.buf but what happens > if such a ref does not exist, exists but is not a symbolic ref, > or is a hierarchy under which other refs hang (e.g., a funny ref > "refs/remotes/origin/HEAD/main" exists)? Does the strcmp() > compare a valid buf2.buf and an uninitialized junk in buf3.buf > and give a random result? Shouldn't the code checking the return > value from the refs_read_symbolic_ref() call, and if we did so, > would we learn enough to tell among the cases where (1) it is a > symref, (2) it is a regular ref, (3) no such ref exists, and (4) > the refs layer hit an error to prevent us from giving one of the > previous 3 answers? > > If we really wanted to resolve the raciness, I think we need to > enhance the set of values that create_symref() can optionally return > to its callers so that the caller can tell what the old value was. > > I am not sure offhand how involved such an update would be, though. Hmm, so there's a lot to unpack for me here. Is create_symref already atomic so if it could give back the previous value we could decide based on that what to print (and forget about not writing the ref when there's no need)? Anyhow, I'll explore this part of the code and see if I can understand what's going on. [snip] > Quoting the values we obtained from the environment or the user may > be nicer, e.g. > > 'refs/remotes/origin/HEAD' is now set to 'main'. > 'refs/remotes/origin/HEAD' is unchanged and points at 'main'. Ack. I also gather from this that we should be a bit more verbose rather than trying to make it easy to parse. I also forgot to free up buf3, which I'll do for the next round. > > This is a tangent outside the immediate topic of this patch, but I > wonder if we need "--quiet" mode. Even with the planned changed to fetch, if one want to be completely up-to-date, they'd be running fetch and remote set-head and since the former has a quiet I think it would make sense. At least: adding it would be more straight forward than solving raciness :D Best, Bence
On Mon Sep 16, 2024 at 20:36, Junio C Hamano <gitster@pobox.com> wrote: [snip] > - This is inherently racy, isn't it? We read the _current_ value. > After we do so, but before we write _our_ value, another process > may update it, so we'd end up overwriting the value they wrote. So I've been thinking my first patch need not be so ambitious. The current behaviour is to indiscriminately overwrite remote/HEAD. So what if we dial this back a bit, leave the indiscriminate overwrite in place (the added benefit of that is pretty small anyway) and only improve the printed output, which was the main goal anyway? This is still slightly racy, since between the read and the write the status of remote/HEAD could still change potentially, so the information received by the user may be _slightly_ off, as we may be printing that the command just created remote/HEAD when it was actually created a split millisecond earlier by another command, but the printed end result will be always correct. And since the output is aimed at humans really, I'd say even if some other process did create that remote/HEAD between read and write, it's still actually true that before running set-head it did not exist and now it does and is set to X. In short: any raciness left should not have practical implications if we always actually write remote/HEAD. What do you think? Best, Bence
"Bence Ferdinandy" <bence@ferdinandy.com> writes: > On Mon Sep 16, 2024 at 20:36, Junio C Hamano <gitster@pobox.com> wrote: > > [snip] > >> - This is inherently racy, isn't it? We read the _current_ value. >> After we do so, but before we write _our_ value, another process >> may update it, so we'd end up overwriting the value they wrote. > > So I've been thinking my first patch need not be so ambitious. The current > behaviour is to indiscriminately overwrite remote/HEAD. So what if we dial this > back a bit, leave the indiscriminate overwrite in place (the added benefit of > that is pretty small anyway) and only improve the printed output, which was the > main goal anyway? Yes. You could ignore races and it does not degrade correctness of the end result (i.e., you will still get the same "oops, that is not what I wrote, but somebody else overwritten what I wrote" and vice versa). Your new messages that try to differenciate "we noticed you are writing the same so we refrained from doing anything" and "we wrote this new value" can racily incorrect the same way. In other words, the message may claim that we detected a no-op change from value A to new value A hence we didn't do anything, but after the detection somebody may have wrote another value B from sideways, so the last message that hit the end user's eyes may imply the resulting value ought to be A but it is not because somebody else silently changed it to B. If we did not skip writing using racy logic, we would have written value A and reported that we updated it to value A, but the other party who silently wrote B may do so after all that happens, and the end result the user would see is the same (i.e. we tell the user the last value we expect it to be is A, but the final value is B). So, as long as we can explain the behaviour to the end-user well, I do not care too deeply. My impression was that avoiding it by just taking advantage of the atomicity afforded by create_symref() looked like a low hanging fruit, but that can be done by somebody who are curious to see how involved such a change actually is and can be safely left as a #leftoverbit ;-). Thanks.
On Tue Sep 17, 2024 at 22:51, Junio C Hamano <gitster@pobox.com> wrote: > > So, as long as we can explain the behaviour to the end-user well, I > do not care too deeply. My impression was that avoiding it by just > taking advantage of the atomicity afforded by create_symref() looked > like a low hanging fruit, but that can be done by somebody who are > curious to see how involved such a change actually is and can be > safely left as a #leftoverbit ;-). > > Thanks. Now that I've poked around a bit (cf. https://lore.kernel.org/git/20240919121335.298856-2-bence@ferdinandy.com/T/#u) that fruit does seem to hang lower :) My idea is the following: Add a new member to the ref_update struct that records the value of the ref _before_ the update transaction (in the files-backend), that can then be accessed post transaction_commit in refs_update_symref. If we want to pass this info up to the caller of refs_update_symref then I guess the only option is to pass an extra buf where we can write this, or if we're fine with update_symref itself doing the printing we probably need a bool to turn this on/off. My PoC seems to work and pass tests, although judging from previous interactions I'm probably overlooking a couple of edge cases. Does this seem reasonable? Best, Bence
diff --git a/builtin/remote.c b/builtin/remote.c index 0acc547d69..4b116dfc19 100644 --- a/builtin/remote.c +++ b/builtin/remote.c @@ -1400,8 +1400,8 @@ static int show(int argc, const char **argv, const char *prefix) static int set_head(int argc, const char **argv, const char *prefix) { - int i, opt_a = 0, opt_d = 0, result = 0; - struct strbuf buf = STRBUF_INIT, buf2 = STRBUF_INIT; + int i, opt_a = 0, opt_d = 0, is_ref_changed = 0, result = 0; + struct strbuf buf = STRBUF_INIT, buf2 = STRBUF_INIT, buf3 = STRBUF_INIT; char *head_name = NULL; struct option options[] = { @@ -1440,13 +1440,19 @@ static int set_head(int argc, const char **argv, const char *prefix) if (head_name) { strbuf_addf(&buf2, "refs/remotes/%s/%s", argv[0], head_name); + refs_read_symbolic_ref(get_main_ref_store(the_repository),buf.buf,&buf3); + is_ref_changed = strcmp(buf2.buf,buf3.buf); /* make sure it's valid */ if (!refs_ref_exists(get_main_ref_store(the_repository), buf2.buf)) result |= error(_("Not a valid ref: %s"), buf2.buf); - else if (refs_update_symref(get_main_ref_store(the_repository), buf.buf, buf2.buf, "remote set-head")) + else if (is_ref_changed && refs_update_symref(get_main_ref_store(the_repository), buf.buf, buf2.buf, "remote set-head")) result |= error(_("Could not setup %s"), buf.buf); + else if (opt_a && !strcmp(buf3.buf,"")) + printf("%s/HEAD was unset: set to %s\n", argv[0], head_name); + else if (opt_a && is_ref_changed) + printf("%s/HEAD was changed: set to %s\n", argv[0], head_name); else if (opt_a) - printf("%s/HEAD set to %s\n", argv[0], head_name); + printf("%s/HEAD was unchanged: set to %s\n", argv[0], head_name); free(head_name); } diff --git a/t/t5505-remote.sh b/t/t5505-remote.sh index 532035933f..6b61f043d3 100755 --- a/t/t5505-remote.sh +++ b/t/t5505-remote.sh @@ -434,7 +434,7 @@ test_expect_success 'set-head --auto has no problem w/multiple HEADs' ' cd test && git fetch two "refs/heads/*:refs/remotes/two/*" && git remote set-head --auto two >output 2>&1 && - echo "two/HEAD set to main" >expect && + echo "two/HEAD was unset: set to main" >expect && test_cmp expect output ) '
Currently, even if there is no actual change to remote/HEAD calling remote set-head will overwrite the appropriate file and if set to --auto will also print a message saying "remote/HEAD set to branch", which implies something was changed. Change the behaviour of remote set-head so that the reference is only updated if it actually needs to change. Change the output of --auto, so the output actually reflects what was done: a) set a previously unset HEAD, b) change HEAD because remote changed or c) no updates. Make the output easily parsable, by using a slightly clunky wording that allows all three outputs to have the same structure and number of words. Signed-off-by: Bence Ferdinandy <bence@ferdinandy.com> --- Notes: v1-v2: was RFC in https://lore.kernel.org/git/20240910203835.2288291-1-bence@ferdinandy.com/ v3: This patch was originally sent along when I thought set-head was going to be invoked by fetch, but the discussion on the RFC concluded that it should be not. This opened the possibility to make it more explicit. Note: although I feel both things the patch does are really just cosmetic, an argument could be made for breaking it into two, one for the no-op part and one for the --auto print update. builtin/remote.c | 14 ++++++++++---- t/t5505-remote.sh | 2 +- 2 files changed, 11 insertions(+), 5 deletions(-)