diff mbox series

[v3] set-head: no update without change and better output for --auto

Message ID 20240915221055.904107-1-bence@ferdinandy.com (mailing list archive)
State New
Headers show
Series [v3] set-head: no update without change and better output for --auto | expand

Commit Message

Bence Ferdinandy Sept. 15, 2024, 10:09 p.m. UTC
Currently, even if there is no actual change to remote/HEAD calling
remote set-head will overwrite the appropriate file and if set to --auto
will also print a message saying "remote/HEAD set to branch", which
implies something was changed.

Change the behaviour of remote set-head so that the reference is only
updated if it actually needs to change.

Change the output of --auto, so the output actually reflects what was
done: a) set a previously unset HEAD, b) change HEAD because remote
changed or c) no updates. Make the output easily parsable, by using
a slightly clunky wording that allows all three outputs to have the same
structure and number of words.

Signed-off-by: Bence Ferdinandy <bence@ferdinandy.com>
---

Notes:
    v1-v2:
    
        was RFC in https://lore.kernel.org/git/20240910203835.2288291-1-bence@ferdinandy.com/
    
    v3:
    
        This patch was originally sent along when I thought set-head was
        going to be invoked by fetch, but the discussion on the RFC
        concluded that it should be not. This opened the possibility to make
        it more explicit.
    
        Note: although I feel both things the patch does are really just
        cosmetic, an argument could be made for breaking it into two, one
        for the no-op part and one for the --auto print update.

 builtin/remote.c  | 14 ++++++++++----
 t/t5505-remote.sh |  2 +-
 2 files changed, 11 insertions(+), 5 deletions(-)

Comments

Junio C Hamano Sept. 16, 2024, 6:36 p.m. UTC | #1
The message I am responding to is sent as the first message to start
a brand-new thread, but for future reference, it makes it easier for
everybody involved if you sent a new iteration of a patch (or a
patch series) as a reply/follow-up to the last round.

If it is too cumbersome to do for any reason, as an alternative, you
can give a reference to previous rounds as URLs to the list archive,
i.e.

 (v1) https://lore.kernel.org/git/20240910203129.2251090-1-bence@ferdinandy.com/
 (v2) https://lore.kernel.org/git/20240910203835.2288291-1-bence@ferdinandy.com/

> Currently, even if there is no actual change to remote/HEAD calling
> remote set-head will overwrite the appropriate file and if set to --auto
> will also print a message saying "remote/HEAD set to branch", which
> implies something was changed.
>
> Change the behaviour of remote set-head so that the reference is only
> updated if it actually needs to change.

I sense a patch with a racy TOCTOU change coming ;-)

> Change the output of --auto, so the output actually reflects what was
> done: a) set a previously unset HEAD, b) change HEAD because remote
> changed or c) no updates. Make the output easily parsable, by using
> a slightly clunky wording that allows all three outputs to have the same
> structure and number of words.

This would be useful.

>         This patch was originally sent along when I thought set-head was
>         going to be invoked by fetch, but the discussion on the RFC
>         concluded that it should be not. This opened the possibility to make
>         it more explicit.

I am not quite sure what the discussion concluded "should be not",
though.  In a message from you

  https://lore.kernel.org/git/D43G2CGX2N7L.ZRETD4HLIH0E@ferdinandy.com/

what we agreed was that "git fetch" does not need a new option, but
we also agreed that it would be a good idea for "git fetch" to
create, refs/remotes/$remote/HEAD when it does not exist without
being told.

I take it that you still want to see such a change to "git fetch"
eventually happen, but decided to tackle the other half of the
original two-patch series first with this patch?

>  static int set_head(int argc, const char **argv, const char *prefix)
>  {
> -	int i, opt_a = 0, opt_d = 0, result = 0;
> -	struct strbuf buf = STRBUF_INIT, buf2 = STRBUF_INIT;
> +	int i, opt_a = 0, opt_d = 0, is_ref_changed = 0, result = 0;

Shall we simply call it ref_changed instead, so that I do not have
to wonder which is better between has_ref_changed and is_ref_changed? ;-)

> +	struct strbuf buf = STRBUF_INIT, buf2 = STRBUF_INIT, buf3 = STRBUF_INIT;
>  	char *head_name = NULL;
>  
>  	struct option options[] = {
> @@ -1440,13 +1440,19 @@ static int set_head(int argc, const char **argv, const char *prefix)
>  
>  	if (head_name) {
>  		strbuf_addf(&buf2, "refs/remotes/%s/%s", argv[0], head_name);

> +		refs_read_symbolic_ref(get_main_ref_store(the_repository),buf.buf,&buf3);
> +		is_ref_changed = strcmp(buf2.buf,buf3.buf);
>  		/* make sure it's valid */
>  		if (!refs_ref_exists(get_main_ref_store(the_repository), buf2.buf))
>  			result |= error(_("Not a valid ref: %s"), buf2.buf);
> -		else if (refs_update_symref(get_main_ref_store(the_repository), buf.buf, buf2.buf, "remote set-head"))
> +		else if (is_ref_changed && refs_update_symref(get_main_ref_store(the_repository), buf.buf, buf2.buf, "remote set-head"))
>  			result |= error(_("Could not setup %s"), buf.buf);

Two and a half small issues:

 - Do not omit " " after "," and avoid overlong lines by folding
   lines at an appropriate place (usually after an operator at
   higher point in parse tree, i.e. after "is_ref_changed &&").

 - This is inherently racy, isn't it?  We read the _current_ value.
   After we do so, but before we write _our_ value, another process
   may update it, so we'd end up overwriting the value they wrote.

 - To compute is_ref_changed, we look at buf3.buf but what happens
   if such a ref does not exist, exists but is not a symbolic ref,
   or is a hierarchy under which other refs hang (e.g., a funny ref
   "refs/remotes/origin/HEAD/main" exists)?  Does the strcmp()
   compare a valid buf2.buf and an uninitialized junk in buf3.buf
   and give a random result?  Shouldn't the code checking the return
   value from the refs_read_symbolic_ref() call, and if we did so,
   would we learn enough to tell among the cases where (1) it is a
   symref, (2) it is a regular ref, (3) no such ref exists, and (4)
   the refs layer hit an error to prevent us from giving one of the
   previous 3 answers?

If we really wanted to resolve the raciness, I think we need to
enhance the set of values that create_symref() can optionally return
to its callers so that the caller can tell what the old value was.

I am not sure offhand how involved such an update would be, though.

> +		else if (opt_a && !strcmp(buf3.buf,""))
> +			printf("%s/HEAD was unset: set to %s\n", argv[0], head_name);

I suspect that this does not account for a case where the
"read_symbolic_ref()" we did earlier failed for a reason other than
"the ref did not exist".

> +		else if (opt_a && is_ref_changed)
> +			printf("%s/HEAD was changed: set to %s\n", argv[0], head_name);
>  		else if (opt_a)
> -			printf("%s/HEAD set to %s\n", argv[0], head_name);
> +			printf("%s/HEAD was unchanged: set to %s\n", argv[0], head_name);
>  		free(head_name);
>  	}

Quoting the values we obtained from the environment or the user may
be nicer, e.g.

    'refs/remotes/origin/HEAD' is now set to 'main'.
    'refs/remotes/origin/HEAD' is unchanged and points at 'main'.

This is a tangent outside the immediate topic of this patch, but I
wonder if we need "--quiet" mode.

Thanks.
Bence Ferdinandy Sept. 16, 2024, 7:44 p.m. UTC | #2
On Mon Sep 16, 2024 at 20:36, Junio C Hamano <gitster@pobox.com> wrote:
> The message I am responding to is sent as the first message to start
> a brand-new thread, but for future reference, it makes it easier for
> everybody involved if you sent a new iteration of a patch (or a
> patch series) as a reply/follow-up to the last round.

Sorry about that, I've only ever used git-send-email on sourcehut before, where
they ask you _not_ to use --in-reply-to because of their UI and got used it...
I'll do that next time!

>
> If it is too cumbersome to do for any reason, as an alternative, you
> can give a reference to previous rounds as URLs to the list archive,
> i.e.
>
>  (v1) https://lore.kernel.org/git/20240910203129.2251090-1-bence@ferdinandy.com/
>  (v2) https://lore.kernel.org/git/20240910203835.2288291-1-bence@ferdinandy.com/

Tbf, I did put this in the commit notes :)

>
> > Currently, even if there is no actual change to remote/HEAD calling
> > remote set-head will overwrite the appropriate file and if set to --auto
> > will also print a message saying "remote/HEAD set to branch", which
> > implies something was changed.
> >
> > Change the behaviour of remote set-head so that the reference is only
> > updated if it actually needs to change.
>
> I sense a patch with a racy TOCTOU change coming ;-)

It seems that indeed :(

[snip]

> I am not quite sure what the discussion concluded "should be not",
> though.  In a message from you
>
>   https://lore.kernel.org/git/D43G2CGX2N7L.ZRETD4HLIH0E@ferdinandy.com/
>
> what we agreed was that "git fetch" does not need a new option, but
> we also agreed that it would be a good idea for "git fetch" to
> create, refs/remotes/$remote/HEAD when it does not exist without
> being told.
>
> I take it that you still want to see such a change to "git fetch"
> eventually happen, but decided to tackle the other half of the
> original two-patch series first with this patch?

Yes, but it did also turn up, that calling `git remote set-head --auto`
directly from `git fetch` as I did in the original RFC requires the user to
authenticate twice, which is not the best UX. So now I'm hammering away at
adding a "set_head" directly to fetch (I've sort-of figured it out already, but
it is definitely more involved than this one, EDIT: since reading below about
racyness, maybe this one is actually more complicated :D)). It's just that now
I can handle the two patches independently.


>
> >  static int set_head(int argc, const char **argv, const char *prefix)
> >  {
> > -	int i, opt_a = 0, opt_d = 0, result = 0;
> > -	struct strbuf buf = STRBUF_INIT, buf2 = STRBUF_INIT;
> > +	int i, opt_a = 0, opt_d = 0, is_ref_changed = 0, result = 0;
>
> Shall we simply call it ref_changed instead, so that I do not have
> to wonder which is better between has_ref_changed and is_ref_changed? ;-)

Makes sense :)

>
> > +	struct strbuf buf = STRBUF_INIT, buf2 = STRBUF_INIT, buf3 = STRBUF_INIT;
> >  	char *head_name = NULL;
> >  
> >  	struct option options[] = {
> > @@ -1440,13 +1440,19 @@ static int set_head(int argc, const char **argv, const char *prefix)
> >  
> >  	if (head_name) {
> >  		strbuf_addf(&buf2, "refs/remotes/%s/%s", argv[0], head_name);
>
> > +		refs_read_symbolic_ref(get_main_ref_store(the_repository),buf.buf,&buf3);
> > +		is_ref_changed = strcmp(buf2.buf,buf3.buf);
> >  		/* make sure it's valid */
> >  		if (!refs_ref_exists(get_main_ref_store(the_repository), buf2.buf))
> >  			result |= error(_("Not a valid ref: %s"), buf2.buf);
> > -		else if (refs_update_symref(get_main_ref_store(the_repository), buf.buf, buf2.buf, "remote set-head"))
> > +		else if (is_ref_changed && refs_update_symref(get_main_ref_store(the_repository), buf.buf, buf2.buf, "remote set-head"))
> >  			result |= error(_("Could not setup %s"), buf.buf);
>
> Two and a half small issues:
>
>  - Do not omit " " after "," and avoid overlong lines by folding
>    lines at an appropriate place (usually after an operator at
>    higher point in parse tree, i.e. after "is_ref_changed &&").
>
>  - This is inherently racy, isn't it?  We read the _current_ value.
>    After we do so, but before we write _our_ value, another process
>    may update it, so we'd end up overwriting the value they wrote.

Honestly, I've just sort of assumed git in general protects against parallel
writes, but I guess I'm wrong :D We'd need to block writes to the reference
while we're doing this, but I'm not quite sure how to achieve that.

>
>  - To compute is_ref_changed, we look at buf3.buf but what happens
>    if such a ref does not exist, exists but is not a symbolic ref,
>    or is a hierarchy under which other refs hang (e.g., a funny ref
>    "refs/remotes/origin/HEAD/main" exists)?  Does the strcmp()
>    compare a valid buf2.buf and an uninitialized junk in buf3.buf
>    and give a random result?  Shouldn't the code checking the return
>    value from the refs_read_symbolic_ref() call, and if we did so,
>    would we learn enough to tell among the cases where (1) it is a
>    symref, (2) it is a regular ref, (3) no such ref exists, and (4)
>    the refs layer hit an error to prevent us from giving one of the
>    previous 3 answers?
>
> If we really wanted to resolve the raciness, I think we need to
> enhance the set of values that create_symref() can optionally return
> to its callers so that the caller can tell what the old value was.
>
> I am not sure offhand how involved such an update would be, though.

Hmm, so there's a lot to unpack for me here. Is create_symref already atomic so
if it could give back the previous value we could decide based on that what to
print (and forget about not writing the ref when there's no need)? Anyhow, I'll
explore this part of the code and see if I can understand what's going on.

[snip]

> Quoting the values we obtained from the environment or the user may
> be nicer, e.g.
>
>     'refs/remotes/origin/HEAD' is now set to 'main'.
>     'refs/remotes/origin/HEAD' is unchanged and points at 'main'.

Ack. I also gather from this that we should be a bit more verbose rather than
trying to make it easy to parse.

I also forgot to free up buf3, which I'll do for the next round.

>
> This is a tangent outside the immediate topic of this patch, but I
> wonder if we need "--quiet" mode.

Even with the planned changed to fetch, if one want to be completely
up-to-date, they'd be running fetch and remote set-head and since the former
has a quiet I think it would make sense. At least: adding it would be more
straight forward than solving raciness :D

Best,
Bence
Bence Ferdinandy Sept. 17, 2024, 8:32 p.m. UTC | #3
On Mon Sep 16, 2024 at 20:36, Junio C Hamano <gitster@pobox.com> wrote:

[snip]

>  - This is inherently racy, isn't it?  We read the _current_ value.
>    After we do so, but before we write _our_ value, another process
>    may update it, so we'd end up overwriting the value they wrote.

So I've been thinking my first patch need not be so ambitious. The current
behaviour is to indiscriminately overwrite remote/HEAD. So what if we dial this
back a bit, leave the indiscriminate overwrite in place (the added benefit of
that is pretty small anyway) and only improve the printed output, which was the
main goal anyway? 

This is still slightly racy, since between the read and the write the status of
remote/HEAD could still change potentially, so the information received by the
user may be _slightly_ off, as we may be printing that the command just created
remote/HEAD when it was actually created a split millisecond earlier by another
command, but the printed end result will be always correct. And since the
output is aimed at humans really, I'd say even if some other process did create
that remote/HEAD between read and write, it's still actually true that before
running set-head it did not exist and now it does and is set to X.

In short: any raciness left should not have practical implications if we always
actually write remote/HEAD.

What do you think?

Best,
Bence
Junio C Hamano Sept. 17, 2024, 8:51 p.m. UTC | #4
"Bence Ferdinandy" <bence@ferdinandy.com> writes:

> On Mon Sep 16, 2024 at 20:36, Junio C Hamano <gitster@pobox.com> wrote:
>
> [snip]
>
>>  - This is inherently racy, isn't it?  We read the _current_ value.
>>    After we do so, but before we write _our_ value, another process
>>    may update it, so we'd end up overwriting the value they wrote.
>
> So I've been thinking my first patch need not be so ambitious. The current
> behaviour is to indiscriminately overwrite remote/HEAD. So what if we dial this
> back a bit, leave the indiscriminate overwrite in place (the added benefit of
> that is pretty small anyway) and only improve the printed output, which was the
> main goal anyway? 

Yes.  You could ignore races and it does not degrade correctness of
the end result (i.e., you will still get the same "oops, that is not
what I wrote, but somebody else overwritten what I wrote" and vice
versa).

Your new messages that try to differenciate "we noticed you are
writing the same so we refrained from doing anything" and "we wrote
this new value" can racily incorrect the same way.  In other words,
the message may claim that we detected a no-op change from value A
to new value A hence we didn't do anything, but after the detection
somebody may have wrote another value B from sideways, so the last
message that hit the end user's eyes may imply the resulting value
ought to be A but it is not because somebody else silently changed
it to B.  If we did not skip writing using racy logic, we would have
written value A and reported that we updated it to value A, but the
other party who silently wrote B may do so after all that happens,
and the end result the user would see is the same (i.e. we tell the
user the last value we expect it to be is A, but the final value is
B).

So, as long as we can explain the behaviour to the end-user well, I
do not care too deeply.  My impression was that avoiding it by just
taking advantage of the atomicity afforded by create_symref() looked
like a low hanging fruit, but that can be done by somebody who are
curious to see how involved such a change actually is and can be
safely left as a #leftoverbit ;-).

Thanks.
Bence Ferdinandy Sept. 19, 2024, 8:53 p.m. UTC | #5
On Tue Sep 17, 2024 at 22:51, Junio C Hamano <gitster@pobox.com> wrote:
>
> So, as long as we can explain the behaviour to the end-user well, I
> do not care too deeply.  My impression was that avoiding it by just
> taking advantage of the atomicity afforded by create_symref() looked
> like a low hanging fruit, but that can be done by somebody who are
> curious to see how involved such a change actually is and can be
> safely left as a #leftoverbit ;-).
>
> Thanks.


Now that I've poked around a bit (cf.
https://lore.kernel.org/git/20240919121335.298856-2-bence@ferdinandy.com/T/#u)
that fruit does seem to hang lower :)

My idea is the following: Add a new member to the ref_update struct that
records the value of the ref _before_ the update transaction (in the
files-backend), that can then be accessed post transaction_commit in
refs_update_symref. If we want to pass this info up to the caller of
refs_update_symref then I guess the only option is to pass an extra buf where
we can write this, or if we're fine with update_symref itself doing the
printing we probably need a bool to turn this on/off.

My PoC seems to work and pass tests, although judging from previous
interactions I'm probably overlooking a couple of edge cases. 

Does this seem reasonable?

Best,
Bence
diff mbox series

Patch

diff --git a/builtin/remote.c b/builtin/remote.c
index 0acc547d69..4b116dfc19 100644
--- a/builtin/remote.c
+++ b/builtin/remote.c
@@ -1400,8 +1400,8 @@  static int show(int argc, const char **argv, const char *prefix)
 
 static int set_head(int argc, const char **argv, const char *prefix)
 {
-	int i, opt_a = 0, opt_d = 0, result = 0;
-	struct strbuf buf = STRBUF_INIT, buf2 = STRBUF_INIT;
+	int i, opt_a = 0, opt_d = 0, is_ref_changed = 0, result = 0;
+	struct strbuf buf = STRBUF_INIT, buf2 = STRBUF_INIT, buf3 = STRBUF_INIT;
 	char *head_name = NULL;
 
 	struct option options[] = {
@@ -1440,13 +1440,19 @@  static int set_head(int argc, const char **argv, const char *prefix)
 
 	if (head_name) {
 		strbuf_addf(&buf2, "refs/remotes/%s/%s", argv[0], head_name);
+		refs_read_symbolic_ref(get_main_ref_store(the_repository),buf.buf,&buf3);
+		is_ref_changed = strcmp(buf2.buf,buf3.buf);
 		/* make sure it's valid */
 		if (!refs_ref_exists(get_main_ref_store(the_repository), buf2.buf))
 			result |= error(_("Not a valid ref: %s"), buf2.buf);
-		else if (refs_update_symref(get_main_ref_store(the_repository), buf.buf, buf2.buf, "remote set-head"))
+		else if (is_ref_changed && refs_update_symref(get_main_ref_store(the_repository), buf.buf, buf2.buf, "remote set-head"))
 			result |= error(_("Could not setup %s"), buf.buf);
+		else if (opt_a && !strcmp(buf3.buf,""))
+			printf("%s/HEAD was unset: set to %s\n", argv[0], head_name);
+		else if (opt_a && is_ref_changed)
+			printf("%s/HEAD was changed: set to %s\n", argv[0], head_name);
 		else if (opt_a)
-			printf("%s/HEAD set to %s\n", argv[0], head_name);
+			printf("%s/HEAD was unchanged: set to %s\n", argv[0], head_name);
 		free(head_name);
 	}
 
diff --git a/t/t5505-remote.sh b/t/t5505-remote.sh
index 532035933f..6b61f043d3 100755
--- a/t/t5505-remote.sh
+++ b/t/t5505-remote.sh
@@ -434,7 +434,7 @@  test_expect_success 'set-head --auto has no problem w/multiple HEADs' '
 		cd test &&
 		git fetch two "refs/heads/*:refs/remotes/two/*" &&
 		git remote set-head --auto two >output 2>&1 &&
-		echo "two/HEAD set to main" >expect &&
+		echo "two/HEAD was unset: set to main" >expect &&
 		test_cmp expect output
 	)
 '