diff mbox series

upload-pack: fix exit code when denying fetch of unreachable object ID

Message ID fe028981d353158e9840eb035194ca15e6a2c15e.1692165840.git.ps@pks.im (mailing list archive)
State New, archived
Headers show
Series upload-pack: fix exit code when denying fetch of unreachable object ID | expand

Commit Message

Patrick Steinhardt Aug. 16, 2023, 6:06 a.m. UTC
In 7ba7c52d76 (upload-pack: fix race condition in error messages,
2023-08-10), we have fixed a race in t5516-fetch-push.sh where sometimes
error messages got intermingled. This was done by splitting up the call
to `die()` such that we print the error message before writing to the
remote side, followed by a call to `exit(1)` afterwards.

This causes a subtle regression though as `die()` causes us to exit with
exit code 128, whereas we now call `exit(1)`. It's not really clear
whether we want to guarantee any specific error code in this case, and
neither do we document anything like that. But on the other hand, it
seems rather clear that this is an unintended side effect of the change
given that this change in behaviour was not mentioned at all.

Fix this regression by exiting with 128 again and tighten one of our
tests to catch such unintended side effects.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---

We have found this issue in our CI pipeline in Gitaly, which explicitly
checks for the error code. It is very much debatable whether we should
even be doing that in the first place given that error codes are not
even explicitly documented here. But I think this is worth fixing anyway
given that it seems like an unintended side effect to me and might be
biting others, as well.

If you folks agree with my reasoning, then I think we should pick this
up before Git v2.42.0 is released.

Patrick

 t/t5703-upload-pack-ref-in-want.sh | 2 +-
 upload-pack.c                      | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Junio C Hamano Aug. 16, 2023, 4:16 p.m. UTC | #1
Patrick Steinhardt <ps@pks.im> writes:

> In 7ba7c52d76 (upload-pack: fix race condition in error messages,
> 2023-08-10), we have fixed a race in t5516-fetch-push.sh where sometimes
> error messages got intermingled. This was done by splitting up the call
> to `die()` such that we print the error message before writing to the
> remote side, followed by a call to `exit(1)` afterwards.
> ...
> We have found this issue in our CI pipeline in Gitaly, which explicitly
> checks for the error code. It is very much debatable whether we should
> even be doing that in the first place given that error codes are not
> even explicitly documented here. But I think this is worth fixing anyway
> given that it seems like an unintended side effect to me and might be
> biting others, as well.
>
> If you folks agree with my reasoning, then I think we should pick this
> up before Git v2.42.0 is released.

Thanks.

The change to the code sounds sensible in that it is a move to
restore the status quo, and we know that the original never intended
to "fix" the exit status from 128 to 1.  The test change etches the
status quo in stone, which is a bit more than that and might be
debatable, but when we someday formally declare that users should
not be relying on the exit status that are not documented, we would
hunt for the uses of test_expect_code in the tests and turn this one
back, and probably do the same to others that are too strict on the
exact exit status, so I think that half of the change is OK, at
least for now.

Comments?

>  t/t5703-upload-pack-ref-in-want.sh | 2 +-
>  upload-pack.c                      | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/t/t5703-upload-pack-ref-in-want.sh b/t/t5703-upload-pack-ref-in-want.sh
> index df74f80061..fe6e056700 100755
> --- a/t/t5703-upload-pack-ref-in-want.sh
> +++ b/t/t5703-upload-pack-ref-in-want.sh
> @@ -483,7 +483,7 @@ test_expect_success 'server is initially ahead - no ref in want' '
>  	rm -rf local &&
>  	cp -r "$LOCAL_PRISTINE" local &&
>  	inconsistency main $(test_oid numeric) &&
> -	test_must_fail git -C local fetch 2>err &&
> +	test_expect_code 128 git -C local fetch 2>err &&
>  	test_i18ngrep "fatal: remote error: upload-pack: not our ref" err
>  '
>  
> diff --git a/upload-pack.c b/upload-pack.c
> index ece111c629..15f3318d6d 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -782,7 +782,7 @@ static void check_non_tip(struct upload_pack_data *data)
>  			packet_writer_error(&data->writer,
>  					    "upload-pack: not our ref %s",
>  					    oid_to_hex(&o->oid));
> -			exit(1);
> +			exit(128);
>  		}
>  	}
>  }
Junio C Hamano Aug. 16, 2023, 4:44 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> The change to the code sounds sensible in that it is a move to
> restore the status quo, and we know that the original never intended
> to "fix" the exit status from 128 to 1.  The test change etches the
> status quo in stone, which is a bit more than that and might be
> debatable, but when we someday formally declare that users should
> not be relying on the exit status that are not documented, we would
> hunt for the uses of test_expect_code in the tests and turn this one
> back, and probably do the same to others that are too strict on the
> exact exit status, so I think that half of the change is OK, at
> least for now.
>
> Comments?

An alternative to make this "fix" without setting any policy is to
do this.  That is, to remove the change to the test part and then to
rephrase the tail end of the proposed commit log message.

I can go either way.  I personally prefer our tests not to be overly
strict about behaviors they test, especially the ones we do not
document.

1:  77d0f01405 ! 1:  5f33a843de upload-pack: fix exit code when denying fetch of unreachable object ID
    @@ Commit message
         seems rather clear that this is an unintended side effect of the change
         given that this change in behaviour was not mentioned at all.
     
    -    Fix this regression by exiting with 128 again and tighten one of our
    -    tests to catch such unintended side effects.
    +    Restore the status-quo by exiting with 128.  The test in t5703 to
    +    ensure that "git fetch" fails by using test_must_fail, which does
    +    not care between exiting 1 and 128, so this changes will not affect
    +    any test.
     
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
         Signed-off-by: Junio C Hamano <gitster@pobox.com>
     
    - ## t/t5703-upload-pack-ref-in-want.sh ##
    -@@ t/t5703-upload-pack-ref-in-want.sh: test_expect_success 'server is initially ahead - no ref in want' '
    - 	rm -rf local &&
    - 	cp -r "$LOCAL_PRISTINE" local &&
    - 	inconsistency main $(test_oid numeric) &&
    --	test_must_fail git -C local fetch 2>err &&
    -+	test_expect_code 128 git -C local fetch 2>err &&
    - 	test_i18ngrep "fatal: remote error: upload-pack: not our ref" err
    - '
    - 
    -
      ## upload-pack.c ##
     @@ upload-pack.c: static void check_non_tip(struct upload_pack_data *data)
      			packet_writer_error(&data->writer,
Junio C Hamano Aug. 16, 2023, 5:04 p.m. UTC | #3
Patrick earlier found that Gitaly's CI pipeline was being overly
picky and complained about the unintended change of the exit code of
"git fetch" in the affected codepath from 128 to 1 in a recent
change that went to 'next', made by 7ba7c52d (upload-pack: fix race
condition in error messages, 2023-08-10).

The thing is, we follow that exiting with 0 is success, and exiting
with non-zero means failure, and we generally do not specify which
non-zero value is used for the latter in our documentation.  This
particular case (i.e. "git fetch") certainly does not say anything
about how failure is signaled to the calling program.  "git help
git" does not have "EXIT CODES" section, and it is assumed that the
"common sense" of older generation (gee, this project is more than
18 years old) that exiting with 0 is success and non-zero is failure
is shared among its users, which might not be warranted these days.

We could either

 * Be more prescriptive and add "EXIT CODES" section to each and
   every document to describe how we fail in the current code.

or

 * Describe "In general, 0 is success, non-zero is failure, but some
   commands may signal more than that with its non-zero exit codes"
   in "git help git", and add "EXIT CODES" section to the manual
   page of the commands whose exit codes matter (there are a
   handful, like "git diff --exit-code" that explicitly says "1" is
   the signal that it found difference as opposed to it failing).

I'd prefer if community agrees that we should do the latter, but I
am OK if the community consensus goes the other way.

If we were to go the former, the task becomes larger but it would be
embarrassingly parallelizable.  Folks with extra time on their hand
can lend their eyes to tackle each command, find what exit code we
use in the current code, add "EXIT CODES" section to the documentation,
and extend test coverage to ensure their findings will stay unless
we deliberately change it in the future.

If we were to go the latter, the task will be significantly smaller.
We need to come up with a careful phrasing to add to "git help git"
and/or "git help cli" to give the general principle of zero vs
non-zero whose exact values are left unspecified and should not be
depended upon.  We also need to identify the special cases where the
exact values have meanings (like the "git diff --exit-code" example
above), describe them by adding "EXIT CODES" section to the manual
pages of these selected commands, and extend test coverage to ensure
these values are kept intact across future changes.

Comments?
Junio C Hamano Aug. 17, 2023, 5:12 a.m. UTC | #4
Derrick Stolee <derrickstolee@github.com> writes:

[jc: the message I am responding to may not be on the list archive,
 as it was multipart/alternative with text/html in it, but I think
 the main point from you can be seen by others only from the parts
 I quoted here].

> While I don't think we should document that the exit code has
> a special meaning for the builtin, adding the test will help
> prevent another accidental change in the future. If the patch is
> worth taking (to fix the accidental change) then I think the test
> should stay, so we don't make this change accidentally again.

I think my stance is a bit more nuanced, in that the first half of
the patch to make us exit with 128 is worth taking, simply because
we did not have to and did not intend to change the exit status, but
the other half of the patch, using test_expect_code in the test
suite, sends a wrong message that somehow exact value of non-zero
exit status in this particular case matters.

To put it another way, if your patch to shuffle the calls for two
error messages, concluded with a call to exit(), were written in the
ideal world, you would have passed 128 to exit(), *and* you wouldn't
have added any test that says "fetch should exit with 128 and not 1
when it fails".  I aimed to massage Patrick's patch so that the
original patch from you will become that patch in the ideal world
when it is squashed in.

> To my view, test cases can change in the future as long as
> there is good justification in doing so. Having existing tests
> helps to demonstrate a change in behavior.

I agree with that 100%, but the thing is that the error shuffling
patch will not escape 'next' until the upcoming release, at which
time we can rewind and redo 'next'.  I think the first half of
Patrick's fix would be a good material to squash into that patch,
which would make the result identical to the one that would have
been written in the ideal world I described above.

And the other half would not have a place to be in that patch in the
ideal world.  IOW, there is no "change in behaviour" we want to
demonstrate here, as we will pretend nothing bad happened after the
upcoming release ;-)

Thanks.
Jeff King Aug. 17, 2023, 5:27 a.m. UTC | #5
On Wed, Aug 16, 2023 at 09:44:04AM -0700, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
> > The change to the code sounds sensible in that it is a move to
> > restore the status quo, and we know that the original never intended
> > to "fix" the exit status from 128 to 1.  The test change etches the
> > status quo in stone, which is a bit more than that and might be
> > debatable, but when we someday formally declare that users should
> > not be relying on the exit status that are not documented, we would
> > hunt for the uses of test_expect_code in the tests and turn this one
> > back, and probably do the same to others that are too strict on the
> > exact exit status, so I think that half of the change is OK, at
> > least for now.
> >
> > Comments?
> 
> An alternative to make this "fix" without setting any policy is to
> do this.  That is, to remove the change to the test part and then to
> rephrase the tail end of the proposed commit log message.
> 
> I can go either way.  I personally prefer our tests not to be overly
> strict about behaviors they test, especially the ones we do not
> document.

FWIW, my gut feeling agrees with you. I do not mind restoring the
previous "128" exit code for consistency and continuity, and there is no
reason to prefer "1" here instead. But I don't know that it's something
we should be promising or keeping track of with a test.

In fact, I would say that most of these die() calls in upload-pack are
pointless. The stderr of the server-side process is frequently not even
visible to the user (because it is on the other side of an http, etc,
connection). So in practice the message is going to /dev/null or perhaps
polluting some daemon log.

From the perspective of a server operator, I would even go so far as to
suggest that upload-pack should return "0" here. The client broke
protocol, but we told them over the correct channel and then exited
cleanly. There is no error on the server side. One could argue that a
large-scale server operator may want to keep track of protocol breaks
like this, because they could be a sign of a bug. But they'd probably be
better off with instrumenting upload-pack (or any proxy that sits in
front of it) to count messages coming over the ERR channel.

I've never really made patches in that direction because for the most
part the existing die() calls aren't hurting anything, and nobody cares
much about the exit code either way. But adding a more specific test
feels like going in the wrong direction (yes, I know t5516 is already
checking for a failed code, but modulo the race problems we have with
reading ERR packets, we really would be better off checking what the
client sees, or talking directly to upload-pack itself, the way t5530
does).

-Peff
Jeff King Aug. 17, 2023, 5:36 a.m. UTC | #6
On Wed, Aug 16, 2023 at 10:04:28AM -0700, Junio C Hamano wrote:

> We could either
> 
>  * Be more prescriptive and add "EXIT CODES" section to each and
>    every document to describe how we fail in the current code.
> 
> or
> 
>  * Describe "In general, 0 is success, non-zero is failure, but some
>    commands may signal more than that with its non-zero exit codes"
>    in "git help git", and add "EXIT CODES" section to the manual
>    page of the commands whose exit codes matter (there are a
>    handful, like "git diff --exit-code" that explicitly says "1" is
>    the signal that it found difference as opposed to it failing).
> 
> I'd prefer if community agrees that we should do the latter, but I
> am OK if the community consensus goes the other way.

I left some notes on upload-pack specifically elsewhere in the thread,
in which I argue that we should definitely not lock ourselves into its
current behavior.

But in the more general sense, yeah, I think that trying to document
specific exit codes for each command is a mistake. It is not just "let's
find which exit codes they use and document them". I suspect it is a
rat's nest of unplanned behaviors that come from unexpected die() calls
deep in the stack. We would not necessarily want to make promises about
what is happening in those, nor do I think it would even be sensible to
find every possible exit.

We _could_ document "128 means something really unexpected happened and
we called die() deep in the code". But even that seems misleading to me,
as we also die() for everyday shallow things (like "the name you gave is
not valid"). The value really means very little in practice, and the
biggest reason not to change it is that we know it doesn't conflict with
any codes that programs _do_ promise are meaningful (like "1" from "diff
--exit-code").

So saying "0 is success, non-zero is failure, and some commands may
document specific codes" is the closest thing to the reality of what we
as developers know and have planned.  (Of course another project is not
just to figure out the possible situations/codes but to catalogue and
organize them. But that seems like an order of magnitude more work, if
not several orders).

-Peff
Oswald Buddenhagen Aug. 17, 2023, 9:24 a.m. UTC | #7
On Wed, Aug 16, 2023 at 10:04:28AM -0700, Junio C Hamano wrote:
>"git help git" does not have "EXIT CODES" section, and it is assumed 
>that the "common sense" of older generation [...] that exiting with 0 
>is success and non-zero is failure is shared among its users, which 
>might not be warranted these days.
>
well, that's actually a standard (exit(3) has things to say about it), 
and given how shell scripts treat exit codes, there is no wiggle room 
here. just about every shell intro tutorial explains it.

>We could either
>
> * Be more prescriptive and add "EXIT CODES" section to each and
>   every document to describe how we fail in the current code.
>
>or
>
> * Describe "In general, 0 is success, non-zero is failure, but some
>   commands may signal more than that with its non-zero exit codes"
>   in "git help git", and add "EXIT CODES" section to the manual
>   page of the commands whose exit codes matter (there are a
>   handful, like "git diff --exit-code" that explicitly says "1" is
>   the signal that it found difference as opposed to it failing).
>
i'd go with the second, with some minor modifications:
- 1 is the by far most common non-zero error code (and it matches 
   EXIT_FAILURE on all relevant systems), so it's ok to state that. it 
   may be wise to actually check that commands don't deviate from it 
   needlessly.
- the canonical name of the section appears to be "EXIT STATUS"

regards
Patrick Steinhardt Aug. 17, 2023, 10:07 a.m. UTC | #8
On Wed, Aug 16, 2023 at 10:12:18PM -0700, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
> [jc: the message I am responding to may not be on the list archive,
>  as it was multipart/alternative with text/html in it, but I think
>  the main point from you can be seen by others only from the parts
>  I quoted here].
> 
> > While I don't think we should document that the exit code has
> > a special meaning for the builtin, adding the test will help
> > prevent another accidental change in the future. If the patch is
> > worth taking (to fix the accidental change) then I think the test
> > should stay, so we don't make this change accidentally again.
> 
> I think my stance is a bit more nuanced, in that the first half of
> the patch to make us exit with 128 is worth taking, simply because
> we did not have to and did not intend to change the exit status, but
> the other half of the patch, using test_expect_code in the test
> suite, sends a wrong message that somehow exact value of non-zero
> exit status in this particular case matters.
> 
> To put it another way, if your patch to shuffle the calls for two
> error messages, concluded with a call to exit(), were written in the
> ideal world, you would have passed 128 to exit(), *and* you wouldn't
> have added any test that says "fetch should exit with 128 and not 1
> when it fails".  I aimed to massage Patrick's patch so that the
> original patch from you will become that patch in the ideal world
> when it is squashed in.

I tend to agree with Derrick -- if we think that it is important enough
to restore the exit code, whether that change was intentional or not,
then I think it makes sense to also add a test. The benefit of that test
wouldn't be to say "This is cast into stone", but rather to indicate to
the developer that a change that they have just been doing has an
unintentional side effect.

The problem I see with my own stance though is that if you extend it to
the extreme, every single `test_must_fail` would need to do exact error
code checking. The benefit of this would be kind of dubious though as
long as we do not decide to attach meaning to specific error codes.

In general I often wish that we had better ways to transport the
circumstances of why a specific command has failed to the caller. In
Gitaly, we often have to fall back to parsing the standard error stream
of a command in order to figure out the failure cause, which does not
exactly feel great given that these are rather intended to be consumed
by a user rather than a program.

Whether that information should be transported via exit codes though...
I don't know. An exit code can only convey so much information and they
often feel fragile to me. Documenting them explicitly would of course
already go a long way, but that wouldn't quite help the fact that this
mechanism still can't convey more information than "The command has
failed because of a specific root cause". Many commands perform more
than a single unit of work though, so even if we know the root cause we
still wouldn't necessarily know where exactly it has failed. 

One way to fix this would be to give commands a way to return structured
error data to the caller instead of relying on exit codes. But that is
of course a bigger topic, and I feel like I'm digressing.

Patrick

> > To my view, test cases can change in the future as long as
> > there is good justification in doing so. Having existing tests
> > helps to demonstrate a change in behavior.
> 
> I agree with that 100%, but the thing is that the error shuffling
> patch will not escape 'next' until the upcoming release, at which
> time we can rewind and redo 'next'.  I think the first half of
> Patrick's fix would be a good material to squash into that patch,
> which would make the result identical to the one that would have
> been written in the ideal world I described above.
> 
> And the other half would not have a place to be in that patch in the
> ideal world.  IOW, there is no "change in behaviour" we want to
> demonstrate here, as we will pretend nothing bad happened after the
> upcoming release ;-)
> 
> Thanks.
Junio C Hamano Aug. 17, 2023, 4:03 p.m. UTC | #9
Jeff King <peff@peff.net> writes:

> We _could_ document "128 means something really unexpected happened and
> we called die() deep in the code". But even that seems misleading to me,
> as we also die() for everyday shallow things (like "the name you gave is
> not valid"). The value really means very little in practice, and the
> biggest reason not to change it is that we know it doesn't conflict with
> any codes that programs _do_ promise are meaningful (like "1" from "diff
> --exit-code").

Yeah, I forgot to say that we should mention 128 to tell the users
that it is a meaningless positive number chosen to signal a general
error and it is set sufficiently high so that it won't conflict with
a range of low positive numbers certain subcommands use to convey
specific meaning in their errors.  And you said it nicely above.

With that clarified, my vote still goes to the "do not overly tied
to what the current implementation happens to do" route.

Thanks.
diff mbox series

Patch

diff --git a/t/t5703-upload-pack-ref-in-want.sh b/t/t5703-upload-pack-ref-in-want.sh
index df74f80061..fe6e056700 100755
--- a/t/t5703-upload-pack-ref-in-want.sh
+++ b/t/t5703-upload-pack-ref-in-want.sh
@@ -483,7 +483,7 @@  test_expect_success 'server is initially ahead - no ref in want' '
 	rm -rf local &&
 	cp -r "$LOCAL_PRISTINE" local &&
 	inconsistency main $(test_oid numeric) &&
-	test_must_fail git -C local fetch 2>err &&
+	test_expect_code 128 git -C local fetch 2>err &&
 	test_i18ngrep "fatal: remote error: upload-pack: not our ref" err
 '
 
diff --git a/upload-pack.c b/upload-pack.c
index ece111c629..15f3318d6d 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -782,7 +782,7 @@  static void check_non_tip(struct upload_pack_data *data)
 			packet_writer_error(&data->writer,
 					    "upload-pack: not our ref %s",
 					    oid_to_hex(&o->oid));
-			exit(1);
+			exit(128);
 		}
 	}
 }