t5562: do not reuse output files

Message ID	20181124070428.18571-1-max@max630.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: Max Kirillov <max@max630.net> To: Carlo Arenas <carenas@gmail.com>, git@vger.kernel.org Cc: Max Kirillov <max@max630.net>, peff@peff.net Subject: [PATCH] t5562: do not reuse output files Date: Sat, 24 Nov 2018 09:04:28 +0200 Message-Id: <20181124070428.18571-1-max@max630.net> In-Reply-To: <CAPUEspi67=Kt=mx21bjG2oCATnU+byO5nkvbMdQkN03yBGZMsA@mail.gmail.com> References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	t5562: do not reuse output files \| expand t5562: do not reuse output files

Max Kirillov Nov. 24, 2018, 7:04 a.m. UTC

Some expected failures of git-http-backend leave running its children
(receive-pack or upload-pack) which still hold opened descriptors
to act.err and with some probability they live long enough to write
their failure messages after next test has already truncated
the files. This causes occasional failures of the test script.

Avoid the issue by unlinking the older files before writing to them.

Reported-by: Carlo Arenas <carenas@gmail.com>
Helped-by: Carlo Arenas <carenas@gmail.com>
Signed-off-by: Max Kirillov <max@max630.net>
---
Thanks for the analysis. I seem to have guessed the reason.
This patch should prevent it.

I think the tests should somehow make sure there are no such late
processes. I can see 2 options:
* somehow find out in the tests all children and wait for them. I have no idea how.
* make http-backend close handle to its child and wait for it to exit before dying.
  This would not prevent childrenc in general, because http-backend may be killed,
  but not in our expected failure cases

Actually, don't the children receive some SIGHUP? Maybe thy should. However, it
would still take some time for them to handle it, so it does not fully solve the issue
 t/t5562-http-backend-content-length.sh | 4 ++++
 1 file changed, 4 insertions(+)

Junio C Hamano Nov. 24, 2018, 7:34 a.m. UTC | #1

Max Kirillov <max@max630.net> writes:

> diff --git a/t/t5562-http-backend-content-length.sh b/t/t5562-http-backend-content-length.sh
> index 90d890d02f..bb53f82c0c 100755
> --- a/t/t5562-http-backend-content-length.sh
> +++ b/t/t5562-http-backend-content-length.sh
> @@ -25,6 +25,8 @@ test_http_env() {
>  	handler_type="$1"
>  	request_body="$2"
>  	shift
> +	(rm -f act.out || true) &&
> +	(rm -f act.err || true) &&

Why "||true"?  If the named file doesn't exist, "rm -f" would
succeed, and if it does exist but somehow we fail to remove, then
these added lines are not preveting the next part from reusing,
i.e. they are not doing what they are supposed to be doing, so we
should detect such a failure (if happens) as an error, no?

IOW, shouldn't it just be more like

	+	rm -f act.out act.err &&

The same comment applies to the other hunk.

>  	env \
>  		CONTENT_TYPE="application/x-git-$handler_type-pack-request" \
>  		QUERY_STRING="/repo.git/git-$handler_type-pack" \
> @@ -155,6 +157,8 @@ test_expect_success 'CONTENT_LENGTH overflow ssite_t' '
>  '
>  
>  test_expect_success 'empty CONTENT_LENGTH' '
> +	(rm -f act.out || true) &&
> +	(rm -f act.err || true) &&
>  	env \
>  		QUERY_STRING="service=git-receive-pack" \
>  		PATH_TRANSLATED="$PWD"/.git/info/refs \

Junio C Hamano Nov. 24, 2018, 7:47 a.m. UTC | #2

Junio C Hamano <gitster@pobox.com> writes:

> Max Kirillov <max@max630.net> writes:
>
>> diff --git a/t/t5562-http-backend-content-length.sh b/t/t5562-http-backend-content-length.sh
>> index 90d890d02f..bb53f82c0c 100755
>> --- a/t/t5562-http-backend-content-length.sh
>> +++ b/t/t5562-http-backend-content-length.sh
>> @@ -25,6 +25,8 @@ test_http_env() {
>>  	handler_type="$1"
>>  	request_body="$2"
>>  	shift
>> +	(rm -f act.out || true) &&
>> +	(rm -f act.err || true) &&
>
> Why "||true"?  If the named file doesn't exist, "rm -f" would
> succeed, and if it does exist but somehow we fail to remove, then
> these added lines are not preveting the next part from reusing,
> i.e. they are not doing what they are supposed to be doing,...

Another thing.  The analysis in your log message talks about a stray
process holding open filehandles to these files.  An attempt to remove
them in such a stat would fail on some systems, so "||true" would not
help, would it?  It just hides the failure to remove, and when ||true
is useful in hiding th failure _is_ when such a stray process is still
there, waiting to corrupt the output of the next request and breaking
the test, no?

I do agree that forcing the parent to wait, like you described in
the comment, would be far more preferrable, but until that happens,
a better workaround might be to write into unique output filenames
(act1.out, act2.out, etc.); that way, you do not have to worry about
the output file for the next request getting clobbered by a stale
process handling the previous request.  But at the same time,
wouldn't this suggest that the test or the previous request may see
an incomplete output, as the analysed problem is that such a process
is writing to the output file very late, while we are preparing to
test the enxt request, which meas we have already checked the output
file for the previous request, right?  So even without ||true, I am
not sure how true a "solution" this change is to the issue.

Max Kirillov Nov. 24, 2018, 7:58 a.m. UTC | #3

On Sat, Nov 24, 2018 at 04:47:19PM +0900, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> a better workaround might be to write into unique output filenames
> (act1.out, act2.out, etc.); that way, you do not have to worry about
> the output file for the next request getting clobbered by a stale
> process handling the previous request.

Yes I agree

> But at the same time,
> wouldn't this suggest that the test or the previous request may see
> an incomplete output,

Yes it may miss the child's message. But in this case of failed
http-backend process, there should be already one "fatal:"
message in the act.err from the parent, and missing another
one does not change the outcome.

Jeff King Nov. 24, 2018, 12:14 p.m. UTC | #4

On Sat, Nov 24, 2018 at 04:47:19PM +0900, Junio C Hamano wrote:

> I do agree that forcing the parent to wait, like you described in
> the comment, would be far more preferrable, [...]

Stray processes can sometimes have funny effects on an outer test
harness, too. E.g., I think I've seen hangs running t5562 under prove,
because some process is holding open a pipe descriptor. This would
probably fix that, too.

> but until that happens,[...]

But if we can't do that immediately for some reason, I do agree with
everything else you said here. ;)

-Peff

Max Kirillov Nov. 24, 2018, 1:03 p.m. UTC | #5

On Sat, Nov 24, 2018 at 04:47:19PM +0900, Junio C Hamano wrote:
> I do agree that forcing the parent to wait, like you described in
> the comment, would be far more preferrable,

It looks like it can be done as simple as:

--- a/http-backend.c
+++ b/http-backend.c
@@ -486,6 +486,8 @@ static void run_service(const char **argv, int buffer_input)
        if (buffer_input || gzipped_request || req_len >= 0)
                cld.in = -1;
        cld.git_cmd = 1;
+       cld.clean_on_exit = 1;
+       cld.wait_after_clean = 1;
        if (start_command(&cld))
                exit(1);

at least according to strate it does what it should.

Junio C Hamano Nov. 26, 2018, 2:06 a.m. UTC | #6

Max Kirillov <max@max630.net> writes:

> On Sat, Nov 24, 2018 at 04:47:19PM +0900, Junio C Hamano wrote:
>> I do agree that forcing the parent to wait, like you described in
>> the comment, would be far more preferrable,
>
> It looks like it can be done as simple as:
>
> --- a/http-backend.c
> +++ b/http-backend.c
> @@ -486,6 +486,8 @@ static void run_service(const char **argv, int buffer_input)
>         if (buffer_input || gzipped_request || req_len >= 0)
>                 cld.in = -1;
>         cld.git_cmd = 1;
> +       cld.clean_on_exit = 1;
> +       cld.wait_after_clean = 1;
>         if (start_command(&cld))
>                 exit(1);
>
> at least according to strate it does what it should.

Sounds sane.

I am offhand not sure what the right value of wait_after_clean for
this codepath be, though.  46df6906 ("execv_dashed_external: wait
for child on signal death", 2017-01-06) made this non-default but
turned it on for dashed externals (especially to help the case where
they spawn a pager), as the parent has nothing other than to wait
for the child to exit in that codepath.  Does the same reasoning
apply here, too?

This is a meta point, but I wonder if there is an easy way to "grep"
for uses of run-command interface that do *not* set clean_on_exit.
The pager that can outlive us long after we exit, kept alive while
the user views our output, is an example cited by afe19ff7
("run-command: optionally kill children on exit", 2012-01-07), and
while I am wondering if the default should hae been to kill the
children instead, such a "grep" would have been very useful to know
what codepaths would be affected if we flipped the default.

Max Kirillov Nov. 28, 2018, 4:17 a.m. UTC | #7

On Mon, Nov 26, 2018 at 11:06:40AM +0900, Junio C Hamano wrote:
> I am offhand not sure what the right value of wait_after_clean for
> this codepath be, though.  46df6906 ("execv_dashed_external: wait
> for child on signal death", 2017-01-06) made this non-default but
> turned it on for dashed externals (especially to help the case where
> they spawn a pager), as the parent has nothing other than to wait
> for the child to exit in that codepath.  Does the same reasoning
> apply here, too?

As far as I understand, the reason to _not_ wait was that
the child might still be expecting closed stdin even after
receiving SIGTERM, so parent would wait forever. But this is
not clearly the case here. And otherwise there should be no
reason to not wait, as long as we are interested in
synchronous exiting of the child.

In my Linux experiments the child was exiting because of signal
earlier than because of closed stdin, but who knows, maybe
with some bad luck the signal would be delivered after the
closed stdin, and we would still have the issue. After all,
at Linux it was mostly working even without fix.

t5562: do not reuse output files

Commit Message

Comments

Patch