diff mbox series

[v6,16/21] mingw: try to work around issues with the test cleanup

Message ID 991b41afa4a83a73f59a72504d269d64d12ecf8f.1548771561.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Offer to run CI/PR builds in Azure Pipelines | expand

Commit Message

Linus Arver via GitGitGadget Jan. 29, 2019, 2:19 p.m. UTC
From: Johannes Schindelin <johannes.schindelin@gmx.de>

It seems that every once in a while in the Git for Windows SDK, there
are some transient file locking issues preventing the test clean up to
delete the trash directory. Let's be gentle and try again five seconds
later, and only error out if it still fails the second time.

This change helps Windows, and does not hurt any other platform
(normally, it is highly unlikely that said deletion fails, and if it
does, normally it will fail again even 5 seconds later).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/test-lib.sh | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Ævar Arnfjörð Bjarmason March 10, 2021, 12:40 p.m. UTC | #1
On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:

> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> It seems that every once in a while in the Git for Windows SDK, there
> are some transient file locking issues preventing the test clean up to
> delete the trash directory. Let's be gentle and try again five seconds
> later, and only error out if it still fails the second time.
>
> This change helps Windows, and does not hurt any other platform
> (normally, it is highly unlikely that said deletion fails, and if it
> does, normally it will fail again even 5 seconds later).
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  t/test-lib.sh | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/t/test-lib.sh b/t/test-lib.sh
> index f31a1c8f79..9c0ca5effb 100644
> --- a/t/test-lib.sh
> +++ b/t/test-lib.sh
> @@ -1104,7 +1104,11 @@ test_done () {
>  			error "Tests passed but trash directory already removed before test cleanup; aborting"
>  
>  			cd "$TRASH_DIRECTORY/.." &&
> -			rm -fr "$TRASH_DIRECTORY" ||
> +			rm -fr "$TRASH_DIRECTORY" || {
> +				# try again in a bit
> +				sleep 5;
> +				rm -fr "$TRASH_DIRECTORY"
> +			} ||
>  			error "Tests passed but test cleanup failed; aborting"
>  		fi
>  		test_at_end_hook_

I saw this sleep while reading some test-lib.sh code, doesn't this break
df4c0d1a79 (test-lib: abort when can't remove trash directory,
2017-04-20) for non-Windows platforms?

Your CL for v3 suggests this was only encountered in Azure VMs:
https://lore.kernel.org/git/pull.31.v3.git.gitgitgadget@gmail.com/

Aside from this obscure issue, wouldn't it make more sense to have some
optional "I'm under CI" flag to skip the teardown one test at a time as
we're probably about to shut off the transitory VM soon?

I skip some tests, but the test suite creates ~950MB of trash for
me. Maybe cheaper for some to just keep that around and have it all
removed at the end.
Johannes Schindelin March 19, 2021, 2:20 p.m. UTC | #2
Hi Ævar,

thanks for digging out this old thread, I really could have done never
thinking about it again!

On Wed, 10 Mar 2021, Ævar Arnfjörð Bjarmason wrote:

>
> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
>
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >
> > It seems that every once in a while in the Git for Windows SDK, there
> > are some transient file locking issues preventing the test clean up to
> > delete the trash directory. Let's be gentle and try again five seconds
> > later, and only error out if it still fails the second time.
> >
> > This change helps Windows, and does not hurt any other platform
> > (normally, it is highly unlikely that said deletion fails, and if it
> > does, normally it will fail again even 5 seconds later).
> >
> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > ---
> >  t/test-lib.sh | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> > index f31a1c8f79..9c0ca5effb 100644
> > --- a/t/test-lib.sh
> > +++ b/t/test-lib.sh
> > @@ -1104,7 +1104,11 @@ test_done () {
> >  			error "Tests passed but trash directory already removed before test cleanup; aborting"
> >
> >  			cd "$TRASH_DIRECTORY/.." &&
> > -			rm -fr "$TRASH_DIRECTORY" ||
> > +			rm -fr "$TRASH_DIRECTORY" || {
> > +				# try again in a bit
> > +				sleep 5;
> > +				rm -fr "$TRASH_DIRECTORY"
> > +			} ||
> >  			error "Tests passed but test cleanup failed; aborting"
> >  		fi
> >  		test_at_end_hook_
>
> I saw this sleep while reading some test-lib.sh code, doesn't this break
> df4c0d1a79 (test-lib: abort when can't remove trash directory,
> 2017-04-20) for non-Windows platforms?

It does not really break it, it just delays the inevitable failure.

> Your CL for v3 suggests this was only encountered in Azure VMs:
> https://lore.kernel.org/git/pull.31.v3.git.gitgitgadget@gmail.com/

If by "CL" you refer to the cover letter, then I might have made it sound
as if it was only encountered in the Azure Pipelines agents. I vaguely
seem to remember seeing something like this quite often on my personal
machine, too, though. Most likely Microsoft Defender going a little wild.

> Aside from this obscure issue, wouldn't it make more sense to have some
> optional "I'm under CI" flag to skip the teardown one test at a time as
> we're probably about to shut off the transitory VM soon?

No, I'm not under CI, and I did encounter these issues. And they abruptly
stopped with the patch you apparently still want to discuss ;-)

> I skip some tests, but the test suite creates ~950MB of trash for
> me. Maybe cheaper for some to just keep that around and have it all
> removed at the end.

I don't understand this statement. Or was there a question in it that
you'd like me to answer?

Ciao,
Johannes
Ævar Arnfjörð Bjarmason March 20, 2021, 6:12 p.m. UTC | #3
On Fri, Mar 19 2021, Johannes Schindelin wrote:

> Hi Ævar,
>
> thanks for digging out this old thread, I really could have done never
> thinking about it again!

I didn't think I'd dug up something so traumatic :)

> On Wed, 10 Mar 2021, Ævar Arnfjörð Bjarmason wrote:
>
>>
>> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
>>
>> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
>> >
>> > It seems that every once in a while in the Git for Windows SDK, there
>> > are some transient file locking issues preventing the test clean up to
>> > delete the trash directory. Let's be gentle and try again five seconds
>> > later, and only error out if it still fails the second time.
>> >
>> > This change helps Windows, and does not hurt any other platform
>> > (normally, it is highly unlikely that said deletion fails, and if it
>> > does, normally it will fail again even 5 seconds later).
>> >
>> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
>> > ---
>> >  t/test-lib.sh | 6 +++++-
>> >  1 file changed, 5 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/t/test-lib.sh b/t/test-lib.sh
>> > index f31a1c8f79..9c0ca5effb 100644
>> > --- a/t/test-lib.sh
>> > +++ b/t/test-lib.sh
>> > @@ -1104,7 +1104,11 @@ test_done () {
>> >  			error "Tests passed but trash directory already removed before test cleanup; aborting"
>> >
>> >  			cd "$TRASH_DIRECTORY/.." &&
>> > -			rm -fr "$TRASH_DIRECTORY" ||
>> > +			rm -fr "$TRASH_DIRECTORY" || {
>> > +				# try again in a bit
>> > +				sleep 5;
>> > +				rm -fr "$TRASH_DIRECTORY"
>> > +			} ||
>> >  			error "Tests passed but test cleanup failed; aborting"
>> >  		fi
>> >  		test_at_end_hook_
>>
>> I saw this sleep while reading some test-lib.sh code, doesn't this break
>> df4c0d1a79 (test-lib: abort when can't remove trash directory,
>> 2017-04-20) for non-Windows platforms?
>
> It does not really break it, it just delays the inevitable failure.
>
>> Your CL for v3 suggests this was only encountered in Azure VMs:
>> https://lore.kernel.org/git/pull.31.v3.git.gitgitgadget@gmail.com/
>
> If by "CL" you refer to the cover letter, then I might have made it sound
> as if it was only encountered in the Azure Pipelines agents. I vaguely

Yes, the cover letter.

> seem to remember seeing something like this quite often on my personal
> machine, too, though. Most likely Microsoft Defender going a little wild.
>
>> Aside from this obscure issue, wouldn't it make more sense to have some
>> optional "I'm under CI" flag to skip the teardown one test at a time as
>> we're probably about to shut off the transitory VM soon?
>
> No, I'm not under CI, and I did encounter these issues. And they abruptly
> stopped with the patch you apparently still want to discuss ;-)
>
>> I skip some tests, but the test suite creates ~950MB of trash for
>> me. Maybe cheaper for some to just keep that around and have it all
>> removed at the end.
>
> I don't understand this statement. Or was there a question in it that
> you'd like me to answer?

If the fix was purely for Azure's CI setup I was suggesting that a
better solution might be to keep the trash around until the end. But
that's not the only setup as you note here, so let's discard that
suggestion.

In any case, your patch clearly undoes whatever canary for gc issues
df4c0d1a792 was trying to put into the test-lib, but didn't say so in
its commit message.

So I figured it was something that was missed at the time, and that I
should send a quick E-Mail to both authors to see if anyone cared, maybe
nobody does.

It's just something I ran into while reviewing test-lib.sh for some
unrelated changes I was making...
Junio C Hamano March 20, 2021, 8:10 p.m. UTC | #4
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> In any case, your patch clearly undoes whatever canary for gc issues
> df4c0d1a792 was trying to put into the test-lib, but didn't say so in
> its commit message.
>
> So I figured it was something that was missed at the time, and that I
> should send a quick E-Mail to both authors to see if anyone cared, maybe
> nobody does.
>
> It's just something I ran into while reviewing test-lib.sh for some
> unrelated changes I was making...

Good eyes.

I am not quite sure if we are better off catching a transitory
failure, which could go away when retried in 5 seconds, to remove
the test directory (in which case, the change in question is a clear
regression from what df4c0d1a (test-lib: abort when can't remove
trash directory, 2017-04-20) intended to catch), or we only want to
catch permanent failures (in which case the patch in question is a
clear improvement).

In either way, the decision and the rationale behind it should be in
the log message.

Thanks.
Johannes Schindelin March 24, 2021, 12:01 p.m. UTC | #5
Hi Ævar,

On Sat, 20 Mar 2021, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Mar 19 2021, Johannes Schindelin wrote:
>
> >> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
> >>
> >> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >> >
> >> > It seems that every once in a while in the Git for Windows SDK, there
> >> > are some transient file locking issues preventing the test clean up to
> >> > delete the trash directory. Let's be gentle and try again five seconds
> >> > later, and only error out if it still fails the second time.
> >> >
> >> > This change helps Windows, and does not hurt any other platform
> >> > (normally, it is highly unlikely that said deletion fails, and if it
> >> > does, normally it will fail again even 5 seconds later).
> >> >
> >> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> >> > ---
> >> >  t/test-lib.sh | 6 +++++-
> >> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> >> > index f31a1c8f79..9c0ca5effb 100644
> >> > --- a/t/test-lib.sh
> >> > +++ b/t/test-lib.sh
> >> > @@ -1104,7 +1104,11 @@ test_done () {
> >> >  			error "Tests passed but trash directory already removed before test cleanup; aborting"
> >> >
> >> >  			cd "$TRASH_DIRECTORY/.." &&
> >> > -			rm -fr "$TRASH_DIRECTORY" ||
> >> > +			rm -fr "$TRASH_DIRECTORY" || {
> >> > +				# try again in a bit
> >> > +				sleep 5;
> >> > +				rm -fr "$TRASH_DIRECTORY"
> >> > +			} ||
> >> >  			error "Tests passed but test cleanup failed; aborting"
> >> >  		fi
> >> >  		test_at_end_hook_
> >>
> >> I saw this sleep while reading some test-lib.sh code, doesn't this break
> >> df4c0d1a79 (test-lib: abort when can't remove trash directory,
> >> 2017-04-20) for non-Windows platforms?
> >
> > It does not really break it, it just delays the inevitable failure.

I still think this is the best answer to this (implicit) question:

> In any case, your patch clearly undoes whatever canary for gc issues
> df4c0d1a792 was trying to put into the test-lib, but didn't say so in
> its commit message.

I was not _really_ paying attention to that commit when I implemented the
work-around you mentioned above. At the same time I think it does _not_
undo the canary. If the trash directory cannot be removed via `rm -fr`,
and if that is an indicator for something fishy going on, chances are that
the second `rm -fr` a couple seconds later will _also_ fail, and we still
get that error message.

The only reason why the second `rm` should succeed, at least that I can
think of, is that something on Windows blocked those files from being
deleted, and it is no longer blocking after a couple seconds, and that
usually means that an anti-malware scanned those files.

Ciao,
Dscho
SZEDER Gábor March 24, 2021, 9:20 p.m. UTC | #6
On Wed, Mar 24, 2021 at 01:01:49PM +0100, Johannes Schindelin wrote:
> Hi Ævar,
> 
> On Sat, 20 Mar 2021, Ævar Arnfjörð Bjarmason wrote:
> 
> > On Fri, Mar 19 2021, Johannes Schindelin wrote:
> >
> > >> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
> > >>
> > >> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > >> >
> > >> > It seems that every once in a while in the Git for Windows SDK, there
> > >> > are some transient file locking issues preventing the test clean up to
> > >> > delete the trash directory. Let's be gentle and try again five seconds
> > >> > later, and only error out if it still fails the second time.
> > >> >
> > >> > This change helps Windows, and does not hurt any other platform
> > >> > (normally, it is highly unlikely that said deletion fails, and if it
> > >> > does, normally it will fail again even 5 seconds later).
> > >> >
> > >> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > >> > ---
> > >> >  t/test-lib.sh | 6 +++++-
> > >> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > >> >
> > >> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> > >> > index f31a1c8f79..9c0ca5effb 100644
> > >> > --- a/t/test-lib.sh
> > >> > +++ b/t/test-lib.sh
> > >> > @@ -1104,7 +1104,11 @@ test_done () {
> > >> >  			error "Tests passed but trash directory already removed before test cleanup; aborting"
> > >> >
> > >> >  			cd "$TRASH_DIRECTORY/.." &&
> > >> > -			rm -fr "$TRASH_DIRECTORY" ||
> > >> > +			rm -fr "$TRASH_DIRECTORY" || {
> > >> > +				# try again in a bit
> > >> > +				sleep 5;
> > >> > +				rm -fr "$TRASH_DIRECTORY"
> > >> > +			} ||
> > >> >  			error "Tests passed but test cleanup failed; aborting"
> > >> >  		fi
> > >> >  		test_at_end_hook_
> > >>
> > >> I saw this sleep while reading some test-lib.sh code, doesn't this break
> > >> df4c0d1a79 (test-lib: abort when can't remove trash directory,
> > >> 2017-04-20) for non-Windows platforms?
> > >
> > > It does not really break it, it just delays the inevitable failure.
> 
> I still think this is the best answer to this (implicit) question:
> 
> > In any case, your patch clearly undoes whatever canary for gc issues
> > df4c0d1a792 was trying to put into the test-lib, but didn't say so in
> > its commit message.
> 
> I was not _really_ paying attention to that commit when I implemented the
> work-around you mentioned above. At the same time I think it does _not_
> undo the canary. If the trash directory cannot be removed via `rm -fr`,
> and if that is an indicator for something fishy going on, chances are that
> the second `rm -fr` a couple seconds later will _also_ fail, and we still
> get that error message.
> 
> The only reason why the second `rm` should succeed, at least that I can
> think of, is that something on Windows blocked those files from being
> deleted, and it is no longer blocking after a couple seconds, and that
> usually means that an anti-malware scanned those files.

Both commits referenced in df4c0d1a79's log message fixed races
between 'test_done's cleanup and a still running background 'git gc',
and df4c0d1a79 was meant to draw our attention to similar issues in
the future.  And it did:

  https://public-inbox.org/git/20190602091919.GN951@szeder.dev/
  
So no, the failure is not inevitable, there are other reasons why the
second 'rm' might still succeed after the first failed, even just a
fraction of a second later.  And yes, that 'sleep' added in the patch
above did unquestionably break that canary,
Ævar Arnfjörð Bjarmason March 24, 2021, 10:57 p.m. UTC | #7
On Wed, Mar 24 2021, SZEDER Gábor wrote:

> On Wed, Mar 24, 2021 at 01:01:49PM +0100, Johannes Schindelin wrote:
>> Hi Ævar,
>> 
>> On Sat, 20 Mar 2021, Ævar Arnfjörð Bjarmason wrote:
>> 
>> > On Fri, Mar 19 2021, Johannes Schindelin wrote:
>> >
>> > >> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
>> > >>
>> > >> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
>> > >> >
>> > >> > It seems that every once in a while in the Git for Windows SDK, there
>> > >> > are some transient file locking issues preventing the test clean up to
>> > >> > delete the trash directory. Let's be gentle and try again five seconds
>> > >> > later, and only error out if it still fails the second time.
>> > >> >
>> > >> > This change helps Windows, and does not hurt any other platform
>> > >> > (normally, it is highly unlikely that said deletion fails, and if it
>> > >> > does, normally it will fail again even 5 seconds later).
>> > >> >
>> > >> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
>> > >> > ---
>> > >> >  t/test-lib.sh | 6 +++++-
>> > >> >  1 file changed, 5 insertions(+), 1 deletion(-)
>> > >> >
>> > >> > diff --git a/t/test-lib.sh b/t/test-lib.sh
>> > >> > index f31a1c8f79..9c0ca5effb 100644
>> > >> > --- a/t/test-lib.sh
>> > >> > +++ b/t/test-lib.sh
>> > >> > @@ -1104,7 +1104,11 @@ test_done () {
>> > >> >  			error "Tests passed but trash directory already removed before test cleanup; aborting"
>> > >> >
>> > >> >  			cd "$TRASH_DIRECTORY/.." &&
>> > >> > -			rm -fr "$TRASH_DIRECTORY" ||
>> > >> > +			rm -fr "$TRASH_DIRECTORY" || {
>> > >> > +				# try again in a bit
>> > >> > +				sleep 5;
>> > >> > +				rm -fr "$TRASH_DIRECTORY"
>> > >> > +			} ||
>> > >> >  			error "Tests passed but test cleanup failed; aborting"
>> > >> >  		fi
>> > >> >  		test_at_end_hook_
>> > >>
>> > >> I saw this sleep while reading some test-lib.sh code, doesn't this break
>> > >> df4c0d1a79 (test-lib: abort when can't remove trash directory,
>> > >> 2017-04-20) for non-Windows platforms?
>> > >
>> > > It does not really break it, it just delays the inevitable failure.
>> 
>> I still think this is the best answer to this (implicit) question:
>> 
>> > In any case, your patch clearly undoes whatever canary for gc issues
>> > df4c0d1a792 was trying to put into the test-lib, but didn't say so in
>> > its commit message.
>> 
>> I was not _really_ paying attention to that commit when I implemented the
>> work-around you mentioned above. At the same time I think it does _not_
>> undo the canary. If the trash directory cannot be removed via `rm -fr`,
>> and if that is an indicator for something fishy going on, chances are that
>> the second `rm -fr` a couple seconds later will _also_ fail, and we still
>> get that error message.
>> 
>> The only reason why the second `rm` should succeed, at least that I can
>> think of, is that something on Windows blocked those files from being
>> deleted, and it is no longer blocking after a couple seconds, and that
>> usually means that an anti-malware scanned those files.
>
> Both commits referenced in df4c0d1a79's log message fixed races
> between 'test_done's cleanup and a still running background 'git gc',
> and df4c0d1a79 was meant to draw our attention to similar issues in
> the future.  And it did:
>
>   https://public-inbox.org/git/20190602091919.GN951@szeder.dev/
>   
> So no, the failure is not inevitable, there are other reasons why the
> second 'rm' might still succeed after the first failed, even just a
> fraction of a second later.  And yes, that 'sleep' added in the patch
> above did unquestionably break that canary,

Having read that thread now I agree, but I also left with a "who cares?"
and "so let's keep the sleep then?".

I.e. is this a problem that any of the software we're maintaining is
going to care about in the wild, it's not like people are expecting gc,
repack, fast-import etc. to behave well in the face of rm -rfing the
directory they're operating on.

So it seems like just an issue that crops up because of how our test
suite manages and removes per-test trash directories. So it seems better
to:

 1. Just keep that "sleep a bit" and retry hack

 2. Maybe on some/most platforms we can use cgroups or whatever passes
    for a reliable "I started a process tree starting at this PID, kill
    -9 the whole thing please" before cleanup these days.
SZEDER Gábor March 26, 2021, 7:54 p.m. UTC | #8
On Wed, Mar 24, 2021 at 11:57:46PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Mar 24 2021, SZEDER Gábor wrote:
> 
> > On Wed, Mar 24, 2021 at 01:01:49PM +0100, Johannes Schindelin wrote:
> >> Hi Ævar,
> >> 
> >> On Sat, 20 Mar 2021, Ævar Arnfjörð Bjarmason wrote:
> >> 
> >> > On Fri, Mar 19 2021, Johannes Schindelin wrote:
> >> >
> >> > >> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
> >> > >>
> >> > >> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >> > >> >
> >> > >> > It seems that every once in a while in the Git for Windows SDK, there
> >> > >> > are some transient file locking issues preventing the test clean up to
> >> > >> > delete the trash directory. Let's be gentle and try again five seconds
> >> > >> > later, and only error out if it still fails the second time.
> >> > >> >
> >> > >> > This change helps Windows, and does not hurt any other platform
> >> > >> > (normally, it is highly unlikely that said deletion fails, and if it
> >> > >> > does, normally it will fail again even 5 seconds later).
> >> > >> >
> >> > >> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> >> > >> > ---
> >> > >> >  t/test-lib.sh | 6 +++++-
> >> > >> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >> > >> >
> >> > >> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> >> > >> > index f31a1c8f79..9c0ca5effb 100644
> >> > >> > --- a/t/test-lib.sh
> >> > >> > +++ b/t/test-lib.sh
> >> > >> > @@ -1104,7 +1104,11 @@ test_done () {
> >> > >> >  			error "Tests passed but trash directory already removed before test cleanup; aborting"
> >> > >> >
> >> > >> >  			cd "$TRASH_DIRECTORY/.." &&
> >> > >> > -			rm -fr "$TRASH_DIRECTORY" ||
> >> > >> > +			rm -fr "$TRASH_DIRECTORY" || {
> >> > >> > +				# try again in a bit
> >> > >> > +				sleep 5;
> >> > >> > +				rm -fr "$TRASH_DIRECTORY"
> >> > >> > +			} ||
> >> > >> >  			error "Tests passed but test cleanup failed; aborting"
> >> > >> >  		fi
> >> > >> >  		test_at_end_hook_
> >> > >>
> >> > >> I saw this sleep while reading some test-lib.sh code, doesn't this break
> >> > >> df4c0d1a79 (test-lib: abort when can't remove trash directory,
> >> > >> 2017-04-20) for non-Windows platforms?
> >> > >
> >> > > It does not really break it, it just delays the inevitable failure.
> >> 
> >> I still think this is the best answer to this (implicit) question:
> >> 
> >> > In any case, your patch clearly undoes whatever canary for gc issues
> >> > df4c0d1a792 was trying to put into the test-lib, but didn't say so in
> >> > its commit message.
> >> 
> >> I was not _really_ paying attention to that commit when I implemented the
> >> work-around you mentioned above. At the same time I think it does _not_
> >> undo the canary. If the trash directory cannot be removed via `rm -fr`,
> >> and if that is an indicator for something fishy going on, chances are that
> >> the second `rm -fr` a couple seconds later will _also_ fail, and we still
> >> get that error message.
> >> 
> >> The only reason why the second `rm` should succeed, at least that I can
> >> think of, is that something on Windows blocked those files from being
> >> deleted, and it is no longer blocking after a couple seconds, and that
> >> usually means that an anti-malware scanned those files.
> >
> > Both commits referenced in df4c0d1a79's log message fixed races
> > between 'test_done's cleanup and a still running background 'git gc',
> > and df4c0d1a79 was meant to draw our attention to similar issues in
> > the future.  And it did:
> >
> >   https://public-inbox.org/git/20190602091919.GN951@szeder.dev/
> >   
> > So no, the failure is not inevitable, there are other reasons why the
> > second 'rm' might still succeed after the first failed, even just a
> > fraction of a second later.  And yes, that 'sleep' added in the patch
> > above did unquestionably break that canary,
> 
> Having read that thread now I agree, but I also left with a "who cares?"
> and "so let's keep the sleep then?".
> 
> I.e. is this a problem that any of the software we're maintaining is
> going to care about in the wild, it's not like people are expecting gc,
> repack, fast-import etc. to behave well in the face of rm -rfing the
> directory they're operating on.
> 
> So it seems like just an issue that crops up because of how our test
> suite manages and removes per-test trash directories.

Not at all.  The real problem is that some stray background git
process is *still* actively writing to the test repository when the
test script is already supposed to be finished.

> So it seems better
> to:
> 
>  1. Just keep that "sleep a bit" and retry hack
> 
>  2. Maybe on some/most platforms we can use cgroups or whatever passes
>     for a reliable "I started a process tree starting at this PID, kill
>     -9 the whole thing please" before cleanup these days.

What really seems better:

  3. Keep that "sleep a bit" hack only on platforms that can't delete
     opened files.
diff mbox series

Patch

diff --git a/t/test-lib.sh b/t/test-lib.sh
index f31a1c8f79..9c0ca5effb 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1104,7 +1104,11 @@  test_done () {
 			error "Tests passed but trash directory already removed before test cleanup; aborting"
 
 			cd "$TRASH_DIRECTORY/.." &&
-			rm -fr "$TRASH_DIRECTORY" ||
+			rm -fr "$TRASH_DIRECTORY" || {
+				# try again in a bit
+				sleep 5;
+				rm -fr "$TRASH_DIRECTORY"
+			} ||
 			error "Tests passed but test cleanup failed; aborting"
 		fi
 		test_at_end_hook_