diff mbox series

[1/8] generic/038: kill background threads on interrupt

Message ID 20220524073411.1943480-2-david@fromorbit.com (mailing list archive)
State New, archived
Headers show
Series fstests: _cleanup() overrides are a mess | expand

Commit Message

Dave Chinner May 24, 2022, 7:34 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

When I ctrl-c g/038, it either does nothing or it leaves processes
running in the background. It is not cleaning up it's background
processes correctly, so add kill vectors into the cleanup. Make sure
we only kill in the cleanup trap if the background processes are
running.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 tests/generic/038 | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

Comments

Amir Goldstein May 24, 2022, 9:41 a.m. UTC | #1
On Tue, May 24, 2022 at 12:12 PM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> When I ctrl-c g/038, it either does nothing or it leaves processes
> running in the background. It is not cleaning up it's background
> processes correctly, so add kill vectors into the cleanup. Make sure
> we only kill in the cleanup trap if the background processes are
> running.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  tests/generic/038 | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/tests/generic/038 b/tests/generic/038
> index c6cea94e..0462ea13 100755
> --- a/tests/generic/038
> +++ b/tests/generic/038
> @@ -36,6 +36,10 @@ _begin_fstest auto stress trim
>  # Override the default cleanup function.
>  _cleanup()
>  {
> +       [ -n "${create_pids}" ] && kill ${create_pids[@]}
> +       [ -n "${fallocate_pids}" ] && kill ${fallocate_pids[@]}
> +       [ -n "${trim_pids}" ] && kill ${trim_pids[@]}
> +       wait

Following the pattern of recently fixed generic/019,
Please redirect stderr of kill to /dev/null
and I think we would rather not wait in cleanup callbacks
which could potentially block forever?

>         rm -fr $tmp

I suppose another patch is going to replace that with the proper _cleanup()?
Patches from vger have been VERY slowly trickling into my mailbox.

>  }
>
> @@ -47,6 +51,8 @@ _supported_fs generic
>  _require_scratch
>  _require_xfs_io_command "falloc"
>
> +echo "Silence is golden"
> +
>  # Keep allocating and deallocating 1G of data space with the goal of creating
>  # and deleting 1 block group constantly. The intention is to race with the
>  # fstrim loop below.
> @@ -121,6 +127,7 @@ _scratch_mount
>  _require_fs_space $SCRATCH_MNT $((10 * 1024 * 1024))
>  _require_batched_discard $SCRATCH_MNT
>
> +
>  for ((i = 0; i < $((4 * $LOAD_FACTOR)); i++)); do
>         trim_loop &
>         trim_pids[$i]=$!
> @@ -136,12 +143,9 @@ create_files "foobar"
>  kill ${fallocate_pids[@]}
>  kill ${trim_pids[@]}
>  wait
> +unset create_pids

This one should be moved up to after the wait in  create_files()

Thanks,
Amir.
Dave Chinner May 24, 2022, 12:10 p.m. UTC | #2
On Tue, May 24, 2022 at 12:41:25PM +0300, Amir Goldstein wrote:
> On Tue, May 24, 2022 at 12:12 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > When I ctrl-c g/038, it either does nothing or it leaves processes
> > running in the background. It is not cleaning up it's background
> > processes correctly, so add kill vectors into the cleanup. Make sure
> > we only kill in the cleanup trap if the background processes are
> > running.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  tests/generic/038 | 16 ++++++++++------
> >  1 file changed, 10 insertions(+), 6 deletions(-)
> >
> > diff --git a/tests/generic/038 b/tests/generic/038
> > index c6cea94e..0462ea13 100755
> > --- a/tests/generic/038
> > +++ b/tests/generic/038
> > @@ -36,6 +36,10 @@ _begin_fstest auto stress trim
> >  # Override the default cleanup function.
> >  _cleanup()
> >  {
> > +       [ -n "${create_pids}" ] && kill ${create_pids[@]}
> > +       [ -n "${fallocate_pids}" ] && kill ${fallocate_pids[@]}
> > +       [ -n "${trim_pids}" ] && kill ${trim_pids[@]}
> > +       wait
> 
> Following the pattern of recently fixed generic/019,
> Please redirect stderr of kill to /dev/null

No. Redirecting errors to /dev/null to hide them is poor behaviour
- g/019 does not test if the pids variables are still valid even
though they get unset. Hence the normal exit trap
runs an empty `kill` command which outputs a help message. The
redirect to /dev/null hides this broken behaviour - it's poor,
buggy code, and an example how buggy and broken a lot of the cleanup
code actually is.

The code above will not run kill with an empty pid list, hence it
should not fail when it is run. If it does fail then we've got a
problem that needs to be captured and analysed and so redirecting
that error to /dev/null is exactly the wrong thing to be doing.

> and I think we would rather not wait in cleanup callbacks
> which could potentially block forever?

We do that in many, many tests - that's how we stop the test leaving
background processes running after the test exits. And if it hangs
here waiting on a hung process, then as a developer using this to
find bugs in my code, I really do want the test to hang hard so I
notice it and can capture the failure and analyse it.

Leaving a warm corpse to post-morten is the desired behaviour of a
failed test.

> >         rm -fr $tmp
> 
> I suppose another patch is going to replace that with the proper _cleanup()?
> Patches from vger have been VERY slowly trickling into my mailbox.

Not in this patch series. When I tackle the generic tests I'll
address these bugs. This patch is just fixing broken ctrl-c
behaviour.

Cheers,

Dave.
Amir Goldstein May 24, 2022, 12:30 p.m. UTC | #3
On Tue, May 24, 2022 at 3:10 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, May 24, 2022 at 12:41:25PM +0300, Amir Goldstein wrote:
> > On Tue, May 24, 2022 at 12:12 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > When I ctrl-c g/038, it either does nothing or it leaves processes
> > > running in the background. It is not cleaning up it's background
> > > processes correctly, so add kill vectors into the cleanup. Make sure
> > > we only kill in the cleanup trap if the background processes are
> > > running.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  tests/generic/038 | 16 ++++++++++------
> > >  1 file changed, 10 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/tests/generic/038 b/tests/generic/038
> > > index c6cea94e..0462ea13 100755
> > > --- a/tests/generic/038
> > > +++ b/tests/generic/038
> > > @@ -36,6 +36,10 @@ _begin_fstest auto stress trim
> > >  # Override the default cleanup function.
> > >  _cleanup()
> > >  {
> > > +       [ -n "${create_pids}" ] && kill ${create_pids[@]}
> > > +       [ -n "${fallocate_pids}" ] && kill ${fallocate_pids[@]}
> > > +       [ -n "${trim_pids}" ] && kill ${trim_pids[@]}
> > > +       wait
> >
> > Following the pattern of recently fixed generic/019,
> > Please redirect stderr of kill to /dev/null
>
> No. Redirecting errors to /dev/null to hide them is poor behaviour
> - g/019 does not test if the pids variables are still valid even
> though they get unset. Hence the normal exit trap
> runs an empty `kill` command which outputs a help message. The
> redirect to /dev/null hides this broken behaviour - it's poor,
> buggy code, and an example how buggy and broken a lot of the cleanup
> code actually is.
>
> The code above will not run kill with an empty pid list, hence it
> should not fail when it is run. If it does fail then we've got a
> problem that needs to be captured and analysed and so redirecting
> that error to /dev/null is exactly the wrong thing to be doing.
>
> > and I think we would rather not wait in cleanup callbacks
> > which could potentially block forever?
>
> We do that in many, many tests - that's how we stop the test leaving

Yes, I've seen that.

> background processes running after the test exits. And if it hangs
> here waiting on a hung process, then as a developer using this to
> find bugs in my code, I really do want the test to hang hard so I
> notice it and can capture the failure and analyse it.

Makes sense to me, but I think the practice for cleanup should be
(like in fsstress_cleanup) to send SIGKILL and wait in case
the processes are ignoring or already handling SIGTERM.

>
> Leaving a warm corpse to post-morten is the desired behaviour of a
> failed test.
>
> > >         rm -fr $tmp
> >
> > I suppose another patch is going to replace that with the proper _cleanup()?
> > Patches from vger have been VERY slowly trickling into my mailbox.
>
> Not in this patch series. When I tackle the generic tests I'll
> address these bugs. This patch is just fixing broken ctrl-c
> behaviour.
>

Great.
So w.r.t this patch there are only minor comments of kill -9
and move unset of create_pids.
Take it or leave it, either way you may add:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.
diff mbox series

Patch

diff --git a/tests/generic/038 b/tests/generic/038
index c6cea94e..0462ea13 100755
--- a/tests/generic/038
+++ b/tests/generic/038
@@ -36,6 +36,10 @@  _begin_fstest auto stress trim
 # Override the default cleanup function.
 _cleanup()
 {
+	[ -n "${create_pids}" ] && kill ${create_pids[@]}
+	[ -n "${fallocate_pids}" ] && kill ${fallocate_pids[@]}
+	[ -n "${trim_pids}" ] && kill ${trim_pids[@]}
+	wait
 	rm -fr $tmp
 }
 
@@ -47,6 +51,8 @@  _supported_fs generic
 _require_scratch
 _require_xfs_io_command "falloc"
 
+echo "Silence is golden"
+
 # Keep allocating and deallocating 1G of data space with the goal of creating
 # and deleting 1 block group constantly. The intention is to race with the
 # fstrim loop below.
@@ -121,6 +127,7 @@  _scratch_mount
 _require_fs_space $SCRATCH_MNT $((10 * 1024 * 1024))
 _require_batched_discard $SCRATCH_MNT
 
+
 for ((i = 0; i < $((4 * $LOAD_FACTOR)); i++)); do
 	trim_loop &
 	trim_pids[$i]=$!
@@ -136,12 +143,9 @@  create_files "foobar"
 kill ${fallocate_pids[@]}
 kill ${trim_pids[@]}
 wait
+unset create_pids
+unset fallocate_pids
+unset trim_pids
 
-# The fstests framework will now check for fs consistency with fsck.
-# The trimming was racy and caused some btree nodes to get full of zeroes on
-# disk, which obviously caused fs metadata corruption. The race often lead
-# to missing free space entries in a block group's free space cache too.
-
-echo "Silence is golden"
 status=0
 exit